Web Scraping with LLaMA 3: Turn Any Website into Structured JSON (2025 Guide)

Use LLaMA 3 to convert big HTML to structured JSON and skip common scraping hurdles.
16 min read
Web scraping with Llama blog image

Traditional web scraping methods often break when website layouts change or anti-bot protections get stricter. In this guide, you’ll learn a more resilient, AI-powered approach using LLaMA 3—Meta’s powerful open-weight language model—to extract structured data from almost any website and convert it into clean, usable JSON.

Let’s get started.

Why Use LLaMA 3 for Web Scraping?

LLaMA 3 (released in April 2024) is Meta’s open-weight large language model, available in sizes ranging from 8B to 405B parameters. It supports a wide range of use cases and hardware capacities. Subsequent iterations—LLaMA 3.1, 3.2, and 3.3—have significantly improved performance and contextual understanding.

Traditional web scraping methods rely on static selectors like XPath or CSS, which can easily break when website structures change. In contrast, LLaMA 3 enables intelligent data extraction by understanding content contextually—just like a human would.

This makes it ideal for:

  • Handling pages where layouts and elements frequently change—such as eCommerce sites like Amazon
  • Parsing complex and unstructured HTML
  • Reducing the need for custom data parsing logic for each website
  • Creating more resilient scrapers that don’t break with every website update
  • Keeping your scraped data within your environment—crucial for sensitive information

Learn more about using AI for web scraping.

Prerequisites

Before diving into LLM web scraping, make sure you have the following in place:

  • Python 3 installed
  • Basic Python knowledge (you don’t need to be an expert)
  • A compatible operating system: – macOS (requires macOS 11 Big Sur or later) – Linux – Windows (requires Windows 10 or later)
  • Adequate hardware resources (see model selection details below)

Installing Ollama

Ollama is a lightweight tool that simplifies downloading, setting up, and running large language models locally.

ollama-llm-download-installation-page

To get started:

  1. Visit the official Ollama website
  2. Download and install the application for your operating system
  3. Important: During installation, Ollama will prompt you to run a terminal command—don’t run it yet. We’ll choose the right model version first.

Choosing Your LLaMA Model

Start by browsing Ollama’s model library to choose the LLaMA version that best fits your hardware and use case.

For most users, llama3.1:8b offers the best balance between performance and efficiency. It’s lightweight, capable, and requires approximately 4.9 GB of disk space and 6–8 GB of RAM. It runs smoothly on most modern laptops.

If you’re working with a more powerful machine and need greater reasoning capabilities or extended context length, consider scaling up to larger models like 70B or 405B. These require significantly more memory and compute power.

Pulling and Running the Model

To download and initialize the LLaMA 3.1 (8B) model, run the following command:

ollama run llama3.1:8b

Once the model is downloaded, you’ll see a simple interactive prompt:

>>> Send a message (/? for help)

You can test the model with a quick query:

>>> who are you?
I am LLaMA, *an AI assistant developed by Meta AI...*

A successful response like the one above confirms that the model is properly installed. Type /bye to exit the prompt.

Next, start the Ollama server by running:

ollama serve

This command launches a local Ollama instance at http://127.0.0.1:11434/. Leave this terminal window open, as the server must stay running in the background.

To verify it’s working, open your browser and go to that URL—you should see the message “Ollama is running”.

Building LLM-Powered Amazon Scraper

In this section, we’ll build a scraper that extracts product details from Amazon—one of the most challenging targets due to its dynamic content and strong anti-bot protections.

amazon-office-chair-product-page

We’ll extract key details like:

  • Product title
  • Current/original price
  • Discount
  • Rating & reviews
  • Description & features
  • Availability & ASIN

The AI-Powered Multi-Stage Workflow

To overcome the limitations of traditional scraping—especially on complex eCommerce sites like Amazon—our LLaMA-powered scraper follows a smart, multi-stage workflow:

  1. Browser Automation – Use Selenium to load the page and render dynamic content
  2. HTML Extraction – Identify and extract the container that includes product details
  3. Markdown Conversion – Convert the HTML to Markdown to reduce token count and improve LLM efficiency
  4. LLM Processing – Use a structured prompt with LLaMA to extract clean, structured JSON
  5. Output Handling – Store the extracted JSON for downstream use or analysis

Here’s a visual breakdown of the workflow:

Let’s now walk through the process step by step. Note that these examples use Python for its simplicity and popularity, but you can achieve similar results using JavaScript or another language of your choice.

Step 1 – Install Required Libraries

First, install the necessary Python libraries:

pip install requests selenium webdriver-manager markdownify
  • requestsThe best Python HTTP client for sending API calls to the LLM service
  • selenium – Automates the browser, ideal for JavaScript-heavy websites
  • webdriver-manager – Automatically downloads and manages the correct ChromeDriver version
  • markdownify – Converts HTML into Markdown

Step 2 – Initialize the Headless Browser

Set up a headless browser using Selenium:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument("--headless")

driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)

Step 3 – Extract the Product HTML

Amazon product details are rendered dynamically and wrapped inside a <div id="ppd"> container. We’ll wait for this section to load, then extract its HTML:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 15)
product_container = wait.until(
    EC.presence_of_element_located((By.ID, "ppd"))
)

# Extract the full HTML of the product container
page_html = product_container.get_attribute("outerHTML")

This approach:

  • Waits for JavaScript-rendered content (like prices and ratings)
  • Targets only the relevant product section—ignoring headers, footers, and sidebars

Check out our complete guide on how to scrape Amazon product data in Python.

Step 4 – Convert HTML to Markdown

Amazon pages contain deeply nested HTML that is inefficient for LLMs to process. A key optimization is converting this HTML to clean Markdown, which dramatically reduces token count and improves comprehension.

When you run the complete script, two files will be generated: amazon_page.html and amazon_page.md. Try pasting both into the Token Calculator Tool to compare their token counts.

As shown below, the HTML contains around 270,000 tokens:

token-calculator-html-tokens

The Markdown version? Just ~11,000 tokens:

token-calculator-markdown-tokens

This 96% reduction leads to:

  • Cost efficiency – Fewer tokens mean lower API or compute costs
  • Faster processing – Less input data = quicker LLM responses
  • Improved accuracy – Cleaner, flatter text helps the model extract structured data more precisely

Read more on why AI agents prefer Markdown over HTML.

Here’s how to do the conversion in Python:

from markdownify import markdownify as md

clean_text = md(page_html, heading_style="ATX")

Step 5 – Create the Data Extraction Prompt

A well-structured prompt is critical for getting consistent, clean JSON output from the LLM. Below is a prompt that instructs the model to return only valid JSON in a predefined format:

PROMPT = (
    "You are an expert Amazon product data extractor. Your task is to extract product data from the provided content. "
    "Return ONLY valid JSON with EXACTLY the following fields and formats:\\n\\n"
    "{\\n"
    '  "title": "string – the product title",\\n'
    '  "price": number – the current price (numerical value only)",\\n'
    '  "original_price": number or null – the original price if available,\\n'
    '  "discount": number or null – the discount percentage if available,\\n'
    '  "rating": number or null – the average rating (0–5 scale),\\n'
    '  "review_count": number or null – total number of reviews,\\n'
    '  "description": "string – main product description",\\n'
    '  "features": ["string"] – list of bullet point features,\\n'
    '  "availability": "string – stock status",\\n'
    '  "asin": "string – 10-character Amazon ID"\\n'
    "}\\n\\n"
    "Return ONLY the JSON without any additional text."
)

Step 6 – Call the LLM API

With Ollama running locally, you can send the Markdown text to your LLaMA instance via its HTTP API:

import requests
import json

response = requests.post(
    "<http://localhost:11434/api/generate>",
    json={
        "model": "llama3.1:8b",
        "prompt": f"{PROMPT}\\n\\n{clean_text}",
        "stream": False,
        "format": "json",
        "options": {
            "temperature": 0.1,
            "num_ctx": 12000,
        },
    },
    timeout=250,
)

raw_output = response.json()["response"].strip()
product_data = json.loads(raw_output)

What each option does:

  • temperature – Set to 0.1 for deterministic output (ideal for JSON formatting)
  • num_ctx – Defines the maximum context length. 12,000 tokens are sufficient for most Amazon product pages
  • stream – When False, the API returns the full response after processing
  • format – Specifies the output format (JSON)
  • model – Indicates which LLaMA version to use

Since the converted Markdown typically contains around 11,000 tokens, it’s important to set the context window (num_ctx) accordingly. While increasing this value allows you to handle longer inputs, it also increases RAM usage and slows down processing. Only increase the context limit if your product pages are especially long or if you have the compute resources to support it.

Step 7 – Save the Results

Finally, save the structured product data to a JSON file:

with open("product_data.json", "w", encoding="utf-8") as f:
    json.dump(product_data, f, indent=2, ensure_ascii=False)

Step 8: Execute the Script

To run your scraper, provide an Amazon product URL and call your scraping function:

if __name__ == "__main__":
    url = "<https://www.amazon.com/Black-Office-Chair-Computer-Adjustable/dp/B00FS3VJAO>"

    # Call your function to scrape and extract product data
    scrape_amazon_product(url)

Step 9 – Full Code Example

Below is the complete Python script that combines all the steps into a cohesive, end-to-end workflow:

import json
import logging
import time
from typing import Final, Optional, Dict, Any

import requests
from markdownify import markdownify as html_to_md
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

# Configuration constants
LLM_API_CONFIG: Final[Dict[str, Any]] = {
    "endpoint": "<http://localhost:11434/api/generate>",
    "model": "llama3.1:8b",
    "temperature": 0.1,
    "context_window": 12000,
    "stream": False,
    "timeout_seconds": 220,
}

DEFAULT_PRODUCT_DATA: Final[Dict[str, Any]] = {
    "title": "",
    "price": 0.0,
    "original_price": None,
    "discount": None,
    "rating": None,
    "review_count": None,
    "description": "",
    "features": [],
    "availability": "",
    "asin": "",
}

PRODUCT_DATA_EXTRACTION_PROMPT: Final[str] = (
    "You are an expert Amazon product data extractor. Your task is to extract product data from the provided content. "
    "Return ONLY valid JSON with EXACTLY the following fields and formats:\\n\\n"
    "{\\n"
    '  "title": "string - the product title",\\n'
    '  "price": number - the current price (numerical value only),\\n'
    '  "original_price": number or null - the original price if available,\\n'
    '  "discount": number or null - the discount percentage if available,\\n'
    '  "rating": number or null - the average rating (0-5 scale),\\n'
    '  "review_count": number or null - total number of reviews,\\n'
    '  "description": "string - main product description",\\n'
    '  "features": ["string"] - list of bullet point features,\\n'
    '  "availability": "string - stock status",\\n'
    '  "asin": "string - 10-character Amazon ID"\\n'
    "}\\n\\n"
    "Return ONLY the JSON without any additional text."
)

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    handlers=[logging.StreamHandler()],
)

def initialize_web_driver(headless: bool = True) -> webdriver.Chrome:
    """Initialize and return a configured Chrome WebDriver instance."""
    options = Options()
    if headless:
        options.add_argument("--headless=new")

    service = Service(ChromeDriverManager().install())
    return webdriver.Chrome(service=service, options=options)

def fetch_product_container_html(product_url: str) -> Optional[str]:
    """Retrieve the HTML content of the Amazon product details container."""
    driver = initialize_web_driver()
    try:
        logging.info(f"Accessing product page: {product_url}")
        driver.set_page_load_timeout(15)
        driver.get(product_url)

        # Wait for the product container to appear
        container = WebDriverWait(driver, 5).until(
            EC.presence_of_element_located((By.ID, "ppd"))
        )
        return container.get_attribute("outerHTML")
    except Exception as e:
        logging.error(f"Error retrieving product details: {str(e)}")
        return None
    finally:
        driver.quit()

def extract_product_data_via_llm(markdown_content: str) -> Optional[Dict[str, Any]]:
    """Extract structured product data from markdown text using LLM API."""
    try:
        logging.info("Extracting product data via LLM API...")
        response = requests.post(
            LLM_API_CONFIG["endpoint"],
            json={
                "model": LLM_API_CONFIG["model"],
                "prompt": f"{PRODUCT_DATA_EXTRACTION_PROMPT}\\n\\n{markdown_content}",
                "format": "json",
                "stream": LLM_API_CONFIG["stream"],
                "options": {
                    "temperature": LLM_API_CONFIG["temperature"],
                    "num_ctx": LLM_API_CONFIG["context_window"],
                },
            },
            timeout=LLM_API_CONFIG["timeout_seconds"],
        )
        response.raise_for_status()

        raw_output = response.json()["response"].strip()
        # Clean JSON output if it's wrapped in markdown code blocks
        if raw_output.startswith(("```json", "```")):
            raw_output = raw_output.split("```")[1].strip()
            if raw_output.startswith("json"):
                raw_output = raw_output[4:].strip()

        return json.loads(raw_output)

    except requests.exceptions.RequestException as e:
        logging.error(f"LLM API request failed: {str(e)}")
        return None
    except json.JSONDecodeError as e:
        logging.error(f"Failed to parse LLM response: {str(e)}")
        return None
    except Exception as e:
        logging.error(f"Unexpected error during data extraction: {str(e)}")
        return None

def scrape_amazon_product(
    product_url: str, output_file: str = "product_data.json"
) -> None:
    """Scrape an Amazon product page and save extracted data along with HTML and Markdown to files."""
    start_time = time.time()
    logging.info(f"Starting scrape for: {product_url}")

    # Step 1: Fetch product page HTML
    product_html = fetch_product_container_html(product_url)
    if not product_html:
        logging.error("Failed to retrieve product page content")
        return

    # Optional: save HTML for debugging
    with open("amazon_product.html", "w", encoding="utf-8") as f:
        f.write(product_html)

    # Step 2: Convert HTML to Markdown
    product_markdown = html_to_md(product_html)

    # Optional: save Markdown for debugging
    with open("amazon_product.md", "w", encoding="utf-8") as f:
        f.write(product_markdown)

    # Step 3: Extract structured data via LLM
    product_data = (
        extract_product_data_via_llm(product_markdown) or DEFAULT_PRODUCT_DATA.copy()
    )

    # Step 4: Save JSON results
    try:
        with open(output_file, "w", encoding="utf-8") as json_file:
            json.dump(product_data, json_file, indent=2, ensure_ascii=False)
        logging.info(f"Successfully saved product data to {output_file}")
    except IOError as e:
        logging.error(f"Failed to save JSON results: {str(e)}")

    elapsed_time = time.time() - start_time
    logging.info(f"Completed in {elapsed_time:.2f} seconds")

if __name__ == "__main__":
    # Example usage
    test_url = (
        "<https://www.amazon.com/Black-Office-Chair-Computer-Adjustable/dp/B00FS3VJAO>"
    )
    scrape_amazon_product(test_url)

When the script runs successfully, it saves the extracted product data to a file named product_data.json. The output will look something like this:

{
    "title": "Home Office Chair Ergonomic Desk Chair Mesh Computer Chair with Lumbar Support Armrest Executive Rolling Swivel Adjustable Mid Back Task Chair for Women Adults, Black",
    "price": 36.98,
    "original_price": 41.46,
    "discount": 11,
    "rating": 4.3,
    "review_count": 58112,
    "description": 'Office chair comes with all hardware and tools, and is easy to assemble in about 10–15 minutes. The high-density sponge cushion offers flexibility and comfort, while the mid-back design and rectangular lumbar support enhance ergonomics. All components are BIFMA certified, supporting up to 250 lbs. The chair includes armrests and an adjustable seat height (17.1"–20.3"). Its ergonomic design ensures a perfect fit for long-term use.',
    "features": [
        "100% mesh material",
        "Quick and easy assembly",
        "High-density comfort seat",
        "BIFMA certified quality",
        "Includes armrests",
        "Ergonomic patented design",
    ],
    "availability": "In Stock",
    "asin": "B00FS3VJAO",
}

Voila! Messy HTML turns into clean JSON — that’s the magic of LLMs in web scraping.

Overcoming Anti-Bot Measures

When running the above web scraping bot, you’ll likely encounter Amazon’s anti-bot measures, such as CAPTCHA challenges:

amazon-captcha-anti-bot-challenge

This highlights a key limitation: While our LLaMA-based workflow excels at parsing HTML, accessing that content is still a challenge on sites with advanced anti-bot protections.

To overcome this, you’ll need to bypass Amazon CAPTCHAs and handle other web scraping challenges.

That’s where Bright Data’s Scraping Browser comes in—a purpose-built solution for handling the complexities of modern web environments, including reliably unlocking even the most protected websites where traditional tools fail.

Learn more: Scraping Browser vs. Headless Browsers

Why Use Bright Data Scraping Browser

The Bright Data Scraping Browser is a headless, cloud-based browser with built-in proxy infrastructure and advanced unblocking capabilities—purpose-built for scaling modern web scraping projects. It’s part of Bright Data Unlocker scraping suite.

Here’s why developers and data teams choose it:

  • Reliable TLS fingerprints and stealth evasion techniques
  • Built-in IP rotation powered by a 150M+ IP proxy network
  • Automatic CAPTCHA solving
  • Cut down on infrastructure – Eliminate costly cloud setups and ongoing maintenance
  • Native support for Playwright, Puppeteer, and Selenium
  • Unlimited scalability for high-volume data extraction

The best part? You can integrate it into your existing workflow with just a few lines of code.

Read why more companies are shifting to cloud-based web scraping.

Setting Up Scraping Browser

To get started with Scraping Browser:

Create a Bright Data account (new users receive a $5 credit after adding a payment method) and in your dashboard, go to Proxies & Scraping and click Get started.

    brightdata-scraping-solutions-dashboard

    Create a new zone (e.g., test_browser) and enable features like Premium domains and CAPTCHA solver.

      brightdata-create-scraping-browser-zone

      Next, copy the Selenium URL from your dashboard.

        brightdata-selenium-connection-credentials

        Modifying Your Code for Scraping Browser

        Update your initialize_web_driver function to connect via the Scraping Browser:

        from selenium.webdriver import Remote
        from selenium.webdriver.chrome.options import Options as ChromeOptions
        from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection
        
        SBR_WEBDRIVER = "<https://username:password@host>:port"
        
        def initialize_web_driver():
            options = ChromeOptions()
            sbr_connection = ChromiumRemoteConnection(SBR_WEBDRIVER, "goog", "chrome")
            driver = Remote(sbr_connection, options=options)
            return driver

        That’s it — your scraper now routes through Bright Data’s infrastructure and handles Amazon and other anti-bot systems with ease.

        Explore more advanced features in the Scraping Browser documentation.

        Next Steps and Alternative Solutions

        To extend the capabilities of your LLaMA-powered scraper or explore other implementations, consider the following improvements and alternatives.

        • Make the script reusable: Allow the URL and prompt to be passed as command-line arguments for flexible use
        • Secure your credentials: Store your Scraping Browser credentials in a .env file and load them securely using python-dotenv
        • Add multi-page support: Implement logic for crawling through multiple pages and handling pagination
        • Scrape more websites – Use Scraping Browser’s anti-detection features to scrape other eCommerce platforms
        • Extract data from Google services – Build dedicated scrapers for Google Flights, Google Search, and Google Trends, or use Bright Data’s SERP API for ready-to-use search data

        If you prefer managed solutions or want to explore other LLM-driven methods, the following options may be suitable:

        1. Scraping with Gemini
        2. Scraping with Perplexity
        3. Build an AI Scraper With Crawl4AI and DeepSeek

        Conclusion

        This guide provides a solid foundation for building resilient web scrapers using LLaMA 3. By combining the reasoning capabilities of large language models with advanced scraping tools, you can extract structured data from complex websites with minimal effort.

        Avoiding detection and blocking is one of the biggest challenges in web scraping. Bright Data Scraping Browser addresses this by automatically handling dynamic rendering, fingerprinting, and anti-bot protection. It’s part of a broader suite of tools designed to support scalable data extraction:

        Sign up today to test Bright Data’s full suite of scraping and proxy tools for free!

        No credit card required