Crawl4AI: Asynchronous Web Data Extraction Made Easy

In this tutorial, we demonstrate how to harness Crawl4AI, a modern, Python‑based web crawling toolkit, to extract structured data from web pages directly within Google Colab. Leveraging the power of asyncio for asynchronous I/O, httpx for HTTP requests, and Crawl4AI’s built‑in AsyncHTTPCrawlerStrategy, we bypass the overhead of headless browsers while still parsing complex HTML via JsonCssExtractionStrategy.

Installing Dependencies

We start by installing the necessary dependencies, including Crawl4AI and httpx, using the following code:

        !pip install -U crawl4ai httpx

Configuring HTTPCrawlerConfig

Next, we configure the HTTPCrawlerConfig to define our HTTP crawler’s behavior, using a GET request with a custom User-Agent, gzip/deflate encoding only, automatic redirects, and SSL verification.

        http_cfg = HTTPCrawlerConfig(
    method="GET",
    headers={
        "User-Agent":      "crawl4ai-bot/1.0",
        "Accept-Encoding": "gzip, deflate"
    },
    follow_redirects=True,
    verify_ssl=True
)
crawler_strategy = AsyncHTTPCrawlerStrategy(browser_config=http_cfg)

Defining Extraction Schema

We define a JSON‑CSS extraction schema targeting each quote block (div.quote) and its child elements (span.text, small.author, div.tags a.tag), then initializes a JsonCssExtractionStrategy with that schema.

        schema = {
    "name": "Quotes",
    "baseSelector": "div.quote",
    "fields": [
        {"name": "quote",  "selector": "span.text",      "type": "text"},
        {"name": "author", "selector": "small.author",   "type": "text"},
        {"name": "tags",   "selector": "div.tags a.tag", "type": "text"}
    ]
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=False)
run_cfg = CrawlerRunConfig(extraction_strategy=extraction_strategy)

Orchestrating the Crawl

Now, this asynchronous function orchestrates the HTTP‑only crawl: it spins up an AsyncWebCrawler with our AsyncHTTPCrawlerStrategy, iterates through each page URL, and safely awaits crawler.arun(), handles any request or JSON parsing errors and collects the extracted quote records into a single pandas DataFrame for downstream analysis.

        async def crawl_quotes_http(max_pages=5):
    all_items = []
    async with AsyncWebCrawler(crawler_strategy=crawler_strategy) as crawler:
        for p in range(1, max_pages+1):
            url = f"https://quotes.toscrape.com/page/{p}/"
            try:
                res = await crawler.arun(url=url, config=run_cfg)
            except Exception as e:
                print(f" Page {p} failed outright: {e}")
                continue


            if not res.extracted_content:
                print(f" Page {p} returned no content, skipping")
                continue


            try:
                items = json.loads(res.extracted_content)
            except Exception as e:
                print(f" Page {p} JSON‑parse error: {e}")
                continue


            print(f" Page {p}: {len(items)} quotes")
            all_items.extend(items)


    return pd.DataFrame(all_items)

Running the Crawl

Finally, we kick off the crawl_quotes_http coroutine on Colab’s existing asyncio loop, fetching three pages of quotes, and then display the first few rows of the resulting pandas DataFrame to verify that our crawler returned structured data as expected.

        df = asyncio.get_event_loop().run_until_complete(crawl_quotes_http(max_pages=3))
df.head()

In conclusion, by combining Google Colab’s zero-config environment with Python’s asynchronous ecosystem and Crawl4AI’s flexible crawling strategies, we have now developed a fully automated pipeline for scraping and structuring web data in minutes. Whether you need to spin up a quick dataset of quotes, build a refreshable news‑article archive, or power a RAG workflow, Crawl4AI’s blend of httpx, asyncio, JsonCssExtractionStrategy, and AsyncHTTPCrawlerStrategy delivers both simplicity and scalability. Check out the Colab Notebook for this tutorial. Don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Comment below!

```

A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows

Crawl4AI: Asynchronous Web Data Extraction Made Easy

Installing Dependencies

Configuring HTTPCrawlerConfig

Defining Extraction Schema

Orchestrating the Crawl

Running the Crawl

Post a Comment

Meta AI Releases Web-SSL: A Scalable and Language-Free Approach to Visual Representation Learning

Hot Posts

Labels

Search This Blog

Most Recent

Meta AI Releases Web-SSL: A Scalable and Language-Free Approach to Visual Representation Learning

OpenAI Releases a Practical Guide to Identifying and Scaling AI Use Cases in Enterprise Workflows

TypeRush Pro⌨️

OpenAI Launches gpt-image-1 API: Bringing High-Quality Image Generation to Developers

Spotify to MP3 Converter

Made with Love by

Contact form

A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows

Crawl4AI: Asynchronous Web Data Extraction Made Easy

Installing Dependencies

Configuring HTTPCrawlerConfig

Defining Extraction Schema

Orchestrating the Crawl

Running the Crawl

You Might Like

Post a Comment

Contact form