Crawl4AI: Asynchronous Web Data Extraction Made Easy
In this tutorial, we demonstrate how to harness Crawl4AI, a modern, Python‑based web crawling toolkit, to extract structured data from web pages directly within Google Colab. Leveraging the power of asyncio for asynchronous I/O, httpx for HTTP requests, and Crawl4AI’s built‑in AsyncHTTPCrawlerStrategy, we bypass the overhead of headless browsers while still parsing complex HTML via JsonCssExtractionStrategy.
Installing Dependencies
We start by installing the necessary dependencies, including Crawl4AI and httpx, using the following code:
!pip install -U crawl4ai httpx
Configuring HTTPCrawlerConfig
Next, we configure the HTTPCrawlerConfig to define our HTTP crawler’s behavior, using a GET request with a custom User-Agent, gzip/deflate encoding only, automatic redirects, and SSL verification.
http_cfg = HTTPCrawlerConfig(
method="GET",
headers={
"User-Agent": "crawl4ai-bot/1.0",
"Accept-Encoding": "gzip, deflate"
},
follow_redirects=True,
verify_ssl=True
)
crawler_strategy = AsyncHTTPCrawlerStrategy(browser_config=http_cfg)
Defining Extraction Schema
We define a JSON‑CSS extraction schema targeting each quote block (div.quote) and its child elements (span.text, small.author, div.tags a.tag), then initializes a JsonCssExtractionStrategy with that schema.
schema = {
"name": "Quotes",
"baseSelector": "div.quote",
"fields": [
{"name": "quote", "selector": "span.text", "type": "text"},
{"name": "author", "selector": "small.author", "type": "text"},
{"name": "tags", "selector": "div.tags a.tag", "type": "text"}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=False)
run_cfg = CrawlerRunConfig(extraction_strategy=extraction_strategy)
Orchestrating the Crawl
Now, this asynchronous function orchestrates the HTTP‑only crawl: it spins up an AsyncWebCrawler with our AsyncHTTPCrawlerStrategy, iterates through each page URL, and safely awaits crawler.arun(), handles any request or JSON parsing errors and collects the extracted quote records into a single pandas DataFrame for downstream analysis.
async def crawl_quotes_http(max_pages=5):
all_items = []
async with AsyncWebCrawler(crawler_strategy=crawler_strategy) as crawler:
for p in range(1, max_pages+1):
url = f"https://quotes.toscrape.com/page/{p}/"
try:
res = await crawler.arun(url=url, config=run_cfg)
except Exception as e:
print(f"
Page {p} failed outright: {e}")
continue
if not res.extracted_content:
print(f"
Page {p} returned no content, skipping")
continue
try:
items = json.loads(res.extracted_content)
except Exception as e:
print(f"
Page {p} JSON‑parse error: {e}")
continue
print(f"
Page {p}: {len(items)} quotes")
all_items.extend(items)
return pd.DataFrame(all_items)
Running the Crawl
Finally, we kick off the crawl_quotes_http coroutine on Colab’s existing asyncio loop, fetching three pages of quotes, and then display the first few rows of the resulting pandas DataFrame to verify that our crawler returned structured data as expected.
df = asyncio.get_event_loop().run_until_complete(crawl_quotes_http(max_pages=3))
df.head()
In conclusion, by combining Google Colab’s zero-config environment with Python’s asynchronous ecosystem and Crawl4AI’s flexible crawling strategies, we have now developed a fully automated pipeline for scraping and structuring web data in minutes. Whether you need to spin up a quick dataset of quotes, build a refreshable news‑article archive, or power a RAG workflow, Crawl4AI’s blend of httpx, asyncio, JsonCssExtractionStrategy, and AsyncHTTPCrawlerStrategy delivers both simplicity and scalability. Check out the Colab Notebook for this tutorial. Don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Comment below!
```
Post a Comment
0Comments