5 Open Source Web Crawler Projects Based on LLM

Recommended Open Source AI Crawlers#

01. Crawl4AI #

Crawl4AI simplifies the process of asynchronous web data extraction, making web data extraction simple and efficient, ideal for AI and LLM applications.

Key Features:#

100% Open Source and Free: Fully open source code.
Lightning Fast Performance: Outperforms many paid services in fast and reliable crawling.
Built on AI LLM: Outputs data in JSON, HTML, or Markdown format.
Multi-Browser Support: Seamlessly works with Chromium, Firefox, and WebKit.
Simultaneous URL Crawling: Processes multiple websites at once for efficient data extraction.
Full Media Support: Easily extracts images, audio, video, and all HTML media tags.
Link Extraction: Retrieves all internal and external links for deeper data mining.
XML Metadata Retrieval: Captures page titles, descriptions, and other metadata.
Customizable: Add features for authentication, headers, or custom page modifications.
Anonymous Support: Custom user agent settings.
Screenshot Support: Powerful error handling capabilities to take page snapshots.
Custom JavaScript: Execute scripts before fetching custom results.
Structured Data Output: Generates well-structured JSON data based on rules.
Intelligent Extraction: Uses LLM, clustering, regular expressions, or CSS selectors for accurate data scraping.
Proxy Validation: Supports access to protected content via secure proxies.
Session Management: Easily handles multi-page navigation.
Image Optimization: Supports lazy loading and responsive images.
Dynamic Content Handling: Manages lazy loading of interactive pages.
LLM-Friendly Headers: Passes custom headers for LLM-specific interactions.
Precise Extraction: Optimizes results using keywords or directives.
Flexible Settings: Adjust timeouts and delays for smoother crawling.
Iframe Support: Extracts content from iframes for deeper data extraction.

02. ScrapeGraphAI #

ScrapeGraphAI is a Python library for web data scraping that uses LLM and logic graphs to create scraping workflows for websites or local documents (XML, HTML, JSON, Markdown, etc.).

03. LLM Scraper #

LLM Scraper is a TypeScript library for web scraping based on LLM, with code generation capabilities.

Key Features:#

Supports Local or MaaS Providers: Compatible with Ollama, GGUF, OpenAI, Vercel AI SDK.
Fully Type Safe: Implemented in TypeScript using patterns defined by Zod.
Based on Playwright Framework: Stream object support.
Code Generation: Supports code generation capabilities.
Four Data Formatting Modes:
- HTML: For loading raw HTML.
- Markdown: For loading Markdown.
- Text: For loading extracted text (using Readability.js).
- Image: For loading screenshots (multi-mode only).

Crawlee is a web crawler and browser automation Python library. It extracts web page data through AI, LLM, RAG, or GPT, including downloading HTML, PDF, JPG, PNG, and other files from websites. It is compatible with BeautifulSoup, Playwright, and raw HTTP, supporting both headless and non-headless modes, as well as proxy rotation rules.

05. CyberScraper 2077 #

CyberScraper 2077 is a web scraping tool based on OpenAI, Gemini, or local large models, designed for precise and efficient data extraction, suitable for data analysts, tech enthusiasts, and anyone needing to simplify online information access.

Key Features:#

AI-Based Extraction: Utilizes AI models to intelligently understand and parse web content.
Smooth Streamlined Interface: User-friendly GUI.
Multi-Format Support: Exports data in JSON, CSV, HTML, SQL, or Excel formats.
Tor Network Support: Securely scrapes .onion sites with automatic routing and security features.
Incognito Mode: Implements incognito mode parameters to help avoid detection as a bot.
LLM Support: Provides functionality supporting various LLMs.
Asynchronous Operations: Asynchronous operations for fast execution.
Intelligent Parsing: Scrapes content as if directly extracting from primary memory.
Caching: Implements content and query-based caching using LRU caching and custom dictionaries to reduce redundant API calls.
Supports Uploading to Google Sheets: Easily uploads extracted CSV data to Google Sheets.
Captcha Bypass: Can bypass captchas by using captcha at the end of the URL (currently only works locally, not on Docker).
Current Browser: Uses local browser environment to help bypass 99% of bot detection.
Proxy Mode (Coming Soon): Built-in proxy support to help bypass network restrictions.
Browse Pages: Browse web pages and scrape data from different pages.

Being towards death