banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

5 Open Source Web Crawler Projects Based on LLM

01. Crawl4AI#

Crawl4AI simplifies the process of asynchronous web data extraction, making web data extraction simple and efficient, ideal for AI and LLM applications.

image

Key Features:#

  • 100% Open Source and Free: Fully open source code.
  • Lightning Fast Performance: Outperforms many paid services in fast and reliable crawling.
  • Built on AI LLM: Outputs data in JSON, HTML, or Markdown format.
  • Multi-Browser Support: Seamlessly works with Chromium, Firefox, and WebKit.
  • Simultaneous URL Crawling: Processes multiple websites at once for efficient data extraction.
  • Full Media Support: Easily extracts images, audio, video, and all HTML media tags.
  • Link Extraction: Retrieves all internal and external links for deeper data mining.
  • XML Metadata Retrieval: Captures page titles, descriptions, and other metadata.
  • Customizable: Add features for authentication, headers, or custom page modifications.
  • Anonymous Support: Custom user agent settings.
  • Screenshot Support: Powerful error handling capabilities to take page snapshots.
  • Custom JavaScript: Execute scripts before fetching custom results.
  • Structured Data Output: Generates well-structured JSON data based on rules.
  • Intelligent Extraction: Uses LLM, clustering, regular expressions, or CSS selectors for accurate data scraping.
  • Proxy Validation: Supports access to protected content via secure proxies.
  • Session Management: Easily handles multi-page navigation.
  • Image Optimization: Supports lazy loading and responsive images.
  • Dynamic Content Handling: Manages lazy loading of interactive pages.
  • LLM-Friendly Headers: Passes custom headers for LLM-specific interactions.
  • Precise Extraction: Optimizes results using keywords or directives.
  • Flexible Settings: Adjust timeouts and delays for smoother crawling.
  • Iframe Support: Extracts content from iframes for deeper data extraction.

02. ScrapeGraphAI#

ScrapeGraphAI is a Python library for web data scraping that uses LLM and logic graphs to create scraping workflows for websites or local documents (XML, HTML, JSON, Markdown, etc.).

image

03. LLM Scraper#

LLM Scraper is a TypeScript library for web scraping based on LLM, with code generation capabilities.

image

Key Features:#

  • Supports Local or MaaS Providers: Compatible with Ollama, GGUF, OpenAI, Vercel AI SDK.
  • Fully Type Safe: Implemented in TypeScript using patterns defined by Zod.
  • Based on Playwright Framework: Stream object support.
  • Code Generation: Supports code generation capabilities.
  • Four Data Formatting Modes:
    • HTML: For loading raw HTML.
    • Markdown: For loading Markdown.
    • Text: For loading extracted text (using Readability.js).
    • Image: For loading screenshots (multi-mode only).

04. Crawlee Python#

image

Crawlee is a web crawler and browser automation Python library. It extracts web page data through AI, LLM, RAG, or GPT, including downloading HTML, PDF, JPG, PNG, and other files from websites. It is compatible with BeautifulSoup, Playwright, and raw HTTP, supporting both headless and non-headless modes, as well as proxy rotation rules.


05. CyberScraper 2077#

CyberScraper 2077 is a web scraping tool based on OpenAI, Gemini, or local large models, designed for precise and efficient data extraction, suitable for data analysts, tech enthusiasts, and anyone needing to simplify online information access.

image

Key Features:#

  • AI-Based Extraction: Utilizes AI models to intelligently understand and parse web content.
  • Smooth Streamlined Interface: User-friendly GUI.
  • Multi-Format Support: Exports data in JSON, CSV, HTML, SQL, or Excel formats.
  • Tor Network Support: Securely scrapes .onion sites with automatic routing and security features.
  • Incognito Mode: Implements incognito mode parameters to help avoid detection as a bot.
  • LLM Support: Provides functionality supporting various LLMs.
  • Asynchronous Operations: Asynchronous operations for fast execution.
  • Intelligent Parsing: Scrapes content as if directly extracting from primary memory.
  • Caching: Implements content and query-based caching using LRU caching and custom dictionaries to reduce redundant API calls.
  • Supports Uploading to Google Sheets: Easily uploads extracted CSV data to Google Sheets.
  • Captcha Bypass: Can bypass captchas by using captcha at the end of the URL (currently only works locally, not on Docker).
  • Current Browser: Uses local browser environment to help bypass 99% of bot detection.
  • Proxy Mode (Coming Soon): Built-in proxy support to help bypass network restrictions.
  • Browse Pages: Browse web pages and scrape data from different pages.
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.