各种不同语言实现的爬虫程序和爬虫框架集合（爬虫编程语言）-FinClip官网

各种不同语言实现的爬虫程序和爬虫框架集合（爬虫编程语言）

网友投稿 4466 2022-10-11

各种不同语言实现的爬虫程序和爬虫框架集合（爬虫编程语言）

A collection of awesome web crawler,spider and resources in different languages.

Contents

PythonJavaC#JavaScriptPHPC++CRubyRErlangPerlGoScala

Python

Scrapy - A fast high-level screen scraping and web crawling framework. django-dynamic-scraper - Creating Scrapy scrapers via the Django admin interface.Scrapy-Redis - Redis-based components for Scrapy.scrapy-cluster - Uses Redis and Kafka to create a distributed on demand scraping cluster.distribute_crawler - Uses scrapy,redis, mongodb,graphite to create a distributed spider. pyspider - A powerful spider system.CoCrawler - A versatile web crawler built using modern tools and concurrency.cola - A distributed crawling framework.Demiurge - PyQuery-based scraping micro-framework.Scrapely - A pure-python HTML screen-scraping library.feedparser - Universal feed parser.you-get - Dumb downloader that scrapes the web.Grab - Site scraping framework.MechanicalSoup - A Python library for automating interaction with websites.portia - Visual scraping for Scrapy.crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.MSpider - A simple ,easy spider using gevent and js render.brownant - A lightweight web data extracting framework.PSpider - A simple spider frame in Python3.Gain - Web crawling framework based on asyncio for everyone.sukhoi - Minimalist and powerful Web Crawler.spidy - The simple, easy to use command line web crawler.newspaper - News, full-text, and article metadata extraction in Python 3aspider - An async web scraping micro-framework based on asyncio.

Java

ACHE Crawler - An easy to use web crawler for domain-specific search.Apache Nutch - Highly extensible, highly scalable web crawler for production environment. anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages. Crawler4j - Simple and lightweight web crawler.JSoup - Scrapes, parses, manipulates and cleans HTML.websphinx - Website-Specific Processors for HTML information extraction.Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.Gecco - A easy to use lightweight web crawlerWebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.Webmagic - A scalable crawler framework.Spiderman - A scalable ,extensible, multi-threaded web crawler. Spiderman2 - A distributed web crawler framework,support js render. Heritrix3 - Extensible, web-scale, archival-quality web crawler project.SeimiCrawler - An agile, distributed crawler framework.StormCrawler - An open source collection of resources for building low-latency, scalable web crawlers on Apache StormSpark-Crawler - Evolving Apache Nutch to run on Spark.webBee - A DFS web spider.

ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can saparate between the web page depending on their content.SimpleCrawler - Simple spider base on mutithreading, regluar expression.DotnetSpider - This is a cross platfrom, ligth spider develop by C#.Abot - C# web crawler built for speed and flexibility.Hawk - Advanced Crawler and ETL tool written in C#/WPF.SkyScraper - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.

JavaScript

scraperjs - A complete and versatile web scraper.scrape-it - A Node.js scraper for humans.simplecrawler - Event driven web crawler.node-crawler - Node-crawler has clean,simple api.js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.webster - A reliable web crawling framework which can scrape ajax and js rendered content in a web page.x-ray - Web scraper with pagination and crawler support.node-osmosis - HTML/XML parser and web scraper for Node.js.web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension.supercrawler - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.headless-chrome-crawler - Headless Chrome crawls with jQuery supportSquidwarc - High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

PHP

Goutte - A screen scraping and web crawling library for PHP. laravel-goutte - Laravel 5 Facade for Goutte. dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.pspider - Parallel web crawler written in PHP.php-spider - A configurable and extensible PHP web spider.spatie/crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.crawlzone/crawlzone - Crawlzone is a fast asynchronous internet crawling framework for PHP.

C++

open-source-search-engine - A distributed open source search engine and spider/crawler written in C/C++.

httrack - Copy websites to your computer.

Ruby

Nokogiri - A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.upton - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.Spidr - Spider a site ,multiple domains, certain links or infinitely.Cobweb - Web crawler with very flexible crawling options, standalone or using sidekiq.mechanize - Automated web interaction & crawling.

rvest - Simple web scraping for R.

Erlang

ebot - A scalable, distribuited and highly configurable web cawler.

Perl

web-scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.

pholcus - A distributed, high concurrency and powerful web crawler.gocrawl - Polite, slim and concurrent web crawler.fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.go_spider - An awesome Go concurrent Crawler(spider) framework.dht - BitTorrent DHT Protocol && DHT Spider.ants-go - A open source, distributed, restful crawler engine in golang.scrape - A simple, higher level interface for Go web scraping.creeper - The Next Generation Crawler Framework (Go).colly - Fast and Elegant Scraping Framework for Gophers.ferret - Declarative web scraping.Dataflow kit - Extract structured data from web pages. Web sites scraping.

Scala

crawler - Scala DSL for web crawling.scrala - Scala crawler(spider) framework, inspired by scrapy.ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

标签：js

敏捷交付如何驱动企业在快速变化的市场中获胜

4466 2022-10-11

各种不同语言实现的爬虫程序和爬虫框架集合（爬虫编程语言）

前端框架选型是企业提升开发效率与用户体验的关键因素

大屏前端框架如何推动企业数据可视化与用户体验的革新

敏捷交付如何驱动企业在快速变化的市场中获胜

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计