各种不同语言实现的爬虫程序和爬虫框架集合(爬虫编程语言)

网友投稿 4390 2022-10-11

各种不同语言实现的爬虫程序和爬虫框架集合(爬虫编程语言)

各种不同语言实现的爬虫程序和爬虫框架集合(爬虫编程语言)

A collection of awesome web crawler,spider and resources in different languages.

Contents

PythonJavaC#JavaScriptPHPC++CRubyRErlangPerlGoScala

Python

Scrapy - A fast high-level screen scraping and web crawling framework. django-dynamic-scraper - Creating Scrapy scrapers via the Django admin interface.Scrapy-Redis - Redis-based components for Scrapy.scrapy-cluster - Uses Redis and Kafka to create a distributed on demand scraping cluster.distribute_crawler - Uses scrapy,redis, mongodb,graphite to create a distributed spider. pyspider - A powerful spider system.CoCrawler - A versatile web crawler built using modern tools and concurrency.cola - A distributed crawling framework.Demiurge - PyQuery-based scraping micro-framework.Scrapely - A pure-python HTML screen-scraping library.feedparser - Universal feed parser.you-get - Dumb downloader that scrapes the web.Grab - Site scraping framework.MechanicalSoup - A Python library for automating interaction with websites.portia - Visual scraping for Scrapy.crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.MSpider - A simple ,easy spider using gevent and js render.brownant - A lightweight web data extracting framework.PSpider - A simple spider frame in Python3.Gain - Web crawling framework based on asyncio for everyone.sukhoi - Minimalist and powerful Web Crawler.spidy - The simple, easy to use command line web crawler.newspaper - News, full-text, and article metadata extraction in Python 3aspider - An async web scraping micro-framework based on asyncio.

Java

ACHE Crawler - An easy to use web crawler for domain-specific search.Apache Nutch - Highly extensible, highly scalable web crawler for production environment. anthelion - A plugin for Apache Nutch to crawl semantic annotations within HTML pages. Crawler4j - Simple and lightweight web crawler.JSoup - Scrapes, parses, manipulates and cleans HTML.websphinx - Website-Specific Processors for HTML information extraction.Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.Gecco - A easy to use lightweight web crawlerWebCollector - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.Webmagic - A scalable crawler framework.Spiderman - A scalable ,extensible, multi-threaded web crawler. Spiderman2 - A distributed web crawler framework,support js render. Heritrix3 - Extensible, web-scale, archival-quality web crawler project.SeimiCrawler - An agile, distributed crawler framework.StormCrawler - An open source collection of resources for building low-latency, scalable web crawlers on Apache StormSpark-Crawler - Evolving Apache Nutch to run on Spark.webBee - A DFS web spider.

C#

ccrawler - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can saparate between the web page depending on their content.SimpleCrawler - Simple spider base on mutithreading, regluar expression.DotnetSpider - This is a cross platfrom, ligth spider develop by C#.Abot - C# web crawler built for speed and flexibility.Hawk - Advanced Crawler and ETL tool written in C#/WPF.SkyScraper - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.

JavaScript

scraperjs - A complete and versatile web scraper.scrape-it - A Node.js scraper for humans.simplecrawler - Event driven web crawler.node-crawler - Node-crawler has clean,simple api.js-crawler - Web crawler for Node.JS, both HTTP and HTTPS are supported.webster - A reliable web crawling framework which can scrape ajax and js rendered content in a web page.x-ray - Web scraper with pagination and crawler support.node-osmosis - HTML/XML parser and web scraper for Node.js.web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension.supercrawler - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.headless-chrome-crawler - Headless Chrome crawls with jQuery supportSquidwarc - High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

PHP

Goutte - A screen scraping and web crawling library for PHP. laravel-goutte - Laravel 5 Facade for Goutte. dom-crawler - The DomCrawler component eases DOM navigation for HTML and XML documents.pspider - Parallel web crawler written in PHP.php-spider - A configurable and extensible PHP web spider.spatie/crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.crawlzone/crawlzone - Crawlzone is a fast asynchronous internet crawling framework for PHP.

C++

open-source-search-engine - A distributed open source search engine and spider/crawler written in C/C++.

C

httrack - Copy websites to your computer.

Ruby

Nokogiri - A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.upton - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.RubyRetriever - RubyRetriever is a Web Crawler, Scraper & File Harvester.Spidr - Spider a site ,multiple domains, certain links or infinitely.Cobweb - Web crawler with very flexible crawling options, standalone or using sidekiq.mechanize - Automated web interaction & crawling.

R

rvest - Simple web scraping for R.

Erlang

ebot - A scalable, distribuited and highly configurable web cawler.

Perl

web-scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.

Go

pholcus - A distributed, high concurrency and powerful web crawler.gocrawl - Polite, slim and concurrent web crawler.fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.go_spider - An awesome Go concurrent Crawler(spider) framework.dht - BitTorrent DHT Protocol && DHT Spider.ants-go - A open source, distributed, restful crawler engine in golang.scrape - A simple, higher level interface for Go web scraping.creeper - The Next Generation Crawler Framework (Go).colly - Fast and Elegant Scraping Framework for Gophers.ferret - Declarative web scraping.Dataflow kit - Extract structured data from web pages. Web sites scraping.

Scala

crawler - Scala DSL for web crawling.scrala - Scala crawler(spider) framework, inspired by scrapy.ferrit - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:elasticsearch索引index数据功能源码示例
下一篇:C++面向对象编程之类的使用(基础案例学习)
相关文章

 发表评论

暂时没有评论,来抢沙发吧~