DotNetSpider,一个类似于WebMagic和Scrapy的.NET标准Web爬虫框架

网友投稿 719 2022-10-31

DotNetSpider,一个类似于WebMagic和Scrapy的.NET标准Web爬虫框架

DotNetSpider,一个类似于WebMagic和Scrapy的.NET标准Web爬虫框架

DotnetSpider

免责申明:本框架如同 Python 下著名的 Scrapy 一样只是为了帮助开发人员简化开发流程、提高开发效率,请勿使用此框架做任何违法国家法律的事情。使用者所做任何事情也与本框架的作者无关。

DotnetSpider, a .NET Standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework.

If you want get latest beta packages, you should add the myget feed:

DESIGN

DEVELOP ENVIROMENT

Visual Studio 2017 (15.3 or later) or Jetbrains Rider .NET Core 2.2 or later Docker MySql docker run --name mysql -d -p 3306:3306 --restart always -e MYSQL_ROOT_PASSWORD=1qazZAQ! mysql:5.7 Redis (option) docker run --name redis -d -p 6379:6379 --restart always redis SqlServer docker run --name sqlserver -d -p 1433:1433 --restart always -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=1qazZAQ!' mcr.microsoft.com/mssql/server:2017-latest PostgreSQL (option) docker run --name postgres -d -p 5432:5432 --restart always -e POSTGRES_PASSWORD=1qazZAQ! postgres MongoDb (option) docker run --name mongo -d -p 27017:27017 --restart always mongo RabbitMQ docker run -d --restart always --name rabbimq -p 4369:4369 -p 5671-5672:5671-5672 -p 25672:25672 -p 15671-15672:15671-15672 \ -e RABBITMQ_DEFAULT_USER=user -e RABBITMQ_DEFAULT_PASS=password \ rabbitmq:3-management Docker remote api for mac docker run -d --restart always --name socat -v /var/run/docker.sock:/var/run/docker.sock -p 2376:2375 bobrik/socat TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock HBase docker run -d --restart always --name hbase -p 20550:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16010:16010 dajobe/hbase

MORE DOCUMENTS

https://github.com/dotnetcore/DotnetSpider/wiki

SAMPLES

Please see the Project DotnetSpider.Sample in the solution.

BASE USAGE

Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View complete Codes

public class EntitySpider : Spider { public static async Task RunAsync() { var builder = Builder.CreateDefaultBuilder(); builder.UseSerilog(); builder.UseQueueDistinctBfsScheduler(); await builder.Build().RunAsync(); } public EntitySpider(IOptions options, SpiderServices services, ILogger logger) : base( options, services, logger) { } protected override async Task InitializeAsync(CancellationToken stoppingToken) { AddDataFlow(new DataParser()); AddDataFlow(GetDefaultStorage()); await AddRequestsAsync( new Request("https://news-blogs.com/n/page/1/", new Dictionary {{"网站", "博客园"}}), new Request("https://news-blogs.com/n/page/2/", new Dictionary {{"网站", "博客园"}})); } protected override (string Id, string Name) GetIdAndName() { return (Guid.NewGuid().ToString(), "博客园"); } [Schema("cnblogs", "news")] [EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)] [GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)] [FollowRequestSelector(XPaths = new[] {"//div[@class='pager']"})] public class CnblogsEntry : EntityBase { protected override void Configure() { HasIndex(x => x.Title); HasIndex(x => new {x.WebSite, x.Guid}, true); } public int Id { get; set; } [Required] [StringLength(200)] [ValueSelector(Expression = "类别", Type = SelectorType.Environment)] public string Category { get; set; } [Required] [StringLength(200)] [ValueSelector(Expression = "网站", Type = SelectorType.Environment)] public string WebSite { get; set; } [StringLength(200)] [ValueSelector(Expression = "//title")] [ReplaceFormatter(NewValue = "", OldValue = " - 博客园")] public string Title { get; set; } [StringLength(40)] [ValueSelector(Expression = "GUID", Type = SelectorType.Environment)] public string Guid { get; set; } [ValueSelector(Expression = ".//h2[@class='news_entry']/a")] public string News { get; set; } [ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")] public string Url { get; set; } [ValueSelector(Expression = ".//div[@class='entry_summary']")] public string PlainText { get; set; } [ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)] public DateTime CreationTime { get; set; } } }

Distributed spider

Read this document

Puppeteer downloader

Coming soon

NOTICE

when you use redis scheduler, please update your redis config:

timeout 0tcp-keepalive 60

Buy me a coffee

AREAS FOR IMPROVEMENTS

QQ Group: 477731655 Email: zlzforever@163.com

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:MyBatis从入门到精通—MyBatis基础知识和快速入门
下一篇:实现一个基于Servlet的hello world程序详解步骤
相关文章

 发表评论

暂时没有评论,来抢沙发吧~