Gain:基于 asyncio, uvloop 和 aiohttp 的 Python 爬虫框架

网友投稿 904 2022-10-25

Gain:基于 asyncio, uvloop 和 aiohttp 的 python 爬虫框架

Gain:基于 asyncio, uvloop 和 aiohttp 的 Python 爬虫框架

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spiderimport aiofilesclass Post(Item): title = Css('.entry-title') content = Css('.entry-content') async def save(self): async with aiofiles.open('scrapinghub.txt', 'a+') as f: await f.write(self.results['title'])class MySpider(Spider): concurrency = 5 headers = {'User-Agent': 'Google Spider'} start_url = 'https://blog.scrapinghub.com/' parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'), Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spiderclass Post(Item): title = Css('.breadcrumb_last') async def save(self): print(self.title)class MySpider(Spider): start_url = 'https://mydramatime.com/europe-and-us-drama/' concurrency = 5 headers = {'User-Agent': 'Google Spider'} parsers = [ XPathParser('//span[@class="category-name"]/a/@href'), XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'), XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post) ] proxy = 'https://localhost:1234'MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py Result:

Example

The examples are in the /example/ directory.

Contribution

Pull request.Open issue.

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:MakaJs:基于 React, Redux 的轻量级前端框架
下一篇:【4月在校总结】
相关文章

 发表评论

暂时没有评论,来抢沙发吧~