后台小程序开发的全方位指南
904
2022-10-25
Gain:基于 asyncio, uvloop 和 aiohttp 的 python 爬虫框架
Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.
Requirements
Python3.5+
Installation
pip install gain
pip install uvloop (Only linux)
Usage
Write spider.py:
from gain import Css, Item, Parser, Spiderimport aiofilesclass Post(Item): title = Css('.entry-title') content = Css('.entry-content') async def save(self): async with aiofiles.open('scrapinghub.txt', 'a+') as f: await f.write(self.results['title'])class MySpider(Spider): concurrency = 5 headers = {'User-Agent': 'Google Spider'} start_url = 'https://blog.scrapinghub.com/' parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'), Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]MySpider.run()
Or use XPathParser:
from gain import Css, Item, Parser, XPathParser, Spiderclass Post(Item): title = Css('.breadcrumb_last') async def save(self): print(self.title)class MySpider(Spider): start_url = 'https://mydramatime.com/europe-and-us-drama/' concurrency = 5 headers = {'User-Agent': 'Google Spider'} parsers = [ XPathParser('//span[@class="category-name"]/a/@href'), XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'), XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post) ] proxy = 'https://localhost:1234'MySpider.run()
You can add proxy setting to spider as above.
Run python spider.py Result:
Example
The examples are in the /example/ directory.
Contribution
Pull request.Open issue.
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
发表评论
暂时没有评论,来抢沙发吧~