xcrawler一个基于Python requests库的轻量级Web爬虫框架

网友投稿 712 2022-10-30

xcrawler一个基于Python requests库的轻量级Web爬虫框架

xcrawler一个基于Python requests库的轻量级Web爬虫框架

A light-weight web crawler framework: xcrawler

Introduction

xcrawler, it's a light-weight web crawler framework. Some of its design concepts are borrowed from the well-known framework Scrapy. I'm very interested in web crawling, however, I'm just a newbie to web scraping. I did this so that I can learn more basics of web crawling and Python language.

Features

Simple: extremely easy to customize your own spider;Fast: multiple requests are spawned concurrently with the ThreadPoolDownloader or ProcessPoolDownloader;Flexible: different scheduling strategies are provided -- FIFO/FILO/Priority based;Extensible: write your own extensions to make your crawler much more powerful.

Install

create a virtual environment for your project, then activate it: virtualenv crawlenvsource crawlenv/bin/activate download and install this package: pip install git+https://github.com/0xE8551CCB/xcrawler.git

Quick start

Define your own spider: from xcrawler import BaseSpiderclass DoubanMovieSpider(BaseSpider): name = 'douban_movie' custom_settings = {} start_urls = ['https://movie.douban.com'] def parse(self, response): # extract items from response # yield new requests # yield new items pass Define your own extension: class DefaultUserAgentExtension(object): config_key = 'DEFAULT_USER_AGENT' def __init__(self): self._user_agent = '' def on_crawler_started(self, crawler): if self.config_key in crawler.settings: self._user_agent = crawler.settings[self.config_key] def process_request(self, request, spider): if not request or 'User-Agent' in request.headers or not self._user_agent: return request logger.debug('[{}]{} adds default user agent: ' '{!r}'.format(spider, request, self._user_agent)) request.headers['User-Agent'] = self._user_agent return request Define a pipeline to store scraped items: class jsonLineStoragePipeline(object): def __init__(self): self._file = None def on_crawler_started(self, crawler): path = crawler.settings.get('STORAGE_PATH', '') if not path: raise FileNotFoundError('missing config key: `STORAGE_PATH`') self._file = open(path, 'a+') def on_crawler_stopped(self, crawler): if self._file: self._file.close() def process_item(self, item, request, spider): if item and isinstance(item, dict): self._file.writeline(json.dumps(item)) Config the crawler: settings = { 'download_timeout': 16, 'download_delay': .5, 'concurrent_requests': 10, 'storage_path': '/tmp/hello.jl', 'default_user_agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) ' 'AppleWebKit/603.3.8 (KHTML, like Gecko) Version' '/10.1.2 Safari/603.3.8', 'global_extensions': {0: DefaultUserAgentExtension}, 'global_pipelines': {0: JsonLineStoragePipeline} }crawler = Crawler('DEBUG', **settings)crawler.crawl(DoubanMovieSpider) Bingo, you are ready to go now: crawler.start()

License

xcrawler is licensed under the MIT license, please feel free and happy crawling!

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:SAP 电商云 Spartacus UI 根据 url 设置 site context 的具体例子
下一篇:HJ13 句子逆序
相关文章

 发表评论

暂时没有评论,来抢沙发吧~