Grab:Web爬虫Python框架

网友投稿 1111 2022-10-15

Grab:Web爬虫python框架

Grab:Web爬虫Python框架

Grab

Installation

$ pip install -U grab

See details about installing Grab on different platforms here http://docs.grablib.org/en/latest/usage/installation.html

Support

Documentation: https://grablab.org/docs/

Russian telegram chat: https://t.me/grablab_ru

English telegram chat: https://t.me/grablab

To report bug please use GitHub issue tracker: https://github.com/lorien/grab/issues

What is Grab?

Grab is a python web scraping framework. Grab provides a number of helpful methods to perform network requests, scrape web sites and process the scraped content:

Automatic cookies (session) supportHTTP and SOCKS proxy with/without authorizationKeep-Alive supportIDN supportTools to work with web formsEasy multipart file uploadingFlexible customization of HTTP requestsAutomatic charset detectionPowerful API to extract data from DOM tree of HTML documents with XPATH queriesAsynchronous API to make thousands of simultaneous queries. This part of library called Spider. See list of spider fetures below.Python 3 ready

Spider is a framework for writing web-site scrapers. Features:

Rules and conventions to organize the request/parse logic in separate blocks of codesMultiple parallel network requestsAutomatic processing of network errors (failed tasks go back to task queue)You can create network requests and parse responses with Grab API (see above)HTTP proxy supportCaching network results in permanent storageDifferent backends for task queue (in-memory, redis, mongodb)Tools to debug and collect statistics

Grab Example

import loggingfrom grab import Grablogging.basicConfig(level=logging.DEBUG)g = Grab()g.go('https://github.com/login')g.doc.set_input('login', '****')g.doc.set_input('password', '****')g.doc.submit()g.doc.save('/tmp/x.html')g.doc('//ul[@id="user-links"]//button[contains(@class, "signout")]').assert_exists()home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text()repo_url = home_url + '?tab=repositories'g.go(repo_url)for elem in g.doc.select('//h3[@class="repo-list-name"]/a'): print('%s: %s' % (elem.text(), g.make_url_absolute(elem.attr('href'))))

Grab::Spider Example

import loggingfrom grab.spider import Spider, Tasklogging.basicConfig(level=logging.DEBUG)class ExampleSpider(Spider): def task_generator(self): for lang in 'python', 'ruby', 'perl': url = 'https://google.com/search?q=%s' % lang yield Task('search', url=url, lang=lang) def task_search(self, grab, task): print('%s: %s' % (task.lang, grab.doc('//div[@class="s"]//cite').text()))bot = ExampleSpider(thread_number=2)bot.run()

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:Zabbix6.2惊喜发布!特别优化中大型环境部署的性能!
下一篇:U盘文件被木马隐藏
相关文章

 发表评论

暂时没有评论,来抢沙发吧~