Creeper - 下一代Go爬虫框架

网友投稿 658 2022-10-20

Creeper - 下一代Go爬虫框架

Creeper - 下一代Go爬虫框架

About

Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.

Warning: At present this project is still under early stage development, please do not use in the production environment.

Get Started

Installation

$ go get github.com/wspl/creeper

Hello World!

Create hacker_news.crs

page(@page=1) = "https://news.ycombinator.com/news?p={@page}"news[]: page -> $("tr.athing") title: $(".title a.storylink").text site: $(".title span.sitestr").text link: $(".title a.storylink").href

Then, create main.go

package mainimport "github.com/wspl/creeper"func main() { c := creeper.Open("./hacker_news.crs") c.Array("news").Each(func(c *creeper.Creeper) { println("title: ", c.String("title")) println("site: ", c.String("site")) println("link: ", c.String("link")) println("===") })}

Build and run. Console will print something like:

title: Samsung chief Lee arrested as S.Korean corruption probe deepenssite: reuters.comlink: http://reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD===title: ReactOS 0.4.4 Releasedsite: reactos.orglink: https://reactos.org/project-news/reactos-044-released===title: FeFETs: How this new memory stacks up against existing non-volatile memorysite: semiengineering.comlink: http://semiengineering.com/what-are-fefets/

Script Spec

Town

Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.

page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"

When you need town, use it as if you were calling a function:

news[]: page(ext="Hello World!") -> $("tr.athing")

You might have noticed that the @page parameter is not used. Yeah, it is a special parameter.

Expression in town definition line like name="something", represents parameter name has a default value "something".

Incidentally, @page is a parameter that will automatically increasing when current page has no more content.

Node

Nodes are tree structure that represent the data structure you are going to crawl.

news[]: page -> $("tr.athing") title: $(".title a.storylink").text site: $(".title span.sitestr").text link: $(".title a.storylink").href

Like yaml, nodes distinguishes the hierarchy by indentation.

Node Name

Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.

Page

Page indicates where to fetching the field data. It can be a town expression or field reference.

Field reference is a advanced usage of Node, you can found the details in ./eh.crs.

If a node owned page and fun at the same time, page should on the left of ->, fun should on the right of ->. Which is page -> fun

Fun

Fun represents the data processing process.

There are all supported funs:

NameParametersDescription
$(selector: string)Relative CSS selector (select from parent node)
$root(selector: string)Absolute CSS selector (select from body)
htmlinner HTML
textinner text
outerHTMLouter HTML
attr(attr: string)attribute value
stylestyle attribute value
hrefhref attribute value
srcsrc attribute value
classclass attribute value
idid attribute value
calc(prec: int)calculate arithmetic expression
match(regexp: string)match first sub-string via regular expression
expand(regexp: string, target: string)expand matched strings to target string

Author

Plutonist

impl.moe · Github @wspl

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:浅谈一下跨端技术方案
下一篇:是时候考虑对自己的 App 进行瘦身
相关文章

 发表评论

暂时没有评论,来抢沙发吧~