首页 > 资讯 > 前端开发 > node.js >简单好用的nodejs 爬虫框架分享

757

分享到

简单好用的nodejs 爬虫框架分享

爬虫好用框架 2022-06-04 17:06:43 757人浏览八月长安

摘要

这个就是一篇介绍爬虫框架的文章，开头就不说什么剧情了。什么最近一个项目了，什么分享新知了，剧情是挺好，但介绍的很初级，根本就没有办法应用，不支持队列的爬虫，都是耍流氓。所以我就先来举一个例子，看一下这个爬

这个就是一篇介绍爬虫框架的文章，开头就不说什么剧情了。什么最近一个项目了，什么分享新知了，剧情是挺好，但介绍的很初级，根本就没有办法应用，不支持队列的爬虫，都是耍流氓。所以我就先来举一个例子，看一下这个爬虫框架是多么简单并可用。

第一步：安装 Crawl-pet

nodejs 就不用多介绍吧，用 npm 安装 crawl-pet


$ npm install crawl-pet -g --production

运行，程序会引导你完成配置，首次运行，会在项目目录下生成 info.JSON 文件


$ crawl-pet

> Set project dir: ./test-crawl-pet
> Create crawl-pet in ./test-crawl-pet [y/n]: y
> Set target url: Http://foodshot.co/
> Set save rule [url/simple/group]: url
> Set file type limit: 
> The limit: not limit
> Set parser rule module:
> The module: use default crawl-pet.parser

这里使用的测试网站 http://foodshot.co/ 是一个自由版权的，分享美食图片的网站，网站里的图片质量非常棒，这里用它只是为测试学习用，大家可以换其它网站测试

如果使用默认解析器的话，已经可以运行，看看效果:


$ crawl-pet -o ./test-crawl-pet

查看图片

试试看

这是下载后的目录结构

查看图片

本地目录结构

第二步：写自己的解析器

现在我们来看一看如何写自己的解析器，有三种方法来生成我们自己的解析器

在新建项目时, 在 Set parser rule module 输入自己的解释器路径。修改 info.json 下的 parser 项这个最简单，直接在项目录下新建一个 parser.js 文件

使用 crawl-pet，新建一个解析器模板


$ crawl-pet --create-parser ./test-crawl-pet/parser.js

打开 ./test-crawl-pet/parser.js 文件


// crawl-pet 支持使用 cheerio，来进行页面分析，如果你有这个需要
const cheerio = require("cheerio")


exports.header = function(options, crawler_handle) {  
}



exports.body = function(url, body, response, crawler_handle) {
 const re = /b(href|src)s*=s*["']([^'"#]+)/ig
 var m = null
 while (m = re.exec(body)){
  let href = m[2]
  if (/.(png|gif|jpg|jpeg|mp4)b/i.test(href)) {
    // 这理添加了一条下载
   crawler_handle.aDDDown(href)
  }else if(!/.(CSS|js|json|xml|svg)/.test(href)){
    // 这理添加了一个待解析页面
   crawler_handle.addPage(href)
  }
 }
  // 记得在解析结束后一定要执行
 crawler_handle.over()
}

在最后会有一个分享，懂得的请往下看

第三步：查看爬取下来的数据

根据以下载到本地的文件，查找下载地址


$ crawl-pet -f ./test-crawl-pet/photos.foodshot.co/*.jpg

查看图片
查找下载地址

查看等待队列


$ crawl-pet -l queue

查看图片
查看等待队列

查看已下载的文件列表

$ crawl-pet -l down # 查看已下载列表中第 0 条后的5条数据 $ crawl-pet -l down,0,5 # --json 参数表示输出格式为 json $ crawl-pet -l down,0,5 --json

查看图片
已下载的文件

查看已解析页面列表，参数与查看已下载的相同

$ crawl-pet -l page

基本功能就这些了，看一下它的帮助吧

该爬虫框架是开源的，GitHub 地址在这里：https://github.com/wl879/Crawl-pet


$ crawl-pet --help

 Crawl-pet options help:

 -u, --url  string    Destination address
 -o, --outdir string    Save the directory, Default use pwd
 -r, --restart      Reload all page
 --clear        Clear queue
 --save   string    Save file rules following options
          = url: Save the path consistent with url
          = simple: Save file in the project path
          = group: Save 500 files in one folder
 --types   array    Limit download file type
 --limit   number=5   Concurrency limit
 --sleep   number=200   Concurrent interval
 --timeout  number=180000  Queue timeout
 --proxy   string    Set up proxy
 --parser  string    Set crawl rule, it's a js file path!
          The default load the parser.js file in the project path
 --maxsize  number    Limit the maximum size of the download file
 --minwidth  number    Limit the minimum width of the download file
 --minheight  number    Limit the minimum height of the download file
 -i, --info       View the configuration file
 -l, --list  array    View the queue data 
          e.g. [page/down/queue],0,-1
 -f, --find  array    Find the download URL of the local file
 --json        Print result to json fORMat
 -v, --version      View version
 -h, --help       View help

最后分享一个配置


$ crawl-pet -u https://www.reddit.com/r/funny/ -o reddit --save group

info.json


{
 "url": "https://www.reddit.com/r/funny/",
 "outdir": ".",
 "save": "group",
 "types": "",
 "limit": "5",
 "parser": "my_parser.js",
 "sleep": "200",
 "timeout": "180000",
 "proxy": "",
 "maxsize": 0,
 "minwidth": 0,
 "minheight": 0,


 "cookie": "over18=1"
}

my_parser.js


exports.body = function(url, body, response, crawler_handle) {
 const re = /b(data-url|href|src)s*=s*["']([^'"#]+)/ig
 var m = null
 while (m = re.exec(body)){
  let href = m[2]
  if (/thumb|user|icon|.(css|json|js|xml|svg)b/i.test(href)) {
   continue
  }
  if (/.(png|gif|jpg|jpeg|mp4)b/i.test(href)) {
   crawler_handle.addDown(href)
   continue
  }
  if(/reddit.com/r//i.test(href)){
   crawler_handle.addPage(href)
  }
 }
 crawler_handle.over()
}

如果你是了解 reddit 的，那就这样了。

GIthub 地址在这里：https://github.com/wl879/Crawl-pet

本站下载地址：点击下载

--结束END--

本文标题: 简单好用的nodejs 爬虫框架分享

本文链接: https://lsjlt.com/news/12722.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

回答

如何调试操作系统的错误？
操作系统

2023-11-15发布

回答

操作系统中的I/O系统是如何实现的？
操作系统

2023-11-15发布

回答

如何实现操作系统的内存管理？
操作系统

2023-11-15发布

回答

什么是虚拟内存，它对操作系统有什么影响？
操作系统

2023-11-15发布

回答

ASP中的MVC架构和WebForms架构有什么区别和使用场景？
ASP.NET

2023-11-15发布

回答

ASP中的数据验证和数据校验有什么不同？
ASP.NET

2023-11-15发布

回答

ASP中的ADO对象和DAO对象有什么区别和使用方法？
ASP.NET

2023-11-15发布

回答

Node.js中的包管理器NPM是什么？如何使用它进行依赖管理？
node.js

2023-11-15发布

回答

Vue.js中的动态组件是什么？如何使用它来动态渲染组件？
VUE

2023-11-15发布

回答

如何使用Vue.js实现懒加载和预加载？
VUE

2023-11-15发布

简单好用的nodejs 爬虫框架分享

简单好用的nodejs 爬虫框架分享

分享一个简单的java爬虫框架

使用Python实现简单的爬虫框架

nodeJS实现简单网页爬虫功能的实例(分享)

Python的Scrapy爬虫框架简单学习笔记

python爬虫框架feapder的使用简介

上手简单,功能强大的Python爬虫框架——feapder

Python爬虫基础之简单说一下scrapy的框架结构

Python爬虫框架-scrapy的使用

Golang爬虫框架colly的使用

springboot+WebMagic+MyBatis爬虫框架的使用

爬虫框架 Feapder 和 Scrapy 的对比分析

神器啊！比requests还好用的Python高效爬虫框架！

怎么使用nodejs实现一个简单的网页爬虫功能

Python的爬虫框架scrapy用21行代码写一个爬虫

NodeJs下的测试框架Mocha的简单介绍

Python爬虫框架scrapy的使用示例

爬虫框架feapder的安装和使用

怎么使用Python的Scrapy爬虫框架

常用的Python爬虫框架有哪些

利用nvm管理多个版本的node.js与npm详解

Node.js中使用socket创建私聊和公聊聊天室

node.js抓取并分析网页内容有无特殊内容的js文件

node.js回调函数之阻塞调用与非阻塞调用

Node.js巧妙实现Web应用代码热更新

Node.js 中使用 async 函数的方法

Node.js重新刷新session过期时间的方法

实例详解Nodejs 保存 payload 发送过来的文件

Nodejs express框架一个工程中同时使用ejs模版和jade模版

深入浅析NodeJs并发异步的回调处理