首页 > 资讯 > 后端开发 > Python >利用 scrapy-splash 对京东

664

分享到

利用 scrapy-splash 对京东

京东 scrapy splash 2023-01-31 00:01:21 664人浏览薄情痞子

Python 官方文档：入门教程 => 点击学习

摘要

本人是第一次写博客，有写得不好的地方欢迎值出来，大家一起进步！ scrapy-splash的介绍 scrapy-splash模块主要使用了Splash. 所谓的Splash, 就是一个javascript渲染服务。它是一个实现了Http

本人是第一次写博客，有写得不好的地方欢迎值出来，大家一起进步！

scrapy-splash的介绍

scrapy-splash模块主要使用了Splash. 所谓的Splash, 就是一个javascript渲染服务。它是一个实现了Http api的轻量级浏览器，Splash是用python实现的，同时使用Twisted和Qt。Twisted（QT）用来让服务具有异步处理能力，以发挥WEBkit的并发能力。Splash的特点如下：

并行处理多个网页
得到html结果以及（或者）渲染成图片
关掉加载图片或使用 Adblock Plus规则使得渲染速度更快
使用JavaScript处理网页内容
使用lua脚本
能在Splash-Jupyter Notebooks中开发Splash Lua scripts
能够获得具体的HAR格式的渲染信息

参考文档：https://www.cnblogs.com/jclian91/p/8590617.html

准备配置

scrapy框架
splash安装，windows用户通过虚拟机安装Docker,linux直接安装docker

页面分析

首先进入https://search.jd.com/ 网站搜索想要的书籍，这里以 python3.7 书籍为例子。

点击搜索后发现京东是通过 js 来加载书籍数据的，通过下来鼠标可以发现加载了更多的书籍数据（数据也可以通过京东的api来获取）

首先是模拟搜索，通过检查可得：

然后是模拟下拉，这里选择页面底部的这个元素作为模拟元素：

开始爬取

模拟点击的lua脚本并获取页数：

 1 function main(splash, args)
 2   splash.images_enabled = false
 3   splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36')
 4   assert(splash:Go(args.url))
 5   splash:wait(0.5)
 6   local input = splash:select("#keyWord")
 7   input:send_text('Python3.7')
 8   splash:wait(0.5)
 9   local fORM = splash:select('.input_submit')
10   form:click()
11   splash:wait(2)
12   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)")
13   splash:wait(6)
14   return splash:html()
15 end

View Code

同上有模拟下拉的代码：

1 function main(splash, args)
2   splash.images_enabled = false
3   splash:set_user_agent('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36')
4   assert(splash:go(args.url))
5   splash:wait(2)
6   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)")
7   splash:wait(6)
8   return splash:html()
9 end

View Code

选择你想要获取的元素，通过检查获得。附上源码：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy import Request
 4 from scrapy_splash import SplashRequest
 5 from ..items import JdsplashItem
 6 
 7 
 8 
 9 lua_script = '''
10 function main(splash, args)
11   splash.images_enabled = false
12   splash:set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36')
13   assert(splash:go(args.url))
14   splash:wait(0.5)
15   local input = splash:select("#keyword")
16   input:send_text('python3.7')
17   splash:wait(0.5)
18   local form = splash:select('.input_submit')
19   form:click()
20   splash:wait(2)
21   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)")
22   splash:wait(6)
23   return splash:html()
24 end
25 '''
26 
27 lua_script2 = '''
28 function main(splash, args)
29   splash.images_enabled = false
30   splash:set_user_agent('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36')
31   assert(splash:go(args.url))
32   splash:wait(2)
33   splash:runjs("document.getElementsByClassName('bottom-search')[0].scrollIntoView(true)")
34   splash:wait(6)
35   return splash:html()
36 end
37 '''
38 
39 class JdBookSpider(scrapy.Spider):
40     name = 'jd'
41     allowed_domains = ['search.jd.com']
42     start_urls = ['https://search.jd.com']
43 
44     def start_requests(self):
45         #进入搜索页进行搜索
46         for each in self.start_urls:
47             yield SplashRequest(each,callback=self.parse,endpoint='execute',
48                 args={'lua_source': lua_script})
49 
50     def parse(self, response):
51         item = JdsplashItem()
52         price = response.CSS('div.gl-i-wrap div.p-price i::text').getall()
53         page_num = response.xpath("//span[@class= 'p-num']/a[last()-1]/text()").get()
54         #这里使用了 xpath 函数 fn:string(arg):返回参数的字符串值。参数可以是数字、逻辑值或节点集。
55         #可能这就是 xpath 比 css 更精致的地方吧
56         name = response.css('div.gl-i-wrap div.p-name').xpath('string(.//em)').getall()
57         #comment = response.css('div.gl-i-wrap div.p-commit').xpath('string(.//strong)').getall()
58         comment = response.css('div.gl-i-wrap div.p-commit strong a::text').getall()
59         publishstore = response.css('div.gl-i-wrap div.p-shopnum a::attr(title)').getall()
60         href = [response.urljoin(i) for i in response.css('div.gl-i-wrap div.p-img a::attr(href)').getall()]
61         for each in zip(name, price, comment, publishstore,href):
62             item['name'] = each[0]
63             item['price'] = each[1]
64             item['comment'] = each[2]
65             item['p_store'] = each[3]
66             item['href'] = each[4]
67             yield item
68         #这里从第二页开始
69         url = 'https://search.jd.com/Search?keyword=python3.7&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&page=%d&s=%d&click=0'
70         for each_page in range(1,int(page_num)):
71             yield SplashRequest(url%(each_page*2+1,each_page*60),callback=self.s_parse,endpoint='execute',
72                 args={'lua_source': lua_script2})
73 
74     def s_parse(self, response):
75         item = JdsplashItem()
76         price = response.css('div.gl-i-wrap div.p-price i::text').getall()
77         name = response.css('div.gl-i-wrap div.p-name').xpath('string(.//em)').getall()
78         comment = response.css('div.gl-i-wrap div.p-commit strong a::text').getall()
79         publishstore = response.css('div.gl-i-wrap div.p-shopnum a::attr(title)').getall()
80         href = [response.urljoin(i) for i in response.css('div.gl-i-wrap div.p-img a::attr(href)').getall()]
81         for each in zip(name, price, comment, publishstore, href):
82             item['name'] = each[0]
83             item['price'] = each[1]
84             item['comment'] = each[2]
85             item['p_store'] = each[3]
86             item['href'] = each[4]
87             yield item

View Code

各个文件的配置：

items.py :

 1 import scrapy
 2 
 3 
 4 class JdsplashItem(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     name = scrapy.Field()
 8     price = scrapy.Field()
 9     p_store = scrapy.Field()
10     comment = scrapy.Field()
11     href = scrapy.Field()
12     pass

settings.py:

1 import scrapy_splash
2 # Splash服务器地址
3 SPLASH_URL = 'http://192.168.99.100:8050'
4 # 开启Splash的两个下载中间件并调整HttpCompressionMiddleware的次序
5 DOWNLOADER_MIDDLEWARES = {
6 'scrapy_splash.SplashCookiesMiddleware': 723,
7 'scrapy_splash.SplashMiddleware': 725,
8 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
9 }

最后运行代码，可以看到书籍数据已经被爬取了：

您可能感兴趣的文档:

--结束END--

本文标题: 利用 scrapy-splash 对京东

本文链接: https://lsjlt.com/news/182739.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

回答

如何调试操作系统的错误？
操作系统

2023-11-15发布

回答

操作系统中的I/O系统是如何实现的？
操作系统

2023-11-15发布

回答

如何实现操作系统的内存管理？
操作系统

2023-11-15发布

回答

什么是虚拟内存，它对操作系统有什么影响？
操作系统

2023-11-15发布

回答

ASP中的MVC架构和WebForms架构有什么区别和使用场景？
ASP.NET

2023-11-15发布

回答

ASP中的数据验证和数据校验有什么不同？
ASP.NET

2023-11-15发布

回答

ASP中的ADO对象和DAO对象有什么区别和使用方法？
ASP.NET

2023-11-15发布

回答

Node.js中的包管理器NPM是什么？如何使用它进行依赖管理？
node.js

2023-11-15发布

回答

Vue.js中的动态组件是什么？如何使用它来动态渲染组件？
VUE

2023-11-15发布

回答

如何使用Vue.js实现懒加载和预加载？
VUE

2023-11-15发布

利用 scrapy-splash 对京东

本人是第一次写博客，有写得不好的地方欢迎值出来，大家一起进步！

scrapy-splash的介绍

准备配置

页面分析

开始爬取

利用 scrapy-splash 对京东

如何使用scrapy-splash

PHP对京东联盟CPS的API调用

Scrapy抓取京东商品、豆瓣电影及代码分享

利用JavaScript实现仿京东放大镜效果

python爬虫框架scrapy实战之爬取京东商城进阶篇

利用JavaScript模拟京东快递单号查询效果

如何利用JavaScript实现仿京东放大镜效果

京东云服务器租用

京东云服务器有人用么

京东云服务器怎么使用

京东云服务器租用流程

阿里云与京东服务器价格对比分析

京东云服务器租用价格表

京东云服务器租用多少钱

python基于scrapy爬取京东笔记本电脑数据并进行简单处理和分析

怎么用JavaScript实现京东秒杀效果

怎么用JavaScript仿京东放大镜效果

怎么用Android实现京东秒杀功能

京东上传不能用JavaScript怎么解决

python分析数据的方法是什么

如何使用Python实现抽奖小程序

python copy函数的作用是什么

python ffmpeg模块怎么安装和使用

python进程池创建队列的方法是什么

python无法运行文件的原因有哪些

python can't open file报错怎么解决

python keyerror错误怎么解决

python字符串处理与应用的方法有哪些

python全局变量如何定义