首页 > 资讯 > 后端开发 > Python >python如何使用Scrapy爬取网易新闻

408

分享到

python如何使用Scrapy爬取网易新闻

2023-06-14 06:06:45 408人浏览独家记忆

Python 官方文档：入门教程 => 点击学习

摘要

这篇文章主要介绍python如何使用scrapy爬取网易新闻，文中介绍的非常详细，具有一定的参考价值，感兴趣的小伙伴们一定要看完！1. 新建项目在命令行窗口下输入scrapy startproject scrapytest, 如下然后就自动

这篇文章主要介绍python如何使用scrapy爬取网易新闻，文中介绍的非常详细，具有一定的参考价值，感兴趣的小伙伴们一定要看完！

1. 新建项目

在命令行窗口下输入scrapy startproject scrapytest, 如下

python如何使用Scrapy爬取网易新闻

然后就自动创建了相应的文件，如下

python如何使用Scrapy爬取网易新闻

2. 修改itmes.py文件

打开scrapy框架自动创建的items.py文件，如下

# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ScrapytestItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass

编写里面的代码，确定我要获取的信息，比如新闻标题，url，时间，来源，来源的url，新闻的内容等

class ScrapytestItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() timestamp = scrapy.Field() cateGory = scrapy.Field() content = scrapy.Field() url = scrapy.Field()  pass

3. 定义spider，创建一个爬虫模板

3.1 创建crawl爬虫模板

在命令行窗口下面创建一个crawl爬虫模板（注意在文件的根目录下面，指令检查别输入错误，-t 表示使用后面的crawl模板），会在spider文件夹生成一个news163.py文件

scrapy genspider -t crawl codingce news.163.com

然后看一下这个‘crawl'模板和一般的模板有什么区别，多了链接提取器还有一些爬虫规则，这样就有利于我们做一些深度信息的爬取

import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass CodinGCeSpider(CrawlSpider): name = 'codingce' allowed_domains = ['163.com'] start_urls = ['Http://news.163.com/'] rules = (  Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), ) def parse_item(self, response):  item = {}  #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()  #item['name'] = response.xpath('//div[@id="name"]').get()  #item['description'] = response.xpath('//div[@id="description"]').get()  return item

3.2 补充知识：selectors选择器

支持xpath和CSS,xpath语法如下

/html/head/title/html/head/title/text()//td (深度提取的话就是两个/)//div[@class=‘mine']

3.3. 分析网页内容

在谷歌chrome浏览器下，打在网页新闻的网站，选择查看源代码，确认我们可以获取到itmes.py文件的内容（其实那里面的要获取的就是查看了网页源代码之后确定可以获取的）

确认标题、时间、url、来源url和内容可以通过检查和标签对应上，比如正文部分

主体

python如何使用Scrapy爬取网易新闻

标题

python如何使用Scrapy爬取网易新闻

时间

python如何使用Scrapy爬取网易新闻

分类

python如何使用Scrapy爬取网易新闻

4. 修改spider下创建的爬虫文件

4.1 导入包

打开创建的爬虫模板，进行代码的编写，除了导入系统自动创建的三个库，我们还需要导入news.items(这里就涉及到了包的概念了，最开始说的–init–.py文件存在说明这个文件夹就是一个包可以直接导入，不需要安装)

注意：使用的类ExampleSpider一定要继承自CrawlSpider，因为最开始我们创建的就是一个‘crawl'的爬虫模板，对应上

import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom scrapytest.items import ScrapytestItemclass CodingceSpider(CrawlSpider): name = 'codingce' allowed_domains = ['163.com'] start_urls = ['http://news.163.com/'] rules = (  Rule(LinkExtractor(allow=r'.*\.163\.com/\d{2}/\d{4}/\d{2}/.*\.html'), callback='parse', follow=True), ) def parse(self, response):  item = {}  content = '<br>'.join(response.css('.post_content p::text').getall())  if len(content) < 100:   return  return item

Rule(LinkExtractor(allow=r'..163.com/\d{2}/\d{4}/\d{2}/..html'), callback=‘parse', follow=True), 其中第一个allow里面是书写正则表达式的（也是我们核心要输入的内容），第二个是回调函数，第三个表示是否允许深入

最终代码

from datetime import datetimeimport reimport scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom scrapytest.items import ScrapytestItemclass CodingceSpider(CrawlSpider): name = 'codingce' allowed_domains = ['163.com'] start_urls = ['http://news.163.com/'] rules = (  Rule(LinkExtractor(allow=r'.*\.163\.com/\d{2}/\d{4}/\d{2}/.*\.html'), callback='parse', follow=True), ) def parse(self, response):  item = {}  content = '<br>'.join(response.css('.post_content p::text').getall())  if len(content) < 100:   return  title = response.css('h2::text').get()  category = response.css('.post_crumb a::text').getall()[-1]  print(category, "=======category")  time_text = response.css('.post_info::text').get()  timestamp_text = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', time_text).group()  timestamp = datetime.fromisofORMat(timestamp_text)  print(title, "=========title")  print(content, "===============content")  print(timestamp, "==============timestamp")  print(response.url)  return item

python如何使用Scrapy爬取网易新闻

以上是“Python如何使用Scrapy爬取网易新闻”这篇文章的所有内容，感谢各位的阅读！希望分享的内容对大家有帮助，更多相关知识，欢迎关注编程网Python频道！

您可能感兴趣的文档:

--结束END--

本文标题: python如何使用Scrapy爬取网易新闻

本文链接: https://lsjlt.com/news/268772.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

回答

如何调试操作系统的错误？
操作系统

2023-11-15发布

回答

操作系统中的I/O系统是如何实现的？
操作系统

2023-11-15发布

回答

如何实现操作系统的内存管理？
操作系统

2023-11-15发布

回答

什么是虚拟内存，它对操作系统有什么影响？
操作系统

2023-11-15发布

回答

ASP中的MVC架构和WebForms架构有什么区别和使用场景？
ASP.NET

2023-11-15发布

回答

ASP中的数据验证和数据校验有什么不同？
ASP.NET

2023-11-15发布

回答

ASP中的ADO对象和DAO对象有什么区别和使用方法？
ASP.NET

2023-11-15发布

回答

Node.js中的包管理器NPM是什么？如何使用它进行依赖管理？
node.js

2023-11-15发布

回答

Vue.js中的动态组件是什么？如何使用它来动态渲染组件？
VUE

2023-11-15发布

回答

如何使用Vue.js实现懒加载和预加载？
VUE

2023-11-15发布

python如何使用Scrapy爬取网易新闻

1. 新建项目

2. 修改itmes.py文件

3. 定义spider，创建一个爬虫模板

3.1 创建crawl爬虫模板

3.2 补充知识：selectors选择器

3.3. 分析网页内容

4. 修改spider下创建的爬虫文件

4.1 导入包

python如何使用Scrapy爬取网易新闻

python实现Scrapy爬取网易新闻

python如何爬取新闻门户网站

python爬虫中如何爬取新闻

python爬虫中如何爬取网页新闻内容

如何用5行python代码爬取新闻网最新资讯

python爬取新闻门户网站的示例

Python如何爬取汽车之家新闻信息

Python正则抓取网易新闻的方法示例

如何使用Scrapy网络爬虫框架

如何用Scrapy爬取豆瓣TOP250

如何使用scrapy实现增量式爬取

如何使用Python爬虫爬取网站图片

Python爬虫实战之使用Scrapy爬取豆瓣图片

Python小程序爬取今日新闻拿走就能用

如何使用python爬取整个网站

【Python】使用Python做简易爬虫爬取B站评论

python中如何使用Scrapy实现定时爬虫

使用python scrapy爬取天气并导出csv文件

怎么在Python中使用Scrapy爬取豆瓣图片

python分析数据的方法是什么

如何使用Python实现抽奖小程序

python copy函数的作用是什么

python ffmpeg模块怎么安装和使用

python进程池创建队列的方法是什么

python无法运行文件的原因有哪些

python can't open file报错怎么解决

python keyerror错误怎么解决

python字符串处理与应用的方法有哪些

python全局变量如何定义