首页 > 资讯 > 后端开发 > Python >PythonScrapy实战之古诗文网的爬取

722

分享到

PythonScrapy实战之古诗文网的爬取

2024-04-02 19:04:59 722人浏览泡泡鱼

Python 官方文档：入门教程 => 点击学习

摘要

目录需求1. scrapy项目创建2. 全局配置 settings.py3. 爬虫程序.py4. 数据结构 items.py5. 管道 pipelines.py6. 程序执行 sta

需求

通过python,Scrapy框架，爬取古诗文网上的诗词数据，具体包括诗词的标题信息，作者，朝代，诗词内容，及译文。爬取过程需要逐页爬取，共4页。第一页的url为（https://www.gushiwen.cn/default_1.aspx）。

1. Scrapy项目创建

首先创建Scrapy项目及爬虫程序

在目标目录下，创建一个名为prose的项目：

scrapy startproject prose

进入项目目录下，然后创建一个名为gs的爬虫程序，爬取范围为 gushiwen.cn

cd prose
scrapy genspider gs gushiwen.cn

2. 全局配置 settings.py

对配置文件settings.py做如下编辑：

①选择不遵守robots协议

②下载间隙设置为1

③并添加请求头，启用管道

④此外设置打印等级：LOG_LEVEL=“WARNING”

具体如下：

# Scrapy settings for prose project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     Https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'prose'

SPIDER_MODULES = ['prose.spiders']
NEWSPIDER_MODULE = 'prose.spiders'

LOG_LEVEL = "WARNING"


# Crawl responsibly by identifying yourself (and your WEBsite) on the user-agent
#USER_AGENT = 'prose (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests perfORMed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'user-agent': 'Mozilla/5.0 (windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'prose.middlewares.ProseSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'prose.middlewares.ProseDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'prose.pipelines.ProsePipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3. 爬虫程序.py

首先是进行页面分析，这里不再赘述该过程。

这部分代码，也即需要编辑的核心部分。

首先是要把初始URL加以修改，修改为要爬取的界面的第一页，而非古诗文网的首页。

需求：我们要爬取的内容包括：诗词的标题信息，作者，朝代，诗词内容，及译文。爬取过程需要逐页爬取。

其中，标题信息，作者，朝代，诗词内容，及译文都存在于同一个<div>标签中。

为了体现两种不同的操作方式，

标题信息，作者，朝代，诗词内容四项，我们使用一种方法获取。并在该for循环中使用到一个异常处理语句（try…except…）来避免取到空值时使用索引导致的报错；

对于译文，我们额外定义一个parse_detail函数，并在scrapy.Request()中传入其，来获取。

关于翻页，我们的思路是：遍历获取完每一页需要的数据后（即一大轮循环结束后），从当前页面上获取下一页的链接，然后判断获取到的链接是否为空。如若不为空则表示获取到了，则再一次使用scrapy.Requests()方法，传入该链接，并再次调用parse函数。如果为空，则表明这已经是最后一页了，程序就会在此处结束。

具体代码如下：

import scrapy
from prose.items import ProseItem


class GsSpider(scrapy.Spider):
    name = 'gs'
    allowed_domains = ['gushiwen.cn']
    start_urls = ['https://www.gushiwen.cn/default_1.aspx']

    # 解析列表页面
    def parse(self, response):
        # 一个class="sons"对应的是一首诗
        div_list = response.xpath('//div[@class="left"]/div[@class="sons"]')
        for div in div_list:
            try:
                # 提取诗词标题信息
                title = div.xpath('.//b/text()').get()
                # 提取作者和朝代
                source = div.xpath('.//p[@class="source"]/a/text()').getall()
                # 作者
                # replace
                author = source[0]
                # 朝代
                dynasty = source[1]
                content_list = div.xpath('.//div[@class="contson"]//text()').getall()
                content_plus = ''.join(content_list).strip()
                # 拿到诗词详情页面的url
                detail_url = div.xpath('.//p/a/@href').get()
                item = ProseItem(title=title, author=author, dynasty=dynasty, content_plus=content_plus, detail_url=detail_url)
                # print(item)
                yield scrapy.Request(
                    url=detail_url,
                    callback=self.parse_detail,
                    meta={'prose_item': item}
                )
            except:
                pass

        next_url = response.xpath('//a[@id="amore"]/@href').get()
        if next_url:
            print(next_url)
            yield scrapy.Request(
                url=next_url,
                callback=self.parse
            )


    # 用于解析详情页面
    def parse_detail(self, response):
        item = response.meta.get('prose_item')
        translation = response.xpath('//div[@class="sons"]/div[@class="contyishang"]/p//text()').getall()
        item['translation'] = ''.join(translation).strip()
        # print(item)
        yield item
        pass

4. 数据结构 items.py

在这里定义了ProseItem类，以便在上边的爬虫程序中调用。（此外要注意的是，爬虫程序中导入了该模块，有必要时需要将合适的文件夹标记为根目录。）

import scrapy


class ProseItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 标题
    title = scrapy.Field()
    # 作者
    author = scrapy.Field()
    # 朝代
    dynasty = scrapy.Field()
    # 诗词内容
    content_plus = scrapy.Field()
    # 详情页面的url
    detail_url = scrapy.Field()
    # 译文
    translation = scrapy.Field()
    pass

5. 管道 pipelines.py

管道，在这里编辑数据存储的过程。

from itemadapter import ItemAdapter
import JSON


class ProsePipeline:
    def __init__(self):
        self.f = open('gs.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
    	# 将item先转化为字典， 再转化为 json类型的字符串
        item_json = json.dumps(dict(item), ensure_ascii=False)
        self.f.write(item_json + '\n')
        return item

    def close_spider(self, spider):
        self.f.close()

6. 程序执行 start.py

定义一个执行命令的程序。

from scrapy import cmdline

cmdline.execute('scrapy crawl gs'.split())

程序执行效果如下：

我们需要的数据，被保存在了一个名为gs.txt的文本文件中了。

以上就是Python Scrapy实战之古诗文网的爬取的详细内容，更多关于Python Scrapy爬取古诗文网的资料请关注编程网其它相关文章！

您可能感兴趣的文档:

--结束END--

本文标题: PythonScrapy实战之古诗文网的爬取

本文链接: https://lsjlt.com/news/118196.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

回答

如何调试操作系统的错误？
操作系统

2023-11-15发布

回答

操作系统中的I/O系统是如何实现的？
操作系统

2023-11-15发布

回答

如何实现操作系统的内存管理？
操作系统

2023-11-15发布

回答

什么是虚拟内存，它对操作系统有什么影响？
操作系统

2023-11-15发布

回答

ASP中的MVC架构和WebForms架构有什么区别和使用场景？
ASP.NET

2023-11-15发布

回答

ASP中的数据验证和数据校验有什么不同？
ASP.NET

2023-11-15发布

回答

ASP中的ADO对象和DAO对象有什么区别和使用方法？
ASP.NET

2023-11-15发布

回答

Node.js中的包管理器NPM是什么？如何使用它进行依赖管理？
node.js

2023-11-15发布

回答

Vue.js中的动态组件是什么？如何使用它来动态渲染组件？
VUE

2023-11-15发布

回答

如何使用Vue.js实现懒加载和预加载？
VUE

2023-11-15发布

PythonScrapy实战之古诗文网的爬取

目录

需求

1. Scrapy项目创建

2. 全局配置 settings.py

3. 爬虫程序.py

4. 数据结构 items.py

5. 管道 pipelines.py

6. 程序执行 start.py

PythonScrapy实战之古诗文网的爬取

Python用正则表达式实现爬取古诗文网站信息

Python怎么用正则表达式实现爬取古诗文网站信息

python爬虫入门实战之爬取网页图片

Python爬虫实战之用selenium爬取某旅游网站

Python爬虫实战之爬取携程评论

python爬虫实战项目之爬取pixiv图片

python爬虫实战之爬取百度首页的方法

Python爬虫实战之爬取某宝男装信息

python爬虫实战之爬取京东商城实例教程

Python爬虫实战之使用Scrapy爬取豆瓣图片

python实战之Scrapy框架爬虫爬取微博热搜

Python爬虫实战之虎牙视频爬取附源码

Python网络爬虫实战案例之：7000

Python实战使用Selenium爬取网页数据

python实战项目：爬取某网帅哥图片

python爬虫框架scrapy实战之爬取京东商城进阶篇

Python进阶多线程爬取网页项目实战

Python项目实战：爬取网易云音乐评论

Python爬虫实战之爬取京东商品数据并实实现数据可视化

python分析数据的方法是什么

如何使用Python实现抽奖小程序

python copy函数的作用是什么

python ffmpeg模块怎么安装和使用

python进程池创建队列的方法是什么

python无法运行文件的原因有哪些

python can't open file报错怎么解决

python keyerror错误怎么解决

python字符串处理与应用的方法有哪些

python全局变量如何定义