【python爬虫学习】python_编程网

pip 安装 pip install scrapy
可能的问题：
问题/解决：error: Microsoft Visual c++ 14.0 is required.

实例demo教程中文教程文档
第一步：创建项目目录

scrapy startproject tutorial

第二步：进入tutorial创建spider爬虫

scrapy genspider baidu www.baidu.com

第三步：创建存储容器，复制项目下的items.py重命名为BaiduItems

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class BaiduItems(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()
    pass

第四步：修改spiders/baidu.py xpath提取数据

# -*- coding: utf-8 -*-
import scrapy
# 引入数据容器
from tutorial.BaiduItems import BaiduItems

class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    allowed_domains = ['www.readingbar.net']
    start_urls = ['Http://www.readingbar.net/']
    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = BaiduItems()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item
        pass

第五步：解决百度首页网站抓取空白问题,设置setting.py

# 设置用户代理
USER_AGENT = 'Mozilla/5.0 (windows NT 10.0; Win64; x64) AppleWEBKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'

# 解决 robots.txt 相关debug
ROBOTSTXT_OBEY = False
# scrapy 解决数据保存乱码问题
FEED_EXPORT_ENCODING = 'utf-8'

最后一步：开始爬取数据命令并保存数据为指定的文件
执行的时候可能报错：No module named 'win32api' 可以下载指定版本安装

scrapy crawl baidu -o baidu.JSON

深度爬取百度首页及导航菜单相关页内容

# -*- coding: utf-8 -*-
import scrapy

from scrapyProject.BaiduItems import BaiduItems

class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    # 由于tab包含其他域名,需要添加域名否则无法爬取
    allowed_domains = [
        'www.baidu.com',
        'v.baidu.com',
        'map.baidu.com',
        'news.baidu.com',
        'tieba.baidu.com',
        'xueshu.baidu.com'
    ]
    start_urls = ['https://www.baidu.com/']
    def parse(self, response):
        item = BaiduItems()
        item['title'] = response.xpath('//title/text()').extract()
        yield item
        for sel in response.xpath('//a[@class="mnav"]'):
            item = BaiduItems()
            item['nav'] = sel.xpath('text()').extract()
            item['href'] = sel.xpath('@href').extract()
            yield item
            # 根据提取的nav地址建立新的请求并执行回调函数
            yield scrapy.Request(item['href'][0],callback=self.parse_newpage)
        pass
    # 深度提取tab网页标题信息
    def parse_newpage(self, response):
        item = BaiduItems()
        item['title'] = response.xpath('//title/text()').extract()
        yield item
        pass

绕过登录进行爬取
a.解决图片验证 pytesseract

【python爬虫学习】python

【python爬虫学习】python

Python爬虫学习路线

python爬虫学习三：python正则

零基础学习Python爬虫

Python爬虫框架Scrapy 学习

Python 爬虫学习笔记之单线程爬虫

Python 爬虫学习笔记之多线程爬虫

学习python爬虫能做什么

Python爬虫学习教程：天猫商品数据爬虫

爬虫学习

【Python学习】爬虫报错处理bs4.

零基础怎么学习Python爬虫

python爬虫Mitmproxy安装使用学习笔记

Python爬虫学习之requests的使用教程

学习网络爬虫python会不会很难

学习Python爬虫前必掌握知识点

Python爬虫练习汇总

『爬虫』学习记录

好程序员Python学习路线之python爬虫入门

python爬虫要学多久

python分析数据的方法是什么

如何使用Python实现抽奖小程序

python copy函数的作用是什么

python ffmpeg模块怎么安装和使用

python进程池创建队列的方法是什么

python无法运行文件的原因有哪些

python can't open file报错怎么解决

python keyerror错误怎么解决

python字符串处理与应用的方法有哪些

python全局变量如何定义

【python爬虫学习 】python

【python爬虫学习】python