返回顶部
首页 > 资讯 > 后端开发 > Python >scrapy-redis 分布式哔哩哔哩
  • 952
分享到

scrapy-redis 分布式哔哩哔哩

分布式scrapyredis 2023-01-31 00:01:40 952人浏览 八月长安

Python 官方文档:入门教程 => 点击学习

摘要

scrapy里面,对每次请求的url都有一个指纹,这个指纹就是判断url是否被请求过的。默认是开启指纹即一个URL请求一次。如果我们使用分布式在多台机上面爬取数据,为了让爬虫的数据不重复,我们也需要一个指纹。但是scrapy默认的指纹是保

scrapy里面,对每次请求的url都有一个指纹,这个指纹就是判断url是否被请求过的。默认是开启指纹即一个URL请求一次。如果我们使用分布式在多台机上面爬取数据,为了让爬虫的数据不重复,我们也需要一个指纹。但是scrapy默认的指纹是保持到本地的。所有我们可以使用Redis来保持指纹,并且用redis里面的set集合来判断是否重复。

setting.py

# -*- coding: utf-8 -*-

# Scrapy settings for bilibili project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     Https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'bilibili'

SPIDER_MODULES = ['bilibili.spiders']
NEWSPIDER_MODULE = 'bilibili.spiders'


# Crawl responsibly by identifying yourself (and your WEBsite) on the user-agent
#USER_AGENT = 'bilibili (+http://www.yourdomain.com)'

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests perfORMed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'bilibili.middlewares.BilibiliSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'bilibili.middlewares.BilibiliDownloaderMiddleware': 543,
    'bilibili.middlewares.randomUserAgentMiddleware':400
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'bilibili.pipelines.BilibiliPipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline':300
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
REDIS_URL = 'redis://@127.0.0.1:6379'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

  spider.py

# -*- coding: utf-8 -*-
import scrapy
import JSON,re
from bilibili.items import BilibiliItem

class BilibiliappSpider(scrapy.Spider):
    name = 'bilibiliapp'
    # allowed_domains = ['www.bilibili.com']
    # start_urls = ['http://www.bilibili.com/']
    def start_requests(self):
            for i in range(1, 300):

                url = 'https://api.bilibili.com/x/relation/stat?vmid={}&jsonp=jsonp&callback=__jp3'.format(i)
                url_ajax = 'https://space.bilibili.com/{}/'.format(i)
                # get的时候是这个东东, scrapy.Request(url=, callback=)
                req = scrapy.Request(url=url,callback=self.parse,meta={'id':i})
                req.headers['referer'] = url_ajax

                yield req


    def parse(self, response):
        # print(response.text)
        comm = re.compile(r'({.*})')
        text = re.findall(comm,response.text)[0]
        data = json.loads(text)
        # print(data)
        follower = data['data']['follower']
        following = data['data']['following']
        id = response.meta.get('id')
        url = 'https://space.bilibili.com/ajax/member/getSubmitVideos?mid={}&page=1&pagesize=25'.format(id)
        yield scrapy.Request(url=url,callback=self.getsubmit,meta={
            'id':id,
            'follower':follower,
            'following':following
        })

    def getsubmit(self, response):
        # print(response.text)
        data = json.loads(response.text)
        tilst = data['data']['tlist']
        tlist_list = []
        if tilst != []:
            # print(tilst)
            for tils in tilst.values():
                # print(tils['name'])
                tlist_list.append(tils['name'])
        else:
            tlist_list = ['无爱好']
        follower = response.meta.get('follower')
        following = response.meta.get('following')
        id = response.meta.get('id')
        url = 'https://api.bilibili.com/x/space/acc/info?mid={}&jsonp=jsonp'.format(id)
        yield scrapy.Request(url=url,callback=self.space,meta={
            'id':id,
            'follower':follower,
            'following':following,
            'tlist_list':tlist_list
        })

    def space(self, respinse):
        # print(respinse.text)
        data = json.loads(respinse.text)
        name = data['data']['name']
        sex = data['data']['sex']
        level = data['data']['level']
        birthday = data['data']['birthday']
        tlist_list = respinse.meta.get('tlist_list')
        animation = 0
        Life = 0
        Music = 0
        Game = 0
        Dance = 0
        Documentary = 0
        Ghost = 0
        science = 0
        Opera = 0
        entertainment = 0
        Movies = 0
        National = 0
        Digital = 0
        fashion = 0
        for tlist in tlist_list:
            if tlist == '动画':
                animation = 1
            elif tlist == '生活':
                Life = 1
            elif tlist == '音乐':
                Music = 1
            elif tlist == '游戏':
                Game = 1
            elif tlist == '舞蹈':
                Dance = 1
            elif tlist == '纪录片':
                Documentary = 1
            elif tlist == '鬼畜':
                Ghost = 1
            elif tlist == '科技':
                science = 1
            elif tlist == '番剧':
                Opera =1
            elif tlist == '娱乐':
                entertainment = 1
            elif tlist == '影视':
                Movies = 1
            elif tlist == '国创':
                National = 1
            elif tlist == '数码':
                Digital = 1
            elif tlist == '时尚':
                fashion = 1
        item = BilibiliItem()
        item['name'] = name
        item['sex'] = sex
        item['level'] = level
        item['birthday'] = birthday
        item['follower'] = respinse.meta.get('follower')
        item['following'] = respinse.meta.get('following')
        item['animation'] = animation
        item['Life'] = Life
        item['Music'] = Music
        item['Game'] = Game
        item['Dance'] = Dance
        item['Documentary'] = Documentary
        item['Ghost'] = Ghost
        item['science'] = science
        item['Opera'] = Opera
        item['entertainment'] = entertainment
        item['Movies'] = Movies
        item['National'] = National
        item['Digital'] = Digital
        item['fashion'] = fashion
        yield item

设置ua池

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random

class randomUserAgentMiddleware(UserAgentMiddleware):

    def __init__(self,user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            request.headers.setdefault('User-Agent', ua)
    user_agent_list = [ \
        "Mozilla/5.0 (windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
        "Mozilla/5.0 (X11; linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]

git地址:https://GitHub.com/18370652038/scrapy-bilibili

--结束END--

本文标题: scrapy-redis 分布式哔哩哔哩

本文链接: https://lsjlt.com/news/182323.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com    QQ/279061341

猜你喜欢
  • scrapy-redis 分布式哔哩哔哩
    scrapy里面,对每次请求的url都有一个指纹,这个指纹就是判断url是否被请求过的。默认是开启指纹即一个URL请求一次。如果我们使用分布式在多台机上面爬取数据,为了让爬虫的数据不重复,我们也需要一个指纹。但是scrapy默认的指纹是保...
    99+
    2023-01-31
    分布式 scrapy redis
  • 哔哩哔哩Android项目编译优化
    目录背景编译优化工作流程快编插件获取工程树结构version版本源码orAAR主动Skip模块Configuration策略远端uploadR8 class checkFaster云...
    99+
    2024-04-02
  • python 爬取哔哩哔哩up主信息和投稿视频
    项目地址: https://github.com/cgDeepLearn/BilibiliCrawler  项目特点 采取了一定的反反爬策略。 Bilibili更改了用户页面的api, 用...
    99+
    2022-06-02
    python 爬取哔哩哔哩 python 爬取哔哩哔哩视频
  • 哔哩哔哩在Hilt组件化的使用技术探索
    目录前言接入HiltHilt在组件化出现了点小问题总结前言 DI(Dependency Injection),即“依赖注入”:组件之间依赖关系由容器在运行期决...
    99+
    2024-04-02
  • Python爬虫之爬取哔哩哔哩热门视频排行榜
    目录一、bs4解析二、xpath解析三、xpath解析(二值化处理后展示图片)四、分析过程一、bs4解析 import requests from bs4 import Beau...
    99+
    2024-04-02
  • python如何爬取哔哩哔哩up主信息和投稿视频
    这篇文章将为大家详细讲解有关python如何爬取哔哩哔哩up主信息和投稿视频,小编觉得挺实用的,因此分享给大家做个参考,希望大家阅读完这篇文章后可以有所收获。 项目特点采取了一定的反反爬策略。Bilibili更改了用户页面的api...
    99+
    2023-06-15
  • 解决阿里云服务器上无法加载哔哩哔哩的问
    本文将探讨在使用阿里云服务器时遇到无法加载哔哩哔哩网站的问题,并提供一些解决方案,以帮助用户解决这一常见的网络问题。 阿里云服务器是一种高性能、可靠的云计算服务,但由于地理位置和网络连接等因素,有时候可能会遇到无法加载某些网站的问题。特别是...
    99+
    2023-12-29
    阿里 器上 加载
  • 写一个Python脚本下载哔哩哔哩舞蹈区的所有视频
    目录一、抓取列表二、获取视频的 CID三、下载视频一、抓取列表 首先点开舞蹈区先选择宅舞列表。 然后打开 F12 的控制面板,可以找到一条 https://api.bilibili...
    99+
    2024-04-02
  • 哔哩哔哩缓存转码|FFmpeg将m4s文件转为mp4|PHP自动批量转码B站视频
    window下载安装FFmpeg 打开ffMpeg官网选择window=>Windows builds from gyan.dev 打开https://www.gyan.dev/ffmpeg/buil...
    99+
    2023-08-31
    ffmpeg php 音视频
  • 开源PHP 代挂机源码,可对接QQ、网易云、哔哩哔哩、QQ空间、等级加速等等
    本程序运行环境PHP5.6 95dg/config.php修改系统数据库 进入数据库绑定 你搭建的域名即可 部署完成 进入数据库 找到data 输入绑定授权域名即可进行授权打开此网站 网站是无对接接口...
    99+
    2023-09-16
    php 开发语言
  • 如何使用Scrapy-Redis实现分布式爬虫
    Scrapy-Redis是一个Scrapy框架的插件,可以用于实现分布式爬虫。下面是使用Scrapy-Redis实现分布式爬虫的步骤...
    99+
    2024-05-15
    Scrapy
  • Scrapy-redis爬虫分布式爬取的分析和实现
    Scrapy Scrapy是一个比较好用的Python爬虫框架,你只需要编写几个组件就可以实现网页数据的爬取。但是当我们要爬取的页面非常多的时候,单个主机的处理能力就不能满足我们的需求了(无论是处理速度还是...
    99+
    2022-06-04
    爬虫 分布式 Scrapy
  • 分布式爬虫scrapy-redis的实战踩坑记录
    目录一、安装redis1.首先要下载相关依赖2.然后编译redis二、scrapy框架出现的问题1.AttributeError: TaocheSpider object has n...
    99+
    2024-04-02
  • redis分布式数据库
    一、key pattern 查询相应的key1)redis允许模糊查询key  有3个通配符 、、[]: 通配任意多个字符: 通配单个字符[]: 通配括号内的某1个字符2)randomkey:返回随机ke...
    99+
    2024-04-02
  • Redis——》实现分布式锁
    推荐链接:     总结——》【Java】     总结——》【Mysql】     总结——》【Redis】     总结——》【Kafka】     总结——》【Spring】     总结—...
    99+
    2023-09-03
    redis 分布式 过期 lua
  • Redis实现分布式锁
    单体锁存在的问题 在单体应用中,如果我们对共享数据不进行加锁操作,多线程操作共享数据时会出现数据一致性问题。 (下述实例是一个简单的下单问题:从redis中获取库存,检查库存是否够,>0才允许下单) 我们的解决办法通常是加锁。如下加单体锁...
    99+
    2023-08-16
    分布式 java jvm
  • redis分布式锁与zk分布式锁的对比分析
    目录Redis实现分布式锁原理能实现的锁类型注意事项 zk实现分布式锁原理能实现的锁类型两种锁的对比在分布式环境下,传统的jvm级别的锁会失效,那么分布式锁就是非常有必要的一个技术,一般我们可以通过redis,...
    99+
    2022-11-18
    redis分布式锁 分布式锁 zk分布式锁
  • NoSQL之redis(分布式集群)
    1.集群架构: 解释: 所有的redis节点彼此互联(PING-PONG机制),内部使用二进制协议优化传输速度和带宽。 节点的fail是通过集群中超过半数的节点检测失效时才生效. 客户端与redis节点直...
    99+
    2024-04-02
  • redis分布式锁的实现
    一、使用分布式锁要满足的几个条件:1、系统是一个分布式系统(关键是分布式,单机的可以使用ReentrantLock或者synchronized代码块来实现)2、共享资源(各个系统访问同一个资源,资源的载体可...
    99+
    2024-04-02
  • redis分布式如何实现
    小编给大家分享一下redis分布式如何实现,希望大家阅读完这篇文章后大所收获,下面让我们一起去探讨吧!一 为什么使用 Redis在项目中使用 Redis,主要考虑两个角度:性能和并发。如果只是为了分布式锁这...
    99+
    2024-04-02
软考高级职称资格查询
编程网,编程工程师的家园,是目前国内优秀的开源技术社区之一,形成了由开源软件库、代码分享、资讯、协作翻译、讨论区和博客等几大频道内容,为IT开发者提供了一个发现、使用、并交流开源技术的平台。
  • 官方手机版

  • 微信公众号

  • 商务合作