首页 > 资讯 > 操作系统 >Linux 服务器配置selenium 爬虫

533

分享到

Linux 服务器配置selenium 爬虫

服务器 linux selenium 2023-12-23 20:12:26 533人浏览安东尼

摘要

linux 服务器配置使用代理 IP 的selenium 爬虫在 Linux 服务器运行爬虫有时可以取得奇效，但在 Linux 服务器环境（即无图形化界面）下配置爬虫环境、代理 IP 与常见的 windows 环境有着较大区别。本文为

linux 服务器配置使用代理 IP 的selenium 爬虫

在 Linux 服务器运行爬虫有时可以取得奇效，但在 Linux 服务器环境（即无图形化界面）下配置爬虫环境、代理 IP 与常见的 windows 环境有着较大区别。本文为对在 Linux 服务器上配置 selenium 及 Google Chrome 环境并基于代理 IP 运行爬虫的经历记录，针对一些笔者遇到的坑提供了解决方案，供读者参考。

一、基础环境

操作系统：ubuntu 20.0

python：3.7

代理 IP：Clash（关于在 Linux 环境配置 Clash 的操作可见文章Linux服务器基于代理IP的爬虫_西南小游侠的博客-CSDN博客）

二、安装并使用 Google Chrome

首先需要在 Root 账户下，直接从 Chrome 官网下载并安装 Chrome：

sudo apt-get install libxss1 libappindicator1 libindicator7wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.debsudo dpkg -i google-chrome*.deb sudo apt-get install -f

安装完成后，需要修改配置文件使其能够在 Root 权限下运行。打开 /opt/google/chrome/google-chrome 文件，找到命令：

exec -a "$0" "$HERE/chrome" "$@"

在其末尾添加命令成为：

exec -a "$0" "$HERE/chrome" "$@" --user-data-dir --no-sandbox

接着基于以下命令测试是否可以使用 Chrome：

google-chrome --headless --remote-debugging-port=9222 Https://chromium.org --disable-gpu

输出网页内容即可。

接下来需要通过修改部分文件权限保证在非 Root 账户可以使用 Chrome。

登录到一个非 Root 账户，测试上述命令，发现其报错：

在这里插入图片描述

该错误是因为启动 Chrome 需要修改 /tmp/Crashpad 文件夹，但该账户没有权限修改该文件夹。解决方法为将 /tmp/Crashpad 文件夹的权限修改，在 Root 账户输入：

chmod -R 777 /tmp/Crashpad

在运行 Chrome，发现报错：

在这里插入图片描述

同理，修改文件夹权限，此处直接将 /opt 文件夹权限开放：

chmod -R 777 /opt

再运行 Chrome，发现可以成功运行了。

三、安装并使用 selenium

首先，需要安装和你所安装的 Chrome 版本一致的 WEBdriver，首先查看 Chrome 版本：

google-chrome --version

接着，根据 Chrome 版本号，从网站 CNPM Binaries Mirror (npmmirror.com) 下载对应版本的 webdriver 并解压到一个自定义目录即可。

接下来需要安装 Python 第三方库 selenium，直接通过 conda 安装即可：

conda install selenium

然后就可以在代码中运行基于 selenium 的爬虫了，提供一个代码示例：

from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionswd = webdriver.Chrome(executable_path='path_for_webdriver')wd.get("https://www.baidu.com")content = wd.page_sourceurl = wd.current_urlprint(url)print(content)wd.quit()

但是，直接运行该代码会报错：

在这里插入图片描述

该错误是比较常见的，通过在网上查询可知，是需要添加启动参数，但不同的系统可能需要不同的参数才能启动，有的仅需要1~3个，笔者的系统需要以下5个参数都添加才可以启动：

chrome_options = Options()chrome_options.add_argument('--headless')chrome_options.add_argument('--disable-gpu')chrome_options.add_argument('--no-sandbox')chrome_options.add_argument('--disable-dev-shm-usage')chrome_options.add_argument("--remote-debugging-port=9222")

添加这些参数后，即可成功运行，完整代码示例如下：

from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionschrome_options = Options()chrome_options.add_argument('--headless')# 使用无头模式，无 GUI的Linux服务器必须添加chrome_options.add_argument('--disable-gpu')# 不使用GPU，有的机器不支持GPUchrome_options.add_argument('--no-sandbox')# 运行 Chrome 的必需参数chrome_options.add_argument('--disable-dev-shm-usage')chrome_options.add_argument("--remote-debugging-port=9222")# 以上两个参数具体作用不明，但笔者机器需要这两个参数才能运行chrome_options.add_argument("user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/62.0.3202.94 Safari/537.36'")# 该参数用于避免被认出，非必需参数wd = webdriver.Chrome(chrome_options=chrome_options,executable_path='path_for_webdriver')wd.get("https://www.baidu.com")content = wd.page_sourceurl = wd.current_urlprint(url)print(content)wd.quit()

四、使用代理 IP 的 selenium 爬虫

在上一篇文章中，我们已经配置好了基于 Clash 的代理 IP，接下来直接向爬虫代码中添加部分即可：

from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionschrome_options = Options()chrome_options.add_argument('--headless')chrome_options.add_argument('--disable-gpu')chrome_options.add_argument('--no-sandbox')chrome_options.add_argument('--disable-dev-shm-usage')chrome_options.add_argument("--remote-debugging-port=9222")chrome_options.add_argument('--proxy-server=http://127.0.0.1:7890') # 添加部分，使用代理IPchrome_options.add_argument("user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'")wd = webdriver.Chrome(chrome_options=chrome_options,executable_path='path_for_webdriver')try:    wd.get("https://www.google.com")    content = wd.page_source    url = wd.current_url    print(url)    print(content)finally:    wd.quit()

接下来运行即可成功访问 Google。

来源地址：https://blog.csdn.net/UIBE_day_day_up/article/details/128989600

--结束END--

本文标题: Linux 服务器配置selenium 爬虫

本文链接: https://lsjlt.com/news/551745.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

回答

如何调试操作系统的错误？
操作系统

2023-11-15发布

回答

操作系统中的I/O系统是如何实现的？
操作系统

2023-11-15发布

回答

如何实现操作系统的内存管理？
操作系统

2023-11-15发布

回答

什么是虚拟内存，它对操作系统有什么影响？
操作系统

2023-11-15发布

回答

ASP中的MVC架构和WebForms架构有什么区别和使用场景？
ASP.NET

2023-11-15发布

回答

ASP中的数据验证和数据校验有什么不同？
ASP.NET

2023-11-15发布

回答

ASP中的ADO对象和DAO对象有什么区别和使用方法？
ASP.NET

2023-11-15发布

回答

Node.js中的包管理器NPM是什么？如何使用它进行依赖管理？
node.js

2023-11-15发布

回答

Vue.js中的动态组件是什么？如何使用它来动态渲染组件？
VUE

2023-11-15发布

回答

如何使用Vue.js实现懒加载和预加载？
VUE

2023-11-15发布

Linux 服务器配置selenium 爬虫

linux 服务器配置使用代理 IP 的selenium 爬虫

一、基础环境

二、安装并使用 Google Chrome

三、安装并使用 selenium

四、使用代理 IP 的 selenium 爬虫

Linux 服务器配置selenium 爬虫

Python 爬虫利器 Selenium

Python3爬虫利器:Selenium怎么安装

在linux系统下部署selenium爬虫程序介绍

python爬虫环境如何配置

python中如何利用selenium进行浏览器爬虫

云服务器部署爬虫

亚马逊+爬虫服务器

阿里云服务器爬虫

阿里云服务器爬虫 ip

SpringBoot定时任务调度与爬虫的配置实现

Linux服务器配置---ntp

Linux服务器---配置telnet

Linux服务器---配置bind

Linux服务器配置---tftpserver

Python中怎么对爬虫程序进行配置

linux selenium chrome 加载用户配置文件

服务器上部署scrapy爬虫项目

阿里云服务器部署python爬虫

部署爬虫脚本到云服务器

linux vmstat命令有哪些功能

linux转义字符使用的方法是什么

linux安装node怎么使用

如何查看Linux系统版本号

linux系统怎么配置ntp服务

win10安装报0x25D错误怎么解决

win10开机卡在用户登录界面如何解决

win10新机怎么跳过创建账户

win10中config.msi文章能不能删除

win10改为ahci后无法进入系统怎么解决