首页 > 资讯 > 后端开发 > Python >python调用mrjob实现hadoo

810

分享到

python调用mrjob实现hadoo

python mrjob hadoo 2023-01-31 06:01:14 810人浏览八月长安

Python 官方文档：入门教程 => 点击学习

摘要

咱们一般写mapReduce是通过java和streaming来写的，身为pythoner的我，java不会，没办法就用streaming来写mapreduce日志分析。这里要介绍一个模块，是基于streaming搞的东西。mrjob 可

咱们一般写mapReduce是通过java和streaming来写的，身为pythoner的我，

java不会，没办法就用streaming来写mapreduce日志分析。这里要介绍一个

模块，是基于streaming搞的东西。

mrjob 可以让用 Python 来编写 MapReduce 运算，并在多个不同平台上运行，你可以：

使用纯 Python 编写多步的 MapReduce 作业
在本机上进行测试
在 hadoop 集群上运行

pip 的安装方法：

pip install mrjob

我测试的脚本

#coding:utf-8
from mrjob.job import MRJob
import re
#xiaorui.cc
#Word_RE = re.compile(r"[\w']+")
WORD_RE = re.compile(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}")
class MRWordFreqCount(MRJob):
    def mapper(self, word, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1
    def combiner(self, word, counts):
        yield word, sum(counts)
    def reducer(self, word, counts):
        yield word, sum(counts)
if __name__ == '__main__':
    MRWordFreqCount.run()

用法算简单：

python i.py -r inline input1 input2 input3 > out 命令可以将处理多个文件的结果输出到out文件里面。

本地模拟hadoop运行：python 1.py -r local <input> output

这个会把结果输出到output里面，这个output必须写。

hadoop集群上运行：python 1.py -r hadoop <input> output

执行脚本～

[root@kspc ~]# python mo.py -r local  <10.7.17.7-dnsquery.log.1> output
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/mo.root.20131224.040935.241241
reading from STDIN
writing to /tmp/mo.root.20131224.040935.241241/step-0-mapper_part-00000
> /usr/bin/python mo.py --step-num=0 --mapper /tmp/mo.root.20131224.040935.241241/input_part-00000 | sort | /usr/bin/python mo.py --step-num=0 --combiner > /tmp/mo.root.20131224.040935.241241/step-0-mapper_part-00000
writing to /tmp/mo.root.20131224.040935.241241/step-0-mapper_part-00001
> /usr/bin/python mo.py --step-num=0 --mapper /tmp/mo.root.20131224.040935.241241/input_part-00001 | sort | /usr/bin/python mo.py --step-num=0 --combiner > /tmp/mo.root.20131224.040935.241241/step-0-mapper_part-00001
Counters from step 1:
  (no counters found)
writing to /tmp/mo.root.20131224.040935.241241/step-0-mapper-sorted
> sort /tmp/mo.root.20131224.040935.241241/step-0-mapper_part-00000 /tmp/mo.root.20131224.040935.241241/step-0-mapper_part-00001
writing to /tmp/mo.root.20131224.040935.241241/step-0-reducer_part-00000
> /usr/bin/python mo.py --step-num=0 --reducer /tmp/mo.root.20131224.040935.241241/input_part-00000 > /tmp/mo.root.20131224.040935.241241/step-0-reducer_part-00000
writing to /tmp/mo.root.20131224.040935.241241/step-0-reducer_part-00001
> /usr/bin/python mo.py --step-num=0 --reducer /tmp/mo.root.20131224.040935.241241/input_part-00001 > /tmp/mo.root.20131224.040935.241241/step-0-reducer_part-00001
Counters from step 1:
  (no counters found)
Moving /tmp/mo.root.20131224.040935.241241/step-0-reducer_part-00000 -> /tmp/mo.root.20131224.040935.241241/output/part-00000
Moving /tmp/mo.root.20131224.040935.241241/step-0-reducer_part-00001 -> /tmp/mo.root.20131224.040935.241241/output/part-00001
Streaming final output from /tmp/mo.root.20131224.040935.241241/output
removing tmp directory /tmp/mo.root.20131224.040935.241241

执行的时候，资源的占用情况。

发现一个很奇妙的东西，mrjob居然调用shell下的sort来排序。。。。

为了更好的理解mrjob的用法，再来个例子。

from mrjob.job import MRJob
#from xiaorui.cc
class MRWordFrequencyCount(MRJob):
#把东西拼凑起来
    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1
#总结kv
    def reducer(self, key, values):
        yield key, sum(values)
if __name__ == '__main__':
    MRWordFrequencyCount.run()

看下结果：

下面是官网给的一些个用法：

我们可以看到他是支持hdfs和s3存储的！

Running your job different ways

The most basic way to run your job is on the command line:

$ python my_job.py input.txt

By default, output will be written to stdout.

You can pass input via stdin, but be aware that mrjob will just dump it to a file first:

$ python my_job.py < input.txt

You can pass multiple input files, mixed with stdin (using the - character):

$ python my_job.py input1.txt input2.txt - < input3.txt

By default, mrjob will run your job in a single Python process. This provides the friendliest debugging experience, but it’s not exactly distributed computing!

You change the way the job is run with the -r/--runner option. You can use -rinline (the default), -rlocal, -rhadoop, or -remr.

To run your job in multiple subprocesses with a few Hadoop features simulated, use -rlocal.

To run it on your Hadoop cluster, use -rhadoop.

If you have Elastic MapReduce configured (see Elastic MapReduce Quickstart), you can run it there with -remr.

Your input files can come from HDFS if you’re using Hadoop, or S3 if you’re using EMR:

$ python my_job.py -r emr s3://my-inputs/input.txt
$ python my_job.py -r hadoop hdfs://my_home/input.txt

您可能感兴趣的文档:

--结束END--

本文标题: python调用mrjob实现hadoo

本文链接: https://lsjlt.com/news/190529.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

回答

如何调试操作系统的错误？
操作系统

2023-11-15发布

回答

操作系统中的I/O系统是如何实现的？
操作系统

2023-11-15发布

回答

如何实现操作系统的内存管理？
操作系统

2023-11-15发布

回答

什么是虚拟内存，它对操作系统有什么影响？
操作系统

2023-11-15发布

回答

ASP中的MVC架构和WebForms架构有什么区别和使用场景？
ASP.NET

2023-11-15发布

回答

ASP中的数据验证和数据校验有什么不同？
ASP.NET

2023-11-15发布

回答

ASP中的ADO对象和DAO对象有什么区别和使用方法？
ASP.NET

2023-11-15发布

回答

Node.js中的包管理器NPM是什么？如何使用它进行依赖管理？
node.js

2023-11-15发布

回答

Vue.js中的动态组件是什么？如何使用它来动态渲染组件？
VUE

2023-11-15发布

回答

如何使用Vue.js实现懒加载和预加载？
VUE

2023-11-15发布

python调用mrjob实现hadoo

Running your job different ways

python调用mrjob实现hadoo

Python Web 实现Ajax调用

用python实现调用jar包

实现正确实现Python调用jar包

C#使用IronPython调用Python的实现

python 与c++相互调用实现

Python调用Pandas实现Excel读取

Python怎么实现链式调用

Python 实现异步调用函数

python调用wcf服务实现网

Python如何实现链式调用

python中如何实现链式调用

python函数递归调用的实现

Java中调用Python的实现示例

python在windows调用svn-pysvn的实现

Python中怎么实现调试器调试

python如何实现链式函数调用

python如何实现API的调用详解

python中实现链式调用的案例

python调用excel_vba的两种实现方式

python分析数据的方法是什么

如何使用Python实现抽奖小程序

python copy函数的作用是什么

python ffmpeg模块怎么安装和使用

python进程池创建队列的方法是什么

python无法运行文件的原因有哪些

python can't open file报错怎么解决

python keyerror错误怎么解决

python字符串处理与应用的方法有哪些

python全局变量如何定义