Version: 0.7.0
Welcome to Pydoop. Pydoop is a package that provides a Python API for Hadoop MapReduce and HDFS. Pydoop has several advantages [1] over Hadoop’s built-in solutions for Python programming, i.e., Hadoop Streaming and Jython: being a CPython package, it allows you to access all standard library and third party modules, some of which may not be available for other Python implementations – e.g., NumPy; in addition, Pydoop provides a Python HDFS API which, to the best of our knowledge, is not available in other solutions.
In addition to its MapReduce and HDFS API, Pydoop also provides a solution for easy Hadoop scripting which allows you to work in a way similar to Dumbo. This mechanism lowers the programming effort to the point that you may start finding yourself writing simple 3-line throw-away Hadoop scripts!
For simple tasks such as word counting, for instance, your code would look like this:
def mapper(k, text, writer):
for word in text.split():
writer.emit(word, 1)
def reducer(word, count, writer):
writer.emit(word, sum(map(int, count)))
See the Pydoop Script page for details.
For more complex applications, you can use the full API. Here is how how a word count would look:
from pydoop.pipes import Mapper, Reducer, Factory, runTask
class WordCountMapper(Mapper):
def __init__(self, context):
super(Mapper, self).__init__(context)
context.setStatus("initializing")
self.input_words = context.getCounter("WORDCOUNT", "INPUT_WORDS")
def map(self, context):
words = context.getInputValue().split()
for w in words:
context.emit(w, "1")
context.incrementCounter(self.input_words, len(words))
class WordCountReducer(Reducer):
def reduce(self, context):
s = 0
while context.nextValue():
s += int(context.getInputValue())
context.emit(context.getInputKey(), str(s))
runTask(Factory(WordCountMapper, WordCountReducer))
Pydoop includes a high-level HDFS API that simplifies common tasks such as copying files and directories, navigating through the file system, etc. Here is a brief snippet that shows some of these functionalities:
>>> import pydoop.hdfs as hdfs
>>> hdfs.mkdir("test")
>>> hdfs.dump("hello", "test/hello.txt")
>>> hdfs.cp("test", "test.copy")
See the tutorial for more examples.
Pydoop is developed and maintained by researchers at CRS4 – Distributed Computing group. If you use Pydoop as part of your research work, please cite the HPDC 2010 paper.
Footnotes
[1] | Simone Leo, Gianluigi Zanetti. Pydoop: a Python MapReduce and HDFS API for Hadoop., Proceedings Of The 19th ACM International Symposium On High Performance Distributed Computing, page 819–825, 2010 |