Table Of Contents

Next topic

News

Get Pydoop

Contributors

Pydoop is developed by: CRS4

And generously hosted by: Get Pydoop at SourceForge.net. Fast, secure and Free Open Source software downloads

Pydoop Documentation

Version: 0.7.0

Welcome to Pydoop. Pydoop is a package that provides a Python API for Hadoop MapReduce and HDFS. Pydoop has several advantages [1] over Hadoop’s built-in solutions for Python programming, i.e., Hadoop Streaming and Jython: being a CPython package, it allows you to access all standard library and third party modules, some of which may not be available for other Python implementations – e.g., NumPy; in addition, Pydoop provides a Python HDFS API which, to the best of our knowledge, is not available in other solutions.

Easy Hadoop Scripting

In addition to its MapReduce and HDFS API, Pydoop also provides a solution for easy Hadoop scripting which allows you to work in a way similar to Dumbo. This mechanism lowers the programming effort to the point that you may start finding yourself writing simple 3-line throw-away Hadoop scripts!

For simple tasks such as word counting, for instance, your code would look like this:

def mapper(k, text, writer):
  for word in text.split():
    writer.emit(word, 1)

def reducer(word, count, writer):
  writer.emit(word, sum(map(int, count)))

See the Pydoop Script page for details.

Full-fledged Hadoop API

For more complex applications, you can use the full API. Here is how how a word count would look:

from pydoop.pipes import Mapper, Reducer, Factory, runTask

class WordCountMapper(Mapper):

  def __init__(self, context):
    super(Mapper, self).__init__(context)
    context.setStatus("initializing")
    self.input_words = context.getCounter("WORDCOUNT", "INPUT_WORDS")

  def map(self, context):
    words = context.getInputValue().split()
    for w in words:
      context.emit(w, "1")
    context.incrementCounter(self.input_words, len(words))

class WordCountReducer(Reducer):

  def reduce(self, context):
    s = 0
    while context.nextValue():
      s += int(context.getInputValue())
    context.emit(context.getInputKey(), str(s))

runTask(Factory(WordCountMapper, WordCountReducer))

High-level HDFS API

Pydoop includes a high-level HDFS API that simplifies common tasks such as copying files and directories, navigating through the file system, etc. Here is a brief snippet that shows some of these functionalities:

>>> import pydoop.hdfs as hdfs
>>> hdfs.mkdir("test")
>>> hdfs.dump("hello", "test/hello.txt")
>>> hdfs.cp("test", "test.copy")

See the tutorial for more examples.

How to Cite

Pydoop is developed and maintained by researchers at CRS4 – Distributed Computing group. If you use Pydoop as part of your research work, please cite the HPDC 2010 paper.

Indices and Tables

Footnotes

[1]Simone Leo, Gianluigi Zanetti. Pydoop: a Python MapReduce and HDFS API for Hadoop., Proceedings Of The 19th ACM International Symposium On High Performance Distributed Computing, page 819–825, 2010