WordCount can be considered as the “hello world” of MapReduce. A simple application that counts the occurrence of each word in a set of text files, it is included in both the original MapReduce paper [1] and in the Hadoop documentation as a MapReduce programming tutorial.
Source code for the WordCount examples is located under examples/wordcount in the Pydoop distribution.
This example includes only the bare minimum required to run the application. The entire application consists of just 14 lines of code:
from pydoop.pipes import Mapper, Reducer, Factory, runTask
class WordCountMapper(Mapper):
def map(self, context):
words = context.getInputValue().split()
for w in words:
context.emit(w, "1")
class WordCountReducer(Reducer):
def reduce(self, context):
s = 0
while context.nextValue():
s += int(context.getInputValue())
context.emit(context.getInputKey(), str(s))
if __name__ == "__main__":
runTask(Factory(WordCountMapper, WordCountReducer))
This is a more verbose version of the above example, written with the purpose of demonstrating most of Pydoop’s MapReduce and HDFS features. Specifically it shows how to:
The RecordReader, RecordWriter and Partitioner classes are Python reimplementations of their default Java counterparts, i.e., the ones the framework uses if you don’t provide your own. As such, they are not needed for the application to work: they have been included only to provide a tutorial on writing additional MapReduce components.
For further details, take a look at the code in the examples/wordcount/bin subdirectory of the Pydoop distribution.