Pydoop includes a command line tool that, for the simpler use cases, hides all the details of running a regular Pydoop job. Basically, it reduces running a simple text-processing Pydoop job to writing two Python functions (or one if you don’t need a reducer) in a module and running them like this:
pydoop script myscript.py hdfs_input hdfs_output
The rest is magic.
To convert some text to lower case, create a module lowercase.py:
def mapper(_, value, writer):
writer.emit("", value.lower())
Now run it over your data:
pydoop script --num-reducers 0 -t '' lowercase.py hdfs_input hdfs_output
Note that we want a map-only job, so we set the number of reducers to 0.
Write your map and reduce functions in wordcount.py:
def mapper(_, text, writer):
for word in text.split():
writer.emit(word, 1)
def reducer(word, count, writer):
writer.emit(word, sum(map(int, count)))
Notice that in the reducer we had to convert the values to int since all data come in as strings.
Run the example:
pydoop script wordcount.py hdfs_input hdfs_output
Suppose that we want to count the occurrence of specific words, like the example above, but in addition we also want to count the total number of words. For this last “global” count we can use Hadoop counters.
Here’s our code in wordcount_with_total.py:
def mapper(_, text, writer):
wordlist = text.split()
for word in wordlist:
writer.emit(word, 1)
writer.count("num words", len(wordlist))
def reducer(word, count, writer):
writer.emit(word, sum(map(int, count)))
Then run it:
pydoop script wordcount_with_total.py hdfs_input hdfs_output
The counter value will show on the JobTracker’s job page and will be present in the job logs.
As a bit of a contrived example, suppose you wanted to select all lines containing a substring to be given at run time. Create a module grep.py:
def mapper(_, text, writer, conf): # notice the fourth 'conf' argument
if text.find(conf['grep-expression']) >= 0:
writer.emit("", text)
Then run it:
pydoop script --num-reducers 0 -t '' -D grep-expression=my_substring grep.py hdfs_input hdfs_output
Here we set the job property grep-expression to my_substring using the -D argument. Also, as in the first case, we want a map-only job and we don’t want a delimiter between output key (empty) and value.
This is a more domain-specific problem. We have some DNA sequencing data in SAM format. We’d like to calculate the nucleotide composition of the sequenced sample.
Our module, base_histogram.py:
def mapper(_, samrecord, writer):
seq = samrecord.split("\t", 10)[9] # extract the DNA sequence
for c in seq:
writer.emit(c, 1)
writer.count("bases", len(seq)) # count all the bases
def reducer(key, ivalue, writer):
writer.emit(key, sum(map(int, ivalue)))
Run it:
pydoop script base_histogram.py hdfs_input hdfs_output
Pydoop Script makes it easy to solve simple problems. It makes it feasible to write simple (even throw-away) scripts to perform simple manipulations or analyses on your data, especially if it’s text-based.
If you can specify your algorithm in two simple functions that have no state or have a simple state that can be stored in module variables, then you can consider using Pydoop Script.
If you need something more sophisticated, then consider using the full Pydoop API.
pydoop script MODULE INPUT OUTPUT
MODULE is the file (on your local file system) containing your map and reduce functions, in Python.
INPUT and OUTPUT are HDFS paths, the former pointing to your input data and the latter to your job’s output directory.
Command line options are shown in the following table.
Short | Long | Meaning |
---|---|---|
-m | --map-fn | Name of map function within module (default: mapper) |
-r | --reduce-fn | Name of reduce function within module (default: reducer) |
-t | --kv-separator | Key-value separator string in final output (default: <tab> character) |
--num-reducers | Number of reduce tasks. Specify 0 to only perform map phase (default: 3 * num task trackers) | |
-D | Set a property value, such as -D mapred.compress.map.output=true |
In addition to the options listed above, you can pass any of the generic Hadoop options to the script tool, but you must pass them after the pydoop script options listed above.
-conf <configuration file> | specify an application configuration file |
-fs <local|namenode:port> | specify a namenode |
-jt <local|jobtracker:port> | specify a job tracker |
-files <list of files> | comma-separated files to be copied to the map reduce cluster |
-libjars <list of jars> | comma-separated jar files to include in the classpath |
-archives <list of archives> | comma-separated archives to be unarchived on the compute machines |
Here is the word count example modified to ignore stop words. The stop words are identified in a file that is distributed to all the nodes using the standard Hadoop -files option.
Code:
with open('stop_words.txt') as f:
STOP_WORDS = frozenset(l.strip() for l in f if not l.isspace())
def mapper(_, v, writer):
for word in v.split():
if word in STOP_WORDS:
writer.count("STOP_WORDS", 1)
else:
writer.emit(word, 1)
def reducer(word, icounts, writer):
writer.emit(word, sum(map(int, icounts)))
Command line:
pydoop script word_count.py alice.txt hdfs_output -files stop_words.txt
While this script works, it has the obvious weakness of loading the stop words list even when executing the reducer (since it’s loaded as soon as we import the module). If this inconvenience is a concern, we could solve the issue by triggering the loading from the mapper function, or by writing a full Pydoop application which would give us all the control we need to only load the list when required.
In this section we assume you’ll be using the default TextInputFormat and TextOutputFormat record reader/writer. You may select a different input or output format by configuring the appropriate Hadoop properties.
The mapper function in your module will be called for each record in your input data. It receives 3 parameters:
The reducer function will be called for each unique key value produced by your map function. It also receives 3 parameters:
The key and value your emit from your reducer will be joined by the key-value separator and written to the final output. You may customize the key-value separator with the --kv-separator command line argument.
The writer object given as the third parameter to both mapper and reducer functions has the following methods:
The latter two methods are useful for keeping your task alive in cases where the amount of computation to be done for a single record might exceed Hadoop’s timeout interval: Hadoop kills a task after a number of milliseconds set through the mapred.task.timeout property – which defaults to 600000, i.e., 10 minutes – if it neither reads an input, writes an output, nor updates its status string.
If desired, Pydoop Script lets you access the values of your programs configuration properties through a dict-like configuration object, which gets passed as the fourth parameter to your functions (if the function accepts 4 parameters instead of 3). To see the methods available check out the api.
If you’d like to give your map and reduce functions names different from mapper and reducer, you may do so, but you must tell the script tool. Use the --map-fn and --reduce-fn command line arguments to select your customized names.
You may have a program that doesn’t use a reduce function. Specify --num-reducers 0 on the command line and your map output will be written directly to file. In this case, you map output will go directly to the output formatter and be written to your final output, separated by the key-value separator.