Pydoop applications are run as any other Hadoop Pipes applications (e.g., C++ ones such as this wordcount example). To start, you will need a working Hadoop cluster. If you don’t have one available, you can bring up a single-node Hadoop cluster on your machine following the Hadoop quickstart guide. Configure Hadoop for “Pseudo-Distributed Operation” and start the daemons as explained in the guide.
Your pipes command line may look something like this:
${HADOOP_HOME}/bin/hadoop pipes -conf conf.xml -input input -output output
The paths input and output are the HDFS directories where the applications will read its input and write its output, respectively. The configuration file, read from the local file system, is an xml document consisting of a simple name = value property list explained below.
Here’s an example of a configuration file:
<?xml version="1.0"?>
<configuration>
<property>
<name>hadoop.pipes.executable</name>
<value>app_executable</value>
</property>
<property>
<name>mapred.job.name</name>
<value>app_name</value>
</property>
<property>
<name>hadoop.pipes.java.recordreader</name>
<value>true</value>
</property>
<property>
<name>hadoop.pipes.java.recordwriter</name>
<value>true</value>
</property>
[...]
</configuration>
The meaning of these properties is as follows:
In the job configuration file you can also set application-specific properties; their values will be accessible at run time through the JobConf object.
Finally, you can include general Hadoop properties (e.g., mapred.reduce.tasks). See the Hadoop documentation for a list of the available properties and their meanings.
Note
You can also configure property values on the command line with the -D property.name=value syntax . You may find this more convenient when scripting or temporarily overriding a specific property value. If you specify all required properties with the -D switches, the xml configuration file is not necessary.
Before running your application, you need to perform the following steps:
The examples subdirectory of Pydoop’s distribution root contains several examples of Python scripts that integrate all of the above steps into a convenient command line tool. Documentation for the examples is in the Examples section.