Apache Mahout is a set of machine learning tools, which deal with classification, clustering, recommendations, and other related stuff. We just bought a new book called Mahout In Action which is full of good examples and general machine learning advice; you can find it here. It’s pretty neat and it’s growing quickly, so I decided to take the time to learn about it.
Mahout functions as a set of MapReduce jobs. It integrates cleanly with Hadoop, and this makes it very attractive for doing text analysis on a large scale. Simpler queries, for instance getting the average response time from a customer, are probably better suited for Hive.
Most examples I’ve seen use Mahout as sort of a black box. The command line just forwards arguments to various Driver classes, which then work their magic. All input and output seems to be through HDFS, and Mahout also uses intermediate temp directories inside HDFS. I tried changing one of the Driver classes to work with HBase data, but the amount of work that seemed to be necessary was non-trivial.
I decided to work with Enron email data set because it’s reasonably large and it tells a story about fraud and corruption. Their use of keywords like ‘Raptor’ and ‘Death Star’ in place of other more descriptive phrases makes topic analysis pretty interesting.
Please read ‘Important things to watch out for’ at the bottom of this post first if you want to follow along.
This is what I did to get the Enron mail set to be analyzed using the LDA algorithm (Latent Dirchlet Allocation), which looks for common topics in a corpus of text data:
- The Enron emails are stored in the maildir format, a directory tree of text emails. In order to process the text, it first needs to be converted to SequenceFiles. A SequenceFile is a file format used extensively by Hadoop, and it contains a series of key/value pairs. One way to convert a directory of text to SequenceFiles is to use Mahout’s
./bin/mahout seqdirectory -i file:
///home/georges/enron_mail_20110402 -o /data/enron_seq
This can take a little while for large amounts of text, maybe 15 minutes. The SequenceFiles produced have key/value pairs where the key is the path of the file and the value is the text from that file.
- Later on I wrote my own Java code which parsed out the mail headers to prevent them from interfering with the results. It is fairly simple to write a MapReduce task to quickly produce your own SequenceFiles. Also note that there are many other possible sources of text data, for instance Lucene indexes. There’s a list of ways to input text data here.
- I needed to tokenize the SequenceFiles into vectors. Vectors in text analysis are a technical idea that I won’t get into, but these particular vectors are just simple term frequencies.
./bin/mahout seq2sparse -i /data/enron_seq -o /data/enron_vec_tf --norm
-wt tf -seq
This command may need changing depending on what text analysis algorithm you’re using. Most algorithms would require tf-idf instead, which weights the term frequency against the size of the email. This took 5 minutes on a 10-node AWS Hadoop cluster. (I set the cluster up using StarCluster, another neat tool for managing EC2 instances.)
- I ran the LDA algorithm:
./bin/mahout lda -i /dev/enron_vec_tf/tf-vectors -o /data/enron_lda -x
x is the max number of iterations for the algorithm. k is the number of topics to display from the corpus. This took a little under 2 hours on my cluster.
- List the LDA topics:
./bin/mahout ldatopics -i /data/enron_lda/state-
This command is a bit of pain because it doesn’t really error when you have an incorrect parameter, it just does nothing. Here’s some of the output I got:
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-
i [p(i|topic_0) =
information [p(information|topic_0) =
you [p(you|topic_1) =
you [p(you|topic_2) =
we [p(we|topic_2) =
your [p(your|topic_2) =
meeting [p(meeting|topic_2) =
message [p(message|topic_2) =
you [p(you|topic_3) =
your [p(your|topic_3) =
i [p(i|topic_4) =
john [p(john|topic_4) =
jones [p(jones|topic_4) =
jpg [p(jpg|topic_4) =
johnson [p(johnson|topic_4) =
- Looks like the data needs a lot of munging to provide more useful results. Still, you can see the relationship between some of the words in each topic.
I recommend playing around with the examples in the examples/bin directory in the Mahout folder.
Important things to watch out for
- I ran out of heap space once I asked Mahout to do some real work. I needed to increase the heap size for child MapReduce processes. How to do this is basically described here. You only need the -Xmx option, and I went for 2 gigabytes:
You may also want to set MAHOUT_HEAPSIZE to 2048, but I’m not sure how much this matters.
- Some environment variables weren’t set on my StarCluster instance by default, and the warnings are subtle. HADOOP_HOME is particularly important. If HADOOP_HOME is not set, MapReduce jobs will run as local jobs. There were weird exceptions accessing HDFS, and your jobs won’t show up in the job tracker. They do warn you in the console output for the job, but it’s easy to miss. JAVA_HOME is also important but it will explicitly error and tell you to set this. HADOOP_CONF_DIR should be set to $HADOOP_HOME/conf. For some reason it assumes you want HADOOP_HOME/src/conf instead if you don’t specify. Also set MAHOUT_HOME to your mahout directory. This is important so it can add its jar files to the CLASSPATH correctly.
- I ended up compiling Mahout from source. The stable version of Mahout had errors I couldn’t really explain. File system mismatches or vector mismatches or something like that. I’m not 100% sure that it’s necessary, but it probably won’t hurt. Compilation is pretty simple, ‘mvn clean install’, but you will probably want to add ‘-DskipTests’ because the tests take a long time.