All posts by george

Latent Text Algorithms

The basic idea behind this kind of analysis is that there are certain latent topics in a body of text. Some words like car and automobile have a very similar meaning which means they are used in similar contexts. There is a lot of redundancy in language, and with enough effort you can group similar words together into topics.

Math behind the idea

Words are represented as vectors (see vector space model), which are a combination of direction and magnitude. Each word starts out pointing to its own dimension with magnitude 1, which means there is a huge number of dimensions (maybe hundreds of thousands, one for every word that comes up in the data set you’re working on). The basic problem is to flatten these large number of dimensions into a smaller number which are easier to manage and understand.

These vectors are represented as a matrix. In linear algebra, there is the idea of a basis, which is the set of vectors that describe the space you’re working in. For example, in everyday 3D space your basis would have one vector pointing in each dimension.

For another example, you could have a basis which is two vectors that describe a 2D plane. This space can be described in 3 dimensions, like how a piece of paper exists in the real world. But if all you’re dealing with is a 2D plane, you’re wasting a lot of effort and dealing with a lot of noise doing calculations for a 3D space.

Essentially, algorithms that do latent analysis attempt to flatten the really large space of all possible words into a smaller space of topics.

A fairly good explanation of the math involved is in the ‘Skillicorn – Understanding Complex Datasets‘ book, in chapters 2 and 3.

For a Mahout example, see Setting up Mahout. There are also examples in the examples/bin directory for topic modelling and clustering.

Latent Semantic Indexing example

Mahout doesn’t currently support this algorithm. Maybe because it was patented until recently? Hard to parallelize? In any case, there’s a Java library called SemanticVectors which makes use of it to enhance your Lucene search.

Note: I think there’s a technical distinction between topic modelling and LSI, but I’m not sure what it is. The ideas are similar, in any case

It’s just a JAR so you just need to add it to CLASSPATH. However, you need to also install Lucene and have it in your CLASSPATH (both lucene-demo and lucene-core jars.)

  • Index the Enron data (or some other directory tree of text files) into Lucene:  java org.apache.lucene.demo.IndexFiles -index PATH_TO_LUCENE_INDEX -docs PATH_TO_ENRON_DOCS
  • Create LSI term frequency vectors from that Lucene data (I think this took a while, but I did it overnight so I’m not sure): java PATH_TO_LUCENE_INDEX 
  • Search the LSI index. There are different searchtypes, I did ‘sum’, the default: java fraud

Here’s my result:

The input is fairly dirty. Mail headers are not stripped out and some emails are arbitrarily cut up at the 80th column (which may explain ‘rivatives’ at the bottom instead of ‘derivatives’). Still, it can be pretty useful

How to Set Up Apache Mahout

Apache Mahout is a set of machine learning tools, which deal with classification, clustering, recommendations, and other related stuff. We just bought a new book called Mahout In Action which is full of good examples and general machine learning advice; you can find it here. It’s pretty neat and it’s growing quickly, so I decided to take the time to learn about it.

Mahout functions as a set of MapReduce jobs. It integrates cleanly with Hadoop, and this makes it very attractive for doing text analysis on a large scale. Simpler queries, for instance getting the average response time from a customer, are probably better suited for Hive.

Most examples I’ve seen use Mahout as sort of a black box. The command line just forwards arguments to various Driver classes, which then work their magic. All input and output seems to be through HDFS, and Mahout also uses intermediate temp directories inside HDFS. I tried changing one of the Driver classes to work with HBase data, but the amount of work that seemed to be necessary was non-trivial.


I decided to work with Enron email data set because it’s reasonably large and it tells a story about fraud and corruption. Their use of keywords like ‘Raptor’ and ‘Death Star’ in place of other more descriptive phrases makes topic analysis pretty interesting.

Please read ‘Important things to watch out for’ at the bottom of this post first if you want to follow along.

This is what I did to get the Enron mail set to be analyzed using the LDA algorithm (Latent Dirchlet Allocation), which looks for common topics in a corpus of text data:

  • The Enron emails are stored in the maildir format, a directory tree of text emails. In order to process the text, it first needs to be converted to SequenceFiles. A SequenceFile is a file format used extensively by Hadoop, and it contains a series of key/value pairs. One way to convert a directory of text to SequenceFiles is to use Mahout’s seqdirectory command:
    ./bin/mahout seqdirectory -i file:///home/georges/enron_mail_20110402 -o /data/enron_seq

    This can take a little while for large amounts of text, maybe 15 minutes. The SequenceFiles produced have key/value pairs where the key is the path of the file and the value is the text from that file.

  • Later on I wrote my own Java code which parsed out the mail headers to prevent them from interfering with the results. It is fairly simple to write a MapReduce task to quickly produce your own SequenceFiles. Also note that there are many other possible sources of text data, for instance Lucene indexes. There’s a list of ways to input text data here.
  • I needed to tokenize the SequenceFiles into vectors. Vectors in text analysis are a technical idea that I won’t get into, but these particular vectors are just simple term frequencies.
    ./bin/mahout seq2sparse -i /data/enron_seq -o /data/enron_vec_tf --norm 2 -wt tf -seq

    This command may need changing depending on what text analysis algorithm you’re using. Most algorithms would require tf-idf instead, which weights the term frequency against the size of the email. This took 5 minutes on a 10-node AWS Hadoop cluster. (I set the cluster up using StarCluster, another neat tool for managing EC2 instances.)

  • I ran the LDA algorithm:
    ./bin/mahout lda -i /dev/enron_vec_tf/tf-vectors -o /data/enron_lda -x 20 -k 10

    x is the max number of iterations for the algorithm. k is the number of topics to display from the corpus. This took a little under 2 hours on my cluster.

  • List the LDA topics:
    ./bin/mahout ldatopics -i /data/enron_lda/state-4 --dict /data/enron_vec_tf/dictionary.file-0 -w 5 --dictionaryType sequencefile

    This command is a bit of pain because it doesn’t really error when you have an incorrect parameter, it just does nothing. Here’s some of the output I got:

    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
    MAHOUT-JOB: /data/mahout-distribution-0.5/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
    Topic 0
    i [p(i|topic_0) = 0.023824791149925677
    information [p(information|topic_0) = 0.004141992353710214
    i'm [p(i'm|topic_0) = 0.0012614859683494856
    i'll [p(i'll|topic_0) = 7.433430267661564E-4
    i've [p(i've|topic_0) = 4.22765928967555E-4
    Topic 1
    you [p(you|topic_1) = 0.013807669181244436
    you're [p(you're|topic_1) = 3.431068629183266E-4
    you'll [p(you'll|topic_1) = 1.0412948245383297E-4
    you'd [p(you'd|topic_1) = 8.39664771688153E-5
    you'all [p(you'all|topic_1) = 1.5437174634592594E-6
    Topic 2
    you [p(you|topic_2) = 0.03938587430317399
    we [p(we|topic_2) = 0.010675333661142919
    your [p(your|topic_2) = 0.0038312042763726448
    meeting [p(meeting|topic_2) = 0.002407369369715602
    message [p(message|topic_2) = 0.0018055376982080878
    Topic 3
    you [p(you|topic_3) = 0.036593494258252174
    your [p(your|topic_3) = 0.003970284840960353
    i'm [p(i'm|topic_3) = 0.0013595988902916712
    i'll [p(i'll|topic_3) = 5.879175074800994E-4
    i've [p(i've|topic_3) = 3.9887853536102604E-4
    Topic 4
    i [p(i|topic_4) = 0.027838628233581693
    john [p(john|topic_4) = 0.002320786569676983
    jones [p(jones|topic_4) = 6.79365597839018E-4
    jpg [p(jpg|topic_4) = 1.5296038761774956E-4
    johnson [p(johnson|topic_4) = 9.771211326361852E-5
  • Looks like the data needs a lot of munging to provide more useful results. Still, you can see the relationship between some of the words in each topic.

I recommend playing around with the examples in the examples/bin directory in the Mahout folder.

Important things to watch out for

  • I ran out of heap space once I asked Mahout to do some real work. I needed to increase the heap size for child MapReduce processes. How to do this is basically described here. You only need the -Xmx option, and I went for 2 gigabytes:

    You may also want to set MAHOUT_HEAPSIZE to 2048, but I’m not sure how much this matters.

  • Some environment variables weren’t set on my StarCluster instance by default, and the warnings are subtle. HADOOP_HOME is particularly important. If HADOOP_HOME is not set, MapReduce jobs will run as local jobs. There were weird exceptions accessing HDFS, and your jobs won’t show up in the job tracker. They do warn you in the console output for the job, but it’s easy to miss. JAVA_HOME is also important but it will explicitly error and tell you to set this. HADOOP_CONF_DIR should be set to $HADOOP_HOME/conf. For some reason it assumes you want HADOOP_HOME/src/conf instead if you don’t specify. Also set MAHOUT_HOME to your mahout directory. This is important so it can add its jar files to the CLASSPATH correctly.
  • I ended up compiling Mahout from source. The stable version of Mahout had errors I couldn’t really explain. File system mismatches or vector mismatches or something like that. I’m not 100% sure that it’s necessary, but it probably won’t hurt. Compilation is pretty simple, ‘mvn clean install’, but you will probably want to add ‘-DskipTests’ because the tests take a long time.

Developing Android Apps: 10 Performance-Boosting Strategies

In working on my spare-time app BostonBusMap, I’ve repeatedly run into problems where I need to cut down on memory usage or running time. It’s not difficult to hit memory limitations on Android, especially on earlier devices which may allow only 16MB per app. Garbage collection can cause noticeable hiccups (especially before Android 2.3), which is one reason why memory usage can impact an app’s sluggishness. Here’s a list of things I did to make my app more responsive:

  1. First, read the Android performance guidelines. It’s also worth rereading this page from time to time because newer Android versions have more advanced capabilities that may influence the optimizations you make.
  2. One other important thing that bears repeating from that page is, don’t optimize unless you can show that there’s a bottleneck, because optimizing is a lot of work and can make code less maintainable
  3. Memory allocations are expensive. Be lazy in allocating objects if they are only used in edge cases
  4. Use DDMS. This is available within Eclipse and as a standalone tool:
    1. Profile against smaller amounts of data at first, to make the experience less painful and time consuming for you. Debugging will slow your app down in general
    2. Click on the green cylindrical icon to the left (called “Update Heap”) to turn on heap tracking for a thread. This will update whenever a garbage collection occurs (you can also force garbage collections) Continue reading Developing Android Apps: 10 Performance-Boosting Strategies

Working around history.back() bug in BlackBerry Torch browser widget

I’m on the team working on Pylon, a smartphone app which allows the user to approve Nintex tasks, among other things. It currently uses BlackBerry Webworks, which is a platform that uses HTML and JavaScript in the UI. On one screen we wanted functionality that would send the user back to the previous screen (as if they clicked the back button). The way to do this in JavaScript is:




However, in my testing this did nothing on the BlackBerry Torch running OS 6, or on the Torch simulator. I found a workaround which works on most OS 5 and 6 devices (see Caveat for the exception.)

Basically, I needed to write a Webworks extension function to trigger the behavior from Webworks. Continue reading Working around history.back() bug in BlackBerry Torch browser widget

Numbers in Emacs

found this page:

Registers are useful tools when writing KeyboardMacros. Notice how you can store numbers in registers, insert them again, and increment them in the register.

See the Node Keeping Numbers in Registers in the EmacsManual.

Example session:

‘C-u 1000 C-x r n x’
Store 1000 in register x
`C-x (‘
Start a keyboard macro
‘C-x r i x’
Insert the number in register x into the buffer
`C-x r + x’
Increment the number by one
`<end> <newline>
Do something else
`C-x )’
End keyboard macro definition

Now run ‘C-u 50 C-x e’ to run the macro 50 times.

See IncrementNumber, ReplaceRegexp, ReplaceCount or RenumberList for alternative that don’t involve registers. Continue reading Numbers in Emacs

MBTA Bus Locator

I wrote an app for my G1 to get information from the real-time MBTA bus location feed and display it in Google Maps as bus icons. It was an interesting experience to develop for.

The Android eclipse environment has its fair share of hiccups unfortunately, simple things like how a down arrow usually behaves, or not being able to scroll all the way to the bottom of a list using your mouse scrollwheel. The emulator and eclipse’s debugging capabilities were really helpful, and robust for the most part. The most difficult thing in developing for it was figuring out how to do what I wanted to do, what functions and classes to use. I guess that’s inherent in developing for a new platform.

The app is pretty simple. It’s one screen, with most of it taken up by Google Maps and a bar at the top containing space for a refresh button and a status update box. It defaults to Watertown square when it starts, although that could easily be changed to use GPS or something else. When the app starts and when the refresh button is pushed, it retrieves a list of bus locations from the feed XML. Those locations are sorted by distance to get the 20 closest buses to where the center of the map is. (The distance is currently computed by sqrt(latitude^2 + longitude^2). This is slightly incorrect since a difference in latitude may be smaller or larger than a difference in longitude. I’ll fix that soon using this. Overlays are then created using a bus picture I got from Wikipedia and inserted into Google Maps. Continue reading MBTA Bus Locator

Best Practices for String Comparison

I was interested in the efficiency of string comparison. For example, when performing a case-insensitive comparison, is it better to use the String.Equals method with StringComparison.InvariantCultureIgnoreCase, or the simple trick of converting both strings to upper- or lower-case? I found an MSDN article, New Recommendations for Using Strings in Microsoft .NET 2.0, that discusses all this and more.

In a nutshell:

  • Use an overload that takes the StringComparison enumeration
  • If you don’t care about the linguistic content, as in the majority of cases, use StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase for the best performance. An ordinal comparison runs the fastest because it compares bytes directly and applies no linguistic rule to compute the result.

Going back to my original question, using String.InvariantCultureIgnoreCase is not as efficient because while it specifies culture-independence, the strings are still compared in a linguistic sense. And when you normalize casing with String.ToUpperInvariant or String.ToLowerInvariant, the extra step means that the comparison is not as fast as using StringComparison.OrdinalIgnoreCase.

Switching on Strings in C++ and C

I was wondering what the performance hit of switching on strings was. This is interesting to know:

In C++ and C they only allow switching on integral values. I imagine in certain cases the compiler could make a jump statement which used the integer directly (maybe as part of the jump address, positioned relatively from the current one). That’s probably why it’s not allowed for character arrays or objects in general in C++. At least that’s my guess.

Another interesting link:

An interesting if unrelated link:

Refactoring XmlWriter

I found this link interesting:

XmlWriter and XmlReader are convenient and one of the fastest ways of creating XML, but they are kind of ugly sometimes. I wrote a class that encapsulated these methods into XmlFilter, which makes it easy to change pieces of an XML file without writing the same scaffolding over and over again, and without using an XmlDocument (which is quite a bit slower and consumed more memory than I preferred.)

The next release of Officewriter should be significantly faster (roughly 2 to 4 times as fast, according to current benchmarks) than the 4.0 release for parsing Excel 2007 files. This is mostly because I replaced a section of code that used the XmlDocument with a custom written replacement which was more specific to my needs. I don’t think the .NET framework (at least parts of it) was written with performance in mind, but I’ll get into that in a later blog post.