Creating a Simple Connection Consumer in SharePoint

There are a million articles about using seven different interfaces and fourteen .wsp deployments to make an entirely custom connection provider and consumer. However, I couldn’t find one about how to create a simple connection consumer that filters based on a SharePoint list. I also couldn’t find anything about changing the brand-new double-headed arrow icon that SharePoint replaced the radio button with. Turns out they can go in the same simple solution:

  • Setting up the Consumer – This consumer will take a single row of information from a SharePoint list. It’s implemented in “MyWebPart.cs” and not in a user control.
//Create the data row that the list information will fill
DataRowView _row = null;
//Declare your consumer. Note that "Row" can be anything - it's just the term that
// SP will use when you enable the connection
[ConnectionConsumer("Row")]
//Set up the actual connection
public void SetConnect(IWebPartRow provider)
{
    //"RecieveRow" is the method you'll create to interpret the data
     RowCallback callback = new RowCallback(ReceiveRow);
    //This is where the data comes in 
     provider.GetRowData(callback);
}

This code sets up the DataRow. Note that the “provider” parameter is passed by the SharePoint connection.

  • Capturing your data – This is where you can actually use the filter – manipulate, save, and query to your heart’s content. Below is a sample RecieveRow method:
public void ReceiveRow(object row)
{
     //Set your local _row equal to the row passed by SharePoint
     _row = (DataRowView)row;
   
     //Make sure it isn't null 
     if (_row != null)
     {
    //The next three lines of code are a great way to open a site without having to dispose of anything!
    SPWeb contextWeb = SPControl.GetContextWeb(Context);
    using (SPSite site = new SPSite(contextWeb.Url))
    {
        using (SPWeb web = site.OpenWeb())
        {
         //Convert the DataRowView into a DataRow, so you can actually use it
             DataRow dr = _row.Row;
            
             //I need to allow unsafe updates to execute my query
             web.AllowUnsafeUpdates = true;
             //Query against the row data
             SPQuery query = new SPQuery();
             query.Query = string.Format(
                 @"<Where>
                     <Eq>
                         <FieldRef Name='MyColumn' />
                         <Value Type='Lookup'>{0}</Value>
                     </Eq>
                 </Where>", dr["mycolumnvalue"].ToString() );
                    
             SPList list = web.Lists["MyList"];
             SPListItemCollection items = list.GetItems(query);
            /*********
             *In here you do all you want with that filtered item list.
             *Convert it to a data table, pass the list to a usercontrol.
             *After all, it's your filter!
             ********/
           
            //Disable unsafe updates when you're done
             web.AllowUnsafeUpdates = false;
        }
     }
     }
}

That’s all there is to it. You can fill a data structure in the RecieveRow method and pass it on to a user control the same way you would pass any other value.

  • Customization –  Here’s a little bonus – how to update the radio buttons with the filter wsp.
    • In the hive, the double-headed arrow radio buttons are the following two files:
      • RBSEL.gif
      • RBUNSEL.gif
    • If you want to replace them create a folder in your solution package with the following path:  MyWebPartProject > TEMPLATE > IMAGES
    • Rename your “on” radio button “RBSEL” and save it as a gif
    • Rename your “off” radio button “RBUNSEL” and save it as a gif
    • Place both of them in the IMAGES folder.
    • When you deploy, it will overwrite the default arrows.

This change OVERWRITES the default SharePoint images. Only do this if you want to update all of the radio buttons on the farm. Otherwise you will have to restore the double-headed arrow icons, and it won’t be fun.

 

[Image via the San Francisco Weekly]

How to import text on multiple lines in Excel

Let’s say that you have a string “Top sales person in the Central region” in your data base. You want it to look like the following in a cell in an Excel worksheet:

Highlights:
Top sales person in the Central region

How do you do this?

Splitting text over multiple lines in Excel

In order to display text on multiple lines in a cell in Excel, two conditions must be met:

  1. The cell must be formatted with “Wrap Text”
  2. The text must contain the new line character

If you press ALT+ENTER in Excel, Excel automatically formats the cell with “Wrap Text” and inserts a new line character into the cell. But this approach won’t work if you are importing your data from an outside source, for example, if you are importing data with OfficeWriter’s ExcelTemplate object.

Formatting the cell with “Wrap Text” is as easy as right clicking the cell, going to Format Cell > Alignment  and checking off ‘Wrap Text’. The next question is how to get the new line character into the cell.

Option 1: Use an Excel formula to concatenate the new line character to the text in the cell

In the example, we need to append “Highlights:” and the new character line to the text that’s already there. Let’s say that the text is in cell D8. Then the formula would be =CONCATENATE("Highlights", CHAR(10), D8). If the formula had to be applied to a series of cells, where you weren’t sure if there would be a comment or not, then you could wrap that formula in an IF formula: =IF(LEN(D8)>0, CONCATENATE("Highlights:", CHAR(10), D8), ").

What if the text from the database needed to be split over multiple lines?

Let’s suppose the text in the database already contained “Highlights”: “Highlights: Top sales person in the Central region”. Then how do you break apart the strings?

First we need to grab the “Highlights:” part. We can employ Excel’s LEFT(N_chars) function, which grabs the N left-most characters:

=LEFT(D8, 11) will return “Highlights:”

Next we need to grab just the second part. We can use Excel’s MID(text, start_index, chars) function to get a specific sub-string:

=MID(D8, 13, LEN(D8)-11) will return “Top sales person in the Central region”.

We can concatenate these together with the new line character: =CONCATENATE(LEFT(D8,11), CHAR(10), MID(D8, 13, LEN(D8)-11)). We can also wrap it in the similar IF formula if we only wanted to apply this formula if there was a comment. (Otherwise you will end up with #VALUE!).

Option 2: Add the new line character to the text that it’s imported into the file

(For example, manipulating the data in .NET code before importing it into a file using ExcelTemplate).

Just add the newline character to your text: “Top sales person in the Central region” –> “Highlights: /n Top sales person in the Central region”. When the text is imported, Excel will respect the new line character. Make sure that the cell is formatted with “Wrap Text” ahead of time.

Additional reading:

Stories from the WIT Trenches: Ann Millspaugh

[This is the seventh in a series of posts exploring the personal stories of real women in technology. Every woman in tech overcame at the very last statistical odds to be here; this blog series aims to find out why, and what they found along the way. Like a number of our interviewees, Ann Millspaugh (t|ln) entered the tech world after college. In less than two years, the former Luddite went from reluctant Drupal admin to passionate advocate of STEM education for girls. She’s currently co-organizer of the Columbia Heights Community Wireless Network and the Online Community Manager for the EdLab Group. If reading her story inspires you to share yours, please feel to email me.]

1)      Can you take us back to your “eureka!” moment—a particular instance or event that got you interested in technology?

To be honest, I don’t think I can claim to be a “woman in technology”. At best, I’m a woman learning technology, and probably more importantly, how to think about technology. For a lot of people, especially “Millennials” and “digital natives,” there’s something almost noble about being adverse to technology – there’s an attitude that “I haven’t submitted myself to this trend yet” or “I’m grounding myself outside of this consumer-driven society.” I’m not saying this as a condescending outsider – I used to feel that way.

Do I feel like I’m now a tech guru who is going to invent the next Linux system? No. But, I do feel like I can be a contributor, and for me, that feeling of empowerment is critical to the way people use and adapt to technology. It’s not about seeing technology as old or new, good or bad, but comprehensively seeing technology for what it is– the resources creating the product, the labor assembling the product, the ingenuity and creativity in software development, and the behavioral trends in the actual usage of these products rather than a cold, static piece of materialism. For me, it’s been fascinating to begin thinking about technology as a tool to improve, analyze and assess behavioral patterns, and that’s what began to get me interested in technology.

2)      Growing up, did you have any preconceived perceptions of the tech world and the kinds of people who lived in it?

Yes, I undoubtedly had preconceptions about the tech world. I started out as one of those people who saw technology as an inhibitor of real-world interaction. Computers were draining, for those anti-social types. I was pretty extreme – I even had a phase in college where I refused to be in pictures because I thought they were too distracting. I think technology can be seen this way – as a way to be self-indulgent and unnecessarily inconvenienced, a byproduct of a consumer-driven society.

It becomes an either-or: either I’m a technology person or I’m not. I think it’s important to realize that just because you don’t dream about coding or you don’t want to eat, sleep, and breathe at a computer doesn’t mean you can’t enjoy computer science. Somehow technology never enters into a realm of moderation; it’s a binary of hacking 24/7 or waiting in line for the Geek Squad. Science and technology fields are like any career – there are people who are obsessed, but there are also plenty of people who live a balanced life.

3)      When did you first start working with tech? Was it by choice?

I was always interested in writing, and over the course of several jobs, realized that writing (as well as many of the arts) is now completely intertwined with technology; it’s almost impossible to pursue those fields with having at least a basic technological background. For me, it was a begrudgingly slow progression over to the tech-side. But, that mindset ultimately came from a lack of understanding. For example, I’ve always liked learning languages, and learning HTML and CSS was just like learning another language. It never occurred to me that the skills I already had could be translated into a STEM field, and that I would actually like it!

4)      Did you experience any personal or systemic setbacks at any point of your academic or professional career?

Like I said before, I started working with technology by accident –I never saw myself as someone interested in technology, or even particularly apt in technology. In fact, when I was in college, computer science classes were at the bottom on my list, for no particular reason except for my perceptions about computer science. I read an interesting book: Unlocking the Clubhouse: Women in Computer Science, that talked about the implicit socialization processes that drive women away from CS, and technology at large (having a computer in your son’s room versus your daughter’s room; taking your son to fix the car with you). These small actions create superficial gender associations that build and become a heavily weighted reality over time. In a lot of ways I feel like the epitome of those socialization processes – I was never bad at science or math, and in retrospect, I’d have to say it was the accumulation of unconscious decisions and stereotypes that drew me away from the field. I would say that was my biggest setback, that I didn’t explore the field until after college.

5)      Whom do you look to as mentors and/or sources of inspiration in your field?

The open source development communities have been incredibly inspiring to me. Everyone is so authentically collaborative: people work together for the sole purpose of making software easier and more accessible to people – for free. And most people do this in their spare time! You can post a question and have a response with seconds, find tutorials and rank suggestions. It’s this incredible network that continually expands through connective curiosity; you rarely see anyone pitching their company or bragging about their latest contribution. There’s a “we want to keep making this better” attitude that drives people to recognize how much more powerful collaboration is than siloed, individual production. No copyrights here!

6)      Why do you think the rate of attrition for women in software engineering is higher than that of women in most other tech fields?

The perception of computer science and programming. There are lots of studies that women tend to be more emotionally-driven; technology, particularly software engineering, can have the perception of being cold, isolating, and distant from immediate applicability. I think it’s important to stop thinking about technology as a new, revolutionary entity. In my opinion, technology doesn’t revolutionize the way people behave. Fundamentally, people want the same things they’ve wanted for hundreds of years – to communicate, connect, and understand – and technology enables these things to happen at an increasingly accelerated rate. If we start to think about technology through this lens, I think many more people, men and women, will be drawn to the field.

7)      Do you have any suggestions for how to get more girls interested in computers and computer science? Is this important to you?

Hopefully by now, it’s evident that yes – this is important to me! Working with the EdLab Group, I’ve been reading and researching how to make STEM fields more appealing to girls. There are a lot of ways to pursue this, one of the most cited examples being that girls enjoy contextualizing information in real-world examples. Rather than solving for a variable in an algorithm, ask girls, “How can this algorithm be applied to make Georgia’s healthcare system more efficient?”

While this is a successful strategy, I also think attributing certain characteristics to STEM competency can be a slippery slope. Bart Franke, a teacher at the Chicago Laboratory High School who boasts a female enrollment of 50% in his computer science classes, recently gave a presentation about his success, citing, “I teach girls, I don’t teach to girls.” As soon as you make distinctions as a woman, a minority, a socio-economically disadvantaged person, etc… you cause people to self-identify in a way that can perpetuate certain stereotypes. Even though gender, ethnicity or socio-economic status is undoubtedly a significant individual and collective characteristic, there are times where this emphasis is appropriate and then there are times where it’s irrelevant and distracting.

How to Export a SharePoint List to Word Using Word Export Plus

We asked EMC’s Paul Forsthoff (b|t) to give us his honest opinion of OfficeWriter’s Word Export Plus solution. IOHO, he did a masterful job. The full review is available on his Everything SharePoint blog.

I recently had the opportunity to check out SoftArtisans OfficeWriter product. The OfficeWriter product exposes an API that allows information from custom ASP.NET applications to be consumed and used to dynamically and programmatically build Microsoft Word documents and Microsoft Excel spreadsheets.

The OfficeWriter API is a .NET library that allows you to read, manipulate and generate Microsoft Word and Microsoft Excel documents from your own applications. The OfficeWriter product can integrate with Sharepoint 2010 allowing you to export Sharepoint list data into Microsoft Word and Excel documents.

SoftArtisans provides easy to understand sample code, videos and pre-built Sharepoint solutions that make getting started with the product very trivial.

For this tutorial I’ll demonstrate deploying, configuring and testing Word Export Plus in a Sharepoint 2010 environment. Word Export Plus is a SharePoint solution that demonstrates the usage of the OfficeWriter API in SharePoint 2010. This solution adds a new context menu (custom action) button to list items, allowing you to export the list data to a pre-formatted Word template that can be designed yourself in Word, or automatically generated by Word Export Plus. [Read more…]

Boston’s Big Datascape, Part 3: StreamBase, Attivio, InsightSquared, Paradigm4, Localytics


[Excerpted from the Riparian Data blog]

This ongoing series examines some of the key, exciting players in Boston’s emerging Big Data arena. The companies I’m highlighting differ in growth stages, target markets and revenue models, but converge around their belief that the data is the castle, and their tools the keys. You can read about the first ten companies here and here.

11) StreamBase

  • Products: StreamBase Complex Event Processing Platform lets you build applications for analyzing real-time streaming data alongside historical data. StreamBase LiveView adds an in-memory data warehouse and a BI front-end to the equation, essentially giving you live (well, a few milliseconds behind) BI.
  • Founder: Richard Tibbetts (t |ln), Michael Stonebraker
  • Technologies used: Complex Event Processing, StreamSQL, cloud storage, pattern-matching, in-memory data warehouse, end-user query interface
  • Target Industries: Capital Markets, Intelligence and Security, MMO, Internet and Mobile Commerce, Telecomunications and Networking
  • Location: Lexington, MA

[read the full post at the Riparian Data blog]

Paul Forsthoff Reviews OfficeWriter’s Word Export Plus Solution for SharePoint

The following is an excerpt from a review written by Paul Forsthoff, Senior Practice Consultant at EMC Global Services. Paul Forstoff reviews OfficeWriter’s Word Export Plus Solution for SharePoint Read the full review here.

I recently had the opportunity to check out SoftArtisans’ OfficeWriter product. The OfficeWriter product exposes an API that allows information from custom ASP.NET applications to be consumed and used to dynamically and programmatically build Microsoft Word documents and Microsoft Excel spreadsheets.

[Read the full review here.]

Latent Text Algorithms

The basic idea behind this kind of analysis is that there are certain latent topics in a body of text. Some words like car and automobile have a very similar meaning which means they are used in similar contexts. There is a lot of redundancy in language, and with enough effort you can group similar words together into topics.

Math behind the idea

Words are represented as vectors (see vector space model), which are a combination of direction and magnitude. Each word starts out pointing to its own dimension with magnitude 1, which means there is a huge number of dimensions (maybe hundreds of thousands, one for every word that comes up in the data set you’re working on). The basic problem is to flatten these large number of dimensions into a smaller number which are easier to manage and understand.

These vectors are represented as a matrix. In linear algebra, there is the idea of a basis, which is the set of vectors that describe the space you’re working in. For example, in everyday 3D space your basis would have one vector pointing in each dimension.

For another example, you could have a basis which is two vectors that describe a 2D plane. This space can be described in 3 dimensions, like how a piece of paper exists in the real world. But if all you’re dealing with is a 2D plane, you’re wasting a lot of effort and dealing with a lot of noise doing calculations for a 3D space.

Essentially, algorithms that do latent analysis attempt to flatten the really large space of all possible words into a smaller space of topics.

A fairly good explanation of the math involved is in the ‘Skillicorn – Understanding Complex Datasets‘ book, in chapters 2 and 3.

For a Mahout example, see Setting up Mahout. There are also examples in the examples/bin directory for topic modelling and clustering.

Latent Semantic Indexing example

Mahout doesn’t currently support this algorithm. Maybe because it was patented until recently? Hard to parallelize? In any case, there’s a Java library called SemanticVectors which makes use of it to enhance your Lucene search.

Note: I think there’s a technical distinction between topic modelling and LSI, but I’m not sure what it is. The ideas are similar, in any case

It’s just a JAR so you just need to add it to CLASSPATH. However, you need to also install Lucene and have it in your CLASSPATH (both lucene-demo and lucene-core jars.)

  • Index the Enron data (or some other directory tree of text files) into Lucene:  java org.apache.lucene.demo.IndexFiles -index PATH_TO_LUCENE_INDEX -docs PATH_TO_ENRON_DOCS
  • Create LSI term frequency vectors from that Lucene data (I think this took a while, but I did it overnight so I’m not sure): java pitt.search.semanticvectors.BuildIndex PATH_TO_LUCENE_INDEX 
  • Search the LSI index. There are different searchtypes, I did ‘sum’, the default: java pitt.search.semanticvectors.Search fraud

Here’s my result:

The input is fairly dirty. Mail headers are not stripped out and some emails are arbitrarily cut up at the 80th column (which may explain ‘rivatives’ at the bottom instead of ‘derivatives’). Still, it can be pretty useful

New England Database Summit 2012: Too Big to Flail?


[Image via John Hugg]

Overview

This year’s New England Database Summit was held in the Stata Center at MIT.  If you haven’t been there, it’s a pretty neat building, with some rather odd architecture. I’d say the conference attendees were 70% academic, primarily researchers and professors from MIT, Brown, the University of Wisconsin-Madison, a little bit of Northeastern and Harvard. The other 30% were businesses—I spotted people from Hadapt, Novartis, Endeca, IBM, VoltDB,and TokuTek. Facebook’s head MySQL guy, Mark Callaghan, was there to give one of the keynotes. Microsoft and EMC were the sponsors, and a bunch of  lectures came from Dave DeWitt’s U-Wisconsin/Jim Gray Systems Lab crew . About half of the talks concerned NoSQL and Hadoop, while the other half were for traditional databases (mostly MySQL) with a smattering of hardware in between. Overall I found it enlightening to see what may be coming down the pipeline.

Keynote – Johannes Gehrke (Cornell) on Declarative Data Driven Coordination

The heart of this talk was a set of extensions to SQL that basically allows one to make an “entangled query.” An example entangled query might be “what are the set of classes I can take, which will all have one friend of mine in them.” As one’s set of classes depends upon others’ sets, the queries to determine the sets are said to be entangled. Other examples given were for MMO raid scheduling, wedding gift registries (“what are the set of gifts not yet purchased”), booking plane tickets with conditions (“What are the sets of flights that go from New York to Boston in the morning, on JetBlue, and I want to sit next to Larry Page”). The system was still trying to keep ACID, although due to not being able to really resolve anything until the other side makes a choice, it’s really eventually consistent. The flip side of these queries were entangled transactions. Rather than booking a specific flight or seat one might just book whatever is “best” in the set from an entangled query. One wouldn’t actually know what was booked until later. Guaranteed booking in case one gets an empty set was actually a piece of future work, which was a little surprising. It looks to me like this could be very interesting and helpful, but it still has some kinks that need to be worked out. External constraints are a hugely limiting factor, and multiple nested constraints (A will only sit with B, who won’t sit with C, who will only sit with A) make the entire thing very difficult to solve or it will fall apart in short order. At least one person asked about this in a roundabout way, and didn’t get a satisfactory answer.

Session 1 – Daniel Bruckner (MIT) on Curating Data at Scale: The Data Tamer System

This session was about a mostly automated process for cleaning up raw web data and putting it into proper columns for querying. I have a little bit of experience with this, and didn’t see anything that was particularly revolutionary. The case study used was goby.com, which uses this technique under the hood. Basically, web spiders using regular expressions can mass scrape sites to collect pieces of data, like the price of admission, where this is, what it’s called, when is it open, contact info, etc. This raw data can  then be sorted and attached to a specific thing (with goby a thing is an event or place). Piece of technology I found rather neat was that they don’t necessarily preset the schema of what properties a thing has, instead adding them as new ones appear. It wasn’t clear to me that this was automated, but their UI for managing the data was pretty slick. The “mostly automated” part came up at the end of the talk, where it was revealed that 10 to 20% of the data needs to be manually filtered by an analyst, which isn’t surprising when dealing with messy data.

Session 2 – Willis Lang (U-Wisconsin) on Energy-Conscious Data Management Systems

This was a mostly academic talk. Basically, no one is really looking at the trade-offs between power and performance  in software. It was shown that a certain set of techniques, like specific map/reduce tasks, certain searches, etc. might not be as performant but will reach a specific required performance while using less power. Much of the session was spent detailing the actual experiments done to show this. The power savings chosen for a rather arbitrary performance level came out to about 12%.

Session 3 – Jeong-Hyon Hwang (SUNY Albany) on G* – A Parallel System for Efficiently Managing Large Graphs

This was another academic talk, but the premise was rather interesting if it ever comes to fruition. G* is a  graph API for managing graph data that runs efficiently on Hadoop. It is scheduled for release August 2012, and will be open source. There is not currently a way to test it, and its end performance and stability is unknown but for those with a need to analyze graph data, G* could prove very helpful.

Session 3 – Richard Tibbetts (StreamBase) on StreamBase LiveView

This was a business/marketing session. LiveView is a pretty compelling product. They are capable of continuously streaming live data (financial feeds, mostly) into constantly updated views that have been designed by user analysts. As changes happen in the streaming data, the actual data view and results a user sees are changed before their eyes. Currently, LiveView is in production with a few unnamed large financial firms. Some random  performance targets given out were that they can handle 50k updates/sec, with 100ms propagation time to the end user. I specifically asked if this was a target or if they were actually meeting these numbers, and they claimed to be “destroying them.” Future work includes dealing with messier data, more events and alerting for data changes, and incorporating more pre-built analytics (possibly from other organizations).

Session 5 – David Karger (MIT) on Documents with Databases Inside Them

This was a purely academic talk with a neat premise. Dido consists of an entire data viewer application with attached data, and a WYSIWYG editor for the data viewer, which are appended to an html document. Essentially, this turns the document into a web application, but it doesn’t need any sort of network connection as all the functionality is captured locally in the document. If a user wanted to make a new application, all he/she needs to do is copy the file, and possibly change the underlying data set. This came with an argument against cloud computing – instead of running an application on the cloud, just make every document include a lightweight, usable application  that can be edited and created locally. One caveat was that applications with massive underlying data sets, even if much of these were never viewed (eg Google Maps), couldn’t possibly be stored locally, and in these big data cases they’d have to be fetched from the cloud. A call to action was made to make httpd for SQL and Hadoop so that accessing data sources is as easy as making an html page.

Keynote 2 – Mark Callaghan (Facebook) on Why Performance is Overrated

Facebook still maintains an absolutely massive number of MySQL instances, and Callaghan is one of Facebook’s ten (!) operations guys who basically keep the damn thing running. The primary point of the talk was to say that at Facebook, they don’t care about peak performance, but instead care about average/constant performance. The rest of the lecture was devoted to how they do everything in their power to avoid stalls in processing. Much of what was presented was very specific to Facebook’s needs. Their current git log for mysql has 452 changes, most of which are being kept internal to the company, but they do occasionally submit patches for MySQL itself. Since the Oracle acquisition this process has become slower. Towards the end of the talk, the Callaghan mentioned that they do have a need to transfer MySQL data into Hadoop, but they are still doing batch transactions to do it because none of the other technology really works.

Session 6 – Daniel Abadi (Hadapt) on Turning Hadoop into an All-Purpose Data Processing Platform

First: Daniel Abadi  is one of the fastest speakers I’ve ever heard, while still be understandable. He must have condensed a one hour lecture into 20 minutes. The lecture consisted of a general overview of Hadapt’s datastore system. Basically, Hadapt feels that having a traditional DBMS next to a Hadoop instance is silly architecture, and are trying to bring some of the DBMS-like features into Hadoop. They aren’t actually sure yet what the best way to store the data is. Abadi had a few hecklers towards the end, as Hadapt has apparently been flip-flopping on this issue. (I believe right now they are using a hacked up version of PostgreSQL.)

Session 7 – Mohamed Y. Eltabahk (WPI) on CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

This was an academic talk. CoHadoop allows for the creation of localization attributes for files which take better advantage of rack awareness and data placement for more efficient map reduce tasks on Hadoop. There are some kinks in it, but it looks pretty solid and will probably eventually find its way into mainline Hadoop in some form.

Session 8 – Andy Pavlo (Brown) on Making Fast Databases Faster

This was an interesting/hilarious lecture. Pavlo started behind a door with a bodyguard in front of it. The bodyguard was the full suit, sunglasses, and ear communicator type. He opened the door and escorted Pavlo to the podium, where he remained standing guard during the entire presentation (as well as threatening anyone who asked a question). The point of this schtick was that every time Pavlo’s advisor, Stan Zdonik, asked him to fix a specific problem with H-Store (the academic version of VoltDB), Pavlo would instead fix something that made the occurance of said problem less likely, without ever fixing the actual problem. With each fix, Zdonik became more irate until he “beat him senseless with a 2×4.” Poor Pavlo had to take a restraining order out against Zdonik, hence the bodyguard. Anyways. The actual presentation focused on three optimizations that I believe have all made their way into H-Store at this point, namely better distributing partitions such that distributed transactions are less likely to occur. One of the neat aspects of doing this is they are using machine learning to determine whether or not a transaction is distributed, and how long that may take, to better use other nodes that might be waiting for that transaction to finish. This is all available on github.

Session 9 – Alvin Cheung (MIT) on Automatic Partitioning of Database Applications

The idea here is that a developer can not always see ahead of time when something should be performed by a client application or when it should be run as a stored procedure on a database. Combined with the source code and some profiling information, the authors created a tool that will basically spit out two separate programs – one to run as the application itself, and a second program containing all of the stored procedures to be called. There is some amount of heap and object transfer logic that is generated as well. It was unclear to me how much profiling information was necessary – it wasn’t all automatically captured, and I could imagine that for significantly complex systems determining such information would be difficult.

Session 10 – Jaeyoung Do (U-Wisconsin) on Racing to the Peak: Fast Restart for SSD Buffer Pool Extension

This was a purely academic talk. The authors devised an alternate scheme that doesn’t sacrifice performance compared to other schemes while making SSD’s reach their peak performance rates faster through buffer pool shenanigans. Basically they lazily write to disk instead of forcing copies to disk, and have a log they can use to replay events in case of disk failure.

Session 13 – Yandong Mao (MIT) on Cache Craftiness for Fast Multicore Key-Value Storage

This was pretty gimmicky, but interesting nonetheless. Robert Morris and a few of his students created a single core key value store called MassTree (Massachusetts + Tree) that achieves “kick ass performance”: 5 million inserts per second and 7 million gets per second. The highlight had to be when these performance numbers were compared to VoltDB – the MassTree had a giant bar next to the tiny sliver of VoltDB. Of course, the slide was a joke. These performance numbers will pretty much plummet the second any sort of multi-core or IO scenarios begin, as it relied on keeping the entire data store in memory. Some of the techniques may be applicable to lower layers of other databases, however.

Session 14 – Ross Shaull (Brandeis) on Retro: Modular and Efficient Retrospection in a Database

This was an academic talk that has some amount of administrative use, as well. The author devised a method of adding snapshots to any database which does not currently possess them by using components common to any modern database. It does require an additional set of servers for managing and storing the snapshots, but adding these features through the Retro method to a specific database (I think it was PostgreSQL) only took about 200 lines of modification on the database source (plus all the retro specific code).

 

How to Set Up Apache Mahout

Apache Mahout is a set of machine learning tools, which deal with classification, clustering, recommendations, and other related stuff. We just bought a new book called Mahout In Action which is full of good examples and general machine learning advice; you can find it here. It’s pretty neat and it’s growing quickly, so I decided to take the time to learn about it.

Mahout functions as a set of MapReduce jobs. It integrates cleanly with Hadoop, and this makes it very attractive for doing text analysis on a large scale. Simpler queries, for instance getting the average response time from a customer, are probably better suited for Hive.

Most examples I’ve seen use Mahout as sort of a black box. The command line just forwards arguments to various Driver classes, which then work their magic. All input and output seems to be through HDFS, and Mahout also uses intermediate temp directories inside HDFS. I tried changing one of the Driver classes to work with HBase data, but the amount of work that seemed to be necessary was non-trivial.

Example

I decided to work with Enron email data set because it’s reasonably large and it tells a story about fraud and corruption. Their use of keywords like ‘Raptor’ and ‘Death Star’ in place of other more descriptive phrases makes topic analysis pretty interesting.

Please read ‘Important things to watch out for’ at the bottom of this post first if you want to follow along.

This is what I did to get the Enron mail set to be analyzed using the LDA algorithm (Latent Dirchlet Allocation), which looks for common topics in a corpus of text data:

  • The Enron emails are stored in the maildir format, a directory tree of text emails. In order to process the text, it first needs to be converted to SequenceFiles. A SequenceFile is a file format used extensively by Hadoop, and it contains a series of key/value pairs. One way to convert a directory of text to SequenceFiles is to use Mahout’s seqdirectory command:
    ./bin/mahout seqdirectory -i file:///home/georges/enron_mail_20110402 -o /data/enron_seq

    This can take a little while for large amounts of text, maybe 15 minutes. The SequenceFiles produced have key/value pairs where the key is the path of the file and the value is the text from that file.

  • Later on I wrote my own Java code which parsed out the mail headers to prevent them from interfering with the results. It is fairly simple to write a MapReduce task to quickly produce your own SequenceFiles. Also note that there are many other possible sources of text data, for instance Lucene indexes. There’s a list of ways to input text data here.
  • I needed to tokenize the SequenceFiles into vectors. Vectors in text analysis are a technical idea that I won’t get into, but these particular vectors are just simple term frequencies.
    ./bin/mahout seq2sparse -i /data/enron_seq -o /data/enron_vec_tf --norm 2 -wt tf -seq

    This command may need changing depending on what text analysis algorithm you’re using. Most algorithms would require tf-idf instead, which weights the term frequency against the size of the email. This took 5 minutes on a 10-node AWS Hadoop cluster. (I set the cluster up using StarCluster, another neat tool for managing EC2 instances.)

  • I ran the LDA algorithm:
    ./bin/mahout lda -i /dev/enron_vec_tf/tf-vectors -o /data/enron_lda -x 20 -k 10

    x is the max number of iterations for the algorithm. k is the number of topics to display from the corpus. This took a little under 2 hours on my cluster.

  • List the LDA topics:
    ./bin/mahout ldatopics -i /data/enron_lda/state-4 --dict /data/enron_vec_tf/dictionary.file-0 -w 5 --dictionaryType sequencefile

    This command is a bit of pain because it doesn’t really error when you have an incorrect parameter, it just does nothing. Here’s some of the output I got:

    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
    HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf
    MAHOUT-JOB: /data/mahout-distribution-0.5/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
    Topic 0
    ===========
    i [p(i|topic_0) = 0.023824791149925677
    information [p(information|topic_0) = 0.004141992353710214
    i'm [p(i'm|topic_0) = 0.0012614859683494856
    i'll [p(i'll|topic_0) = 7.433430267661564E-4
    i've [p(i've|topic_0) = 4.22765928967555E-4
    Topic 1
    ===========
    you [p(you|topic_1) = 0.013807669181244436
    you're [p(you're|topic_1) = 3.431068629183266E-4
    you'll [p(you'll|topic_1) = 1.0412948245383297E-4
    you'd [p(you'd|topic_1) = 8.39664771688153E-5
    you'all [p(you'all|topic_1) = 1.5437174634592594E-6
    Topic 2
    ===========
    you [p(you|topic_2) = 0.03938587430317399
    we [p(we|topic_2) = 0.010675333661142919
    your [p(your|topic_2) = 0.0038312042763726448
    meeting [p(meeting|topic_2) = 0.002407369369715602
    message [p(message|topic_2) = 0.0018055376982080878
    Topic 3
    ===========
    you [p(you|topic_3) = 0.036593494258252174
    your [p(your|topic_3) = 0.003970284840960353
    i'm [p(i'm|topic_3) = 0.0013595988902916712
    i'll [p(i'll|topic_3) = 5.879175074800994E-4
    i've [p(i've|topic_3) = 3.9887853536102604E-4
    Topic 4
    ===========
    i [p(i|topic_4) = 0.027838628233581693
    john [p(john|topic_4) = 0.002320786569676983
    jones [p(jones|topic_4) = 6.79365597839018E-4
    jpg [p(jpg|topic_4) = 1.5296038761774956E-4
    johnson [p(johnson|topic_4) = 9.771211326361852E-5
  • Looks like the data needs a lot of munging to provide more useful results. Still, you can see the relationship between some of the words in each topic.

I recommend playing around with the examples in the examples/bin directory in the Mahout folder.

Important things to watch out for

  • I ran out of heap space once I asked Mahout to do some real work. I needed to increase the heap size for child MapReduce processes. How to do this is basically described here. You only need the -Xmx option, and I went for 2 gigabytes:
    <property>
       <name>mapred.child.java.opts</name>
       <value>
         -Xmx2048M
       </value>
     </property>

    You may also want to set MAHOUT_HEAPSIZE to 2048, but I’m not sure how much this matters.

  • Some environment variables weren’t set on my StarCluster instance by default, and the warnings are subtle. HADOOP_HOME is particularly important. If HADOOP_HOME is not set, MapReduce jobs will run as local jobs. There were weird exceptions accessing HDFS, and your jobs won’t show up in the job tracker. They do warn you in the console output for the job, but it’s easy to miss. JAVA_HOME is also important but it will explicitly error and tell you to set this. HADOOP_CONF_DIR should be set to $HADOOP_HOME/conf. For some reason it assumes you want HADOOP_HOME/src/conf instead if you don’t specify. Also set MAHOUT_HOME to your mahout directory. This is important so it can add its jar files to the CLASSPATH correctly.
  • I ended up compiling Mahout from source. The stable version of Mahout had errors I couldn’t really explain. File system mismatches or vector mismatches or something like that. I’m not 100% sure that it’s necessary, but it probably won’t hurt. Compilation is pretty simple, ‘mvn clean install’, but you will probably want to add ‘-DskipTests’ because the tests take a long time.

Creating a Virtual Machine in SCVMM Self-Service Portal v1

System Center Virtual Machine manager ships with an addon called the Self-Service Portal. The goal of the Self-Service portal is to provide a hassle free method for end users to create and manage their own virtual machines in Hyper-V. Below are the instructions and screenshots an end user would use to create a virtual machine in v1 of the Self-Service portal that shipped with System Center Virtual Machine Manager 2008 R2.

These steps assume that you have Self-Service portal running in your environment and that you have already created SCVMM templates for users to deploy.

How to Create a virtual machine in the Self-Service Portal

  1. Log in to the Self-Service Portal.
  2. Under the Create menu, click New Computer.
  3. Under the Creation Source, choose a Template from the available list.
  4. Enter the display name and description of the virtual machine.
  5. Enter the computer name (this will be the name referenced on the network). Be wary of adding a trialing space in the computer name or it will fail to create.
  6. Click create, if successful a pop up should indicate the virtual machine was created successfully.
  7. Wait for the virtual machine to be deployed to the host. You can view the estimated completion time from the properties -> latest job tab.
  8. Connect to the virtual machine.

Screenshots

 

Additional Tips

  • Configure your template or use group policy to make your virtual machines accessible over RDP. The connection method used in the Self-Service Portal is difficult to use. It requires a Active X control be installed on the client and is genrally inferior to RDP (fixed screen size, no copy paste, etc..).
  • Use a naming convention to track virtual machines created (i.e. hv-xx) through the Self-Service Portal. Internally, we created a small ASP.NET web application that will automatically generate a virtual machine name for users when they click a button.

 

 

 

Blogged