Boston’s Big Datascape, Part 2: Nasuni, VoltDB, Lexalytics, Totutek, Cloudant

[Excerpted from the Riparian Data blog]

This ongoing series examines some of the key, exciting players in Boston’s emerging Big Data arena. The companies I’m highlighting differ in growth stages, target markets and revenue models, but converge around their belief that the data is the castle, and their tools the keys. You can read about the first five companies here.

6) Nasuni

  • Product: Nasuni is an cloud enterprise storage system. Their Nasuni Filers propagate data from a local disk cache to cloud storage, essentially giving users a unified file share in the cloud that doesn’t require replication of file servers.
  • Founder: Andres Rodriguez
  • Technologies used: on-premise storage, UniFS™ file system, VMs, cloud storage
  • Target Industries: Manufacturing, Construction, Legal, Education
  • Location: Natick, MA

 

7) VoltDB

  • Product: VoltDB is an in-memory relational database designed to handle millions of operations per second (125k TPS per commodity server) with near-perfect fault tolerance and automatic scale-out. It has three flavors—an Enterprise, startup/ISV, and community edition.
  • Founder: Michael Stonebraker (ln)
  • Technologies used: in-memory DBMS, OLTP, ACID, SQL
  • Target industries: Capital Markets, Digital Advertising, Online Games, Network Services
  • Location: Billerica, MA

[Read the full post]

Boston Hadoop Meetup Group: The Trumpet of the Elephant

Heheh. But seriously, if you live in the Boston area and are working with Hadoop, or interested in working with Hadoop, or just think the name is fun to say, you should absolutely clear your calendar the night of February 15. Why? Because it’s the first Boston Hadoop Meetup Group since November, and judging by the presenter line-up, it’s going to be a doozie (or an Oozie, if you want to get all topical).

First up, MapR’s Chief Application Architect Ted Dunning (t|l) on using Machine Learning within Hadoop. I’m really excited about this one.

Second, Cloudera Systems Engineer Adam Smieszy (t|l) on integrating Hadoop into your existing data management and analysis workflows.

Last, Hadapt’s CTO Philip Wickline (t|ln) “will give a high-level discussion about the differences between HBase and Hive, and about transactional versus analytical workloads more generally speaking, and dive into the systems required for each type of workload. ”

Each talk will run about 15-20 minutes, with time for Q&A after, followed by (free) beer and mingling.

The Boston Hadoop MeetUp Group is organized by Hadapt’s Reed Shea (t|l). Hadapt is doing some very very cool stuff with unstructured and structured data processing and analytics–cool enough that founder/Chief Scientist Daniel Abadi took teaching leave from Yale to turn his research into a product.

This particular MeetUp is sponsored by Hadapt, MapR, Cloudera and Fidelity, and is being held at Fidelity’s downtown office, from 6 to about 8:30 pm. For more information and to sign up, visit the event page.

See you there!

Better living through user scripts

The best feature of the web is the fact that once a page is served to your client, it’s yours.  There’s no such thing as a closed-source web page, only one that you haven’t looked at.  Better yet, any reasonably modern browser can inject your custom CSS and javascript into a web page after it’s been downloaded, to extend or enhance it any way you see fit.  As a celebration of the open web, here are three client-side additions to JIRA’s Greenhopper that I wrote to make my daily life better.

Don’t Resolve Tasks

Our Greenhopper workflow deals mostly with Stories and Tasks.  Both of those are stored as JIRA issues, so they can both be either Closed or Resolved.  The way we do things, it doesn’t make sense to Resolve a task, only ever Close it.  This short bit of CSS will remove the option to Resolve a task when it gets dragged into the Done column:

.aui-popup .gh-aui ul li
{
  display:none
}
.aui-popup .gh-aui ul :first-child
{
  display:inline-block
}

Highlight Tasks By User

Greenhopper sets different colors on different issue types.  This is nice, but I realized that what I really want is to be able to quickly distinguish tasks by who’s working on them, rather than what they are.  The following javascript will run through the Task Board and assign classes to each subtask based on who’s assigned to it (and run again whenever the page updates from dragging an issues):

function addClasses() {
    var subtasks = document.getElementsByClassName("gh-issue");

    for (var i = 0; i < subtasks.length; i++) {
        var issue_body = subtasks[i].children[0].children[1];
        var assignee = null;
        for (var j = 0; j < issue_body.children.length; j++) {
            if (issue_body.children[j].getAttribute('data-fieldid') == 'assignee') {
                assignee = issue_body.children[j].innerText.replace(' ', '').replace('\n', '').toLowerCase();
            }
        }

        if (assignee != null && subtasks[i].children[0].className.indexOf(assignee) < 0) {
            subtasks[i].children[0].className += " " + assignee;
        }
    }
}

document.addEventListener("DOMNodeInserted", addClasses);
addClasses();

Once that’s in place, I can add a little CSS to style different users.  For example, this turns all issues assigned to me green and bold:

div.gh-issue-inner.seankermes {
  background-color:#384 !important;
  font-weight:bold;
}

Colored Columns

Our Greenhopper task board has five columns at the moment, and there’s enough issues in it that when I scroll down past the column headers I can lose track of which issues are in which column based solely on their position.  This script sets the background color of each column section in the task board to a different color to make it easier:

function colorColumns() {
    var colors = ["#e0e0d7", "#cca36e", "#614a48", "#f5e1a4", "#99948b" ]
    var cols = document.getElementsByClassName("gh-step-col")
    for (var i = 0; i < cols.length; i++) {
        cols[i].style.backgroundColor=colors[i%5]
    }
}
document.addEventListener("DOMNodeInserted", colorColumns);
colorColumns();

All the code here can be installed in your very own browser either by default (Chrome handles .user.js files as extensions) or with browser add-ons like Stylish and Greasemonkey

#Meme15 Assignment 2: All A’Twitter

sqlfamilyA new monthly blog series has entered the #sqlfamily. The brainchild of Jason Strate (b|t), “#Meme15” focuses around the ways social networks can further our professional development.  This month’s assignment is one dear to my own heart (and brain. And fingers): Twitter. I’ve written before about what Twitter can do for your company—how it can give high-tech B2Bs personality, credibility and new leads. What I haven’t covered as much is what it can do for you, the employee. There are two questions in the assignment:

 

  • Why should average Jane or Joe professional consider using twitter?
  • What benefit have you seen in your career because of twitter?

As a person whose primary job responsibilities involve social media, I’m going to go with the first option—for an excellent answer to the second, check out Stacia Misner’s response.

So, why should you, the non Social Media Marketer/Specialist/Strategist etc use Twitter? In short, there are three main reasons: build relationships, gain knowledge and enhance your public image.

In slightly longer, Twitter is a public conversation, a place to learn, share and connect. Someone posts a link to a blog post about Power View; you read it and learn something new about Power View (animated data points, oh my!). Someone asks a question about stored procedures, aka your pride and joy, and you answer them. Bonds form between the teachers and the taught, the @er and @ed, tweeter and retweeter—but they can also form, albeit more loosely, between all of the above and their networks of listeners. When you perform any activity on Twitter, from favoriting a Tweet to organizing a Tweetup, it deepens your digital profile to anyone who thinks to look or happens to listen at the right time.

Twitter allows you to join  (or start!) non-geographically-restricted communities grouped around any interest or combination of interests. It lets you play pin the avatar on the body at conferences. It’s a virtual kickstarter for eventual IRL relationships. For all the banality of some of its content, Twitter’s function as a connector is far from trivial.

 [#Meme15 logo by Matt Velic]

Boston’s Big Datascape, Part 1

[Excerpted from the Riparian Data blog]
Big Data, or the technologies, languages, databases and platforms used to efficiently store, analyze and extract conclusions from massive data sets, is a Big Trend right now. Why? In a nutshell, because a) we are generating ever increasing amounts of data, and b) we keep learning faster, easier and more accurate ways of handling and extracting business value from it. On Wall Street, some investment banks and hedgefunds are incorporating sentiment analysis of web documents into their trading strategies. In healthcare, companies like WellPoint, Explorys and Apixio are using distributed computing to mine health records, practice guidelines, studies and medical/service costs to more accurately and affordably insure, diagnose and treat patients.

Unsurprisingly, Silicon Valley is big data’s epicenter, but Boston, long a bastion of Life Sciences, Healthcare, High Tech and Higher Ed, is becoming an important player, particularly in the storage and analytics arenas. This series aims to spotlight some of the current and future game changers. These companies differ in growth stages, target markets and revenue models, but converge around their belief that the data is the castle, and their tools the keys.

1)      Recorded Future

  • Product: Recorded Future is an API that scans, analyzes and visualizes the sentiment and momentum of specified references in publically available web documents (news sites, blogs, govt. sites, social media sites etc)
  • Founder/CEO: Christopher Ahlberg
  • Technologies used: JSON, real-time data feeds, predictive modeling, sentiment analysis
  • Target Industries: Financial Services, Competitive Intelligence, Defense Intelligence
  • Located: Cambridge, MA

2)      Hadapt

  • Product: The Hadapt Adaptive Analytical Platform is a single system for processing, querying and analyzing both structured and unstructured data. The platform doesn’t need connectors, and supports SQL queries.
  • Founders: Justin Borgman (CEO); Dr. Daniel Abadi (Chief Scientist)
  •  Technologies used: Hadoop, SQL, Adaptive Query Execution™
  • Target Industries: Financial Services, Healthcare, Telecom, Government

[Read the full post]

Manual Joins in Hadoop

I recently learned about how to perform joins with map reduce. Generally, you usually won’t have to do this since tools such as Hive or Pig exist which can do this much more easily. But it’s still a cool idea so I’ll discuss the overall concept here. Let’s take the following two tables which each exist over one or more files. They contain information regarding employees at a company.

Employees (PersonID / name / Age)

100   Bobby   24
101   Charles 54
102   Jenny   23
103   Oswald  41
104   Cindy   30

Pets (PersonID / Pet type / Pet name)

100   Dog   Knuckles
101   Snake Jenny
103   Cat   Uncle Naptime
102   Bird  Mitzy
102   Bird  Bessy
100   Dog   Chuckles

We want to join these two tables to associate a person’s name with their pet’s name. So we need to perform a join using the PersonID. There are generally two ways to perform a join in map reduce: one uses both a mapper and a reducer while the other just uses a mapper. Both ways have their pros and cons.

Mapper and Reducer

So our mappers will read in all the data from both tables and spit out results using the PersonId as the key, regardless of which table it happens to be processing. The combined results from all mappers will look like this:

Key | Value
100 | 100   Bobby   24
101 | 101   Charles 54
102 | 102   Jenny   23
103 | 103   Oswald  41
104 | 104   Cindy   30
100 | 100  Dog   Knuckles
101 | 101  Snake Jenny
103 | 103  Cat   Uncle Naptime
102 | 102  Bird  Mitzy
102 | 102  Bird  Bessy
103 | 103  Cat   Sir Borington
100 | 100  Dog   Chuckles

Obviously, you can and should filter out unneeded information. I’m just not doing it here.  So then all the values for each key go off to separate reducers. If we look at the reducer responsible for key 100, it will have values like so:

100  Bobby   24
100 Dog   Knuckles
100 Dog   Chuckles

This reducer now has the record of Bobby and his age. And all other records are his pets. It can output:

Bobby Knuckles
Bobby Chuckles

… Or whatever output you want. You can easily perform any kind of join this way. It’s all a matter of filtering it how you want. In the above example, one reducer will get the key 104 which corresponds to Cindy, who doesn’t have any pets. So that reducer got the single value of “104   Cindy   30“. If we want to perform an inner join, then that reducer can emit nothing. If we want an outer join or a left join, it could emit “Cindy  null” or some such. You really have a lot of flexibility.

Secondary sorting

One downside to the above example is that the line containing the person’s name may not be the first value in the list. This requires the reducer to examine each value to determine whether it contains the person information or more pet information. However, hadoop supports secondary sorting. Which means you can sort composite keys differently than you separated them. So for instance, in the above example, we could create a composite key in the mappers to include a little bit more information. Like so:

Key | Value
100#Emp | 100   Bobby   24
101#Emp | 101   Charles 54
102#Emp | 102   Jenny   23
103#Emp | 103   Oswald  41
104#Emp | 104   Cindy   30
100#Pet | 100  Dog   Knuckles
101#Pet | 101  Snake Jenny
103#Pet | 103  Cat   Uncle Naptime
102#Pet | 102  Bird  Mitzy
102#Pet | 102  Bird  Bessy
103#Pet | 103  Cat   Sir Borington
100#Pet | 100  Dog   Chuckles

Then the mapreducer can be configured to separate the keys based on the number before the ‘#’ but sort based on the entire key. This way, each reducer still gets all the values for a given PersonID, but the employee name will always be first since “#Emp” will show up before “#Pet” in sorting. Of course, Secondary sorting has much, much more potential than this. But like I said, this is an incredibly simple coverage of basic mapreduce joining.

Map Only

One major downside to using the Mapper and Reducer method, is that it create a lotof intermediate data. In fact, there is a line of intermediate data for every line read from the files. This will result a considerable network traffic within your hadoop cluster. So one other way to do a join is with mappers only. The only requirement is that all of the mappers have a complete copy of either one table or the other. This is an implausible order if both tables are massive, but ideal if one is massive and the other is rather small. So in our running example, lets assume that every mapper has a copy of the employees table. Then, as each mapper runs through their portion of the Pets table, they can compare each incoming entry for a match within the Employees table and emit any matches. This can be much faster and will consume minimal network traffic. But again, it will require that all mappers own a full version of one of the tables.

image via: africanbudgetsafaris.com

Dealing with Flash Modules in SharePoint

flash sharepoint 2010There are a number of problems adding flash to SharePoint 2010. A few issues and resolutions follow:

ErrorPROBLEM: Flash files won’t open from a document library

  • What’s actually happening here is that SharePoint adds special headers to disallow applications and scripts from being run in the browser.
  • This is a security measure to keep users from uploading dangerous content.
  • The user gets prompted to download the file instead.

Tick SOLUTION:

ErrorPROBLEM: I can’t access swf files from the 14 hive! Continue reading Dealing with Flash Modules in SharePoint

What’s New in OfficeWriter 8.0?

server-side excel generation[cross-posted from officewriter.com]

OfficeWriter 8.0 was just recently released!  So, what’s new in this most significant release of OfficeWriter in a number of years?  Here’s the breakdown:

  • XLSX Support – Complete support for Excel 2007/2010 (XLSX) files in the ExcelApplication API
    • Programmatically create, manipulate, and read XLSX, XLS, and DOC
    • Run on your server with confidence – OfficeWriter is designed for high performance and scale
    • Build sophisticated Excel and Word reporting features into your applications
  • RTF/HTML import – Import arbitrary RTF and HTML documents into Word reports
    • Quickly and easily import markup into Word reports
    • Supports DOCX and DOC files
  • Enhanced documentation – New layout, new organization, new tutorials. We made it easier than ever for developers to find information they’re looking for at http://wiki.softartisans.com Continue reading What’s New in OfficeWriter 8.0?

Blogged