All posts by cameron

sinatra rubygems

How to Set Up Sinatra on Bluehost

sinatra rubygems

I recently had to install Sinatra on bluehost. It proved troublesome so I’m documenting what I did. One curious handicap I had is I could not ssh into bluehost due to silly administrative reasons. Here’s what I did:

Install the needed RubyGems

First, from cPanel, I went into RubyGems (under Software/Services) and I installed the following packages:

  1. sinatra (version 1.3.2)
  2. tilt (version 1.3.3)
  3. rack (version 1.4.1)
  4. fcgi (version 0.8.8)

You likely already have some of these so be sure to check the list first.

Install the “.htaccess” file

From the cPanel, I went to FileManager (under Files) and chose to browse the web root (Note: make sure you check “Show hidden files”). In public_html, I put in a new file called “.htaccess” and put the following fluff inside of it:

# General Apache options
AddHandler fcgid-script .fcgi
AddHandler cgi-script .cgi
#Options +FollowSymLinks +ExecCGI

# If you don't want Rails to look in certain directories,
# use the following rewrite rules so that Apache won't rewrite certain requests
#
# Example:
#   RewriteCond %{REQUEST_URI} ^/notrails.*
#   RewriteRule .* - [L]

# Redirect all requests not available on the filesystem to Rails
# By default the cgi dispatcher is used which is very slow
#
# For better performance replace the dispatcher with the fastcgi one
#
# Example:
#   RewriteRule ^(.*)$ dispatch.fcgi [QSA,L]
RewriteEngine On

# If your Rails application is accessed via an Alias directive,
# then you MUST also set the RewriteBase in this htaccess file.
#
# Example:
#   Alias /myrailsapp /path/to/myrailsapp/public
#   RewriteBase /myrailsapp

RewriteBase /
RewriteRule ^$ index.html [QSA]
RewriteRule ^([^.]+)$ $1.html [QSA]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^(.*)$ dispatch.fcgi [QSA,L]

# In case Rails experiences terminal errors
# Instead of displaying this message you can supply a file here which will be rendered instead
#
# Example:
#   ErrorDocument 500 /500.html

ErrorDocument 500 "<h2>Application error</h2>Ruby application failed to start properly"

This file was taken, in large part, from the bluehost forum post linked at the bottom of the page.

Install the “dispatch.fcgi” file

In the same directory, I created the “dispatch.fcgi” file and put the following into that:

#!/usr/bin/ruby
#
# Sample dispatch.fcgi to make Sinatra work on Bluehost
#
# http://www.sinatrarb.com/
#

require 'rubygems'

# *** CONFIGURE HERE ***
# You must put the gems on the path
ENV["GEM_HOME"] = "/home#/XXXXX/ruby/gems"

# sinatra should load now
require 'sinatra'

module Rack
  class Request
    def path_info
      @env["REDIRECT_URL"].to_s
    end
    def path_info=(s)
      @env["REDIRECT_URL"] = s.to_s
    end
  end
end

# Define your Sinatra application here
class MyApp < Sinatra::Application
  get '/hi' do
    "Hello World!"
  end
end

builder = Rack::Builder.new do
  use Rack::ShowStatus
  use Rack::ShowExceptions

  map '/' do
    run MyApp.new
  end
end

Rack::Handler::FastCGI.run(builder)

You need to replace “/home#/XXXX” with the appropriate path in your system.

Correct all the date times in your gemspec files

At this point, most other sources said I should be done, but I kept getting an error along the lines of:

Invalid gemspec in [D:/RailsInstaller/Ruby1.8.7/lib/ruby/gems/1.8/specifications
/tilt-1.3.3.gemspec]: invalid date format in specification: "2011-08-25 00:00:00
.000000000Z"

So what I had to do to fix it was go to “/ruby/gems/specifications/tilt-1.3.3.gemspec” in the file manager and change the line:

s.date = %q{2011-08-25 00:00:00 .000000000Z}

to

s.date = %q{2011-08-25}

If that wonky date format shows up on any other gemspecs, you’ll likely have to alter them as well.

Hopefully, this will get you up and running. It did for me anyway.

Sources

Everything I learned I learned from the following sites and posts:

africanbudgetsafaris.com

Manual Joins in Hadoop

I recently learned about how to perform joins with map reduce. Generally, you usually won’t have to do this since tools such as Hive or Pig exist which can do this much more easily. But it’s still a cool idea so I’ll discuss the overall concept here. Let’s take the following two tables which each exist over one or more files. They contain information regarding employees at a company.

Employees (PersonID / name / Age)

100   Bobby   24
101   Charles 54
102   Jenny   23
103   Oswald  41
104   Cindy   30

Pets (PersonID / Pet type / Pet name)

100   Dog   Knuckles
101   Snake Jenny
103   Cat   Uncle Naptime
102   Bird  Mitzy
102   Bird  Bessy
100   Dog   Chuckles

We want to join these two tables to associate a person’s name with their pet’s name. So we need to perform a join using the PersonID. There are generally two ways to perform a join in map reduce: one uses both a mapper and a reducer while the other just uses a mapper. Both ways have their pros and cons.

Mapper and Reducer

So our mappers will read in all the data from both tables and spit out results using the PersonId as the key, regardless of which table it happens to be processing. The combined results from all mappers will look like this:

Key | Value
100 | 100   Bobby   24
101 | 101   Charles 54
102 | 102   Jenny   23
103 | 103   Oswald  41
104 | 104   Cindy   30
100 | 100  Dog   Knuckles
101 | 101  Snake Jenny
103 | 103  Cat   Uncle Naptime
102 | 102  Bird  Mitzy
102 | 102  Bird  Bessy
103 | 103  Cat   Sir Borington
100 | 100  Dog   Chuckles

Obviously, you can and should filter out unneeded information. I’m just not doing it here.  So then all the values for each key go off to separate reducers. If we look at the reducer responsible for key 100, it will have values like so:

100  Bobby   24
100 Dog   Knuckles
100 Dog   Chuckles

This reducer now has the record of Bobby and his age. And all other records are his pets. It can output:

Bobby Knuckles
Bobby Chuckles

… Or whatever output you want. You can easily perform any kind of join this way. It’s all a matter of filtering it how you want. In the above example, one reducer will get the key 104 which corresponds to Cindy, who doesn’t have any pets. So that reducer got the single value of “104   Cindy   30“. If we want to perform an inner join, then that reducer can emit nothing. If we want an outer join or a left join, it could emit “Cindy  null” or some such. You really have a lot of flexibility.

Secondary sorting

One downside to the above example is that the line containing the person’s name may not be the first value in the list. This requires the reducer to examine each value to determine whether it contains the person information or more pet information. However, hadoop supports secondary sorting. Which means you can sort composite keys differently than you separated them. So for instance, in the above example, we could create a composite key in the mappers to include a little bit more information. Like so:

Key | Value
100#Emp | 100   Bobby   24
101#Emp | 101   Charles 54
102#Emp | 102   Jenny   23
103#Emp | 103   Oswald  41
104#Emp | 104   Cindy   30
100#Pet | 100  Dog   Knuckles
101#Pet | 101  Snake Jenny
103#Pet | 103  Cat   Uncle Naptime
102#Pet | 102  Bird  Mitzy
102#Pet | 102  Bird  Bessy
103#Pet | 103  Cat   Sir Borington
100#Pet | 100  Dog   Chuckles

Then the mapreducer can be configured to separate the keys based on the number before the ‘#’ but sort based on the entire key. This way, each reducer still gets all the values for a given PersonID, but the employee name will always be first since “#Emp” will show up before “#Pet” in sorting. Of course, Secondary sorting has much, much more potential than this. But like I said, this is an incredibly simple coverage of basic mapreduce joining.

Map Only

One major downside to using the Mapper and Reducer method, is that it create a lotof intermediate data. In fact, there is a line of intermediate data for every line read from the files. This will result a considerable network traffic within your hadoop cluster. So one other way to do a join is with mappers only. The only requirement is that all of the mappers have a complete copy of either one table or the other. This is an implausible order if both tables are massive, but ideal if one is massive and the other is rather small. So in our running example, lets assume that every mapper has a copy of the employees table. Then, as each mapper runs through their portion of the Pets table, they can compare each incoming entry for a match within the Employees table and emit any matches. This can be much faster and will consume minimal network traffic. But again, it will require that all mappers own a full version of one of the tables.

image via: africanbudgetsafaris.com

merging_traffic_ahead

Combiners: The Optional Step to MapReduce

Most of us know that hadoop mapreduce is made up of mappers and reducers. A map task runs on a task tracker. Then all the data for each key is collected from all the mappers and sent to another task tracker for reducing, one reduce task per key. But what slightly less than most of us know about are combiners. Combiners are an optimization that can occur after mapping but before the data is segregated to other machines based on key. Combiners often perform the exact same function as reducers, but only on the subset of data created on one mapper. This allows the task tracker an opportunity to reduce the size of the intermediate data it must send along to the reducers.

For instance, if we take the ubiquitous word count example. Two mappers may produce results like this:

Mapper A Mapper B
X - 1
Y - 1
Z - 1
X - 1
X - 1
X - 1
Z - 1
Y - 1
Y - 1

All those key-value pairs will need to passed to the reducers to tabulate the values. But suppose the reducer is also used as a combiner (which is quite often the case) and suppose it gets called on both results before they’re passed along:

Mapper A Mapper B
X - 2
Y - 1
Z - 1
X - 2
Z - 1
Y - 2

The traffic load has been reduced. Now all that’s left to do is call the reducers on the keys across all map results to produce:

X - 4
Z - 2
Y - 3

An important point to keep in mind is that the combiner is not always called, even when you assign one. The mapper will generally only call the combiner if the intermediate it’s producing is getting large, perhaps to the point that it must be written to disk before it can be sent. That’s why it’s important to make sure that the combiner does not change the inherit form of the data it processes. It must produce the same sort of content that it reads in. In the above example, the combiners read (word – sum) pairs and wrote out (word – sum) pairs.

image via: wpclipart.com

speculative execution

Speculative Execution: Proceed with Caution (or Not at All)

speculative execution

When a job tracker receives a map reduce job, it will divvy out tasks to several task trackers in order to complete the job. If any of those tasks fails for whatever reason (perhaps they threw an exception), then it’s up to the job tracker to restart the job on another slave. This process can occur up to three times before the job tracker gives up. But what happens if a task doesn’t fail, but it doesn’t succeed either? What if it just hangs? Perhaps that map task received an extra large or extra tough block to work with. Maybe some other application on that task tracker is running and it’s hogging the entire CPU. Maybe the task tracker has entered an infinite loop. Either way, the task tracker continues to check in from time to time, which prevents it from being killed outright, but it just isn’t finishing. The job tracker can’t possibly know why this task tracker is taking longer nor can it know when or if it will finish. What does the job tracker do?

Speculative Execution!

Without shutting down the first task tracker, it goes to another task tracker and gives it the same job. Then it’s a race. Whoever finishes first is the one that gets to submit its results. The other is killed (a most cutthroat race). That’s it.

Speculative execution isn’t always appropriate. In fact, some people recommend that you disable it for reduce jobs entirely. Why? Continue reading

mapreduce

Traversing Graphs with MapReduce

Hadoop can be used to perform breadth-first searches through graphs. One such way is done through a series of mapreduce jobs where each mapreduce is another layer of the breadth first search. Here is a very high-level explanation of what I mean. Suppose we have the simple graph:

 E <-- C <-- F
 ^     ^     ^
 |     |     |
 A --> B --> D

This data would likely be represented in our Hadoop cluster as list of connections, like: Continue reading

Scrum Debates: Story Pointing Bugs

Our company adopted scrum as our method for development over a year ago. And to this day, we’ve yet to get it entirely right. Over the months we’ve had to address issues ranging from “What do we do when critical issues come up?” to “Who should be a part of our daily scrum meeting?” to “What the hell is a story point anyway?” But one such problem that our Product owner recently brought up is, “Should we really be story pointing bugs?” And the answer, of course, is… well, actually, no one can really agree on an answer.

Well, let’s start out with why we all originally said yes. In any given sprint, our team will have to commit to some handful of stories and bugs off the top of the product backlog. If both the stories and bugs are story pointed to the same scale, then we have no problem deciding how much we can commit to this sprint given our previous velocities. And for a while, this worked fine.

But let’s look at this from the product owner’s point of view. Let’s say we’ve just released the next version of our product and now our product owner is looking over the backlog, trying to figure out how many sprints we’ll need to complete the required features (let’s say there are 10) we need for the next release. Each of these features gets defined as a story. For the sake of simplicity, let’s say they all have a story point value of 10, resulting in a total of 100 story points. In our past sprints, our velocity has averaged 20 story points. So our product owner can set the next release date for five sprints from now. Easy! Continue reading