In keeping with my sports theme from March, on March Madness and predicting the NCAA tournament, it is time to look at this season’s sport: Baseball. As you might know, Major League Baseball’s All-Star game is tonight, so let’s use Excel to pick which league (the AL or NL) will be victorious, so we can kick back and enjoy the game.
Baseball has long been associated with using stats to predict outcomes and player performances. This was made famous in the movie Moneyball, and has its own cottage industry around helping fantasy baseball players perform better. There is no shortage of information that can be gathered about baseball, with the whole industry even having deemed the term: sabermetrics.
Before jumping into the vastness of data, however, I want to point out where we are getting the data from. There are many different sources for baseball stats, many requiring a fee, but I will be referring to the Sean Lahman Baseball Stats Database. It is open source, so you can just download a version that works for you, and run with it. I am only going to look at the players who have actually appeared in the All-Star game and their year’s performance.
The other source of data is the year-to-date stats for the All-Star players. These can be gathered from the many, many sports sites (like ESPN.com), but is a manual process. I will leave it as an exercise for the reader to copy and paste those! You can find the sum of those in the example file attached.
Just to simplify things, we are going to use the old standbys: Earned Run Averages (ERA) and Batting Averages (BA) to compare the two Major League Baseball leagues. The other consideration is to analyze data from just the “Long Ball Era,” which started in 1994. Since the sport has been around so very long, it helps to categorize the data so you get a better apples-to-apples comparison.
The first step is to get the data into your database. I used the Access file download, since SQL Server 2008 R2 can import that directly, and you don’t have to do any of the conversions. The data is also available in CSV and MySQL format. Now that we have a nice almost relational database, all we need are a couple of SQL statements to get the data for our processing needs. (They have been attached for reference.)
On to building our Excel workbook. Continue reading Baseball’s All-Star Break: Predicting the Game Using Excel