When a job tracker receives a map reduce job, it will divvy out tasks to several task trackers in order to complete the job. If any of those tasks fails for whatever reason (perhaps they threw an exception), then it’s up to the job tracker to restart the job on another slave. This process can occur up to three times before the job tracker gives up. But what happens if a task doesn’t fail, but it doesn’t succeed either? What if it just hangs? Perhaps that map task received an extra large or extra tough block to work with. Maybe some other application on that task tracker is running and it’s hogging the entire CPU. Maybe the task tracker has entered an infinite loop. Either way, the task tracker continues to check in from time to time, which prevents it from being killed outright, but it just isn’t finishing. The job tracker can’t possibly know why this task tracker is taking longer nor can it know when or if it will finish. What does the job tracker do?
Speculative Execution!
Without shutting down the first task tracker, it goes to another task tracker and gives it the same job. Then it’s a race. Whoever finishes first is the one that gets to submit its results. The other is killed (a most cutthroat race). That’s it.
Speculative execution isn’t always appropriate. In fact, some people recommend that you disable it for reduce jobs entirely. Why? Continue reading Speculative Execution: Proceed with Caution (or Not at All)