

Java MR might be old, but it is still the only tool that can solve certain large-scale problems. Let’s not burn our Java MapReduce textbooks just yet… As a developer, I tend to use Java MapReduce when no other solution will work. The official Java MapReduce word count example is over fifty! So RIP Java MapReduce? assertEquals import java.io.File import .FileUtils import .Configuration import import public class WordCountTest Simply fork as you would the Getting Started demo: I have modified one class ( WordCount.java) and created another ( WordCountTest.java). I’ve placed my modified fork of the official Crunch demo in my GitHub repository.

This will only work with UNIX-based operating systems - sorry Windows users! Using Local Mode with Apache Crunch In facy, using Hadoop’s local mode (which runs the job in a single JVM), it’s possible to run a Crunch job incredibly easily and on a local machine. The Getting Started page requires you to have a Hadoop environment, and then requires you to spend time getting input data onto the HDFS, getting the Jar onto a job submission node… For someone just wanting to try out Crunch, it’s unnecessary and could prove to be a turn-off. I found it to be very useful, but I think it misses one step that makes working with Crunch a breeze. Its in-built support for Avro is fantastic, and it provides enough control that it’s possible to still write highly-efficient operations (with custom comparators) where required.įor the basics, read the official Apache Crunch Getting Started page. There’s also a great discussion of Cascading vs Crunch over at Quora - basically Cascading is good for jobs using basic data types with straightforward functionality, whereas Crunch is useful for more complex data types and algorithms.Ĭrunch has some pretty good credentials: it recently became a top-level Apache project and it’s used in production by Spotify. I’ve written a summary of Cascading vs Java MapReduce here, and the majority of the discussion also applies to Crunch. Instead of clunky map() and reduce() methods, jobs are created as pipelines, similarly to Cascading. Apache Crunch is an incredibly useful Hadoop tool for extracting away the boilerplace produced by Java MapReduce jobs.
