Java posts

Developing Hadoop jobs to work with elasticrawl

08 Feb 2015   commoncrawl, java, wikireverse

This post covers how to create your own Hadoop jobs that can be launched by the elasticrawl tool I’ve developed. Each web crawl released by Common Crawl is large and typically has around 2 billion pages split over several hundred segments. Elasticrawl helps with launching AWS Elastic MapReduce jobs to process this data.