10 May 2015
docker, aws, commoncrawl, wikireverse, force12
2 weeks ago I was leaving Budapest after a fun week with two of my colleagues. We were attending Craft Conf, which is a 2-day multi track conference on software craftsmanship.
In Budapest I was also working with Anne on Force12.io which we’re launching this Thursday (14th May). Force12 is a container scheduler that provides auto scaling. On Thursday we’re launching a demo of Force12 running on EC2 Container Service.
08 Feb 2015
commoncrawl, java, wikireverse
This post covers how to create your own Hadoop jobs that can be launched by the elasticrawl tool I’ve developed. Each web crawl released by Common Crawl is large and typically has around 2 billion pages split over several hundred segments. Elasticrawl helps with launching AWS Elastic MapReduce jobs to process this data.
23 Jan 2015
aws, commoncrawl, wikireverse
WikiReverse is a reverse web-link graph for Wikipedia articles. It consists of approximately 36 million links to 4 million Wikipedia articles from 900,000 websites. The source data was the July 2014 web crawl released by Common Crawl, which contains 3.6 billion web pages. This post is an overview of the data pipeline I created to produce the results. Processing the entire crawl using Hadoop cost $64 USD using the AWS EMR (Elastic MapReduce) service with spot instances. The full dataset can be downloaded from S3 as a torrent or browsed on the WikiReverse site.