Commoncrawl posts

Craft Conf and

10 May 2015   docker, aws, commoncrawl, wikireverse, force12

2 weeks ago I was leaving Budapest after a fun week with two of my colleagues. We were attending Craft Conf, which is a 2-day multi track conference on software craftsmanship.

In Budapest I was also working with Anne on which we’re launching this Thursday (14th May). Force12 is a container scheduler that provides auto scaling. On Thursday we’re launching a demo of Force12 running on EC2 Container Service.

Developing Hadoop jobs to work with elasticrawl

08 Feb 2015   commoncrawl, java, wikireverse

This post covers how to create your own Hadoop jobs that can be launched by the elasticrawl tool I’ve developed. Each web crawl released by Common Crawl is large and typically has around 2 billion pages split over several hundred segments. Elasticrawl helps with launching AWS Elastic MapReduce jobs to process this data.

WikiReverse data pipeline

23 Jan 2015   aws, commoncrawl, wikireverse

WikiReverse is a reverse web-link graph for Wikipedia articles. It consists of approximately 36 million links to 4 million Wikipedia articles from 900,000 websites. The source data was the July 2014 web crawl released by Common Crawl, which contains 3.6 billion web pages. This post is an overview of the data pipeline I created to produce the results. Processing the entire crawl using Hadoop cost $64 USD using the AWS EMR (Elastic MapReduce) service with spot instances. The full dataset can be downloaded from S3 as a torrent or browsed on the WikiReverse site.

Page: 1 of 2