10 May 2015
docker, aws, commoncrawl, wikireverse, force12
2 weeks ago I was leaving Budapest after a fun week with two of my colleagues. We were attending Craft Conf, which is a 2-day multi track conference on software craftsmanship.
In Budapest I was also working with Anne on Force12.io which we’re launching this Thursday (14th May). Force12 is a container scheduler that provides auto scaling. On Thursday we’re launching a demo of Force12 running on EC2 Container Service.
15 Feb 2015
WikiReverse.org is my open data project using Common Crawl data. It’s a reverse link graph with 36 million external links to 4 million Wikipedia articles. One of the reasons I chose Wikipedia as the topic is because there is a wealth of open data released by Wikipedia and built using their data.
One of the enhancements I’m making to the WikiReverse site is to make browsing more interesting by categorising the articles and adding images. To do this I’m using data from the DBpedia project. They take the raw data dumps released by Wikipedia and create structured datasets for many facets of the data including categories, images and geographical locations.
08 Feb 2015
commoncrawl, java, wikireverse
This post covers how to create your own Hadoop jobs that can be launched by the elasticrawl tool I’ve developed. Each web crawl released by Common Crawl is large and typically has around 2 billion pages split over several hundred segments. Elasticrawl helps with launching AWS Elastic MapReduce jobs to process this data.