Wikireverse posts

Craft Conf and

10 May 2015   docker, aws, commoncrawl, wikireverse, force12

2 weeks ago I was leaving Budapest after a fun week with two of my colleagues. We were attending Craft Conf, which is a 2-day multi track conference on software craftsmanship.

In Budapest I was also working with Anne on which we’re launching this Thursday (14th May). Force12 is a container scheduler that provides auto scaling. On Thursday we’re launching a demo of Force12 running on EC2 Container Service.

Adding DBpedia data to WikiReverse

15 Feb 2015   ruby, wikireverse is my open data project using Common Crawl data. It’s a reverse link graph with 36 million external links to 4 million Wikipedia articles. One of the reasons I chose Wikipedia as the topic is because there is a wealth of open data released by Wikipedia and built using their data.

One of the enhancements I’m making to the WikiReverse site is to make browsing more interesting by categorising the articles and adding images. To do this I’m using data from the DBpedia project. They take the raw data dumps released by Wikipedia and create structured datasets for many facets of the data including categories, images and geographical locations.

Developing Hadoop jobs to work with elasticrawl

08 Feb 2015   commoncrawl, java, wikireverse

This post covers how to create your own Hadoop jobs that can be launched by the elasticrawl tool I’ve developed. Each web crawl released by Common Crawl is large and typically has around 2 billion pages split over several hundred segments. Elasticrawl helps with launching AWS Elastic MapReduce jobs to process this data.

Page: 1 of 2