WikiReverse data pipeline

23 Jan 2015   aws, commoncrawl, wikireverse

WikiReverse is a reverse web-link graph for Wikipedia articles. It consists of approximately 36 million links to 4 million Wikipedia articles from 900,000 websites. The source data was the July 2014 web crawl released by Common Crawl, which contains 3.6 billion web pages. This post is an overview of the data pipeline I created to produce the results. Processing the entire crawl using Hadoop cost $64 USD using the AWS EMR (Elastic MapReduce) service with spot instances. The full dataset can be downloaded from S3 as a torrent or browsed on the WikiReverse site.

Common Crawl data

Common Crawl has been releasing public web crawls since 2008. Currently they release crawl data in 3 formats. WARC files contain the complete web pages, WAT files contain just the page metadata in a JSON format and WET files contain only the plain text extracted from the web pages. The metadata includes the full link graph so I processed these files rather than the complete web pages.

The Common Crawl data is stored in AWS’s S3 service as a public dataset in their US-East-1 region.

Parsing the metadata

To parse the data I developed a Hadoop job in Java. I primarily develop in Ruby but for the Hadoop jobs I went with Java for speed and because the Hadoop Java API is richer than the streaming API. It contains extra features such as counters that were very useful when developing and running the jobs.

For parsing the metadata JSON I used the Jackson parser. It has a streaming API that allows for fast low level parsing of the JSON.

To process the data I used the EMR service. This is a managed version of Hadoop, it creates your cluster using EC2 instances. To reduce costs I used EC2 spot instances and ran the parse jobs on a weekend while the US was asleep. As at this time there is usually plenty of spot capacity in the US East region of AWS.

The metadata for the July 2014 crawl consists of 63,560 WET files that are split into 253 segments. Each segment was processed as a separate parse job. This avoids processing the whole crawl in a single job, as this would overload the Hadoop job tracker.

Another reason to process the segments individually was that if the spot price goes above your bid price your instances will be terminated. Spikes in the spot price happen quite often and this caused a couple of my jobs to fail but only the data from the in progress segment was lost.

However creating 250 parse jobs manually in the AWS web console would be very time consuming and error prone. So I developed elasticrawl to help with this. It’s a command line tool developed in Ruby that automates creating the EMR jobs.

Combining Segments

The output parsed from each segment is stored in a separate location in S3. The combine Hadoop job takes these segments and combines them into a single set of files. For the parse and combine stages I used Hadoop sequence files as the output format as they provide good performance. As with the parse jobs elasticrawl was used to launch the combine job.

Outputting as CSV

The final Hadoop job was to output the combined results as a set of gzipped tab delimited files. The raw data for each link has the

  • ISO language code of the Wikipedia site e.g. en for English Wikipedia.
  • Article name e.g. Common_Crawl
  • Page URL
  • Page hostname
  • Page title
  • Timestamp when the page was crawled

This job is simple as it just takes in the S3 location of the combined results. So it was created manually rather than using elasticrawl.

Hadoop Costs

The Hadoop jobs for all 3 steps used 279 instance hours with m3.2xlarge instances and cost $64. Using spot instances rather than on-demand instances saved $115. The costs were $39 for EMR, $18 for the spot instances and $7 for S3. So without using EMR it would have been incredibly cheap but I didn’t have a way of launching a Hadoop cluster using just EC2 instances.

ETL using Redshift

The complete results are 2GB compressed. I used yet another AWS service called Redshift to create a simple ETL process. Redshift is a high performance data warehouse built on top of Postgres.

Redshift extends the Postgres COPY command to provide very fast import / export of CSV data from S3. It also has good error handling for filtering out invalid data. In my case this was mainly due to invalid UTF-8 characters appearing in the results.

However Redshift is intended for use as a data warehouse solution. So not all Postgres features are implemented. For me the problems were there is no free text search support or support for sequence primary keys.

So I imported the raw results into a single table and used it to populate separate tables for Wikipedia languages, articles, sites, pages and links. These tables were then exported back to S3 as tab delimited files.

Import into Postgres

I was very impressed with the performance of Redshift. Unfortunately things now slowed down. As I needed to load the transformed data from S3 into a regular Postgres database and populate the database schema needed for the website.

Although I used a fairly powerful m3.2xlarge EC2 instance this took a long time. I did some tuning of Postgres but the EC2 instance needed more tuning of its storage. Instead I took the hit and ran some long running queries as background tasks. I think next time I would try using Pgloader.

The final step was to take a dump of the Postgres database and back it up to S3 and some other cloud storage services.

WikiReverse site

For the site itself I wanted to try out using a SPA framework. So the site is an AngularJS app that gets its data from the REST API. The site is hosted on S3 and served from CloudFront. The REST API is developed in Rails accessing the Postgres database. Since the data is read only the API requests are cached using Varnish.

I usually deploy to the AWS stack using Elastic Beanstalk, RDS and Elasticache. This is because we need high availability and performance. It also allows us to focus on developing new features for our customers rather than building out infrastructure. However cost was the most important factor for hosting this project. AWS was too expensive for always on servers that don’t need a SLA. So I went with Linode and created cache, web and database servers. The code for the Hadoop jobs and the elasticrawl tool are available on GitHub under the MIT license. I’ll be adding the code for the Angular app and REST API soon.


Overall I’m pretty happy with the project. I wanted to process an entire crawl and then release both the data and the code used to generate it. Hopefully people will find the Wikipedia data interesting and it will encourage more people to work with the Common Crawl data.

The main problem is how long it’s taken both in effort and elapsed time. Its been a big side project that I’ve only worked on during gaps in my freelance work. Part of the reason for that is I also wanted to build the WikiReverse site so the data can be browsed online rather than just downloaded via the torrent.

Using Angular to build the site did take a while. Partly because of the steep learning curve that many people have commented on. I also built out the whole site in Angular whereas I could have used a static site generator for the content pages and just used Angular for the dynamic data. However the front end is the weakest part of the stack for me, and building the site has helped me improve.

On the Hadoop side I was pleased the whole crawl could be parsed for $64. Using spot instances made a big difference to the costs. Hopefully the elasticrawl tool will be useful as each crawl has a lot of data to process. By parsing each segment seperately and then combining the results it means spot instances can be used and even if you lose your cluster only a small amount of data is lost.

comments powered by Disqus