This post covers how to create your own Hadoop jobs that can be launched by the elasticrawl tool I’ve developed. Each web crawl released by Common Crawl is large and typically has around 2 billion pages split over several hundred segments. Elasticrawl helps with launching AWS Elastic MapReduce jobs to process this data.
By processing each segment as a separate Hadoop job EC2 spot instances can be used which greatly reduces costs. For my wikireverse project I parsed a large crawl from July 2014 with 3.6 billion pages for $64 USD. Using spot instances rather than on-demand saved $115.
The design means if the spot price spikes above your bid price only the data for the in-process segment will be lost. The elasticrawl tool also launches combine jobs that aggregate the results of each segment into a single set of files. This post describes the minor changes needed to develop Hadoop jobs that integrate with elasticrawl.
Using elasticrawl you launch parse jobs using the parse command. The max segments and max files parameters control how many segments and files to parse. If you leave them out then all segments and all files will be parsed.
./elasticrawl parse --max-segments 2 --max-files 2
The EMR terminology can be a bit confusing at first. In EMR you launch job flows. Each job flow is a Hadoop cluster made up of EC2 instances. A job flow has one or more job steps. Each job step is a Hadoop job that runs on your cluster.
Elasticrawl launches an EMR job flow with a job step for each crawl segment. Below is an example job step. The first parameter is which class to call and this is configured in the jobs.yml file.
com.rossfairbanks.elasticrawl.examples.WordCount s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/segments/1418802764809.9/wet/*.warc.wet.gz s3://elasticrawl-test/data/1-parse/1423393919310/segments/1418802764809.9/ 2
In your Hadoop job you need to accept 3 parameters.
- Input Path: s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/segments/1418802764809.9/wet/*.warc.wet.gz
- Output Path: s3://elasticrawl-test/data/1-parse/1423393919310/segments/1418802764809.9/
- Max Files: 2
The input path is the location of the Common Crawl segment in the AWS public datasets bucket. The output path is the location in your S3 bucket where the results will be saved.
Max files is an optional parameter and controls how many files should be processed. To implement this you need to create a path filter class. Here is an example parse job that does the Hadoop word count example using the Common Crawl WET files.
The combine job takes in the results of one or more parse jobs. It combines the results of all the segments parsed into a single set of files.
Here is the elasticrawl command
./elasticrawl combine --input-jobs 1423393932467 1423393919310
It launches an EMR job flow with a single step. Again the first command is the job class and is configured in jobs.yml.
com.rossfairbanks.elasticrawl.examples.SegmentCombiner s3://elasticrawl-test/data/1-parse/1423393932467/segments/*/part-*,s3://elasticrawl-test/data/1-parse/1423393919310/segments/*/part-* s3://elasticrawl-test/data/2-combine/1423406909964/
This job takes in 2 arguments.
- Input Paths: s3://elasticrawl-test/data/1-parse/1423393932467/segments//part-,s3://elasticrawl-test/data/1-parse/1423393919310/segments//part-
- Output Paths: s3://elasticrawl-test/data/2-combine/1423406909964/
The input paths are a CSV list with the S3 locations where the parse results are stored. An input path is added for each parse job and a wildcard is used to match all segments. In your job class you need to split the CSV list and add each input path to the job.
The output path is a single S3 location where the combined results will be saved. Here is the example combine job.
Follow Up Jobs
Depending on what your analysis is doing you may need to run some follow up jobs. For my WikiReverse project I needed a 3rd output step that cleaned up the data and output it as flat files ready for import into a Postgres database.
These follow up jobs are not launched by elasticrawl as the data is not split across multiple segments. Instead you can launch an EMR job manually using the output of the combine job. As this is a single S3 location containing all of your data.
By default elasticrawl is designed to work its way through each segment in a crawl in order. However when you’re developing jobs or fixing bugs it’s often useful to re-test with the same segments.
To do this you can call the parse command using the segment list argument.
./elasticrawl parse CC-MAIN-2014-52 --segment-list 1418802764809.9 1418802764752.1 --max-files 2
Common Crawl release data in 3 formats.
- WARC files contain the full HTML and headers fpr each page in the crawl.
- WAT files contain the metadata for each page in a JSON format.
- WET files contain just the text that has been extracted from each page.
You can read more about the file formats on the Common Crawl site.
For output formats I found the best 2 options are Hadoop sequence files or gzipped text files. Sequence files are a binary format with key value pairs that can be compressed. They provide great performance but are hard to read with standard tools.
For WikiReverse I used sequence files as the output format for the combine and parse steps and only used text files for the final output step. However elasticrawl will work with any output format.
Elasticrawl is configured using 2 files that are stored in your .elasticrawl directory. The cluster.yml file controls the configuration of your Hadoop cluster. In this post I’ve written about how you can scale up your cluster to process more data.
To run your own jobs you need to change jobs.yml. You’ll need to upload your compiled JAR file to an S3 bucket. You then need to specify which classes to run for the parse and combine steps.
Finally for the parse step you need to change the input filter depending on whether you’re parsing the WARC, WAT or WET files.
Hopefully this will help if you wish to run your own Hadoop jobs with the Common Crawl data. If you have problems getting your jobs running with elasticrawl or you have a feature request please raise an issue with the project on GitHub.