Parsing Common Crawl data using Elasticrawl

03 Jan 2015   aws, commoncrawl

This is a walkthrough of how to use the elasticrawl tool I’ve developed to help parse Common Crawl data. The latest web crawl released by Common Crawl has approximately 2 billion pages from November 2014. The data is stored on AWS as a S3 public dataset.

The Common Crawl is one of the largest corpuses of web data that is publicly accessible. It can also be processed relatively inexpensively. I parsed the complete July 2014 crawl (3.6 billion pages) for $70 USD. Whereas if you crawled this number of pages yourself, it would be vastly more expensive and time consuming.

There are various ways to parse the data but the elasticrawl tool creates Hadoop jobs that run on AWS using EMR (Elastic MapReduce). It is designed to work with EC2 spot instances to keep down your costs. For processing the July 2014 crawl this saved me $115.

Word Count example

In this walkthrough we’ll run the Hadoop Word Count example. Common Crawl release data in 3 formats, we’ll use the WET files that contain just the text extracted from each web page.

We’ll only parse the first 2 files from the first 2 segments. This is 1.9GB of input data that will produce 80Mb of results (both gzip compressed). The exact costs will depend on the current spot price. But it will be between 40 and 80 cents.

However you will be charged for any results and logs you keep in your S3 bucket. This will only be a few cents a month but you should run the destroy command at the end to delete this data.

Installing elasticrawl & AWS costs

You’ll first need to install elasticrawl using the instructions on GitHub. You’ll also need an AWS account and access keys. For the access keys I recommend you create a separate IAM user that just has access to EMR.

Running the init command

The init command sets up elasticrawl. It takes in a S3 bucket name that will store your results and logs. It will also ask you for your AWS access keys. A .elasticrawl directory is created in your home directory that stores your config files and a SQLite database.

$ elasticrawl init elasticrawl-2014-49
Enter AWS Access Key ID:
Enter AWS Secret Access Key: 
…

Bucket s3://elasticrawl-2014-49 created
Config dir /home/vagrant/.elasticrawl created
Config complete

Launching a parse job

We’re now ready to launch a parse job using the parse command. This takes in the name of the crawl. We’re using the latest, which is CC-MAIN-2014-49. This data is from November 2014. 49 is the week number when the crawl was started. New crawls are announced on the Common Crawl blog.

Each crawl is split into multiple segments. The –max-segments option means only the first 2 segments out of 134 will be processed. The –max-files option means only the first 2 WET files in each segment will be processed.

$ elasticrawl parse CC-MAIN-2014-49 --max-segments 2 --max-files 2
Segments
Segment: 1416400372202.67 Files: 150
Segment: 1416400372490.23 Files: 124

Job configuration
Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 2 files per segment

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)
y

Job: 1420280061691 Job Flow ID: j-2XGIK13GP7GKD

The parse command shows a confirmation message with how much data is to be parsed. It also shows the Hadoop cluster configuration including the bid price for the spot instances.

If the job is launched successfully it will show you the job name and the EMR job flow ID. The job name is the current Unix epoch time. This is the same format that Common Crawl uses for their segment names.

Viewing progress in the AWS console

Now that your job has been launched you can view its progress in the AWS console. In the EMR section your job will appear in the Cluster List.

The first step EMR does is to provision your Hadoop cluster using EC2 instances. Since we’re using spot instances it has to bid for these. If your bid is too low then the provisioning and your job will fail. The Spot Requests page lets you track the progress of your bids.

Once your EC2 instances have launched EMR will configure your Hadoop cluster. The whole provisioning phase will take 10-12 minutes. Your Hadoop job has 2 steps for the 2 segments that you are parsing.

Each segment is processed as a separate Hadoop job to avoid overloading the Hadoop master node. It also means that if your Hadoop cluster fails you only lose the segment that is in progress. This is especially important when using spot instances. As if the spot price spikes and goes above your bid price your instances will be terminated.

Each step will take around 5 minutes to complete. You can then view the results and logs in the S3 Console. Here is the directory structure that elasticrawl set up for the parse job.

S3://elasticrawl-2014-49/

  data/
    1-parse/
      1420280061691/
        segments/
          416400372202.67/
          416400372490.23/

  logs/
    1-parse/
      1420280061691/
        j-2XGIK13GP7GKD/

Launching a combine job

We’re now ready to combine the results of the 2 segments into a single set of results. The combine command takes in the job name from the parse command.

$ elasticrawl combine --input-jobs 1420280061691
Job configuration
Combining: 2 segments

Cluster configuration
Master: 1 m1.medium  (Spot: 0.12)
Core:   2 m1.medium  (Spot: 0.12)
Task:   --
Launch job? (y/n)
y

Job: 1420281163734 Job Flow ID: j-1583Y291J17O6

EMR will provision another Hadoop cluster using spot instances. This time the job has a single step. Here is the S3 directory structure for the combine job.

s3://elasticrawl-2014-49/

  data/
   2-combine/
      1420281163734/

  logs/
    2-combine/
      1420281163734/
        j-1583Y291J17O6/

To view the Word Count results you can download them from the S3 console.

Status and Reset commands

The status command lets you see how many segments you’ve parsed in each crawl and your job history.

$ elasticrawl status
Crawl Status
CC-MAIN-2014-49 Segments: to parse 134, parsed 2, total 136

Job History (last 10)
1420281163734 Launched: 2015-01-03 10:32:43 Combining: 2 segments
1420280061691 Launched: 2015-01-03 10:14:21 Crawl: CC-MAIN-2014-49 Segments: 2 Parsing: 2 files per segment

The reset command takes in a crawl name and resets it so all segments are parsed again.

$ elasticrawl reset CC-MAIN-2014-49
Reset crawl? (y/n)
y
 CC-MAIN-2014-49 Segments: to parse 136, parsed 0, total 136

Cleaning up with the destroy command

You’ll pay a monthly storage cost for any results and logs you keep in your S3 bucket. So the destroy command deletes your bucket and the .elasticrawl directory.

This completes the walkthrough. The final sections on this post are how you can configure elasticrawl to process more data and use it to run your own Hadoop jobs.

Scaling up

To process an entire crawl you can run the parse command without the –max-segments and–max-files options. When processing more data you should increase the size of your cluster by editing cluster.yml in your .elasticrawl directory.

For your cluster you need to choose the instance types and how many nodes you want to run. When I parsed the July 2014 crawl I used m3.2xlarge instances as they have a good mix of CPU, memory and network performance. It’s worth doing some benchmarking of your Hadoop jobs to choose the best instance type.

How many instances you run depends on how many files are in each segment. All 3 file types are based on the WARC file format which isn’t splittable. So I match the number of Hadoop mappers to the number of files. This means each segment is parsed in a single pass.

This page shows the Hadoop configuration for each instance type. The m3.2xlarge instances have 12 mappers each. So for a segment with 260 files you need 22 nodes.

This gets tricky because the number of files per segment varies. I got around this by splitting the nodes into Core and Task groups. The size of the Task group can be altered while your job is running. However the Core group can only be made bigger because these nodes store data as well as processing it.

Configuring jobs

For this walkthrough we ran the elasticrawl-examples Hadoop job, which is developed in Java. This is a simple implementation of the Hadoop Word Count example using the WET files.

The job configuration is stored in jobs.yml in the .elasticrawl directory. I’m going to do a follow up post on how to configure elasticrawl to run your own jobs. For now if you fork the elasticrawl-examples repository that should get you started.

What’s next

I developed elasticrawl to help with the analysis I’ve done of the July 2014 crawl. I’ll be releasing that in the next couple of weeks. This will include releasing the Hadoop code so you can see another example of using elasticrawl.

I’m also going to try and streamline the install process using the Traveling Ruby framework that has just been released.

comments powered by Disqus