WikiReverse.org is my open data project using Common Crawl data. It’s a reverse link graph with 36 million external links to 4 million Wikipedia articles. One of the reasons I chose Wikipedia as the topic is because there is a wealth of open data released by Wikipedia and built using their data.
One of the enhancements I’m making to the WikiReverse site is to make browsing more interesting by categorising the articles and adding images. To do this I’m using data from the DBpedia project. They take the raw data dumps released by Wikipedia and create structured datasets for many facets of the data including categories, images and geographical locations.
The DBpedia data released for 2014 is based on Wikipedia data dumps from April and May 2014. The Common Crawl data was crawled in July 2014 so there should be good overlap between the 2 datasets.
DBpedia have developed a detailed ontology based on the infobox data from Wikipedia. The WikiReverse API is built using Ruby on Rails so I added a Category model that implements acts as tree to create a tree structure of categories.
The DBpedia ontology is released as an XML file in OWL format. I developed a Rake task that parsed the XML file and created the tree structure started at the top level class http://www.w3.org/2002/07/owl#Thing. The mapping of articles to categories is available for download from DBpedia in RDF triples format.
<http://dbpedia.org/resource/Aristotle> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/Philosopher> .
To parse this data in Ruby I used the linkeddata gem. This is a metadata distribution of several gems based on the RDF.rb gem which made it easy to parse the RDF triples data.
DBpedia also release mappings of images to articles in RDF triples format. I used this to get the thumbnail URLs for the articles that have the most links.
<http://dbpedia.org/resource/Astronomer> <http://dbpedia.org/ontology/thumbnail> <http://commons.wikimedia.org/wiki/Special:FilePath/JohannesVermeer-TheAstronomer(1668).jpg?width=300> .
Calling WikiMedia API & downloading images
The images in the WikiMedia Commons are primarily either Public Domain or released under the Creative Commons Share Alike license. To get the license data for the images I called the WikiMedia API.
I didn’t want to overload the API so I tried to follow the WikiMedia advice for accessing it. This involved identifying my bot by setting the User Agent to WikiReverse.org ImageBot. I also made sure the requests were single threaded and cached the API responses to S3 so I can reprocess them if necessary.
The thumbnail images are first downloaded from upload.wikimedia.org and then uploaded to a S3 bucket. Like for the metadata API calls only a single image is downloaded at a time. For images I also check the x-cache header returned by the WikiMedia Varnish server. If there is a cache-miss then the bot sleeps for a second before requesting the next image.
Here is how I use open-uri to download the file as a stream but also checking for the cache header.
USER_AGENT = 'WikiReverse.org ImageBot' CACHE_HEADER = 'x-cache' def download_file(file_url, file_name, wait_for_miss = false) File.open(file_name, 'wb') do |saved_file| open(file_url, 'rb', 'User-Agent' => USER_AGENT) do |read_file| saved_file.write(read_file.read) if wait_for_miss cache = read_file.meta[CACHE_HEADER] sleep(1) if cache.include?('miss') end end end end
So far I’ve downloaded around 19,000 images. This means all articles that are linked to by more than 66 sites will have images. This is provided the image has a valid license. About 1,300 of the images do not have a suitable license and so have not been downloaded.
The new version of the wikireverse.org site with categories and images should be ready in the next couple of weeks.