Information Blog

Tuesday, March 8, 2016

Distributing the Edit History of Wikipedia Infoboxes

Posted by Enrique Alfonseca, Google Research

Aside from its value as a general-purpose encyclopedia, Wikipedia is also one of the most widely used resources to acquire, either automatically or semi-automatically, knowledge bases of structured data. Much research has been devoted to automatically building disambiguation resources, parallel corpora and structured knowledge from Wikipedia. Still, most of those projects have been based on single snapshots of Wikipedia, extracting the attribute values that were valid at a particular point in time. So about a year ago we compiled and released a data set that allows researchers to see how data attributes can change over time.

Figure 1. Infobox for the Republic of Palau in 2006 and 2013 showing the capital change.

Many attributes vary over time. These include the presidents of countries, the spouses of people, the populations of cities and the number of employees of companies. Every Wikipedia page has an associated history from which the users can view and compare past versions. Having the historical values of Infobox entries available would provide a historical overview of change affecting each entry, to understand which attributes are more likely to change over time or have a regularity in their changes, and which ones attract more user interest and are actually updated in a timely fashion. We believe that such a resource will also be useful in training systems to learn to extract data from documents, as it will allow us to collect more training examples by matching old values of an attribute inside old pages.

For this reason, we released, in collaboration with Wikimedia Deutschland e.V., a resource containing all the edit history of infoboxes in Wikipedia pages. While this was already available indirectly in Wikimedia’s full history dumps, the smaller size of the released dataset will make it easier to download and process this data. The released dataset contains 38,979,871 infobox attribute updates for 1,845,172 different entities, and it is available for download both from Google and from Wikimedia Deutschland’s Toolserver page. A description of the dataset can be found in our paper WHAD: Wikipedia Historical Attributes Data, accepted for publication at the Language Resources and Evaluation journal.

What kind of information can be learned from this data? Some examples from preliminary analyses include the following:

Every country in the world has a population in its Wikipedia attribute, which is updated at least yearly for more than 90% of them. The average error rate with respect to the yearly World Bank estimates is between two and three percent, mostly due to rounding.
50% of deaths are updated into Wikipedia infoboxes within a couple of days... but for scientists it takes 31 days to reach 50% coverage!
For the last episode of TV shows, the airing date is updated for 50% of them within 9 days; for for the first episode of TV shows, it takes 106 days.

While infobox attribute updates will be much easier to process as they transition into the Wikidata project, we are not there yet and we believe that the availability of this dataset will facilitate the study of changing attribute values. We are looking forward to the results of those studies.

Thanks to Googler Jean-Yves Delort and Guillermo Garrido and Anselmo Peñas from UNED for putting this dataset together, and to Angelika Mühlbauer and Kai Nissen from Wikipedia Deutschland for their support. Thanks also to Thomas Hofmann and Fernando Pereira for making this data release possible.

A Multilingual Corpus of Automatically Extracted Relations from Wikipedia

Posted by Shankar Kumar, Google Research Scientist and Manaal Faruqui, Carnegie Mellon University PhD candidate

In Natural Language Processing, relation extraction is the task of assigning a semantic relationship between a pair of arguments. As an example, a relationship between the phrases “Ottawa” and “Canada” is “is the capital of”. These extracted relations could be used in a variety of applications ranging from Question Answering to building databases from unstructured text.

While relation extraction systems work accurately for English and a few other languages, where tools for syntactic analysis such as parsers, part-of-speech taggers and named entity analyzers are readily available, there is relatively little work in developing such systems for most of the worlds languages where linguistic analysis tools do not yet exist. Fortunately, because we do have translation systems between English and many other languages (such as Google Translate), we can translate text from a non-English language to English, perform relation extraction and project these relations back to the foreign language.

Relation extraction in a Spanish sentence using the cross-lingual relation extraction pipeline.

In Multilingual Open Relation Extraction Using Cross-lingual Projection, that will appear at the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), we use this idea of cross-lingual projection to develop an algorithm that extracts open-domain relation tuples, i.e. where an arbitrary phrase can describe the relation between the arguments, in multiple languages from Wikipedia. In this work, we also evaluated the performance of extracted relations using human annotations in French, Hindi and Russian.

Since there is no such publicly available corpus of multilingual relations, we are releasing a dataset of automatically extracted relations from the Wikipedia corpus in 61 languages, along with the manually annotated relations in 3 languages (French, Hindi and Russian). It is our hope that our data will help researchers working on natural language processing and encourage novel applications in a wide variety of languages. More details on the corpus and the file formats can be found in this README file.

We wish to thank Bruno Cartoni, Vitaly Nikolaev, Hidetoshi Shimokawa, Kishore Papineni, John Giannandrea and their teams for making this data release possible. This dataset is licensed by Google Inc. under the Creative Commons Attribution-ShareAlike 3.0 License.

Information Blog

Tuesday, March 8, 2016

Distributing the Edit History of Wikipedia Infoboxes

Monday, February 15, 2016

A Multilingual Corpus of Automatically Extracted Relations from Wikipedia

Popular Posts

Blog Archive

Followers