Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas.
Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.
We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people -- we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.
One way to approach the problem is to look for words that appear more often than their ordinary rates. For example, if you see the word “coach” 5 times in a 581 word article, and compare that to the usual frequency of “coach” -- more like 5 in 330,000 words -- you have reason to suspect the article has something to do with coaching. The term “basketball” is even more extreme, appearing 150,000 times more often than usual. This is the idea of the famous TFIDF, long used to index web pages.
![]() |
Congratulations to Becky Hammon, first female NBA coach! Image via Wikipedia. |
Background information about entities ought to help us decide which of them are most salient. After all, an article’s author assumes her readers have some general understanding of the world, and probably a bit about sports too. Using background knowledge, we might be able to infer that the WNBA is a salient entity in the Becky Hammon article even though it only appears once.
To encourage research on leveraging background information, we are releasing a large dataset of annotations to accompany the New York Times Annotated Corpus, including resolved Freebase entity IDs and labels indicating which entities are salient. The salience annotations are determined by automatically aligning entities in the document with entities in accompanying human-written abstracts. Details of the salience annotations and some baseline results are described in our recent paper: A New Entity Salience Task with Millions of Training Examples (Jesse Dunietz and Dan Gillick).
Since our entity resolver works better for named entities like WNBA than for nominals like “coach” (this is the notoriously difficult word sense disambiguation problem, which we’ve previously touched on), the annotations are limited to names.
Below is sample output for a document. The first line contains the NYT document ID and the headline; each subsequent line includes an entity index, an indicator for salience, the mention count for this entity in the document as determined by our coreference system, the text of the first mention of the entity, the byte offsets (start and end) for the first mention of the entity, and the resolved Freebase MID.

Download the data directly from Google Drive, or visit the project home page with more information at our Google Code site. We look forward to seeing what you come up with!
Related Post:
and
- MOOC Research and Innovation
- Collection of SQL queries with Answer and Output Set 2
- PiAUISuite Update and Voicecommand v3 1
- Sign in to edx org with Google and Facebook and
- Throwing fireballs with the Kinect and Oculus Rift in Unity 3D
- IT Laws and Patents notes for BSc IT Mumbai University
- How To Bypass Megaupload Wait Time And Download At Maximum Speed !!!
- The rise of the Bots Robots Surgeons and Disruptive Technology
- The Computer Science Pipeline and Diversity Part 2 Some positive signs and looking towards the future
- Collection of SQL queries with Answer and Output Set 4
- Skill maps analytics and more with Google’s Course Builder 1 8
- Why Watson and Siri Are Not Real AI
- PPT Presentation on Memory Management in Winnows2000 and WindowsXP
- Moore’s Law Part 1 Brief history of Moores Law and current state
- Information sharing for more efficient network utilization and management
- A year and a bit with Inbox Zero
- Explore the history of Pop and Punk Jazz and Folk with the Music Timeline
- Tips on Choosing Apt Web Templates and Service Providers
- Remembering to forget
- See through the clouds with Earth Engine and Sentinel 1 Data
- The Computer Science Pipeline and Diversity Part 1 How did we get here
- Getting your fridge to order food for you with a RPi camera and a hacked up Instacart API
- Google’s Course Builder 1 9 improves instructor experience and takes Skill Maps to the next level
- Sudoku Linear Optimization and the Ten Cent Diet
a
- Take a better selfie with Lily
- Calculating Ada The Countess of Computing
- Creating a templated Binary Search Tree Class in C
- Projecting without a projector sharing your smartphone content onto an arbitrary display
- Will a robot take your job
- Hacker Tricks from Insiders A Threat to ERP Systems
- Forget Turing the Lovelace Test Has a Better Shot at Spotting AI
- A Billion Words Because todays language modeling standard should be higher
- Apple is building a car
- A step closer to quantum computation with Quantum Error Correction
- Could you fly a fighter jet with your mind
- Mounting the home directory on a different drive on the Raspberry Pi
- How Google Translate squeezes deep learning onto a phone
- The Plan to Build a Massive Online Brain for All the World’s Robots
- A Beginner’s Guide to Deep Neural Networks
- How to Copy or Hide a File inside an Image
- The life of a software engineer
- A Farewell to Orkut
- A Project on Windows NT
- Building A Visual Planetary Time Machine
- 10 awesome internet hacks to make your life better
- Google Databoard A new way to explore industry research
- How to put a flash mp3 player in blogger post
- A year and a bit with Inbox Zero
- Map of Life A preview of how to evaluate species conservation with Google Earth Engine
annotations
0 comments:
Post a Comment