“Two pizzas sitting on top of a stove top oven”
“A group of people shopping at an outdoor market”
“Best seats in the house”
People can summarize a complex scene in a few words without thinking twice. It’s much more difficult for computers. But we’ve just gotten a bit closer -- we’ve developed a machine-learning system that can automatically produce captions (like the three above) to accurately describe images the first time it sees them. This kind of system could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images.
Recent research has greatly improved object detection, classification, and labeling. But accurately describing a complex scene requires a deeper representation of what’s going on in the scene, capturing how the various objects relate to one another and translating it all into natural-sounding language.
![]() |
Automatically captioned: “Two pizzas sitting on top of a stove top oven” |
This idea comes from recent advances in machine translation between languages, where a Recurrent Neural Network (RNN) transforms, say, a French sentence into a vector representation, and a second RNN uses that vector representation to generate a target sentence in German.
Now, what if we replaced that first RNN and its input words with a deep Convolutional Neural Network (CNN) trained to classify objects in images? Normally, the CNN’s last layer is used in a final Softmax among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN’s rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image.
![]() |
The model combines a vision CNN with a language-generating RNN so it can take in an image and generate a fitting natural-language caption. |
![]() |
A selection of evaluation results, grouped by human rating. |
Related Post:
a
- Take a better selfie with Lily
- Calculating Ada The Countess of Computing
- Creating a templated Binary Search Tree Class in C
- Projecting without a projector sharing your smartphone content onto an arbitrary display
- Will a robot take your job
- Hacker Tricks from Insiders A Threat to ERP Systems
- Forget Turing the Lovelace Test Has a Better Shot at Spotting AI
- A Billion Words Because todays language modeling standard should be higher
- Apple is building a car
- A step closer to quantum computation with Quantum Error Correction
- Could you fly a fighter jet with your mind
- Mounting the home directory on a different drive on the Raspberry Pi
- How Google Translate squeezes deep learning onto a phone
- The Plan to Build a Massive Online Brain for All the World’s Robots
- A Beginner’s Guide to Deep Neural Networks
- How to Copy or Hide a File inside an Image
- The life of a software engineer
- A Farewell to Orkut
- A Project on Windows NT
- Building A Visual Planetary Time Machine
- 10 awesome internet hacks to make your life better
- Google Databoard A new way to explore industry research
- How to put a flash mp3 player in blogger post
- A year and a bit with Inbox Zero
- Map of Life A preview of how to evaluate species conservation with Google Earth Engine
coherent
0 comments:
Post a Comment