An academic paper published recently by two Stanford researchers describes the state of computer image recognition and proposes a method of improving on that recognition. I first learned about the existence of the research at Metafilter, so it seems proper to place the views of that site’s commenters alongside my own as I describe the researchers’ findings and provide a bit of my own analysis.
Andrej Karpathy and Li Fei-Fei are both members of the Department of Computer Science at Stanford University. Their paper, “Deep Visual-Semantic Alignments for Generating Image Descriptions,” which can be found at the paper’s corresponding website, describes Karpathy’s and Fei-Fei’s use of a convolutional neural network (CNN) to detect images, a recurrent neural network (RNN) to create sentences, and a combination of those two neural networks to create textual representations of select images.
The researcher’s region convolutional neural network takes image data and processes it in such a way that it can tile and analyze small collections of data to form a more coherent whole. The algorithm associated with the CNN creates bounding boxes that define specific elements within an overall image, such as the racquet, dress, and bystanders in a photo where a woman is holding a tennis racquet and where two men are standing in the background. Furthermore, the researcher’s bidirectional recurrent neural network generates words and sentence fragments which are informed by their context.
The CNN essentially moves in a straight line from Point A, specific data about a small section of an image, to Point B, the generation and recognition of elements of an overall image. In contrast, the RNN uses a feedback loop to analyze the elements generated by the CNN and creates sentences based on information it learns by recycling data. This distinction, in my limited understanding of the specifics of the researcher’s methods, appears to be important because they are able to get the best of both worlds.
By using this combined approach, they were able to capitalize on the weaknesses of their predecessors and create a set of algorithms that, the researchers note, performed better than other image retrieval methods. This is where the limitations of the researcher’s method seem to begin. They note in the paper that the CNN was trained on the image database ImageNet and on 200 classes of images used in the ImageNet Detection Challenge. That baseline, while broad, is inherently limited in the sense that the CNN does not appear to freely recognize images out of thin air; it must gather data from its selected training. Therefore, as Metafilter commenter advil points out, the system is “likely to be constrained as to the kinds of images it can handle.”
To be fair, the system is impressive. But it is also inherently limited. advil also notes that the the system does not come anywhere near levels of human performance — an idea somewhat suggested in the New York Times article that discusses the research. Pushback comes from fellow Mefite tss who states, “I don’t think ‘near human’ performance is necessary for this work to be significant.” Indeed, the researchers note that their results are “encouraging,” but they also describe various areas in need of improvement. Their model, for instance, is limited to generating image elements that exist at a fixed resolution. In other words, it does not analyze pixels around any element that are of a different resolution than that element. A better method of analysis could incorporate pixels of varying resolutions to generate a more complete survey of image contents.
Several other commenters grab onto the limitation that the system needs to learn what to recognize in order to recognize similar elements at a later date. As Mefite makes clear, however, a defining feature of neural networks is that they learn about data before being called on to recognize similar data. Of the entirety of the relatively short list of comments, this one may place the paper in the most perspective. The Stanford researchers showed that their unique approach of combined modalities could generate descriptions of images at a higher level than baselines. Even if it is limited, it appears to be progress that could ultimately impact image recall through Web-based search engines or image description applications for people with vision impairments — hat tip to Mefite idiopath for those examples. Clearly, what the researchers — in fact, all of us — have on our hands is a system still in its infancy but one which shows signs of progress.
Image courtesy of Exemple_d’infractions.jpg via Wikimedia Commons