Microsoft claims its image captioning system is more accurate than humans

Microsoft's AI-powered system for generating image caption is now integrated into Azure's computer vision service, and it's billed as being more accurate in describing images than people.

Microsoft today announced the availability of its artificial intelligence-based technology for image captioning via the Azure Cognitive Services. The company also claims the system can now describe images as well as humans do.

The new milestone should help developers improve accessibility in their own applications. With AI-powered image captioning, users can view important content in images like those in search results and photos in a presentation, for example. The software giant cautioned, however, that results may not be perfect all the time.

More importantly, Saqib Shaikh, a software engineering manager with Microsoft’s AI platform group, said image captioning can help people with visual disabilities by generating a photo description, commonly referred to asalt text, in a web page or document. His team also uses the system in the Seeing AI talking camera app to describe photos for people who are blind or have low vision.

Shaikh said:

“Ideally, everyone would include alt text for all images in documents, on the web, in social media – as this enables people who are blind to access the content and participate in the conversation. But, alas, people don’t. So, there are several apps that use image captioning as way to fill in alt text when it’s missing.”

Microsoft also claims the new system is two times better than the image captioning model that's been in use since 2015. And it was able to produce captions that "were more descriptive and accurate" than the ones created by real people for the same images.

Later this year, the image captioning technology will also be incorporated into Microsoft Word and Outlook for Windows and Mac, and into PowerPoint for Windows, Mac, and the web. It will be interesting to see by then how the system works in the real world compared with competing AI models.