Nvidia Jarvis—a multi-modal AI SDK—fuses speech, vision, and other sensors into one system

Today, at the 5G Mobile World Conference, Nvidia co-founder and CEO Jensen Huang, announced Nvidia Jarvis, a multi-modal AI software development kit, that combines speech, vision, and other sensors in one AI system.

NVIDIA Jarvis is an SDK for building and deploying AI applications that fuse vision, speech and other sensors. It offers a complete workflow to build, train and deploy GPU-accelerated AI systems that can use visual cues such as gestures and gaze along with speech in context.

Here's a YouTube video of the presentation:

As stated before, Jarvis is the company's attempt to process multiple inputs from different sensors simultaneously. The wisdom behind this approach is that it will help build context for accurately predicting and generating responses in conversation-based AI applications. To preface this, Nvidia exemplified situations where this might help on its blog post:

...lip movement can be fused with speech input to identify the active speaker. Gaze can be used to understand if the speaker is engaging the AI agent or other people in the scene. Such multi-modal fusion enables simultaneous multi-user, multi-context conversations with the AI agent that need deeper understanding of the context.

In Jarvis, Nvidia has included modules that can be tweaked according to the user's requirements. For vision, Jarvis has modules for person detection and tracking, detection of gestures, lip activity, gaze, and body pose. While for speech, the system has sentiment analysis, dialog modeling, domain and intent, and entity classification. For integration into the system, fusion algorithms have been employed to synchronize the working of these models.

Moreover, the firm claims that Jarvis-based applications work best when used in conjunction with Nvidia Neural Modules (NeMo), which is a framework-agnostic toolkit for creating AI applications built around neural modules. For cloud-based applications, services developed using Jarvis can be deployed using the EGX platform, which Nvidia is touting as the world's first edge supercomputer. For edge and Internet of Things use cases, Jarvis runs on the Nvidia EGX stack, which is compatible with a large swath of Kubernetes infrastructure available today.

Jarvis is now open for early access. If you are interested, you can log in to your Nvidia account and sign up for early access to it here.