MIT researchers build AI system that can model the world using sound in latest breakthrough

A silhouette of a person on the left with sound waves coming out of their mouth on a blue background

Computer vision is a field that has been researched quite a lot in the past few decades, primarily because of its immediate and obvious applications of building autonomous vehicles and others tools that can "see" the world as humans do. However, one area that has not seen this level of research until recently is the use of sound instead of sight to model an environment. Now, researchers at the Massachusetts Institute of Technology (MIT) have penned a research paper regarding the construction of a machine learning (ML) model trained in this domain.

A blog post over on the MIT News website describes that researchers at MIT and the MIT-IBM Watson AI Lab have collaborated to build an ML model that uses spatial acoustics to see and model the environment. Simply stated, this model enables the mapping of an environment by figuring out how a listener would hear a sound originating from a point and being propagated to different positions.

There are numerous benefits to this technique since it allows the determination of the underlying 3D geometry of environmental objects using just sound. It can then render accurate visuals to reconstruct the environment. Potential applications include virtual and augmented reality, along with augmenting AI agents so that they can utilize both sound and sight to better visualize their environments. For example, an underwater exploration robot could use acoustics to better determine the location of certain objects as compared to computer vision.

A graphic showing a 3D model of the room on the top and heatview type thing at the bottom with sound

The researchers have emphasized that building this ML model based on sound was considerably more complex than one based on computer vision. This is because computer vision models leverage a property called photometric consistency, which means that an object looks roughly the same when viewed from different angles. This does not apply to sound as depending upon your location and other obstacles, what you hear from a source may highly variable.

In order to tackle this problem, the researchers used two other features called reciprocality and local geometry. The former basically means that even if you swap the location of the speaker and the listener, the sound will be exactly the same. Meanwhile, local geometry mapping involved combining reciprocality in a neural acoustic field (NAF) to capture objects and other architectural components. You can view the fascinating results below:

To get the ML model working in test environments, it needs to be fed some visual information and spectograms containing samples of what the audio would sound like based on specified locations for the originator and the listener. Following these inputs, the model can accurately determine how the sound will change as the listener moves around the environment.

The research paper's lead author Andrew Luo noted that:

If you imagine standing near a doorway, what most strongly affects what you hear is the presence of that doorway, not necessarily geometric features far away from you on the other side of the room. We found this information enables better generalization than a simple fully connected network.

Moving forward, the researchers want to further enhance the model so it can visualize bigger and more complex environments such as a building or even an entire city. In the meantime, you can read their research paper here.

Tags

Subscribe to our Newsletter

Trending Stories

Windows 11 22635.4000 adds a new taskbar feature and more

Windows 11 26257 adds a way to duplicate a tab in File Explorer

Meta: candidates are subject to the same rules as regular users. It's a blatant lie

Edifier STAX Spirit S5. Probably the best closed-back Planar Magnetic headphones

Blazing PCIe 5.0 speeds with T-Force Z540 2 TB NVMe and DARK AirFlow I

Windows 11 26120.1330 adds a new Power setting and more

TerraMaster F4-424 Pro: powerful media class 4-bay NAS, the best on the market

So cheap, so good - EasySMX X05 games controller offers multi-platform fun

Launches from China and New Zealand coming up, Ariane 6 maiden flight

Windows Server 2025 version 26244 does away with a known issue

Oukitel C50: a cheap and cheerful 5G phone with a 5,150mAh battery

GEEKOM GT13 Pro: 13th gen i9 power inside a tiny aluminum frame

Self-hosting: What is it and why you might (or might not!) be interested

How to set up and use Eye Tracking on your iPhone running iOS 18

Login