Meta shows AI-based models to improve audio in XR experiences

Meta shows AI-based models to improve audio in XR experiences

The researchers in Meta AI and Reality Labs they have made available to developers three audiovisual comprehension models designed to make the sound more realistic in mixed and virtual reality experiences: Visual-Acoustic Matching (Visual-acoustic Coincidence), Visually-Informed Dereverberation (visually Informed Dereverberation) and VisualVoice (Visual Voice).

Visual-Acoustic Matching: audio and video must match coherently so that a scene is not disturbing to human perception. It is a self-monitored visual-acoustic matching model, called AViTAR, which adjusts the audio to match the target space of an image.

Visually-Informed Dereverberation: In the example they propose, it would be that if we are visualizing a hologram of our daughter’s ballet recital, the audio we hear sounds the same as when we saw her dancing sitting in the theater’s seating area. The music is heard with a different reverb in different environments. To relive a memory, it is more realistic that the sound is not artificial, pure and crystal clear.

VisualVoice: to get the AI to learn new skills by learning visual and auditory signals from unlabeled videos to achieve the separation of audiovisual speech. In an avatar meeting in the metaverse, the AI would adjust the audio as we approached a group of people to hear them better, and if we moved to another group, it would adapt to the new situation.

The objective of these three models, of these investigations, is to create mixed reality and virtual reality experiences where AI is fundamental to offer a realistic and immersive sound quality.

Go to our cases Get a free quote