Ambient speech is recovered from video of a bag of chips

In a technical paper to be presented at SIGGRAPH 2014 (10-14 Aug. 2014; Vancouver, BC, Canada), researchers at the Massachusetts Institute of technology's Computer Science and Artificial Intelligence Laboratory (MIT CSAIL; Boston, MA), Microsoft Research (Redmond, WA), and Adobe Research (San Jose, CA) will describe their successful efforts to recover ambient sound from a video taken from a distance of an object such as a bag of chips.

John Wallace

Aug. 5, 2014

3 min read

Ambient speech is recovered from video of a bag of chips

The one caveat is that they use a high-speed (2 to 20 kHz) video system for most of their experiments; however, they do describe and try out a method using CMOS cameras that operate at the normal 60 Hz video rate.

(Video: MIT)

High frame rates

The experimental setup includes an object, a loudspeaker (placed on a separate stand from the object), the video camera, and photography lamps. The high-end cutoff frequency of the recovered audio is naturally related to the video frame rate used (higher rates lead to higher cutoff frequencies).

Captured resolutions ranged from 192 x 192 to 700 x 700 pixels. Sound volumes ranged from 80 dB (actor's stage voice) to 110 dB (comparable to a jet engine running 100 m away). The researchers used publicly available 14-year-old code to process the videos.

In addition to ramp signals for characterization, the researchers tested the setup on human voices, including a live speaker reciting the poem "Mary had a little lamb." The majority of experiments focused on the bag of chips at 2200 frames per second (FPS).

Speech recovery was successful, with results comparable to those taken using a laser Doppler vibrometer combined with retroreflective tape. One great advantage of the video approach itself is that no active lighting or retroreflective tape is needed.

Low frame rates

Even more interesting was how the researchers took advantage of what normally is considered a disadvantage of inexpensive CMOS imagers (such as those in phones and DSLR cameras). A typical CMOS device has a "rolling shutter" where individual lines are sequentially read out to create an image. If each line at 60 FPS is considered a separate exposure, "frame" rates of up to about 2000 Hz can be achieved.

With a loudspeaker playing speech and the bag of chips as an object, a processed and "denoised" signal was obtained; the resulting audio will be available as clips here.

The technology is patent pending.

MIT CSAIL's page on the SIGGRAPH paper: http://people.csail.mit.edu/mrub/VisualMic/

About the Author

John Wallace

Senior Technical Editor (1998-2022)

John Wallace was with Laser Focus World for nearly 25 years, retiring in late June 2022. He obtained a bachelor's degree in mechanical engineering and physics at Rutgers University and a master's in optical engineering at the University of Rochester. Before becoming an editor, John worked as an engineer at RCA, Exxon, Eastman Kodak, and GCA Corporation.