In a technical paper to be presented at SIGGRAPH 2014 (10-14 Aug. 2014; Vancouver, BC, Canada), researchers at the Massachusetts Institute of technology's Computer Science and Artificial Intelligence Laboratory (MIT CSAIL; Boston, MA), Microsoft Research (Redmond, WA), and Adobe Research (San Jose, CA) will describe their successful efforts to recover ambient sound from a video taken from a distance of an object such as a houseplant or a bag of chips.
The one caveat is that they use a high-speed (2 to 20 kHz) video system for most of their experiments; however, they do describe and try out a method using CMOS cameras that operate at the normal 60 Hz video rate.
(Video: MIT)
High frame rates
The experimental setup includes an object, a loudspeaker (placed on a separate stand from the object), the video camera, and photography lamps. The high-end cutoff frequency of the recovered audio is naturally related to the video frame rate used (higher rates lead to higher cutoff frequencies).
Captured resolutions ranged from 192 x 192 to 700 x 700 pixels. Sound volumes ranged from 80 dB (actor's stage voice) to 110 dB (comparable to a jet engine running 100 m away). The researchers used publicly available 14-year-old code to process the videos.
In addition to ramp signals for characterization, the researchers tested the setup on human voices, including a live speaker reciting the poem "Mary had a little lamb." The majority of experiments focused on the bag of chips at 2200 frames per second (FPS).
Speech recovery was successful, with results comparable to those taken using a laser Doppler vibrometer combined with retroreflective tape. One great advantage of the video approach itself is that no active lighting or retroreflective tape is needed.
Low frame rates
Even more interesting was how the researchers took advantage of what normally is considered a disadvantage of inexpensive CMOS imagers (such as those in phones and DSLR cameras). A typical CMOS device has a "rolling shutter" where individual lines are sequentially read out to create an image. If each line at 60 FPS is considered a separate exposure, "frame" rates of up to about 2000 Hz can be achieved.
With a loudspeaker playing speech and the bag of chips as an object, a processed and "denoised" signal was obtained; the resulting audio will be available as clips here.
The technology is patent pending.
MIT CSAIL's page on the SIGGRAPH paper: http://people.csail.mit.edu/mrub/VisualMic/