Interesting People mailing list archives

Google works out a fascinating, slightly scary way for AI to isolate voices in a crowd


From: "Dave Farber" <farber () gmail com>
Date: Sun, 15 Apr 2018 09:56:16 -0400




Begin forwarded message:

From: Dewayne Hendricks <dewayne () warpspeed com>
Date: April 15, 2018 at 7:51:06 AM EDT
To: Multiple recipients of Dewayne-Net <dewayne-net () warpspeed com>
Subject: [Dewayne-Net] Google works out a fascinating, slightly scary way for AI to isolate voices in a crowd
Reply-To: dewayne-net () warpspeed com

Google works out a fascinating, slightly scary way for AI to isolate voices in a crowd
Google researchers try to replicate the “cocktail party effect” for computers.
By JEFF DUNN
Apr 13 2018
<https://arstechnica.com/gadgets/2018/04/google-works-out-a-fascinating-slightly-scary-way-for-ai-to-isolate-voices-in-a-crowd/>

Google researchers have developed a deep-learning system designed to help computers better identify and isolate 
individual voices within a noisy environment.

As noted in a post on the company's Google Research Blog this week, a team within the tech giant attempted to 
replicate the cocktail party effect, or the human brain's ability to focus on one source of audio while filtering out 
others—just as you would while talking to a friend at a party.

Google's method uses an audio-visual model, so it is primarily focused on isolating voices in videos. The company 
posted a number of YouTube videos showing the tech in action:

The company says this tech works on videos with a single audio track and can isolate voices in a video 
algorithmically, depending on who's talking, or by having a user manually select the face of the person whose voice 
they want to hear.

Google says the visual component here is key, as the tech watches for when a person's mouth is moving to better 
identify which voices to focus on at a given point and to create more accurate individual speech tracks for the 
length of a video.

According to the blog post, the researchers developed this model by gathering 100,000 videos of "lectures and talks" 
on YouTube, extracting nearly 2,000 hours worth of segments from those videos featuring unobstructed speech, then 
mixing that audio to create a "synthetic cocktail party" with artificial background noise added.

Google then trained the tech to split that mixed audio by reading the "face thumbnails" of people speaking in each 
video frame and a spectrogram of that video's soundtrack. The system is able to sort out which audio source belongs 
to which face at a given time and create separate speech tracks for each speaker. Whew.

Google singled out closed-captioning systems as one area where this system could be a boon, but the company says it 
envisions "a wide range of applications for this technology" and that it is "currently exploring opportunities for 
incorporating it into various Google products." Hangouts and YouTube seem like two easy places to start. It's not 
hard to see how the tech could work when applied to a pair of smart glasses, à la Google Glass, and voice-amplifying 
earbuds, either.

Aiding smart speakers like the Google Home in their ability to recognize individual voices seems like another use 
case, but because this model is focused on video, it would likely work better with a speaker with a display, like 
Amazon's Echo Show. Earlier this year, Google opened up the Google Assistant to "smart display" devices like the Echo 
Show, but the company hasn't released one itself.

[snip]

Dewayne-Net RSS Feed: http://dewaynenet.wordpress.com/feed/
Twitter: https://twitter.com/wa8dzp





-------------------------------------------
Archives: https://www.listbox.com/member/archive/247/=now
Modify Your Subscription: https://www.listbox.com/member/?member_id=18849915&id_secret=18849915-4ac2c253
Unsubscribe Now: 
https://www.listbox.com/unsubscribe/?member_id=18849915&id_secret=18849915-a538de84&post_id=20180415095629:C9B7A428-40B4-11E8-8538-F979612B5C5B
Powered by Listbox: http://www.listbox.com

Current thread: