Google has come up with a new Artificial Intelligence system which has the capability to detect an individual’s sound from among a large group of people by abolishing other sounds. This technology will soon be available in our smartphones. Our smart computers are not yet equipped with the method of noise cancellation unlike the humans, who can easily focus on a single voice in a noisy environment. The process of mentally muting all other voices and sounds occurs naturally to humans and this process is known as cocktail party effect.
Software engineers from Google has stated that segregating audio signals to the respective speech sources still remains a major challenge for computers and this method is known as automatic speech separation. Researchers have demonstrated a deep learning audio visual model that can detect a single speech signal from a mixture of sounds like background noises and multiple voices heard in a crowd. By using this method, researchers have computationally produced a video where speech of a specific individual was enhanced, and all other voices were suppressed.
This process applies to ordinary videos having a single audio track. We just need to select the face of the person whose voice needs to be enhanced or select the same using a context-based algorithm. It is a major breakthrough in many fields as amplified speech recognition will help in a wide range of applications such as speech enhancement and recognition in videos, video conferencing, improved hearing aids, and off course in cases where multiple people are speaking.
The uniqueness of this technology helps combine both audio and visual signals of a video to segregate the speech signal. It associates with a person’s mouth movement and the sounds produced when he/she is speaking, resulting in isolating the audio signals emitting from a particular person. In critical cases of mixed speech this technique uses the visual signal to enhance the speech separation quality.