AI is disrupting privacy in a variety of ways, from algorithms that automatically tag you in images to facial recognition systems embedded in surveillance systems to voice generators that can put words in people’s mouths. Now, to add to this disruption, Speech2Face is surfacing with a method to figure out what your face looks like from your voice.
When we listen to someone speak without seeing their face, whether on the phone or on the radio, we typically form a mental image of how they seem. As there is a strong connection between speech and appearance,It is easy to say that Ai results of this technology aren’t entirely accurate, but they are quite impressive. The generated results are typically based on age, gender (an influential factor on a person’s voice), the shape of the mouth, facial bone structure, and the structure of their lips.
In a recent study, based on a short audio recording, titled “Speech2Face: Learning the Face Behind a Voice,” MIT researchers reveal how they used a dataset of millions of YouTube clips to build a neural network-based model that learns vocal qualities connected with facial features from the videos. When the AI hears a new sound bite, it can utilize what it has learnt to infer what the person’s face looks like. In the image below, we see an example presented by the researchers to demonstrate physical features that are correlated with the input speech.
In a “Ethical Consideration” portion of the study, the researchers briefly address privacy concerns, saying that Speech2Face was trained to capture common visual traits like gender and age only when there was enough evidence from the speech to do so. To put it another way, the technology isn’t attempting or capable of generating images of individual persons. Nonetheless, the AI “may support beneficial applications, such as attaching a representative face to phone/video chats based on the speaker’s voice,” according to the researchers.
Throughout the research, despite advanced technology and the application of AI, machine learning displayed limitations in recognizing certain identities as a result of lack of information. The researchers expect the S2F model to produce an estimate of average-looking faces, but not the exact images of individuals. As a result, the Speech2Face images, which were all front-facing and had neutral expressions, didn’t exactly match the persons behind the voices. Yet according to the study, the photos frequently captured the correct age ranges, ethnicities, and genders of the subjects.
This novel discovery is intriguing because it takes AI to a new level by allowing us to predict the face using only audio recordings and without the use of DNA. However, there may be some consequences, particularly in terms of security. Such technology can be easily abused by mimicking someone else and causing chaos. Hopefully, in the near future, S2F might be able to guess the precise person’s face by voice with enough training. So it’s reasonable to say that we’re experiencing a major technological advance, and it’s thrilling.
To know more about the study, click on the link below: