Facial reconstruction based solely on someone's voice is now possible. A neural network called Speech2Face, was trained by scientists on millions of educational videos from the internet that showed over 100,000 different people talking. Researchers claimed that from this, the AI learned associations between vocal cues and certain physical features in a human face after which it then used an audio clip to model a photorealistic face matching the voice.
The details of the study were recently published online in the preprint journal arXiv, and although have not been peer-reviewed, it is causing much stir online.
The researchers of the said study stated that the AI doesn't (yet) know exactly what a specific individual looks like based on their voice alone. The neural network recognized certain markers in speech that pointed to gender, age and ethnicity, features that are shared by many people, unless it learns eventually of course.
"As such, the model will only produce average-looking faces," the scientists wrote. "It will not produce images of specific individuals.
"Although the faces generated by Speech2Face - as shown in the picture above, all facing front and with neutral expressions - didn't precisely match the people behind the voices. Its facial reconstruction did usually capture the correct age ranges, ethnicities and genders of the individuals, according to the study.
Given that the technology is still in its early stages, it is far from perfect. It showed "mixed performance" when given samples that have language variations. When the algorithm was made to listen to a single person speaking Chinese the first time and English the second time, it was not able to determine that both voices were of the same man and instead generated two different faces, one Asian and the second a white man. Besides this, it also showed gender bias by associating low-pitched voices with male faces and high-pitched voices with female faces; like the researchers said, not perfect.
The researchers further stated that given that the AI was only trained based on Youtube videos it "does not represent equally the entire world population".
On the other hand, there are some who are concerned that their videos on youtube are being used as a dataset given to AI for training. Such as what happened to Nick Sullivan, from Cloudflare in San Francisco, when he unexpectedly spotted his face as one of the examples used to train Speech2Face. As of this moment, YouTube videos are widely considered to be available for researchers to use without acquiring additional permissions.