Recently, a video was shared on Youtube showing 3 startling videos. These were of Monalisa; with lips moving, head turning, eyes blinking, showing a wide array of expressions. She was created by a vast system of neural network - a type of AI that processes information much as a human brain does, to analyze and process images.
In recent years, there had been huge advancements in the area of machine learning AI, and one of that is understanding human facial features, but for an AI to animate and convey these it usually requires a large number of photo/video samples of that individual, but the Samsung AI Center in Moscow has figured out how to create realistic talking heads from a single portrait photo.
They were able to train the algorithm not only to understand facial features' general shapes but also how they behave in relation to each other, and then apply those learnings on still images such as portrait paintings. This resulted in this amazing yet oddly disconcerting set of videos.
Details of the said technology were published in a paper titled, "Few-Shot Adversarial Learning of Realistic Neural Talking Head Models," wherein the researchers explained that Samsung's AI underwent extensive "meta-learning" by studying a huge trove of videos. Once it's familiar with human faces, it's able to create talking heads of previously unseen people using one or a few shots of that person. For each photo, the AI is able to detect various "landmarks" on the face - things like the eyes, nose, mouth, and various lengths and shapes.
According to Egor Zakharov, an engineer with the Skolkovo Institute of Science and Technology, and the Samsung AI Center, with the Monalisa video, the AI "learned" facial movement from datasets of 3 different people, thereby producing 3 very different animations, thus, in turn, lending 3 distinct personalities to the famous painting.
Besides the Monalisa, the team was also able to generate animations from other icons such as Marilyn Monroe, Salvador Dali, and even Albert Einstein.
Generating such videos are extremely complicated since human expressions are quite complex and highly dynamic as well which would mean there are millions of parameters that the AI has to consider, furthermore the human vision is very good at identifying even the smallest of mistakes in such models.
According to the study, the researchers' aim is to achieve "perfect realism", and this is indeed quite possible by adding a few more photos to help create an even more convincing result. Given that the technology is still young and is at its early stages, it still remains to be seen what kind of real-world applications it will have.