In a preprint paper published on Arxiv.org, researchers at the University of California, Berkeley and Adobe Research describe the Swapping Autoencoder, a machine learning model designed specifically for image manipulation. They claim it can modify any image in a variety ways, including texture swapping, while remaining “substantially” more efficient compared with previous generative models.
The researchers acknowledge that their work could be used to create deepfakes, or synthetic media in which a person in an existing image or video is replaced with someone else’s likeness. In a human perceptual study, subjects were fooled 31% of the time by images created using the Swapping Autoencoder. But they also say that proposed detectors can successfully spot images manipulated by the tool at least 73.9% of the time, suggesting the Swapping Autoencoder is no more harmful than other AI-powered image manipulation tools.
“We show that our method based on an auto-encoder model has a number of advantages over prior work, in that it can accurately embed high-resolution images in real-time, into an embedding space that disentangles texture from structure, and generates realistic output images … Each code in the representation can be
independently modified such that the resulting image both looks realistic and reflects the unmodified codes,” the coauthors of the study wrote.
The researchers’ approach isn’t novel in the sense that many AI models can edit portions of images to create new images. For example, the MIT-IBM Watson AI Lab released a tool that lets users upload photographs and customize the appearance of pictured buildings, flora, and fixtures, and Nvidia’s GauGAN can create lifelike landscape images that never existed. But these models tend to be challenging to design and computationally intensive to run.
By contrast, the Swapping Autoencoder is lightweight, using image swapping as a “pretext” task for learning an embedding space useful for image manipulation. It encodes a given image into two separate latent codes — a “structure” code and a “texture” code — intended to represent structure and texture, and during training, the structure code learns to correspond to the layout or structure of a scene while the texture codes capture properties about the scene’s overall appearance.
In an experiment, the researchers trained Swapping Autoencoder on a data sets containing images of churches, animal faces, bedrooms, people, mountain ranges, and waterfalls and built a web app that offers fine-grained control over uploaded photos. The app supports global style editing and region editing as well as cloning, with a brush tool that replaces the structure code from another part of the image.
“Tools for creative expression are an important part of human culture … Learning-based content creation tools such as our method can be used to democratize content creation, allowing novice users to synthesize compelling images,” the coauthors wrote.