How Was Stable Diffusion Trained?
One of the key issues surrounding text-to-image generation artificial intelligence (AI) models is the fact that they feel like a black box–they have been trained on images that were pulled from the web, but we are never quite sure which ones, and to what extent. However, the team that developed Stable Diffusion made the project open-source and has been extremely transparent when it comes to how their model is actually trained.
Due to the fact that it has exploded in popularity thanks to its free and permissive licensing, Stable Diffusion has already been incorporated into Midjourney beta, NightCafe, and Stability AI's own DreamStudio application, alongside many users' very own computers.
However, the training datasets are almost impossible for many to actually get ahold of or search through, due to the fact that there is metadata connecting to millions of images stored in obscure file formats within files that are quite large.
The Data Source for Stable Diffusion
Stable Diffusion was essentially trained through the utilisation of three massive datasets, all of which were collected by LAION, a non-profit that had its compute time funded by Stable Diffusion’s owner, Stability AI. Furthermore, all of LAION's image datasets were built out of Common Crawl, another non-profit that was specifically created to scrape billions of web pages on a monthly basis and then release them as massive datasets.
The Collection Process
LAION collected all HTML image tags, which featured alt-text attributes, after which it classified the resulting five billion image pairs based on their overall language. It then filtered the results into separate datasets based on their resolution, a predicted likelihood of a watermark, and the predicted aesthetic score.
The Initial Training
Stable Diffusion featured an initial training that had a low resolution of 256 x 256 images from LAION-2B-EN, a set of 2.3 billion English-captioned images, which were taken from LAION-5B's full collection of 5.85 billion image-text pairs. This was done alongside LAION-High-Resolution, another subset of LAION-5B with 170 million images greater than 1024 x 1024 resolution, all of which were downsampled to 512 x 512 for efficiency.
Then, there were checkpoints on LAION-Aesthetics v2 5+, a 600-million-image subset of LAION-2B-EN with a predicted aesthetics score of five or higher, which ensures low-resolution and watermarked images are filtered out.
There are over 1,800 artists listed in MisterRuffian's Latent Artist and Modifier Encyclopedia, which can be utilised to search for the dataset as well as provide researchers with the ability to count the number of images that reference each artist's name.
Out of the top twenty-five artists, the ones that are creating to this very day include Phil Koch, Erin Hanson, and Steve Henderson. The most frequent artist, however, has been Thomas Kinkade.
Another key training feature to consider is fictional characters–some of the most common characters found in the database, through which Stable Diffusion is trained, included Marvel Cinematic Universe characters, such as Captain Marvel, Black Panther, and Captain America. Other characters included Batman, Superman, Luke Skywalker, Darth Vader, Han Solo, and Mickey Mouse.
Moving Forward With Stable Diffusion
It is clear how Stable Diffusion was trained and how the most common artists, characters, and keywords have been utilised as a means of training the AI to generate images based on text prompts. The project is open-source and, as such, is extremely flexible to work with, so anyone can essentially analyse the references and data collected.
Stable Diffusion is one of the latest–and currently one of the best–machine learning models developed by Stability AI; it generates digital images from natural language descriptions in a more efficient manner than any other tool.Looking for tips to make the most of Stable Diffusion or interested in the comparisons of Stable Diffusion versus Latent Diffusion? Take a look at our latest posts.