What Does VQGAN Stand For?

Machine learning and artificial intelligence-generated artworks are being shared across numerous social media channels and have seen a high level of popularity within the non-fungible token (NFT) sector of the crypto industry. 

Large-scale NFT collections, such as those that number in the 1,000’s aren't typically hand-drawn but use an algorithm to be generated via AI, which follows specific attributes. Each attribute has its own level of rarity associated with it. 

VQGAN and CLIP are two separate machine learning algorithms that can be used together to generate images from a text prompt.

VQGAN is short for Vector Quantized Generative Adversarial Network and is utilized for high-resolution images; and is a type of neural network architecture that combines convolutional neural networks with Transformers. The goal of this technology is to generate high-quality images.

While VQGAN is a generative adversarial neural network that is good at generating images that look similar to others, it requires some steering. This is where CLIP comes to the big picture. 

CLIP is a neural network that is able to determine how well a caption matches an image. These two algorithms, when combined, can produce various forms of AI-generated art. 

Training VQGAN

Due to the fact that VQGAN is a hybrid transformer model, it shows original and encoded samples during its training session.

Training VQGAN is quick due to the fact that each section of the image gets checked by the discriminator, whereas classic GANs end up using an all-or-nothing approach when it comes to their training.

The discriminator within the VQGAN looks at 16 sub-images through a 4x4 grid, which gives the generator a thumbs up or a thumbs down for each section as feedback for its improvement. This discriminator in the classic GAN gives the generator a single thumbs up or thumbs down for the entirety of the image. 

Reverse encoding is also a seamless procedure when the VQGAN model is utilized. Because it is effectively a codec, it has a model that can encode an image to an embedding and a corresponding model to decode an embedding to generate an image. As such, it is easy to produce output images from classic GANs. All you have to do is give it random numbers, after which it will generate a solid picture. 

However, if you end up giving the decoder random numbers, the output image will not be coherent, as VQGAN needs to be steered by some process, like a text prompt to generate a recognizable image. 

The AI System Known as CLIP

OpenAI has designed and trained an artificial intelligence system known as CLIP, and this stands for Contrastive Language–Image Pre-training. The CLIP system has an image and a text encoder and can be used to perform cross-modal semantic searches - meaning that you can use words to search images.

The interesting part about it, and one that will assist you with your work, is the fact that OpenAI ended up training the encoders on a dataset of images with corresponding phrases with the main goal of having the encoded images match the encoded words. 

Once it has been trained, the image encoder system converts images to embeddings, a list of 512 floating-point numbers that capture an image's general features. This text encoder then converts a text phrase to a similar embedding, one which can be compared to image embedding for a semantic search. 

If you were to command VQGAN and combine it with CLIP to create an abstract painting of blocks in blue, it would make one. In order for it to produce multiple paintings, you will need to command it to generate prompts with varying parts, such as style, subject, and color, for example. 

If you are curious about trying out VQGAN+CLIP, you can do so through the utilization of our image generator, which you can use to create art from nothing but a text prompt; as the AI will “paint” just about anything that you want, which makes the entire process as simple as just putting in a few words, be it a cultural reference, a lyric from a song, or just any random phrase.

VQGAN and CLIP: Giving Directions 

Typically, you will need to create a custom algorithm to have the CLIP steer a variant of VQGAN to create images from text prompts. 

To steer VQGAN with CLIP, you can use an optimizer in the Pytorch Library, such as Adaptive Moment Estimation (ADAM). The CLIP system would use a flat embedding of 512 numbers, whereas the VQGAN would use a three-dimensional embedding with 256x16x16 numbers.

The goal of this algorithm would be to produce an output image that closely matches the text query, and the system would start by running a text query through the CLIP text encoder. After generating hundreds of digital paintings, you will come to the realization that not every single one of them would be produced as a solid outcome. Images that are generated from prompts within a specific category will work better than others. 

Hopefully, now you have a heightened level of understanding of what VQGAN stands for and how it can be utilized to generate artworks that can later be used within NFT collections.

If all this seems a bit complicated for you, don’t worry - you can always use our app for creating AI-generated NFTart, without having to worry about how it works.

Be sure to check out the other posts on our site to learn more about the field of NFTs, such as, How much it costs to mint an NFT.