Ayuda:Generar imágenes con VQGAN+CLIP/English

De Bestiario del Hypogripho

How to generate images with VQGAN+CLIP

Este artículo tiene contenido abordado desde la perspectiva de la "vida real".     Este artículo se compone de contenidos redactados por Jakeukalane (y creados por terceras personas).  Este artículo se compone de contenidos redactados por Avengium (y creados por terceras personas).  Este artículo está ilustrado con imágenes de Khang Le, de Jakeukalane, de licencia Creative Commons y nadie más.  Este artículo tiene bibliografía real que sustenta su contenido en todo o en parte.  Este artículo es de dificultad intraficcional negligible o nula (0). Debería ser apto para todo público. 

VQGAN is a generative adversarial network. Generative Adversarial Networks, also known as GANs are a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in 2014. Two neural networks contest with each other in a game (in the form of a zero-sum game, where one agent's gain is another agent's loss).

This technique can produce images that appear authentic to human observers. For example, a synthetic image of a cat that manages to fool the discriminator (one of the functional parts of the algorithm) is likely to lead some people to accept it as a real photograph. The difference of VQGAN with previous GAN networks is that it allows high resolution outputs.

CLIP (Contrastive Language Image Pretraining) is another artificial intelligence that allows you to transform texts into images. That is, in 'VQGAN + CLIP' , CLIP introduces text inputs to VQGAN. Here we explain how to use it.

VQGAN+CLIP in Google Colaboratory

Entering in VQGAN+CLIP (z+quantize method con augmentations).ipynb done in Google Colaboratory by Katherine Crowson you can run a VQGAN model preformated with values and combined with a CLIP model. Here we explain how to make it work.

Previous steps

  • 2) At the top right click on Conectar (that means Connect) to be assigned a machine.
Play button.
Cells can be executed one after another, without waiting for the previous one to finish[1]. When a cell is on hold it looks like this.
  • 3) On the page there are black circles with an arrow that looks like "Play". Click on these buttons to run each of the cells.

  • 4) Click on the cell with the text: Licensed under the MIT License .
  • 5) Click on the cell with the text: !nvidia-smi. The data of the remote PC that will run the VQGAN + CLIP model appears here. VRAM can be like this: 0MiB / 15109MiB. The more VRAM, the more rendering power. With less than 15109MiB it might not be worth using the machine (taking on average: 1 iteration 4 seconds, that is, four times longer than a 15 GiB of RAM graphic card).
  • 6) Click on the cell with the text: Instalación de bibliotecas (that means Installation of libraries). You will see that in that cell, progress bars appear. That's the installations and downloads in progress. Wait for it to finish downloading[1].
  • 7) Click on the cell with the text: Selección de modelos a descargar (that means Selections of models to download). You can choose to download other models but the default model imagenet_16384 is good. imagenet_1024 is also light. Time total, time spent, time left indicate how much time is left to download. Wait for it to finish downloading.
  • 8) Click on the cell with the text: Carga de bibliotecas y definiciones (that means Load of libraries and definitions).
  • 9) Click on the cell with the text: Parámetros (that means Parameters).

To the right of the parameters cell there is a text box that allows you to customize them more easily. Every time you modify the Parámetros ("Parameters") you have to rerun the cell so that it gets updated.


Name of the parameter English translation Default text Description
textos texts A fantasy world This parameter is the text that VQGAN + CLIP will interpret as the concept of the image. If you write "fire", it will draw fire, and if you write "water", it will represent water. More information in the section "text and context".
ancho wide 480 The width of the image that VQGAN+CLIP will generate inside the Colab. The recommendation is to not modify it to more than 600px because the virtual machine has a limited memory. Is better to use after bigjpg (or waifu2x or any other resizer). You can change the proportions, so the image won't be squared (A little help here: Proportions calculator).
alto height 480 The height of each image that will be generated in the Colab. The recommendation is to not modify it to more than 600px because the virtual machine has a limited memory. Is better to use after bigjpg (or waifu2x or any other resizer). You can also change the proportion.
modelo model imagenet_16384 This parameters decides which model of VQGAN will be run. There are boxes that allows to select a model. The one you select must have been previously downloaded. The number indicates the number of models it contains so imagenet_16384 is (supposedly) better than imagenet_1024 (although heavier). See QQ.
intervalo_imagenes image_interval 50 This tells the program every few iterations to show the image result on the page. If you type 50, it will print the results of the iterations 0, 50, 100, 150, 200, etc.
imagen_inicial initial_image None To use an initial image, you only have to upload a file to the Colab environment (in the left side) and then modify the imagen_inicial ("initial_image"): putting the exact filename. Example: sample.png. See Upload images.
imagenes_objetivo target_images None One or more images that the AI will take as "target", fulfilling the same function as putting a text on it. That is, the AI will try to imitate the image or images. They are separated with |.
seed = -1 The seed of that image. -1 indicates that the seed will be random each time. By choosing -1 you will only see in the Colaboratory interface the seed chosen in the cell "Hacer la ejecución" ("Make the execution"), such as this: Using seed: 7613718202034261325 (ejemplo). If you want to find out the iterations and seeds of the images you have downloaded, they are in the image comments. On Linux, normal viewers can see comments. In Windows the default viewers cannot see the metadata, but with Jeffrey's Image Metadata Viewer you can see them[r 1][r 2].
max_iteraciones max_iterations -1 The maximum number of iterations before the program stops. Default is -1, that means the program will not stop unless it does not crash or stop for some other reason. It is recommended to change it to a value like 500, 600, 1000 o 3000. A higher number is sometimes not necessary (variability decreases with higher number of iterations). Remember that doing these calculations is very expensive energetically (and if you leave the session for too long doing calculations you will have a limitation in Google Colaboratory).

Text and context


The AI is much more trained in English so many times the context is better by entering the input in English, but it understand (somewhat) other languages. This can be seen in the Images, if you put as text input greek temples in space the result is better than with Templos griegos en el espacio (Spanish).

Separate entries

You can separate concepts with vertical bars (also known as pipes) ( | ) and it produces two different entries of text, each one with a different "hinge loss"[r 3]. This allows to allows you to assign effects or adjectives independently to different elements.

In the execution cell they can be seen separated by commas and in quotes.

  • Text: Cosmic egg.
  • Name='cosmic egg'. (this is what the program runs).
  • Text: Bronze | space
  • Name='bronze' , 'space'.

It will produce a different result than:

  • Text: Bronze, space.
  • Name='bronze, space'.

Use adjectives

Adjectives / styles can be used to vary the image without varying the objects we want it to draw.

There is not a fixed number of styles, there are as many styles as we can think of.

By artist
  • Beksinski style / Dali style / Van Gogh style / Giger style / Monet style / Klimt / Katsuhiro Otomo style / Goya[r 4] / Miguel Angel (Sistine Chapel style) / Joaquin Sorolla / Moebius / in Raphael style (sometimes it helps to improve the lines of the faces).
    • can also be mixed.

In Wiki-art even better results are achieved.

By art style
  • Camera qualities and distortions: 4k / chromatic aberration effect / cinematic effect / diorama / dof / depth of field / field of view / fisheye lens effect / photorealistic / hyperrealistic / raytracing / stop motion / tilt-shift photography / ultrarealistic / vignetting.
  • cell shading / flat colors / full of color / electric colors.
  • anime / comic / graphic novel / visual novel.
  • Materials: acrylic painting style / clay / coffee paint / collage / glitch / graphite drawing / gouache / illuminated manuscript / ink drawing / medieval parchment / detailed oil painting / tempera / watercolor.
  • isometric / lineart / lofi / lowpoly / photoshop / pixel art / vector art / voxel art.
  • Historical periods: baroque / German Romanticism / impressionism / Luminism / pointillism / postimpressionism / Vienna Secession.
  • You can also use type of paint by regions:
    • chinese painting / indian art / tibetan paintings / nordic mythology style / etc.
By movies or video games
  • in Ghost in the Shell style / in Star Wars style / in Ghibli style / in Metropolis film style / Death Stranding style
By rendering program

It mimics the result, is not like the IA is actually using those graphics engines[r 5]):

Specific effects
  • Add brdf / caustics / global-illumination / non-photorealistic / path tracing physically based rendering / raytracing / etc.
Effects / lights
  • Fog / fire / lava / shining / glow / red-hot / incandescent / iridiscent / etc.
Other modifiers
  • Trending on (website): for example "trending on artstation"[r 8]
  • minimalistic / dismal / grim / liminal / surprising / black hole / diamond.

Assign weights

Percentages can also be used and CLIP will interpret decimals (0.1, 0.5, 0.8) as weights of that concept in the drawing (1 will be the total). You can also use "percentages" (without the percent symbol). Negative weights can be used to remove a color, for example.

  • It not recommended to put weights less than -1.
  • The weights are relative to each other (the total is recalculated and does not have to coincide with the numbers that have been set). That is why it is recommended that they add 100% —or 1— (more than anything for ourselves to know the real weights).

Examples of weights in decimals (the parentheses indicate the error):

  • Text: rubber:0.5 | rainbow:0.5. Equivalent to rubber:50 | rainbow:50.
    • Badly done: "0.5 rubber | 0.5 rainbow". (The allocation of weights goes after the concept and after :).
    • Badly done: "awesome:100 adorable:300". (It is not separated by |)
    • Badly done: "rubber:50% | rainbow:50%". (The % are not admitted symbols).

Other example of weights (total=100):

  • Text: sky:35 | fire:35 | torment:20 | dinosaurs:10


  • Text: fantasy world | pink:0
  • Result: It doesn't have pink.

Note: Deleting a word using negative values can completely change the image, with unexpected results. If you are very specific you can achieve the desired results, but still the image will change a lot.

Not checked yet

'Note 2: It is better to eliminate a concept using values to 0.

For example to remove the Unreal logo:

  • Text: … | logo of unreal:-1. It could give a satisfactory result.
Not checked yet
  • Text: … | logo of unreal:0. Could work better.


  • Text: … | logo:-1. It will give a totally different result by being too unspecific.

Another advices

  • For astronomical images, a better result is achieved when weights are assigned to the parameters, thus the elements are defined much better (for example a galaxy).
  • Texts that are too short tend to go wrong but if they are very specific not so much.
  • AI warps people's faces when you name someone specific.
  • Starting point using images from VQGAN itself is also very efficient.

Upload images

To use an initial image you first have to upload it.

  1. Go to the left side and click on "Archivos" ("Files").
  2. Select the icon that represents the upload "Subir al almacenamiento de sesión" (something like "Upload to the session storage").
  3. Upload the image you want from your file system (give it a recognizable name).
  4. The image will only remain during the session, then it will be deleted.
  5. Then you have to modify the section imagen_inicial (initial_image) or imagenes_objetivo (target_images) putting the exact filename. In the section imagenes_objetivo (target_images) you can put several images, using | as separator.

Other techniques

Adapt images to specific shapes

To adapt the final image to a specific shape, starting images with color masks can be used by making that shape. (Optional: it can simply be a white square, although it will be more accurate if you start from iteration 0) and things that conform to that shape are selected in the text input (for example, in a round mask: a watch, a pizza, a crystal ball, etc.)

You can add or download masks from the Colaborative Drive of DotHub.

Guide the AI to a result

  • When you get to an iteration you deviate from what you want. You can stop and use that last frame as new imagen_inicial ("initial_image"), modifying the description a bit. That way, to a certain extent, you can "guide" the AI towards what you want.


Once you have decided the parameters:

  1. Click on the cell with the text: Hacer la ejecución… ("To do the execution…").
  2. Wait for images to appear in that cell.
  3. When you want to save an image, press the right button + Save image and save with the name you want.

Generate a video

The corresponding "Play" is pressed. If the range is not specified for the video, it will generate a video of all frames and it may take a while. To avoid it you can change the parameters in the cell likeinit_frame, last_frame and also the FPS.

When the process ends, sometimes it does not load and it is not evident where the generated video is. It is in Archivos ("Files") (left sidebar).

Update: There is a new specific cell called Descargar vídeo (" Download video"), which performs the download automatically.

Create a zip with all the images

This has not been activated in the online version yet but with the code shown it is fully functional.

If we want to download all the steps, we have generated too many images and it is very tedious for us to save them one or in one or if we have simply deleted or stopped the cell where the images were, they can still be downloaded.

The middle steps generated (although they are not shown) are placed in the /steps folder. If we want to download all the steps, it is not feasible to do it by hand, so we are going to introduce them in a single file and thus download them more easily.

  • This applies after images have been generated.
  • If the machine has been disconnected, there will be no files in /steps (most of the times) so this procedure will be useless. The images shown in the main interface may be preserved, so they could be saved as a group using the extension Download All images. Links to this plugin are at More tools.
Name of the parameter Text by default Description
initial_image 0 Determines the first image that will be included in the zip.
final_image -1 Determines the last image that will be included in the zip. The -1 is the value for including until the last one.º
step 50 Determines the interval between an image and the next. By default is 50, the same than in the above menu. If you put the interval to 1 it saves all the middle steps. The resulting zip will be very heavy, depending of the amount of images. In addition, it also influences the time it takes to download. If it is too large, it may take a long time and / or the machine may be disconnected in the middle of the download. Normally when reconnecting the files have not been deleted (but it may be the case).
filename files.zip This is the name of the file to be downloaded. The images inside the folder will also have this name and a numbering. It is recommended to put distinctive names.

Once the generated zip archive is generated it's in the left sidebar, in Archivos (Files) should be download automatically.

Open new entries for commands

In order to introduce a new entry, we go with the mouse to the edge of a cell and as we pass above, it will appear two tabs: + Código (+ Code) y + Texto (+ Text). We click + Código (+ Code). Another option is to use Control+M B[2].

And paste this:

Code for generating the zip

# @title Create a zip with all of the images (or some)
initial_image = 0 #@param {type:"integer"}
final_image = -1 #@param {type:"integer"}
pass = 50 #@param {type:"integer"}
filename = "files.zip" #@param {type:"string"}

import zipfile

if final_image == -1:
  final_image = i

zipf = zipfile.ZipFile(filename, 'w', zipfile.ZIP_DEFLATED)
for i in tqdm(range(initial_image, final_imagen+1, step)):
  fname = f"{i:04}.png"
  zipf.write(f"steps/{fname}", f"{filename.split('.')[0]}-{fname}")
print(filename, "created. Downloading… A download dialog will be open when the download is ready.")
from google.colab import files

If you do more than one zip, sometimes you must give your browser permission to download multiple files.

Create a zip with all the images (2)

This section it was outdated version of the previous one.

See Open new entries for commands.

Important in the case of downloading all the images: If we have many files and little space, it can fail (especially if combined with using very heavy models, such as COCO-stuff or after generating a video). If that were the case and we were not going to use the machine again with that model, we could delete the specific model to make space or the video, once downloaded.

Control the notebook from the keyboard / automated execution

Control the notebook from the keyboard

In order to control the notebook from the keyboard we need to understand the concept of "focus". The focus is the place where the "selection" is every time . If a window, element, cell, text box, etc. has focus means that it can directly receive commands from the keyboard.

The key combinations that we will use to navigate the notebook are few:Control+Enter to run a cell, to change cells and (the tab key, above caps lock), to move through menus or within a cell.


1) Center the focus on the first cell (going down with once).

2) Run the second cell (Licensed under the MIT License). A menu will appear where we have to accept ( and Enter), we will go to the next cell and execute them successively[1] until to come to Parámetros ("Parameters") that is a text cell, so we don't execute it right away. In Parámetros ("Parameters") we will be choosing the necessary fields moving with the tabulator ().

3) We can move backwards with the tabulator (Shift+↹) or forward until we reach the heading of a cell and keep moving down with in order to execute Hacer la ejecución (Make the execution). In the case of wanting to save as a zip (see Create a zip with all the images) or Guardar como vídeo (Save as video),we will move to their respective cells. Currently we would have to manually enter the cell (see Open new entries for commands).

Automated execution

Go to the section of ParámetrosParameters) (or Selección de modelos a descargar (Selection of models to download in the case of wanting to use a different one from the default— and once the desired parameters have been filled in, go to the menu RuntimeRun before or press Control+F8.

Mount Google Drive

In the lateral menu we click in Activar Drive (To activate Dive). botón añadirá automáticamente un bloque de code. If don't load/works the lateral menu you can try to Open new entries for commands, with this code:

from google.colab import drive


How to stop the execution of the program?

By default max_iteraciones is -1, that means that the program isn't going to stop of doing iterations. If you want to sto the proccess you can press the circular button of grey background with a withe X and a tooltip that says "borrar resultado" ("delete result"). Before doing so, make sure you have saved everything you want.

The shortcut to stop a cell is Control+M I[2].

Sometimes a cell "hangs". Then you can delete it on the delete button (trash can symbol). To restore it use Control+M Z[2].

How do I know if a cell has been executed?

If you position yourself on the "Play" button of a cell, you can see if it has been executed and the result of the execution.

I upload an image but is distorted from beginning

That is because 480x480 is a squar image. If you uload an image with a different propotion than 1:1, the image generated will be distorted from beginning. Calculate the proportion of the image you are going to upload and use the same proportion in the fields ancho (width) and alto (height) if you want to avoid this distortion. You can use this Calculadora de proporciones.

Sometimes I specify a value, but it doesn´t take it

This is because you don't have executed the cell again. Any change in the fields of the Parámetros ("Parameters") needs to execute again that cell.

I specified a limit value but I want to continue with more iterations

It is not needed to iterate again from the beginning to continue with a result (although you should know seed or have found out using the tool Steganography Online), you can use as imagen_inicial ("initial_image") the last iterated image. See Guide the AI to a result.

I don´t know if I have to execute everything from the beginning (the machine has disconnected)

Although sometimes the packages and definitions are preserved in the memory although some time has passed, sometimes not and you have to start from the beginning. It can be easily checked by looking if the cell has been executed in the current session. See How do I know if a cell has been executed or trying to execute the cell Parámetros ("Parameters") and looking the error.

What model is better for me?

ADVERTENCIA DE FALTA DE CONTENIDO: This section es un borrador, que se encuentra (evidentemente) incompleto.
Be pacient

Most of the models have no obvious advantages between them. They have been trained with different sets of images so they will produce different results, not necessarily better or worse.

  • ImageNet: The ImageNet project is a large visual database designed for use in visual object recognition software research. The project has hand-annotated over 14 million images to indicate which objects are rendered, and in at least one million of the images, bounding boxes are also provided. Contains over 20,000 categories with a typical category such as "balloon" or "strawberry" consisting of several hundred images (via wikipedia).
    • imagenet_1024 (913.57 MiB - FID[3]: 8.0): Utiliza el conjunto de datos (=dataset) de ImageNet con un "codebook" de 1024 elementos.
    • imagenet_16384 (934.68 MiB -FID[3]: 4.9): Utiliza el conjunto de datos de ImageNet con un "codebook" de 16384 elementos[r 9].

Compared to what could be assumed, having a larger codebook does not exactly mean that it is more powerful, it simply allows you to capture more characteristics of the images. That may or may not be good, depending on what kind of results you want to achieve. The 1024 is a bit "freer" so to speak when it comes to generating images. It tends to create things more abstract, more chaotic, and more artistic. Your "world view" has fewer categories, forcing it to abstract more.

16384 it is much better for soft or minimalist backgrounds.

You can see a comparising between some images of 1024 vs 16384 in this colab: Reconstruction usage.

  • COCO-Stuff (Proyecto - 7.86 GiB - FID[3]: 20.4): COCO-Stuff es una modificación con "augmentos" del un conjunto de datos (=dataset) COCO de Microsoft, con imágenes cotidianas (calles, personas, animales, interiores…).
  • faceshq (3.70 GiB). Especializado en caras.
  • Wikiart: Las imágenes del conjunto de datos de WikiArt se obtuvieron de WikiArt.org. Licencia: Solo para fines de investigación no comerciales. Es decir, es un conjunto entrenado con cuadros de arte por lo que los resultados generalmente serán pinturas. Un resultado similar se podría conseguir en los conjuntos de datos de imagenet usando estilos de pintores famosos.
    • wikiart_1024 (913.75 MiB): Version de wikiart con un codebook de 1024 elementos.
    • wikiart_16384 (958.75 MiB): Version de wikiart con un codebook de 16384 elementos.
  • s-flckr (3.97 GiB ): Conjunto desde Flickr.
  • ade20k (4.61 GiB - FID[3]: 35.5) (Nota: no está por defecto, ver ¿Cómo puedo añadir nuevos modelos?): Conjunto de datos con segmentación semántica del MIT. Contiene más de 20K imágenes centradas en escenas con anotaciones exhaustivas con objetos a nivel de píxel y etiquetas de partes de objetos. Hay un total de 150 categorías semánticas, que incluyen cosas como cielo, carreteras, césped y objetos discretos como personas, coches, camas, etc.

I have left VQGAN open overnight and now it won't connect to a machine with a GPU

Although "it is not doing anything" if the connection exists, it is already speding assigned time, so if you leave the machine stopped for a long time, the next time you will not be assigned a machine with GPU.


  1. Avoid leaving the machine connected longer than necessary.
  2. Wait long enough (1 day or so) for the limitation to pass.
  3. Use a different gmail account.
  4. If you have infinite patience, you can use the machine with only CPU. It takes about 30 times longer. Not recommended.

What are the limits of use of Colab?

Colab may provide free resources in part by having dynamic usage limits that sometimes fluctuate and by not providing guaranteed or unlimited resources. This means that general usage limits, as well as idle periods, maximum virtual machine lifespan, available GPU types, and other factors vary over time. Colab does not publish these limits, in part because they can (and sometimes do) change rapidly.

GPUs and TPUs are sometimes prioritized for users using Colab interactively over long-running computations, or for users who have recently used fewer resources in Colab. As a result, users using Colab for long-running computations, or users who have recently used more resources in Colab, are more likely to run into usage caps and have their access to GPUs and TPUs temporarily restricted. Users with heavy computational needs may be interested in using the Colab UI with a local runtime running on their own hardware. Users interested in having higher and stable usage limits may be interested in Colab Pro[r 10].

How can I upload an image from the internet directly?

If you have a poor connection sometimes the functionality of uploading an image fails or does not load. This code can be used to download the image from an internet URL.

!wget https://url.to.your/image.jpg

It allows to download a URL from Internet to the notebook. See Open new entries for commands. If the URL have strange symbols you could use instead:

!wget "https://url.to.your/image.jpg"

How can I run VQGAN+CLIP locally?

ADVERTENCIA DE FALTA DE CONTENIDO: Esta sección es un borrador, que se encuentra (evidentemente) incompleto.
Se completará próximamente

The problem is that running it locally will ask for a very powerful graphic. Approximately you need from 15GB of GPU approximately so that it goes decently for you. With 10GB minimum, it will run, but extremely slow (This maybe is outdated).

There are people who have managed to make it usable with graphics cards (1060) of 6GB. A possible optimization solution would be to modify the parameter cutn (for example to 32). If the parameter is lowered too much (that is related with quality), it does not give good results, but if it is too high, it consumes a lot of memory.

In Local execution there are Docker and github versions.

How can I add new models?

The better way is to notify the existence of new models to the DotHub discord and wait for implementations of people. .

How can I disable the augmentations?

The augmentations are random variations in each step that that improve the final quality of the image. That is, if they are deactivated, quality is reduced, but the execution will be faster (although not much 1.01 seconds each iteration vs 1.31 seconds each iteration (on a Tesla T4). It can also vary the content substantially.

In the cell Carga de bibliotecas (Loading libraries) we click two times and it will appear the editable code. W search for (Control+F) this code class MakeCutouts(nn.Module): and remove from self.augs = nn.Sequential() all that goes between parenthesis, being this way:

class MakeCutouts(nn.Module):
    def __init__(self, cut_size, cutn, cut_pow=1.):
        self.cut_size = cut_size
        self.cutn = cutn
        self.cut_pow = cut_pow
        self.augs = nn.Sequential()
        self.noise_fac = 0.1


VQGAN+CLIP error fixing

If the cell has given an error, the play button is shown in red.

Here are some of the errors you may encounter and how to fix them.

NameError: name 'textos' is not defined

NameError                                 Traceback (most recent call last)
<ipython-input-6-b687f6112952> in <module>()
      2 device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
      3 print('Using device:', device)
----> 4 if textos:
      5     print('Using texts:', textos)
      6 if imagenes_objetivo:

NameError: name 'textos' is not defined

Solution: You have pressed in Hacer la ejecución… (Make the execution…) before loading the parameters. Stop the cell Hacer la ejecución… (Make the execution…) and run the cell Parámetros (Parameters). After that, press the cell "Hacer la ejecución…" (Make the execution…) again.

NameError: name 'argparse' is not defined

NameError                                 Traceback (most recent call last)
<ipython-input-8-9ad04e66b81c> in <module>()
---> 31 args = argparse.Namespace(
     32     prompts=textos,
     33     image_prompts=imagenes_objetivo,

NameError: name 'argparse' is not defined

Solution: This means that you have not run the cell "Carga de bibliotecas y definiciones" (Loading libraries and definitions. (Also can be because the virtual machine have expired).

ModuleNotFoundError: No module named 'transformers'

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-12-a7b339e6dfb6> in <module>()
     13 print('Using seed:', seed)
---> 15 model = load_vqgan_model(args.vqgan_config, args.vqgan_checkpoint).to(device)
     16 perceptor = clip.load(args.clip_model, jit=False)[0].eval().requires_grad_(False).to(device)

11 frames
/content/taming-transformers/taming/modules/transformer/mingpt.py in <module>()
     15 import torch.nn as nn
     16 from torch.nn import functional as F
---> 17 from transformers import top_k_top_p_filtering
     19 logger = logging.getLogger(__name__)

ModuleNotFoundError: No module named 'transformers'

Solution: It happens with coco, faceshq and sflickr. You have to open a cell before the cell of Carga de bibliotecas y definiciones (Load of libraries and definitions and write:

!pip install transformers

And run that cell.

ModuleNotFoundError: No module named 'taming'

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-10-c0ac0bf55e51> in <module>()
     11 from omegaconf import OmegaConf
     12 from PIL import Image
---> 13 from taming.models import cond_transformer, vqgan
     14 import torch
     15 from torch import nn, optim

ModuleNotFoundError: No module named 'taming'

Solution: Maybe restart the enviroment or see the solution to ModuleNotFoundError: No module named 'transformers'.

ModuleNotFoundError: No module named 'taming.modules.misc'

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-11-b687f6112952> in <module>()
     13 print('Using seed:', seed)
---> 15 model = load_vqgan_model(args.vqgan_config, args.vqgan_checkpoint).to(device)
     16 perceptor = clip.load(args.clip_model, jit=False)[0].eval().requires_grad_(False).to(device)

12 frames
/usr/lib/python3.7/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'taming.modules.misc'

Solution: One of the packages needed for the execution of the program fails. Run again the installation cell and try again.

  • If you chose the model "faceshq" it could appear this error. This error have been corrected (2021-06-11).

FileNotFoundError: [Errno 2] No such file or directory

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-10-f0ccea6d731d> in <module>()
     13 print('Using seed:', seed)
---> 15 model = load_vqgan_model(args.vqgan_config, args.vqgan_checkpoint).to(device)
     16 perceptor = clip.load(args.clip_model, jit=False)[0].eval().requires_grad_(False).to(device)

1 frames
/usr/local/lib/python3.7/dist-packages/omegaconf/omegaconf.py in load(file_)
    182         if isinstance(file_, (str, pathlib.Path)):
--> 183             with io.open(os.path.abspath(file_), "r", encoding="utf-8") as f:
    184                 obj = yaml.load(f, Loader=get_yaml_loader())
    185         elif getattr(file_, "read", None):

FileNotFoundError: [Errno 2] No such file or directory: '/content/wikiart_16384.yaml' (alternativa: Orange.png)

Solution: This could mean two things:

  1. You have choosen a model that have not been download. Check that the chosen model has been downloaded. It may also be that the machine has expired (that is, everything has been erased). In the latter case you would have to run everything from the beginning.
  2. Has puesto un nombre en imagen_inicial ("initial_image") que no se corresponde con la imagen que has subido. Este programa detecta como diferentes las mayúsculas de las minúsculas por lo que Orange.png es diferente que orange.png. Pon el nombre correcto en imagen_inicial ("initial_image").
  • If by accident you have uploaded an image with a very complicated name, you can change the name from the interface itself.

RuntimeError: CUDA out of memory

RuntimeError                              Traceback (most recent call last)
<ipython-input-13-f0ccea6d731d> in <module>()
    131     with tqdm() as pbar:
    132         while True:
--> 133             train(i)
    134             if i == max_iteraciones:
    135                 break

8 frames
/usr/local/lib/python3.7/dist-packages/taming/modules/diffusionmodules/model.py in nonlinearity(x)
     29 def nonlinearity(x):
     30     # swish
---> 31     return x*torch.sigmoid(x)

RuntimeError: CUDA out of memory. Tried to allocate […] (GPU 0; […] total capacity; […] already allocated; […] free; […] reserved in total by PyTorch)

Solution: This can mean several things:

  1. You have chosen too large image dimensions. The size 480x480px is enough (although in theory it could support up to 420000 pixels in total, i.e. ~648x648). To enlarge the dimensions of the image use the tools linked in Image resizers.
  2. You have run out of memory from using it for a long time. You will have to start a new session.
  3. Google has assigned you a low memory GPU (<15109MiB). You will have to start a new session.

RuntimeError […] is too long for context length X

RuntimeError                              Traceback (most recent call last)
<ipython-input-10-f0ccea6d731d> in <module>()
     46 for prompt in args.prompts:
     47     txt, weight, stop = parse_prompt(prompt)
---> 48     embed = perceptor.encode_text(clip.tokenize(txt).to(device)).float()
     49     pMs.append(Prompt(embed, weight, stop).to(device))

/content/CLIP/clip/clip.py in tokenize(texts, context_length)
    188     for i, tokens in enumerate(all_tokens):
    189         if len(tokens) > context_length:
--> 190             raise RuntimeError(f"Input {texts[i]} is too long for context length {context_length}")
    191         result[i, :len(tokens)] = torch.tensor(tokens)

RuntimeError: Input […] is too long for context length 77

Solution. The input text is too long. Enter a shorter text. It has to be less than 350 characters[r 11]. See Lettercount.

TypeError: randint() received an invalid combination of arguments

TypeError                                 Traceback (most recent call last)
<ipython-input-8-b8abd6a7071a> in <module>()
     43     z, *_ = model.encode(TF.to_tensor(pil_image).to(device).unsqueeze(0) * 2 - 1)
     44 else:
---> 45     one_hot = F.one_hot(torch.randint(n_toks, [toksY * toksX], device=device), n_toks).float()
     46     if is_gumbel:
     47         z = one_hot @ model.quantize.embed.weight

TypeError: randint() received an invalid combination of arguments - got (int, list, device=torch.device), but expected one of:
 * (int high, tuple of ints size, *, torch.Generator generator, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool requires_grad)
 * (int low, int high, tuple of ints size, *, torch.Generator generator, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool requires_grad)

Solution: No solution is not yet, since it only appears with certain configurations (not by default). The solution temporarily should be to change the parameters until it works.

ValueError: could not convert string to float

ValueError                                Traceback (most recent call last)
<ipython-input-8-f0ccea6d731d> in <module>()
     46 for prompt in args.prompts:
---> 47     txt, weight, stop = parse_prompt(prompt)
     48     embed = perceptor.encode_text(clip.tokenize(txt).to(device)).float()
     49     pMs.append(Prompt(embed, weight, stop).to(device))

<ipython-input-5-32991545ebb9> in parse_prompt(prompt)
    129     vals = prompt.rsplit(':', 2)
    130     vals = vals + ['', '1', '-inf'][len(vals):]
--> 131     return vals[0], float(vals[1]), float(vals[2])

ValueError: could not convert string to float: ' los leones la levantan y[…]'

Solution: The text contained a colon :, illegal characterif not in a certain sentence (red:-1, for example).

WARNING:root:kernel restarted

Even if you try to reconnect, or re-run the cells, the machine seems to be dead. For (for example) the error WARNING:root:kernel […] restarted.

Solution: Force the session to end, by going to the top tab next to Conectar (Connect) → Gestionar sesiones (Manage sessionsFinalizar (Finalize).



  • Without image entry:
  • With image input:

Examples of video

See also


Las Referencias aluden a las relaciones de un artículo con la "vida real".

  1. Note: An alternative to XMP metadata is data entered using steganography. This metadata can be viewed using the stegano python library (Steganography Online).
  2. Note 2: Even though the seed is identical the results will still vary a bit due to the "augmentations". Augments are pseudo-random variations introduced in each iteration. They can be disabled, which would make it a bit faster, but at the cost of losing quality. See How can I disable the augmentations?
  3. Hinge loss is a kind of specific and independent loss function for artificial intelligences. See hinge loss.
      In machine learning, the "hinge loss" is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).
  4. In Goya and other artists it has been appreciated that the first iterations are better.
  5. I.e., the AI has been trained with many images, some of which were labeled "rendered in X" so it mimics those results.
  6. Vía Aran Komatsuzaki.
  7. In the section Asignar pesos shows a way to try to remove the Unreal Engine logo that sometimes appears floating across the image.
  8. Although the word "trending" evokes something current, the dataset is not updated every day. That is, that "trending" will be taking the data from the moment in which the dataset was made.
  9. Es decir, el número de elementos que utiliza el modelo para definir una única imagen see ¡Esta IA crea ARTE con tus TEXTOS! (y tú puedes usarla 👀) [Minuto 7:30].
  10. Referencia.
  11. As such there is no character limit, only tokens (which are a group of characters), but the tokens vary depending on what you write. As I understand it, there are a maximum of 75 tokens, approximately 350 characters.
  1. 1,0 1,1 1,2 Before mastering the program, it is recommended to wait for each cell to be executed (in order to not skip steps). But it is not really necessary. It can be programmed from start to finish (even video) by filling in the correct parameters. Obviously, if there is an error or the machine expires, the video will not be able to download or the images will not be generated correctly, so you have to take a look from time to time.
  2. 2,0 2,1 2,2 This shortcuts are activated by pressing the first to keys united by the + first and then the aditional letter.
  3. 3,0 3,1 3,2 3,3 La puntuación de distancia de inicio de Frechet, o FID para abreviar, es una métrica que calcula la distancia entre los vectores de características calculados para imágenes reales y generadas. Vía: How to Implement the Frechet Inception Distance (FID) for Evaluating GANs.

External links

Los enlaces externos no están avalados por esta wiki. No nos hacemos responsables de la caída o redirección de los enlaces.



Technical information

Other guides

Local execution

Other tools

Resources for initial images

Image resizers

Metadata viewers

More tools

Text input generators


Social media


Special thanks to Eleiber#8347 for answering my questions and providing corrections. Also to Abulafia#3734 for explaining techniques and elchampi#0893for sharing his doubts. And to many users of the DotHub discord who have shared their techniques or doubts. Also to the Reddit users who have helped me.


   Artículo redactado por Jakeukalane
Para proponer cualquier cambio o adición, consulte a los redactores.
   Artículo redactado por Avengium
Para proponer cualquier cambio o adición, consulte a los redactores.