The AI world is still figuring out how to deal with the amazing power that DALL-E 2 is capable of drawing/painting/imagining just about anything… but OpenAI isn’t the only one thinking of something like this is working. Google Research has rushed to release a similar model it’s working on – which it claims is even better.
Imagen (get it?) is a text-to-image diffusion-based generator based on large Transformer language models that are…okay, let’s slow down and unpack as fast.
Text-to-image models take text inputs like “a dog on a bike” and generate a corresponding image, which has been practiced for years but has recently seen huge leaps in quality and accessibility.
Part of this is using diffusion techniques, which basically start with a pure noise image and slowly refine it bit by bit until the model thinks it can’t look more like a dog on a bike than it already does. This was an improvement over top-to-bottom generators that could be hilariously wrong at first glance, and others that could easily be misled.
The other part is improved language understanding through large language models using the Transformer approach, which I don’t want to (and can’t) go into the technical aspects of here, but it and some other recent advances have led to convincing language models, such as GPT-3 and other.
Imagen starts by generating a small image (64×64 pixels) and then does two “super-resolution” passes to bring it up to 1024×1024. However, this is not normal upscaling, as the AI Super Resolution based on the original creates new details in harmony with the smaller image.
Suppose you have a dog on a bike and the dog’s eye is 3 pixels in diameter in the first image. Not much room for expression! But on the second picture it is 12 pixels wide. Where do the details needed for this come from? Well, the AI knows what a dog’s eye looks like, so it generates more detail as it draws. Then this happens again when the eye is finished again, but with 48 pixels across. But at no point did the AI just have to pull 48x the number of pixels of a dog’s eye out of its…let’s say magic bag. Like many artists, it started with the equivalent of a rough sketch, filled it out in a study room, and then really went to town with the final canvas.
This is not unprecedented, and in fact artists working with AI models are already using this technique to create pieces much larger than what the AI can handle at once. If you split a canvas into several parts and high-res each separately, you end up with something much bigger and finer; You can even do it repeatedly. An interesting example by an artist I know:
The advances that Google researchers claim with Imagen are varied. They say that existing text models can be used for the text encoding part and that their quality is more important than just increasing visual fidelity. This makes intuitive sense since a detailed picture of nonsense is definitely worse than a slightly less detailed picture of exactly what you asked for.
For example, in the article describing Imagen, they compare the results for it and DALL-E 2, which “does a panda that does latte art.” In all of the latter’s images, it is the latte art of a panda; In most imagens, it’s a panda that does the art. (Neither was able to render a horse riding an astronaut, and on all attempts showed the opposite. It’s a work in progress.)
In Google’s tests, Imagen comes out on top in both accuracy and fidelity in human evaluation tests. This is of course quite subjective, but it’s quite impressive to even match the perceived quality of DALL-E 2, which until now has been considered a huge step up from everything else. I’d just like to add that while it’s pretty good, none of these images (from any generator) will stand up to more than cursory scrutiny before people realize they were generated, or have serious suspicions.
However, OpenAI is a step or two ahead of Google in some respects. DALL-E 2 is more than a research paper, it’s a private beta with people using it just like they used its predecessor and GPT-2 and 3. Ironically, with “frank” in its name, the company has focused on producing its text-to-image research while the fabulously profitable internet giant has yet to try.
That’s clear from the DALL-E 2 researchers’ decision to pre-curate the training dataset and remove any content that might violate their own guidelines. The model couldn’t do anything NSFW if it tried. However, the Google team used some large datasets that are known to contain inappropriate material. In a revealing section on the Imagen website describing “Limitations and Societal Impacts,” the researchers write:
Downstream applications of text-to-image models are diverse and can impact society in complex ways. The potential risks of abuse raise concerns about responsible open source sourcing of code and demos. At this point we have decided not to release code or a public demo.
The data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scratched datasets. While this approach has enabled rapid algorithmic advances in recent years, datasets of this type often reflect social stereotypes, oppressive viewpoints, and derogatory or otherwise harmful associations with marginalized identity groups. While some of our training data was filtered to remove noise and unwanted content such as pornographic images and toxic language, we also used the LAION-400M dataset, which is known to contain a wide range of inappropriate content, including pornographic images, racial slurs and harmful social stereotypes. Imagen relies on text encoders trained on uncurated web-scale data, thus inheriting the social biases and limitations of large language models. Therefore, there is a risk that Imagen has encoded harmful stereotypes and representations, which guides our decision not to release Imagen for public use without further safeguards
While some might quibble about saying that Google fears its AI isn’t politically correct enough, that’s an unloving and short-sighted view. An AI model is only as good as the data it was trained on, and not every team can put in the time and effort it might take to remove the truly horrible stuff those scrapers pick up when they have multiple Assemble millions of images or several billions of images. word records.
Such biases are intended to emerge during the research process, which uncovers how the systems work and provides an unrestricted testing ground for identifying these and other limitations. How else would we know that an AI can’t draw common black hairstyles – hairstyles that any child could draw? Or that when the AI is asked to write stories about work environments, it inevitably turns the boss into a man? In these cases, an AI model works perfectly and as designed – it has successfully learned the biases that permeate the media it is being trained on. not dissimilar to humans!
But while unlearning systemic biases is a lifelong project for many people, an AI has it easier and its creators can remove the content that caused them to behave badly in the first place. Maybe someday it will need an AI to write in the style of a racist, sexist expert from the 1950s, but for now the benefits of incorporating this data are small and the risks are great.
In any case, Imagen, like the others, is clearly still in the experimental phase and is not ready to be used other than under close human supervision. As Google gets around to making its capabilities more accessible, we’re sure to learn more about how and why it works.