We all remember the distorted faces and images with varying numbers of fingers that were generated by AI just a few years ago. While most of the results were amusing, they were rather useless for real-world applications. This has changed dramatically, particularly since 2025. The limits of what is technically feasible, and thus the number of possible use cases, are shifting every month. Image generation is becoming faster and cheaper. It can also process more reference images and display text more effectively. Even 4K resolutions and all common image formats are now possible. The pioneers of this revolution are the software giant Google and a comparatively small but powerful start-up from Freiburg. But more on that later.

Together with our client, we implemented a comprehensive project based on image generation technology: an application for creating storyboards – visual representations of stories – with the help of AI. In this blog post, we will share our experiences from this project. We also discuss how the best current models work, provide an overview of the current model landscape, explain how to effectively prompt image models and present other possible use cases.

📑Table of Contents

How it works
Model landscape
Prompting
Image generation for the creation of visual storyboards
Technologies used
Limitations
Outlook

How it works

State-of-the-art (SOTA) image generators commonly use a combination of transformer architecture — which also underpins large language models — and diffusion models. These models generate high-resolution images from random pixel noise, as seen on old TVs with poor reception. Roughly speaking, the process can be imagined as a kind of camera — let’s call it a ‘dream camera‘ for the sake of this example.

Like a real camera, the dream camera starts with a closed lens. This means that no light signal falls on the individual pixel sensors, and only completely random thermal noise signals are generated.

Rather than opening the lens, the dream camera receives a prompt describing the desired image. The dream camera’s internal electronics have been trained using a large number of prompts and finished images, enabling them to gradually transform the random noise into a real image. At each stage, the pixel intensities and colours are adjusted so that the desired image gradually emerges, just as with a real camera where more and more light falls on the sensors during exposure, creating the finished image. The fascinating thing about the dream camera is that its lens was never opened; the image was created solely from the prompt and the training dataset, which shaped the circuits of the dream camera implicitly.

An important improvement on the first diffusion models is flow matching. This method enables the finished image to be created in significantly fewer steps, greatly speeding up the process.

A major disadvantage of diffusion models is that, while they can generate high-resolution images, they lack an understanding of text. To address this issue, innovations in large language models have been utilised. This has resulted in models that can comprehend complex inputs and produce clear images. All this is achieved in one step using natural language.

Model landscape

Judging by performance alone, the top spot is now up for grabs again. Alongside Gemini 3 Pro Image (Nano Banana Pro) and Flux 2, OpenAI’s GPT Image 1.5 has returned to the ranks of the best image generators. Black Forest Labs, the Freiburg-based startup, also offers a strong open-weights variant with Flux 2 Dev. Meanwhile, Chinese models such as Seedream 4.0 and Qwen Image Edit are also gaining ground. OpenAI is thus no longer solely benefiting from its widespread use, but can finally impress again in terms of performance. The passage of time illustrates the great dynamism in model development: Following the hype surrounding GPT Image 1 in April — remember the Ghibli wave? — and a temporary lull, OpenAI has now caught up technologically with its latest update.

Prompting

Anyone who has ever tried to create a specific image using AI will be familiar with the problem of random results. Structured prompting is therefore necessary for professional use.

The basic structure is relatively simple: the more central an element is to the image, the more likely it is to be mentioned in the prompt, and the more detailed its description will be. Less central elements, such as the image background, are described at the end of the prompt with fewer details.

Various terms from professional photography and the film industry can be used to specify the angle and distance of an image.

Infografik mit dem Titel „Camera Angles & Shot Levels“, die links Kameraperspektiven (High Angle, Low Angle) und rechts fünf Einstellungsgrößen von „Extreme Wide Shot“ bis „Extreme Close Up“ anhand einer männlichen Figur illustriert.

In addition to perspective, exposure is a particularly important factor in images. Even a small change to the settings can make the whole image look different. The following three images only differ in terms of exposure.

Nahaufnahme eines futuristischen, transparenten Smartphones mit holografischem Display in einem sonnendurchfluteten Café, neben einer Espressotasse und Smartwatch.

…large floor-to-ceiling windows overlooking a dense metropolitan street filled with electric vehicles and glass skyscrapers. Cinematic lighting, sharp focus on the hologram, shallow depth of field, 8k resolution.

Ein futuristisches Smartphone in einem Café, beleuchtet durch dämmriges Abendlicht, das eine stimmungsvolle Atmosphäre erzeugt.

First image + “transform the lighting into twilight lighting“

Ein futuristisches Smartphone in einem Café, angestrahlt von künstlichem Neonlicht, was dem Bild einen Cyberpunk-Look verleiht.

First image + “transform the lighting into neon lighting“

Colour grading is closely related to lighting. The art lies in selecting colours that achieve the desired visual style. This allows you to give the image a specific look and feel. To illustrate the difference, we have varied the colours of a prompt once again. We also had to change the time of day, otherwise it would not match the colours. When it comes to colour grading, it is usually helpful to provide a more detailed description of the colours.

Junge Frau in einem Weizenfeld während der Abenddämmerung, in kühlen Blau- und Grautönen mit einem blassen Silberhimmel und melancholischer Stimmung

A young woman standing in a field of wheat at dusk, cool blue twilight atmosphere, desaturated teal and slate gray tones, pale silver sky, cold and melancholic mood, cinematic.

Junge Frau in einem Weizenfeld bei Sonnenuntergang, umgeben von warmen Bernstein- und Honigtönen und einem tief orangenen Himmel.

A young woman standing in a field of wheat at sunset, bathed in warm golden hour light, rich amber and honey tones throughout, deep orange sky, everything glowing with warmth, cinematic.

Certain camera settings also make good control terms. For instance, the keyword ‘depth of field’ can be used to specify the desired depth of the camera’s focus. The focal length of the camera lens can also be specified in the prompt to achieve particular perspectives and focal behaviours.

Shallow depth of field

Deep depth of field

Teleaufnahme mit 200mm Brennweite, die ein Motiv stark heranzoomt und den Hintergrund komprimiert.

200mm telephoto lens

Weitwinkelaufnahme mit 14mm Brennweite, die einen großen Bildausschnitt und viel Umgebung zeigt.

Ultra-wide 14mm lens

In addition to these special settings, it is now possible to dispense with specialised language when providing prompts and simply use natural language, in the same way as one would with large language models. Please note that image models can only cover a certain level of detail. Additionally, the individual parts of the image prompt must fit together. This is particularly problematic when prompts for image generation are generated by LLMs. These can generate instructions that are geometrically impossible for the image model to render. This often results in an incorrect or jumbled image.

Fehlerhafte Darstellung einer Frau in einer Galerie, deren Arm unnatürlich lang durch den Raum gestreckt ist, um eine Kaffeetasse auf einem weit entfernten Tisch zu erreichen.

A woman standing in the far left corner of a minimalist white gallery … her hand casually resting on a coffee cup placed on a small wooden table in the far right corner.

Furthermore, some keywords are more effective than others; for instance, many models find it easier to adjust the exposure than to accurately represent a camera perspective.

Image generation for the creation of visual storyboards

When converting an existing short story into a visual storyboard, additional context engineering techniques are employed alongside the aforementioned prompting methods. To tell a story visually using a series of images depicting multiple scenes, high-quality individual images and, above all, consistency across the images are required. In a real photo series, the characters naturally all look the same, and the same objects or locations are depicted in the background in an absolutely realistic manner. However, for AI-generated images, this is challenging: each image in the series is generated anew, and the probabilistic nature of the process means that each image initially looks different.

These difficulties can be overcome with the help of an important feature of the latest AI image models. These models can use not only textual prompts, but also reference images as input, and the generated images can reproduce many visual details of the reference images. Currently, this is reliable for around 5–10 images, depending on the use case. This means that the central characters and objects of a short story can always be used as a reference when generating individual scenes.

For our use case, we have created a customised orchestration of various LLM and image model calls that meets our specific requirements. First, a reasoning model — a language model optimised for complex tasks — is used to extract detailed image generation prompts for the main characters in a story. Based on these prompts, an image is generated for each character, serving as a central reference for all scenes in which they appear. We then use the same techniques to generate detailed prompts for each individual scene in the storyboard. Additionally, we establish consistency between the textual descriptions, resulting in subsequent images that appear much more harmonious. In the interface we have developed, users can monitor each step and refine prompts or images individually. This enables minor errors or inconsistencies in the images to be swiftly rectified.

The following story is a modified example, in which little Ben lectures Santa Claus about modern means of transportation:

Rückansicht des Weihnachtsmanns, der mit einem Rentier und einem hölzernen Schlitten bei Nacht vor einer beleuchteten Haustür steht.

Der Weihnachtsmann steht im hell erleuchteten Wohnzimmer; eine Familie (Vater, Mutter und ein kleiner Junge) blickt ihn überrascht und ehrfürchtig an.

Der Weihnachtsmann überreicht dem kleinen Jungen ein in grünes Papier verpacktes Geschenk mit roter Schleife, während die Eltern lächelnd zusehen.

Der Junge erklärt dem sitzenden Weihnachtsmann gestikulierend etwas; im Hintergrund ist durch die Terrassentür ein moderner blauer Sportwagen geparkt.

Der Junge hält sein Geschenk glücklich im Arm, während der Weihnachtsmann und die Eltern im unscharfen Hintergrund lachend applaudieren.

Technologies used

The application has a modular structure and is deliberately vendor-independent. All calls to reasoning models run via LangChain, enabling rapid model exchange. For image models, we use native APIs from the respective providers. However, our experience has shown that these interfaces are currently changing faster than libraries such as LangChain can keep up with. This results in unnecessary friction during connection.

To facilitate debugging and quality improvement, we trace all generations fully — from storyline parsing and prompt optimisation to the finished image. This makes it possible to trace exactly which prompts and references were used in the event of unexpected results.

Generated images are stored in the cloud in different versions, and references between cast and scene images are retained. Users can upload their own images, which are treated like generated assets and can be used as a reference for subsequent generations.

Limitations

Modern AI image models offer enormous technological potential. However, there are clear limitations and difficulties when it comes to their productive use in our application.

The core problem remains consistency: more complex changes in perspective, such as when a character is seen from the front in one scene and from the side in the next, push current models to their limits. Backgrounds are also difficult to transfer using reference images. However, consistent spaces across a few scenes can be achieved through strong consistency mechanisms in our workflow. However, the longer the desired sequence, the more likely the image models are to produce noticeable artefacts.

Not every use case is suitable. The system works well when no real personalities need to be depicted and the visual relationships remain manageable.

Outlook

Together with our users, we are developing additional workflows and enhancing the existing system to make it more agent-based. The system should be able to respond independently to feedback and improve images iteratively. We are testing ‘LLM-as-a-judge’ approaches, which use the image comprehension capabilities of multimodal language models to automatically detect artefacts and inconsistencies.

Another exciting development for our use case is AI-based video generation. The boundaries of what is possible are shifting rapidly in the field of video modelling. Current models still struggle with the typical challenges of image generation, and moving images require even greater consistency across many frames. We are continuously evaluating the latest models and are optimistic that automated video generation will soon be feasible.

Additionally, image models offer other interesting applications, such as the automated creation of infographics and charts with consistent illustrations, product visualisation for e-commerce — for instance, displaying furniture in various interior styles or clothing in different settings — and the generation of consistent social media content to ensure a uniform brand aesthetic across multiple posts.

Authors

Mats Faulborn, Data Scieneer at scieneers GmbH

mats.faulborn@scieneers.de

Richard Naab, Data Scieneer at scieneers GmbH

richard.naab@scieneers.de