Coding, prompting, “deprompting”: testing AI generated images against the “broken telephone” game, based on images from the Ad Library Political Observatory

Team Members

Facilitator: João Guilherme Bastos dos Santos

Participants: Matilde Ficozzi, Carlo De Gaetano, Christian Bitar, Omer Rothenstein, Federica Bardelli, Carla D’Antonio, Quyang (July) Zhao, Piyush Aggarwal, Guillén Torres

Contents

1. Introduction

Prompt design can be understood similarly to query design, as a prompt aiming to understand cultural dimensions in visual research, forcing the AI to make sense of ambiguity as a way to understand its training dataset and biases ( Niederer, S. and Colombo, G., ‘Visual Methods for Digital Research’). It moves away from prompting engineering and efforts to make “code-like” prompts that suppress ambiguity and prevent the AI from bringing biases to the surface. Our idea is to keep the ambiguity present in the image descriptions like in natural language and let it flow through different stages (degrees) of the broken telephone dynamics. This way we have less control over the result or selection of the ideal result and more questions about the dynamics implicit in the biases present in the results obtained.

Different from textual or mathematical results, in which prompt chains or asking the AI to explain how it got the result might be enough, images and visual methods assisted by AI demand new methods to deal with that. Exploring and developing a new approach for it is the main goal of this research project, particularly interested in possible biases and unexplored patterns in AI’s image affordances.

How could we detect small biases in describing images and creating based on descriptions when it comes to AI? What exactly do the words written by AI when describing an image stand for? When detecting a ‘human’ or ‘science’, for example, what elements or archetypes are invisible between prompting, and the image created or described?

Turning an AI’s image description into a new image could help us to have a glimpse behind the scenes. In the broken telephone game, small misperceptions between telling and hearing, coding and decoding, produce big divergences in the final result - and the cultural factors in between have been largely studied. To amplify and understand possible biases, we could check how this new image would be described by AI, starting a broken telephone cycle. This process could shed light not just into the gap between AI image description and its capacity to reconstruct images using this description as part of prompts, but also illuminate biases and patterns in AI image description and creation based on description.

It is in line with previous projects on image clustering and image prompt analysis (see reference links), and questions such as identification of AI image biases, cross AI models analysis, reverse engineering through prompts, image clustering, and analysis of large datasets of images from online image and video-based platforms.

The experiment becomes even more relevant in light of the results from recent studies (Shumailov et al., 2024) that show that AI models trained on AI generated data will eventually collapse.

To frame this analysis, the proposal from Munn, Magee and Arora (2023) titled Unmaking AI Imagemaking introduces three methodological approaches for investigating AI image models: Unmaking the ecosystem, Unmaking the data and Unmaking the outputs.

First, the idea of ecosystem was taken for these authors to describe socio-technical implications that surround the AI models: the place where they have been developed; the owners, partners, or supporters; and their interests, goals, and impositions. “Research has already identified how these image models internalize toxic stereotypes (Birnhane 2021) and reproduce forms of gendered and ethnic bias (Luccioni 2023), to name just two issues” (Munn et al., 2023, p. 2).

There are also differences between the different models that currently dominate the market. Although Stable Diffusion seems to be the most open due to its origin, when working with images with this model, biases appear even more quickly than in other models. “In this framing, Stable Diffusion becomes an internet-based tool, which can be used and abused by “the people,” rather than a corporate product, where responsibility is clear, quality must be ensured, and toxicity must be mitigated” (Munn et al., 2023, p. 5).

To unmaking the data, it is important to ask ourselves about the source and interests for the extraction of the data used. According to the description of their project “Creating an Ad Library Political Observatory”, “This project aims to explore diverse approaches to analyze and visualize the data from Meta’s ad library, which includes Instagram, Facebook, and other Meta products, using LLMs. The ultimate goal is to enhance the Ad Library Political Observatory, a tool we are developing to monitor Meta’s ad business.” That is to say, the images were taken from political advertising on the social network Facebook, as part of an observation process that seeks to make evident the investments in advertising around politics. These are prepared images in terms of what is seen in the background of the image, the position and posture of the characters, the visible objects. In general, we could say that we are dealing with staged images. This is important since the initial information that describes the AI ​​is in itself a representation, a visual creation.

To unmaking the outputs we will explore the findings in the process subsequently in this document.

2. Research Questions

RQ1 - If we give a prompt to an AI model to “tell” (describe) what is in an image, could this same AI or other models reproduce it accurately afterward based solely on the previous description (“deprompt”)?

RQ2 - The different degrees down the “broken telephone” game could amplify biases in IA image description and creation? What about cross-AI models?

RQ3 - Could previous results help us understand invisible dynamics in the gap between prompting and “deprompting”? Could it help us understand its potential and limitations in new digital methods to analyse images and videos in platforms such as Facebook? How could it be complementary to computer vision and image clustering methods?

RQ4 - Clustering the images in the end of the process would group images with similar original image-sources? Or different AI models will group together regardless of the image used as a source?

3. Methodology and initial datasets

We collaborated with theAd Library Political Observatory facilitated by Guillén Torres and Piyush Aggarwal, who have collected datasets for the following countries, around their elections (or debates):

  • U.S.

  • India

  • Mexico

  • Netherlands (European Election)

We just worked with U.S and India for the broken telephone tests.

The datasets include the following information and many other data points.

Ad creation time, Ad creatives, Ad link captions, Ad link descriptions, Ad link titles, Ad start/stop time, Ad url, Ad gender, age breakup, Ad payers, Byline, Ad region breakup, Demographic distribution, Audience size, Reach, Impressions, Spend, etc.

We also have collected the images included in the advertisement, since our goal is to explore visual methodologies to get insights from them.

The 2 thousand images with the biggest audience in each country were registered and the ones available for download were clustered and used as an initial dataset for the broken telephone process.

Image sorter was selected as a clustering tool and thus began the process of browsing through the images to select the representatives to be used in the prompting process. Criteria for the selection were not precisely stated, as it is an explorative phase of the research. The only prerogative is that the images selected must contain “interesting” subjects that might trigger some bias in the AIs.

Two approaches were carried out in parallel. One approach was to manually check the images by browsing through the tool image sorter, trying to find the most relevant or interesting ones.

The other approach was to use ChatGPT to cluster the images based on the subject. A zip folder with all the images was uploaded. The AI could not correctly carry out the request of clustering and describing the clusters due to the fact that the folder contained too many images. In order to solve that, it suggested uploading just a batch of 20 images. It managed to create clusters and also accurately described the images, but these clusters are not necessarily significant.

Another try was made by giving ChatGPT a single image and asking it to describe it: ChatGPT was successful, and it was able to give an accurate description of the image. after asking it to create another image based on the description it was successful in the task. NOTE: The original image contained a portion of a political party symbol. When asked to create another image, the result was a much more exaggerated representation of the subject, spreading the symbol everywhere on the image.

After trying these two approaches, the final decision is to use the hand-picked images, as they are much more representative and intentionally ambiguous. The objective was to find ambiguity and then force the AI to disambiguate.

The images were clustered, after which a few clusters were selected for further analysis, and then images were selected as representatives of each cluster. ChatGPT was asked to describe each of the representative images, with that description used as the source for the first round of Gen-AI images. The aim was to use the AI’s image description as a basis for generating images, both between platforms (to check interoperability and specificities, consistency between models) and within the same AI (bi-directionality, reliability). Image descriptions from a single AI were tested in many different generators.

Models used:

We used the different models to create a chain of images and descriptions. Starting from a seed image and its given description by ChatGPT.

This is a mixed methods approach, combining Facebook ads data, clustering to prioritize images, AI resources to describe and re-create images (initially focused on DALL-E and Midjourney, and expanded to Stable Diffusion) based on the prompts “Describe this image” (always in ChatGPT) and "Create an image based on this description:” (using ChatGPT and other models).

The protocol was thus to create an image from a prompt, then to “deprompt” each selected image by asking ChatGPT to “describe this image”, and then use the description as a new prompt. Starting from descriptions of selected images, we repeated the process five times for each chosen seed image. It was necessary to create a new chat for each step of the way to prevent the AI from just copying the image that was used as input. In some instances, ChatGPT blocked the process after a few tries because of “guidelines violations”, and when asked to explain what constituted those violations it explained that referral to real people and a real active political party were not allowed. It is unclear (In the case of DALL-E) why it didn’t block the process from the beginning.

4. Findings

RQ1 - If we give a prompt to an AI model to “tell” (describe) what is in an image, could this same AI or other models reproduce it accurately afterward based solely on the previous description (“deprompt”)?

A1- It happened a few times (therefore it is possible) but most times it does not happen. OpenAI (ChatGPT and Dall-E) was not able to reproduce the images it was prompted to according to its own descriptions and was usually less stable across “deprompts” (overamplifying situations) in comparison to other modules – though more accurate when it comes to generate images related with actual political campaigns, politicians and symbols. It appears that the broken telephone caused a sort of feedback loop where certain words used to describe an image caused the IA generating image to amplify certain aspects of the original image which would, in turn, be featured more prominently in future generated images – sometimes increasing the number of people and objects exponentially.

RQ2 - The different degrees down the “broken telephone” game could amplify biases in IA image description and creation? What about cross-AI models?

A2- Yes, certain biases of both the generative text model (ChatGPT), and the generative image models emerged in different aspects of the generation process. These are described case by case in the Discussion session, passing through images with nationality, race and gender in ways that were not clear in first stage of the process.

RQ3 - Could previous results help us understand invisible dynamics in the gap between prompting and “deprompting”? Could it help us understand its potential and limitations in new digital methods to analyse images and videos in platforms such as Facebook? How could it be complementary to computer vision and image clustering methods?

A3 – The results gave us a glimpse into patterns happening progressively throughout the broken telephone, like increasing more and more the central role of specific figures in the image, making protests bigger and bigger, increasing more and more the amount of skin shown when depicting a female character, among others (again, it was presented case by case in the Discussion session). These tendencies shed light into the need for more studies about biases before applying AIs description or image generation as a source for understanding data about political events. Computer vision could help us understand which elements are able to trigger feedback loops when present in the images generated, open space for complementary use.

RQ4 - Clustering the images in the end of the process would group images with similar original image-sources? Or different AI models will group together regardless of the image used as a source?

A4 - Given differences in styles and colours between in the way different AIs portray the same prompt (choosing a “notable figure” as a politician or as a religious figure brought different clothes and colour palettes, for example), images coming from the same original description does not necessarily group together. Images coming from the same generative AI in different stages of the broken telephone, though, does bring similar palettes and tend to group together. This specific question, though, need further exploration to test different kinds of image embeddings (we just used ImageSorterV4 for the tests).

5. Discussion

We now explore the different results of the broken telephone process when the very same description of an image from Meta Ad Library is described by ChatGPT is used as a starting point for the broken telephone process in Dall-E, Midjourney and Dream Studio, all going back to ChatGPT for ne descriptions in each stage. We will go through nine images (five from India and four from the United States of America), being the Indian ones a street market trading scene (Meta Ad Library id 908752667667552), BJP political campaign (3673711239611791 and 741964741427730), problematic political cartoons (327977776661801 and 345472831844966), and the US ones pictures people talking or reading about sensitive topics (1097954751485997 and 1175294460274614) and protests of people holding signs (793416362694934 and 1414979136072795). The analysis starts with the original description and image (equal to all models) referred to as stage 1, being stage 2 the description of the previous stage’s image and the new image resulting from using it as a prompt, being this new image described in stage 3 and the same for the next stages. To enable further exploration and reproducibility, the mentions to images and descriptions will follow the pattern (MODEL, Meta Library Id, stage) – noting that images result from the different models indicated, but descriptions are always from ChatGPT. The team gave provisory names to each broken telephone process, and these are the names before the Ad Id in each topic.

By creating an image, AI gives meaning to what was previously ambiguous: a person starts to have traits of gender, colour, ethnicity, face, clothes, expressions. Each reiteration causes these initial assumptions to be confirmed in the images created, potentially appearing in the new ChatGPT descriptions, opening the way for the initial factors to be extrapolated if there is a tendency to do so and for other ambiguous factors to be filled in with information. It could happen either because of a model tendency to portray groups or scenarios in a specific way, each models’ sensibility to specific elements working as contextual markers when defining what is more important in the prompts, incompatibilities between the model’s images and ChatGPT ability to describe it, among other reasons.

“Spices and Bread” (908752667667552)