IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features

Abstract

Text-to-image (T2I) models have recently gained widespread adoption. This has spurred concerns about safeguarding intellectual property rights and an increasing demand for mechanisms that prevent the generation of specific artistic styles. Existing methods for style extraction typically necessitate the collection of custom datasets and the training of specialized models. This, however, is resource-intensive, time-consuming, and often impractical for real-time applications.

We present a novel, training-free framework to solve the style attribution problem, using the features produced by a diffusion model alone, without any external modules or retraining. This is denoted as Introspective Style attribution (IntroStyle) and is shown to have superior performance to state-of-the-art models for style attribution. We also introduce a synthetic dataset of Artistic Style Split (ArtSplit) to isolate artistic style and evaluate fine-grained style attribution performance. Our experimental results on WikiArt and DomainNet datasets show that IntroStyle is robust to the dynamic nature of artistic styles, outperforming existing methods by a wide margin.

Approach

What is IntroStyle?

IntroStyle is a method that helps AI understand and compare artistic styles of images. It does this by extracting special features (called "style features") from each image using a pre-trained diffusion model — a type of powerful image AI.

How does it work?

Here's a breakdown:

First, the input image is converted into a compressed form called a latent vector using the model's encoder.
A bit of random noise is added to this latent vector — like shaking up the image slightly — at a certain time step t.
This noisy latent is then passed through the model's denoising network, but with no text prompt attached (so it's only focused on the image itself).
From a specific part of the network (an “up-block” layer), they pull out a tensor of features — think of it like a 3D grid that holds important information about the image's style.

How is style represented?

From that tensor, they calculate two things for each channel (kind of like a color or feature layer):

Mean (μ_c): The average value
Variance (σ²_c): How much the values vary

These numbers together form the IntroStyle feature representation for that image. It's like a fingerprint of the image's style.

How are styles compared?

To see how similar two styles are, they use a special metric called the 2-Wasserstein distance — which measures how different the style fingerprints are. The smaller the distance, the more similar the styles.

Why is this useful?

Despite being a simple method, IntroStyle works very well for tasks where the AI needs to identify or compare artistic styles between images.

IntroStyle computation of style similarity: channel-wise mean µ and variance σ² are computed for the identified style layers. Then a distance metric, 2-Wassertein Distance, can be used to measure styles between a pair of images.

Artistic Style Split (ArtSplit) Dataset

The Problem

Current datasets aren't good enough for testing how well an AI can recognize and match artistic styles (like the brushstroke or color technique of Van Gogh or Picasso). These datasets often mix up the content (what's shown) with the style (how it's shown), making it hard to evaluate style understanding properly.

Our Solution

We made a new dataset called ArtSplit. Here's how we built it:

Start with famous artists: We selected 50 well-known artists and chose 2 famous artworks from each, making a total of 100 paintings.
Describe each painting two ways: For every painting, we asked ChatGPT-4o to generate:
- A style description (e.g., "thick brushstrokes, bright swirling colors")
- A semantic description (e.g., "a starry night sky over a village"), with no style information
Generate new images: They used Stable Diffusion v2.1 (an image generator) with the style and semantic prompts combined to create new images.
Do this a lot:
- 50 artists x 100 prompt combinations = 5,000 pairs
- For each pair, generate 12 images
- Total = 60,000 images

Why is this cool?

This new dataset helps researchers clearly test how well AI understands style (independent of content), because the style and content are deliberately kept separate in the prompts.

Artistic Style Split (ArtSplit) Dataset samples. Each row shows images generated with the same style, and each column with the same semantics.

Comparision with state-of-the-art models

We compare our retreival method with state-of-the-art models for style attribution. We also use the ArtSplit dataset to evaluate the performance of our method against the baseline models. The results show that our method outperforms the baseline models in fine-grained style attribution tasks.

Wikiart retrieval results.

Style-based Artistic Style Split (ArtSplit) retrieval results. We show the ranked images for a fixed semantic for isolating stylistic variations.

Semantic-based Artistic Style Split (ArtSplit) retrieval results. The results suggest that our retrieval emphasizes styles rather than semantic content.

BibTeX

@InProceedings{kumar2025introstyle, author = {Kumar, Anand and Mu, Jiteng and Vasconcelos, Nuno}, title = {IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, year = {2025}, }