IntroStyle

Training-Free Introspective Style Attribution using Diffusion Features

ICCV 2025

1University of California, San Diego
IntroStyle teaser image.

IntroStyle a metric for style measurement. Top two rows are retreival results with green colors for correct and red for incorrect retrievals. The bottom row shows ranking with lower score for images further away in style from the reference in the first column.

Retrieval Results

The first left column is the query image with remaining columns showing retreival results with green colors for correct and red for incorrect retrievals.

Abstract

Text-to-image (T2I) models have gained widespread adoption among content creators and the general public. Gradually, there is an increasing demand for T2I models to incorporate mechanisms that prevent the generation of specific artistic styles, thereby safeguarding intellectual property rights. Existing methods for style extraction typically necessitate the collection of custom datasets and the training of specialized models. This, however, is resource-intensive, time-consuming, and often impractical for real-time applications. We present a novel, training-free framework to solve the style attribution problem, using the features produced by a diffusion model alone, without any external modules or retraining.

This is denoted as Introspective Style attribution (IntroStyle) and is shown to have superior performance to state-of-the-art models for style attribution. We also introduce a synthetic dataset of Artistic Style Split (ArtSplit) to isolate artistic style and evaluate fine-grained style attribution performance. Our experimental results show that our method adequately addresses the dynamic nature of artistic styles and the rapidly evolving landscape of digital art with no training overhead.

Approach

Our IntroStyle approach leverages a pre-trained diffusion model for extracting style features. We encode the input image into a latent vector using the diffusion model's encoder, noise this latent to a specific timestep t, and pass the noised latent through the denoising network with a null text embedding. We then extract a feature tensor from an intermediate layer of the network, specifically from an up-block. We compute the channel-wise mean μc and variance σc2 for each channel c of this feature tensor. These statistics form our IntroStyle feature representation: ft,idx(I) = (μ1, ..., μC, σ12, ..., σC2)T. To compare styles between images, we use the 2-Wasserstein distance between their IntroStyle representations. This simple approach proves remarkably effective for style attribution tasks.
IntroStyle architecture image.
IntroStyle computation of style similarity: channel-wise mean µ and variance σ2 are computed for the identified style layers. Then a distance metric, 2-Wassertein Distance, can be used to measure styles between a pair of images.

Artistic Style Split (ArtSplit) Dataset

To address the limitations of existing datasets for fine-grained evaluation of style retrieval, we propose the Artistic Style Split (ArtSplit) dataset. This was created with the prompt-image pairs of the 2 most recognized works of 50 prominent artists from the LAION Aesthetic Dataset. For each of the two paintings, ChatGPT-4o was asked to generate a "style" specification and a "semantic" description, such that there is no style information in the semantic description and vice-versa. Stable Diffusion v2.1 was then used with a combination of two prompts, "style" and "semantic," to synthesize a reference image dataset. With 50 artists and 100 paintings, this led to 50 X 100 = 5,000 prompt combinations. A set of 12 images was sampled per combination, yielding 60,000 images in total. The procedure is detailed in Supplemental Section 4.
ArtSplit dataset image.
Artistic Style Split (ArtSplit) Dataset samples. Each row shows images generated with the same style, and each column with the same semantics.

Comparision with state-of-the-art models

We compare our retreival method with state-of-the-art models for style attribution. We also use the ArtSplit dataset to evaluate the performance of our method against the baseline models. The results show that our method outperforms the baseline models in fine-grained style attribution tasks.
Retreival on Wikiart.
Wikiart retrieval results.
Retreival on ArtSplit.
Style-based Artistic Style Split (ArtSplit) retrieval results. We show the ranked images for a fixed semantic for isolating stylistic variations.
Retreival on ArtSplit.
Semantic-based Artistic Style Split (ArtSplit) retrieval results. The results suggest that our retrieval emphasizes styles rather than semantic content.

BibTeX

@article{kumar2024introstyle,
  author    = {Kumar, Anand and Mu, Jiteng and Vasconcelos, Nuno},
  title     = {IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features},
  journal   = {arXiv preprint arXiv: 2412.14432},
  year      = {2024},
}