Paper Summary: Method to Measure Benchmark Contamination

December 14, 2025

This is a summary and explanation for a paper that measured dataset contamination using kernel diverge scores: https://arxiv.org/pdf/2502.00678.

Benchmarks are a flawed estimation of a model’s real-world capabilities for a variety of reasons. One reason in benchmark dataset contamination - i.e. if part of benchmark’s dataset overlaps with training data, the model will have been specifically trained to do well on those tests. Therefore, the benchmark fails to accurately measure how well the model generalizes to examples it hasn’t seen.

This is an interesting paper that looks at Kernel Divergence to estimate to dataset contamination.

Aim

Given a dataset D and model M, the paper aims to create a dataset contamination score S(D,M). The contamination ratio is the proportion of the benchmark dataset that has been seen in training data. This score should be monotonic (datasets with higher contamination should have higher scores) and consistent (datasets with similar contamination should have similar scores).

Methodology

First, the researchers randomly sampled different subsets of the benchmark dataset. They mixed data that was known to be seen (contaminated) and unseen (uncontaminated) to create known contamination rates.

The researchers pass in the samples to the base model and compute each sample's embedding. They then construct a pair-wise matrix of the similarity of pairs of sample embeddings. They compute the pair-wise similarity of embeddings using kernels. A kernel is a function that computes the similarity of two vectors (x and y). For example, the commonly used cosine similarity normalizes x and y, and then simply computes their dot product to give the cosine of the angle between them. The researchers however used another kernel function, the RBF kernel (though they show the finding is robust to choices of different kernels).

The above figure shows the pair-wise similarity matrix before fine-tuning.

Then, the researchers fine-tune the base model*, compute embeddings again, and construct another pair-wise matrix of the similarities between sample embeddings.

They compute a Kernel Divergence Score (KDS) for each benchmark-model combination. A KDS is defined as the negative of the KL divergence in two pair-wise similar matrices, one computed before fine-tuning and one after.

The above function shows KDS, as a function of a dataset D and the percentage of examples known to be seen M. Their hypothesis was that if the benchmarks were less contaminated, the base models wouldn’t have been exposed to similar training examples. Therefore, fine-tuning would case a greater update to the pair-wise similarity structure between sample embeddings (larger divergence = lower KDS). If a benchmark is more contaminated, fine-tuning changes the model less, so pair-wise similar structures stay more similar after fine-tuning (lower divergence = higher KDS).

But why calculate the divergence in pair-wise relationships rather than how much each embedding moved? Remember that vector embeddings essentially store meaning in latent space. So when a model learns new information (during fine-tuning or training) it involves changing the geometry of one embedding in relation to other embeddings, for example to say two examples are similar to each other.

Fine-tuning can cause a global drift in the embedding space (e.g. by rotating or translating the entire space), even without the model learning anything new about the samples. This would still show up as a large divergence in sample embeddings, even though new information hasn't been learned; however, a pair-wise relationships are less sensitive to some kinds of global drift, since the relative similarity of embeddings stays roughly the same for some global drifts (rotation/scaling). Therefore, looking at divergence between pair-wise similarity ensures we're actually looking at whether the model has internalized new information.

Specifically, the researchers hypothesized that unseen examples would be updated more by fine-tuning and so the relationship between unseen examples would see a greater divergence.

Results

The KDS indeed was highly correlated with benchmark contamination. Models with a greater percentage of seen samples (higher contamination rates) had a higher KDS. Additionally, the scores were consistent (samples with a similar percentage of contaminated data had similar scores). This was a neat result! The exact methodology needs to be modified to measure contamination for large benchmark datasets, due to the large memory space pair-wise representations occupy (discussed more in the paper). Of course, because this requires fine-tuning, it can only be used on open-weights models. Additionally, this method is a good rough proxy to measure how contaminated benchmarks are relative to each other. However, they don't guarantee a good absolute measure of the percentage contamination.

*They do this using a process called Low Rank Adaptation (LoRA). In normal supervised fine-tuning, you run the same training process that you normally do on models, except on a narrow dataset. This process involves having the token predict the next outcome, calculating loss, and then doing back-propagation on all the weights to update them. Because doing back-prop on all the weights is compute-intensive, researchers use LoRA, which involves only updating some low-rank matrixes.

Search This Blog

Ishan Khire