Posts

Showing posts from December, 2025

Paper Summary: Method to Measure Benchmark Contamination

This is a summary and explanation for a paper that measured dataset contamination using kernel diverge scores:  https://arxiv.org/pdf/2502.00678. (Blogpost still in progress, explanations need to be refined and expanded in places). Benchmarks are a flawed estimation of a model’s real-world capabilities for a variety of reasons. One reason in benchmark dataset contamination - i.e. if part of benchmark’s dataset is included as training data, the model will have been specifically trained to do well on those tests. Therefore, the benchmark fails to be a good measure of how well the model generalizes to examples it hasn’t seen. This is an interesting paper that looks at Kernel Divergence Scores to estimate to what extent benchmark datasets are contaminated. This is the methodology the researchers used. First, the researchers randomly sampled different subsets of the benchmark dataset. They mixed data that was known to be seen (contaminated) and unseen (uncontaminated) to create known co...