B A C O N : Bayesian Optimal Condensation Framework for Dataset Distillation

Zheng Zhou¹	Hongbo Zhao¹	Guangliang Cheng²	Xiangtai Li³	Shuchang Lyu^1*	Wenquan Feng¹	Qi Zhao¹
¹Beihang University, ²University of Liverpool, ³Nanyang Technological University

* Corresponding Author

[Paper]

[GitHub]

[Distilled Dataset]

Abstract

**Figure 1:** Comparison of our method with previous methods: (a) Existing DD methods typically align gradients and distributions, but lack theoretical guarantees. (b) BACON models DD as a Bayesian optimization problem, generating synthetic images by assessing likelihood and prior probabilities, thereby improving accuracy and reducing training costs.

Abstract: Dataset Distillation (DD) aims to condense large datasets into compact synthetic sets that preserve performance on unseen data, thereby reducing storage and training costs. However, most existing methods emphasize empirical performance without solid theoretical grounding, leaving issues such as optimization inefficiency and the lack of theoretical guarantees against suboptimal solutions unresolved. To bridge this gap, we propose the BAyesian optimal CONdensation framework (BACON), the first to incorporate a Bayesian perspective into dataset distillation. BACON offers a principled probabilistic formulation by casting DD as a Bayesian optimization problem, addressing the lack of Bayesian theoretical analysis in prior methods. To characterize the theoretical limit of DD, we derive a numerically tractable lower bound on the expected risk over the joint distribution of latent variables. Under mild assumptions, we obtain an approximate solution for data synthesis, where incorporating prior knowledge improves optimization efficiency through guiding posterior estimation. We evaluate BACON against 18 state-of-the-art methods on four standard image classification datasets under various images-per-class (IPC) settings, where it consistently demonstrates superior performance. For example, under the IPC-10 setting on CIFAR-10, BACON achieves the largest accuracy gain of 17.16% among all methods, outperforming the second-best approach, IDM, by 3.46%, while also reducing both synthesis and training costs. These results underscore the theoretical soundness and practical effectiveness of BACON for dataset distillation. Code and distilled datasets are available at https://github.com/zhouzhengqd/BACON.

Background, Motivation, and Contribution

Dataset Distillation (DD) seeks to compress large datasets into compact synthetic subsets without sacrificing model performance. While recent methods have made notable empirical strides, most lack a solid theoretical foundation, often resulting in inefficient optimization and vulnerability to suboptimal solutions. Fundamentally, DD can be viewed as an optimization problem over probability distributions; however, existing techniques rarely adopt a principled probabilistic approach, as illustrated in Figure 1(a). This raises three central research questions:

How can dataset distillation be framed within a probabilistic framework?
What is the theoretical lower bound for optimal condensation?
How can Bayesian methods enable efficient and practical data synthesis?

To address these challenges, we propose the Bayesian Optimal Condensation Framework (BACON) —the first framework to adopt a Bayesian perspective for dataset distillation. BACON formulates condensation as a Bayesian optimization problem that minimizes expected risk. A theoretical lower bound on this risk is derived over the joint distribution of latent variables, revealing fundamental constraints of optimal condensation. For practical deployment, the risk function is approximated under specific assumptions, enabling efficient data synthesis using likelihood estimation and prior knowledge from the original dataset.

As shown in Figure 1(b), BACON not only offers theoretical rigor but also demonstrates strong empirical performance. We benchmark BACON against 18 state-of-the-art methods on four standard image classification datasets (SVHN, CIFAR-10, CIFAR-100, TinyImageNet) under various IPC settings. BACON consistently outperforms gradient-based methods (e.g., DC, DSA, DCC), distribution-based methods (e.g., IDM, DataDAM, IID), and recent advanced methods such as G-VBSM and Teddy. Notably, under the IPC-10 setting on CIFAR-10, BACON achieves a 17.16% accuracy improvement, surpassing the next-best method IDM by 3.46%, while also reducing synthesis and training costs. These results underscore BACON's theoretical soundness and practical superiority in dataset distillation.

Key Benefits of BACON:

First Bayesian DD Framework: BACON is the first to model dataset distillation as a Bayesian optimization task, minimizing expected risk. It establishes a theoretical lower bound over latent variables, offering a new perspective on the limits of condensation.
Efficient Distillation Algorithm: BACON incorporates assumptions such as Gaussian priors and total variance constraints to derive tractable loss terms, enabling efficient and principled data synthesis.
Superior Empirical Performance: Extensive evaluations across multiple benchmarks confirm that BACON consistently achieves state-of-the-art performance in both accuracy and efficiency.

Bayesian Optimal Condensation Framework

**Figure 2:** Overview of BACON: BACON formulates DD as Bayesian risk minimization over embeddings (*), and derives a tractable lower bound for optimization (I), guided by prior and likelihood from the original data (II). Monte Carlo sampling accelerates optimization (III), and a loss is constructed under two assumptions (IV) to update synthetic data via gradient descent (V).

As shown in Figure 2, BACON uses a Bayesian-based joint probability model to optimize dataset distillation. It derives a condensation risk function to compute the optimal synthetic dataset and applies approximation methods for efficient solution. The Bayesian optimal condensation risk function is defined in Theorem 3.4, and the risk function’s theoretical lower bound is established in Theorem 3.6. Assumptions on log-likelihood and prior distribution are introduced to approximate solutions and define the training strategy. BACON represents the first theoretical analysis of optimal condensation via Bayesian principles, providing a strong foundation for improved distillation performance.

Math Theorem Example

Theorem 3.4: The expected risk function in a joint probability distribution can also be calculated as follows:

\[ R(\phi_\theta) = 1- \mathbb{E}_{z_{\tilde{x}} \sim p(z_{\tilde{x}})}\left[\int_{\mathcal{B}(z_{\tilde{x}},\epsilon)}p(z_x| z_{\tilde{x}})dz_x\right]. \]

Theorem 3.6: The optimal embedding feature of synthetic image can be computed as follows:

\[ z_{\tilde{x}}^* = \arg\max_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} \left[\log p(z_{\tilde{x}} | z_x) + \log p(z_x)\right] dz_x. \]

Assumption 1: Likelihood Conforming Gaussian

To estimate the log-likelihood \( \log p(z_{\tilde{x}} | z_{x_i}) \), we make the assumption that \( p(z_{\tilde{x}} | z_{x_i}) \) conforms to a Gaussian distribution. In this distribution, \( \sigma_x^2 \) represents the variance and \( z_{x_i} \) represents the mean. It is denoted as:

\[ p(z_{\tilde{x}} | z_{x_i}) \sim \mathcal{N}(z_{x_i}, \sigma_{xi}^2 I) \]

Assumption 2: Prior Distribution Approximation with TV Extension

The Total Variation (TV) and CLIP operation are incorporated as distribution priors to represent \( \log p(z_{x_i}) \). The CLIP operation constrains the probability within the bound of \( [0,1] \). In contrast to their study, we extend the TV from a pixel-wise approach to a distribution-wise approach, which is also referred to as the total variation of probability distribution measures.

Results

**Table 1:** Comparison with previous methods: BACON is compared with 18 state-of-the-art methods (4 coreset selection and 14 dataset distillation) across 4 datasets with varying IPC, including pruning-based (Prun.), meta-learning (Meta), gradient-based (Grad.), kernel-based (Kern.), hybrid-based (Hybr.), and distribution-based (Dist.) approaches (abbreviations in the table). ``Ratio(%)'' indicates the proportion of condensed images relative to the full training set, and ``Full Set'' shows accuracy with the original dataset. DD and LD refer to early-stage distillation methods with AlexNet, while others use ConvNet. Best accuracy (%) is in bold, second-best in underlined, and worst in gray; improvements over the second-best and worst are shown in light red and dark red, respectively.

**Figure 3:** Performance comparison with BACON, IDM, and DM across varying training steps on the CIFAR-10/100 datasets: The blue line with white circles represents our proposed BACON, the orange line with white circles represents IDM, and the green line with white circles represents DM. All synthetic images are generated using the CIFAR-10/100 datasets across training steps from 0 to 20000 with IPC-1, IPC-10, and IPC-50, respectively.

Visulization

Paper

Zheng Zhou, Hongbo Zhao, Guangliang Cheng, Xiangtai Li, Shuchang Lyu, Wenquan Feng, and Qi Zhao
BACON: Bayesian Optimal Condensation Framework for Dataset Distillation.
In submission, 2024.
(hosted on ArXiv)

[Bibtex]

Acknowledgements

We gratefully acknowledge the contributions of the DC-bench and IDM teams, as our code builds upon their work. You can find their repositories here: DC-bench and IDM.

This template was originally created by Phillip Isola and Richard Zhang for their colorful ECCV project. The code can be found here.