Zheng Zhou1 Hongbo Zhao1 Guangliang Cheng2 Xiangtai Li3 Shuchang Lyu*1 Wenquan Feng1 Qi Zhao1
1Beihang University, 2University of Liverpool, 3Nanyang Technological University

* Corresponding Author

[Paper] [GitHub] [Distilled Dataset]

Abstract

Figure 1: Comparison of BACON and existing DD methods: (a) Traditional methods align gradients and distributions on original and synthetic datasets. (b) BACON transforms DD into a Bayesian optimization task, generating synthetic images using likelihood and prior probabilities.

Abstract: Dataset Distillation (DD) aims to distill knowledge from extensive datasets into more compact ones while preserving performance on the test set, thereby reducing storage costs and training expenses. However, existing methods often suffer from computational intensity, particularly exhibiting suboptimal performance with large dataset sizes due to the lack of a robust theoretical framework for analyzing the DD problem. To address these challenges, we propose the BAyesian optimal CONdensation framework (BACON), which is the first work to introduce the Bayesian theoretical framework to the literature of DD. This framework provides theoretical support for enhancing the performance of DD. Furthermore, BACON formulates the DD problem as the minimization of the expected risk function in joint probability distributions using the Bayesian framework. Additionally, by analyzing the expected risk function for optimal condensation, we derive a numerically feasible lower bound based on specific assumptions, providing an approximate solution for BACON. We validate BACON across several datasets, demonstrating its superior performance compared to existing state-of-the-art methods. For instance, under the IPC-10 setting, BACON achieves a 3.46% accuracy gain over the IDM method on the CIFAR-10 dataset and a 3.10% gain on the TinyImageNet dataset. Our extensive experiments confirm the effectiveness of BACON and its seamless integration with existing methods, thereby enhancing their performance for the DD task. Code and distilled datasets are available at BACON.



Motivation

Dataset Distillation (DD) aims to reduce dataset size while maintaining test set performance, but existing methods struggle with large datasets and lack a solid theoretical foundation. Current approaches are computationally intensive and often perform poorly on complex tasks, particularly with large-scale datasets.

To address these challenges, we introduce the Bayesian Optimal Condensation Framework (BACON). This is the first Bayesian approach to DD, providing a clear theoretical foundation and a numerically feasible solution. BACON formulates DD as a minimization problem within Bayesian joint distributions, significantly improving performance, especially on large datasets like CIFAR-10 and TinyImageNet.

Key Benefits of BACON:

Bayesian Optimal Condensation Framework

Figure 2: Illustration of BACON: The neural network outputs a distribution from both synthetic and real datasets. BACON formulates this distribution as a Bayesian optimal condensation risk function and derives its optimal solution using Bayesian principles.

As illustrated in Figure 2, BACON uses a joint probability model, based on Bayesian theory, to optimize dataset distillation tasks. The framework calculates the optimal synthetic dataset by deriving a condensation risk function and applies approximation techniques for efficient solution computation.

Math Theorem Example

Theorem: Bayesian Optimal Condensation Risk Function

The optimal embedding feature of synthetic image $z_{\tilde{x}}^*$ can be computed as follows

\[ z_{\tilde{x}}^* = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} \left[\log p(z_{\tilde{x}} | z_x) + \log p(z_x)\right] dz_x. \]

Proof: By leveraging Bayes' rule and Jensen's inequality, we derive the function as follows:

\[ z_{\tilde{x}}^* = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} p(z_x | z_{\tilde{x}}) dz_x \]

\[ = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} \frac{p(z_{\tilde{x}} | z_x) p(z_x)}{p(z_{\tilde{x}})} dz_x \]

\[ = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} \underbrace{p(z_{\tilde{x}} | z_x) p(z_x)}_{\text{Bayesian formula}} dz_x \]

\[ = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \left[ \log \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} p(z_{\tilde{x}} | z_x) p(z_x) dz_x \right] \]

\[ \geq argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} \log \left[ p(z_{\tilde{x}} | z_x) p(z_x) \right] dz_x \quad \text{(by Jensen's inequality)} \]

\[ = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} \left[\log p(z_{\tilde{x}} | z_x) + \log p(z_x)\right] dz_x \]

Assumptions and Loss Function

Assumption 1: Likelihood Conforming Gaussian

To estimate the log-likelihood \( \log p(z_{\tilde{x}} | z_{x_i}) \), we make the assumption that \( p(z_{\tilde{x}} | z_{x_i}) \) conforms to a Gaussian distribution. In this distribution, \( \sigma_x^2 \) represents the variance and \( z_{x_i} \) represents the mean. It is denoted as:

\[ p(z_{\tilde{x}} | z_{x_i}) \sim \mathcal{N}(z_{x_i}, \sigma_{xi}^2 I) \]

Assumption 2: Prior Distribution Approximation with TV Extension

The Total Variation (TV) and CLIP operation are incorporated as distribution priors to represent \( \log p(z_{x_i}) \). The CLIP operation constrains the probability within the bound of \( [0,1] \). In contrast to their study, we extend the TV from a pixel-wise approach to a distribution-wise approach, which is also referred to as the total variation of probability distribution measures.

Loss Functions

Under Assumption 1 and Assumption 2, we mitigate computational costs by using mini-batch sampling instead of Monte Carlo methods. Mini-batch sampling, which divides the dataset into smaller subsets, is more convenient for training and reduces computational overhead, whereas Monte Carlo methods require more extensive computational resources. Consequently, we divide the following equation into three separate loss terms:

\[ \mathcal{L}_{\text{LH}} = -\frac{1}{k}\sum_{i=1}^{k}\left[\frac{1}{2}\log(2 \pi \sigma_{x_i}) +\frac{1}{2\sigma_{x_i}^2}\Vert z_{\tilde{x}}- z_{x_i}\Vert_2^2\right] \]

\[ \mathcal{L}_{\text{TV}} = \frac{1}{k}\sum_{i=1}^{k}\left(\frac{1}{2} \Vert z_{\tilde{x}} - z_{x_i}\Vert_1\right) \]

\[ \mathcal{L}_{\text{CLIP}} = \frac{1}{k}\sum_{i=1}^{k}\left[\frac{z_{\tilde{x}} - z_{x_i}}{\sigma_{x_i}} - \text{CLIP}\left(\frac{z_{\tilde{x}} - z_{x_i}}{\sigma_{x_i}}, 0, 1\right)\right]^2 \]

Overall Loss Function

To summarize, the overall loss function of BACON integrates the three loss terms above. The combined loss function is defined as:

\[ \mathcal{L}_{\text{TOTAL}} = \mathcal{L}_{\text{LH}} + \lambda \mathcal{L}_{\text{TV}} + (1-\lambda)\mathcal{L}_{\text{CLIP}} \]

where the hyperparameter \( \lambda \) serves as the weighting factor for the total loss function and is adjustable. By tuning \( \lambda \), we can customize the loss function to optimize performance.


Results

Table 1: Comparison with previous coreset selection and dataset condensation methods: Like most state-of-the-art methods, we evaluate our method on six datasets (MNIST, Fashion-MNIST, SVHN, CIFAR-10/100, TinyImageNet) with different numbers of synthetic images per class (IPC). The ``Ratio(%)'' represents the condensed images' ratio to the entire training set. For reference, ``Full Set'' indicates the accuracy of the trained model on the complete training set. It's important to note that DD and LD employ different architectures, specifically LeNet for MNIST and AlexNet for CIFAR-10. Meanwhile, the remaining methods all utilize ConvNet.
Figure 3: Performance comparison with BACON, IDM, and DM across varying training steps on the CIFAR-10/100 datasets: The blue line with white circles represents our proposed BACON, the orange line with white circles represents IDM, and the green line with white circles represents DM. All synthetic images are generated using the CIFAR-10/100 datasets across training steps from 0 to 20000 with IPC-1, IPC-10, and IPC-50, respectively.

Visulization



Paper

Zheng Zhou, Hongbo Zhao, Guangliang Cheng, Xiangtai Li, Shuchang Lyu, Wenquan Feng, and Qi Zhao
BACON: Bayesian Optimal Condensation Framework for Dataset Distillation.
In submission, 2024.
(hosted on ArXiv)


[Bibtex]


Acknowledgements

We gratefully acknowledge the contributors of DC-bench and IDM, as our code builds upon their work. You can find their repositories here: DC-bench and IDM.