Zheng Zhou1 | Hongbo Zhao1 | Guangliang Cheng2 | Xiangtai Li3 | Shuchang Lyu*1 | Wenquan Feng1 | Qi Zhao1 |
1Beihang University, 2University of Liverpool, 3Nanyang Technological University |
* Corresponding Author
[Paper] | [GitHub] | [Distilled Dataset] |
|
Abstract: Dataset Distillation (DD) aims to distill knowledge from extensive datasets into more compact ones while preserving performance on the test set, thereby reducing storage costs and training expenses. However, existing methods often suffer from computational intensity, particularly exhibiting suboptimal performance with large dataset sizes due to the lack of a robust theoretical framework for analyzing the DD problem. To address these challenges, we propose the BAyesian optimal CONdensation framework (BACON), which is the first work to introduce the Bayesian theoretical framework to the literature of DD. This framework provides theoretical support for enhancing the performance of DD. Furthermore, BACON formulates the DD problem as the minimization of the expected risk function in joint probability distributions using the Bayesian framework. Additionally, by analyzing the expected risk function for optimal condensation, we derive a numerically feasible lower bound based on specific assumptions, providing an approximate solution for BACON. We validate BACON across several datasets, demonstrating its superior performance compared to existing state-of-the-art methods. For instance, under the IPC-10 setting, BACON achieves a 3.46% accuracy gain over the IDM method on the CIFAR-10 dataset and a 3.10% gain on the TinyImageNet dataset. Our extensive experiments confirm the effectiveness of BACON and its seamless integration with existing methods, thereby enhancing their performance for the DD task. Code and distilled datasets are available at BACON. |
Dataset Distillation (DD) aims to reduce dataset size while maintaining test set performance, but existing methods struggle with large datasets and lack a solid theoretical foundation. Current approaches are computationally intensive and often perform poorly on complex tasks, particularly with large-scale datasets.
To address these challenges, we introduce the Bayesian Optimal Condensation Framework (BACON). This is the first Bayesian approach to DD, providing a clear theoretical foundation and a numerically feasible solution. BACON formulates DD as a minimization problem within Bayesian joint distributions, significantly improving performance, especially on large datasets like CIFAR-10 and TinyImageNet.
Key Benefits of BACON:As illustrated in Figure 2, BACON uses a joint probability model, based on Bayesian theory, to optimize dataset distillation tasks. The framework calculates the optimal synthetic dataset by deriving a condensation risk function and applies approximation techniques for efficient solution computation.
The optimal embedding feature of synthetic image $z_{\tilde{x}}^*$ can be computed as follows
\[ z_{\tilde{x}}^* = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} \left[\log p(z_{\tilde{x}} | z_x) + \log p(z_x)\right] dz_x. \]
Proof: By leveraging Bayes' rule and Jensen's inequality, we derive the function as follows:
\[ z_{\tilde{x}}^* = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} p(z_x | z_{\tilde{x}}) dz_x \]
\[ = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} \frac{p(z_{\tilde{x}} | z_x) p(z_x)}{p(z_{\tilde{x}})} dz_x \]
\[ = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} \underbrace{p(z_{\tilde{x}} | z_x) p(z_x)}_{\text{Bayesian formula}} dz_x \]
\[ = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \left[ \log \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} p(z_{\tilde{x}} | z_x) p(z_x) dz_x \right] \]
\[ \geq argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} \log \left[ p(z_{\tilde{x}} | z_x) p(z_x) \right] dz_x \quad \text{(by Jensen's inequality)} \]
\[ = argmax_{z_{\tilde{x}} \in \mathcal{D}_S} \int_{\mathcal{B}(z_{\tilde{x}}, \epsilon)} \left[\log p(z_{\tilde{x}} | z_x) + \log p(z_x)\right] dz_x \]
To estimate the log-likelihood \( \log p(z_{\tilde{x}} | z_{x_i}) \), we make the assumption that \( p(z_{\tilde{x}} | z_{x_i}) \) conforms to a Gaussian distribution. In this distribution, \( \sigma_x^2 \) represents the variance and \( z_{x_i} \) represents the mean. It is denoted as:
\[ p(z_{\tilde{x}} | z_{x_i}) \sim \mathcal{N}(z_{x_i}, \sigma_{xi}^2 I) \]
The Total Variation (TV) and CLIP operation are incorporated as distribution priors to represent \( \log p(z_{x_i}) \). The CLIP operation constrains the probability within the bound of \( [0,1] \). In contrast to their study, we extend the TV from a pixel-wise approach to a distribution-wise approach, which is also referred to as the total variation of probability distribution measures.
Under Assumption 1 and Assumption 2, we mitigate computational costs by using mini-batch sampling instead of Monte Carlo methods. Mini-batch sampling, which divides the dataset into smaller subsets, is more convenient for training and reduces computational overhead, whereas Monte Carlo methods require more extensive computational resources. Consequently, we divide the following equation into three separate loss terms:
\[ \mathcal{L}_{\text{LH}} = -\frac{1}{k}\sum_{i=1}^{k}\left[\frac{1}{2}\log(2 \pi \sigma_{x_i}) +\frac{1}{2\sigma_{x_i}^2}\Vert z_{\tilde{x}}- z_{x_i}\Vert_2^2\right] \]
\[ \mathcal{L}_{\text{TV}} = \frac{1}{k}\sum_{i=1}^{k}\left(\frac{1}{2} \Vert z_{\tilde{x}} - z_{x_i}\Vert_1\right) \]
\[ \mathcal{L}_{\text{CLIP}} = \frac{1}{k}\sum_{i=1}^{k}\left[\frac{z_{\tilde{x}} - z_{x_i}}{\sigma_{x_i}} - \text{CLIP}\left(\frac{z_{\tilde{x}} - z_{x_i}}{\sigma_{x_i}}, 0, 1\right)\right]^2 \]
To summarize, the overall loss function of BACON integrates the three loss terms above. The combined loss function is defined as:
\[ \mathcal{L}_{\text{TOTAL}} = \mathcal{L}_{\text{LH}} + \lambda \mathcal{L}_{\text{TV}} + (1-\lambda)\mathcal{L}_{\text{CLIP}} \]
where the hyperparameter \( \lambda \) serves as the weighting factor for the total loss function and is adjustable. By tuning \( \lambda \), we can customize the loss function to optimize performance.
|
|
Zheng Zhou, Hongbo Zhao, Guangliang Cheng, Xiangtai Li, Shuchang Lyu, Wenquan Feng, and Qi Zhao BACON: Bayesian Optimal Condensation Framework for Dataset Distillation. In submission, 2024. (hosted on ArXiv) |
We gratefully acknowledge the contributors of DC-bench and IDM, as our code builds upon their work. You can find their repositories here: DC-bench and IDM.