AIzora - Subset Selection Methods

Subset Selection from Pretrained Embeddings

Note: If you think I should correct something, let me know 😊.

This blog includes:

Introduction
CRAIG Method
Probabilistic Bilevel Coreset (PBC)
Stochastic Greedy Selection
Comparison Table
Subset Selection Code

Introduction

Subset selection methods aim to select a small yet representative subset of a dataset, especially useful when training on large-scale embeddings. These techniques allow us to retain the model's performance while reducing compute and memory costs.

CRAIG Method

CRAIG selects points with high influence on the objective defined by the trace of the kernel matrix: \( \text{Tr}(G_S^T G_S) \). This simplifies to choosing points with the highest L2 norm in the embedding space:

\[ \text{Select } S \subseteq D,\ \text{such that } \sum_{x_i \in S} \|G_i\|^2 \text{ is maximized} \]

norms_sq = torch.sum(embeddings**2, dim=1)
selected_indices = torch.topk(norms_sq, k=subset_size).indices

Probabilistic Bilevel Coreset (PBC)

PBC frames subset selection as a bilevel optimization problem. It selects a probability vector \( p \in \Delta_k \) such that training on a weighted dataset gives best performance on a held-out set:

\[ \min_{p \in \Delta_k} L_{val}(\theta^*(p)),\ \text{where } \theta^*(p) = \arg\min_\theta \sum_i p_i \ell(\theta; x_i) \]

Due to its complexity, this method is often approximated. In our implementation, a random sampling placeholder is used for practical purposes.

selected_indices = np.random.choice(embeddings.shape[0], size=subset_size)

Stochastic Greedy Selection

This method iteratively builds the subset by sampling a candidate pool \( R_t \) and selecting the best candidate based on a marginal gain criterion. It supports two objective types:

Facility Location: Maximize similarity coverage over the dataset.
Sum of Squared Norms: Similar to CRAIG, but with greedy sampling.

Gain for Facility Location is given by:

\[ \text{Gain}(x) = \sum_{j \in D} \max(0, \text{sim}(x, j) - \max_{s \in S} \text{sim}(s, j)) \]

It is efficient and empirically effective for large-scale selections.

Comparison Table

Method	Objective	Optimizes	Complexity	Strength
CRAIG	\( \text{Tr}(G^T G) \)	L2 Norm	\( O(n \log k) \)	Fast & simple
PBC	Bilevel Optimization	Expected Validation Loss	\( O(nm) + \text{training loop} \)	Theoretically optimal
Stochastic Greedy	Greedy Marginal Gain	Similarity / Norm Gain	\( O(kR) \)	Balanced & scalable

Subset Selection Code

The full pipeline loads embeddings, selects indices using the chosen method, loads the dataset, and saves the selected data subset to a JSON file.

parser.add_argument("--method", choices=["craig", "pbc", "stochastic_greedy"])
if args.method == "craig":
    select_subset_craig(...)
elif args.method == "pbc":
    select_subset_pbc(...)
elif args.method == "stochastic_greedy":
    select_subset_stochastic_greedy(...)

For results you can refer to the https://github.com/cv-chaitali/AlgoData.git