Subset Selection from Pretrained Embeddings
Note: If you think I should correct something, let me know 😊.
This blog includes:
- Introduction
- CRAIG Method
- Probabilistic Bilevel Coreset (PBC)
- Stochastic Greedy Selection
- Comparison Table
- Subset Selection Code
Introduction
Subset selection methods aim to select a small yet representative subset of a dataset, especially useful when training on large-scale embeddings. These techniques allow us to retain the model's performance while reducing compute and memory costs.
CRAIG Method
CRAIG selects points with high influence on the objective defined by the trace of the kernel matrix: \( \text{Tr}(G_S^T G_S) \). This simplifies to choosing points with the highest L2 norm in the embedding space:
\[ \text{Select } S \subseteq D,\ \text{such that } \sum_{x_i \in S} \|G_i\|^2 \text{ is maximized} \]
norms_sq = torch.sum(embeddings**2, dim=1)
selected_indices = torch.topk(norms_sq, k=subset_size).indices
Probabilistic Bilevel Coreset (PBC)
PBC frames subset selection as a bilevel optimization problem. It selects a probability vector \( p \in \Delta_k \) such that training on a weighted dataset gives best performance on a held-out set:
\[ \min_{p \in \Delta_k} L_{val}(\theta^*(p)),\ \text{where } \theta^*(p) = \arg\min_\theta \sum_i p_i \ell(\theta; x_i) \]
Due to its complexity, this method is often approximated. In our implementation, a random sampling placeholder is used for practical purposes.
selected_indices = np.random.choice(embeddings.shape[0], size=subset_size)
Stochastic Greedy Selection
This method iteratively builds the subset by sampling a candidate pool \( R_t \) and selecting the best candidate based on a marginal gain criterion. It supports two objective types:
- Facility Location: Maximize similarity coverage over the dataset.
- Sum of Squared Norms: Similar to CRAIG, but with greedy sampling.
Gain for Facility Location is given by:
\[ \text{Gain}(x) = \sum_{j \in D} \max(0, \text{sim}(x, j) - \max_{s \in S} \text{sim}(s, j)) \]
It is efficient and empirically effective for large-scale selections.
Comparison Table
Method | Objective | Optimizes | Complexity | Strength |
---|---|---|---|---|
CRAIG | \( \text{Tr}(G^T G) \) | L2 Norm | \( O(n \log k) \) | Fast & simple |
PBC | Bilevel Optimization | Expected Validation Loss | \( O(nm) + \text{training loop} \) | Theoretically optimal |
Stochastic Greedy | Greedy Marginal Gain | Similarity / Norm Gain | \( O(kR) \) | Balanced & scalable |
Subset Selection Code
The full pipeline loads embeddings, selects indices using the chosen method, loads the dataset, and saves the selected data subset to a JSON file.
parser.add_argument("--method", choices=["craig", "pbc", "stochastic_greedy"])
if args.method == "craig":
select_subset_craig(...)
elif args.method == "pbc":
select_subset_pbc(...)
elif args.method == "stochastic_greedy":
select_subset_stochastic_greedy(...)
For results you can refer to the https://github.com/cv-chaitali/AlgoData.git