AIzora - Paper 1 Summary

Paper Summary

Note: Images and equations are from the original paper. But I did redraw them for ppt animation effects

Paper Reference


      @misc{liu2025selectitselectiveinstructiontuning,
        title={SelectIT: Selective Instruction Tuning for LLMs via Uncertainty-Aware Self-Reflection}, 
        author={Liangxin Liu and Xuebo Liu and Derek F. Wong and Dongfang Li and Ziyi Wang and Baotian Hu and Min Zhang},
        year={2025},
        eprint={2402.16705},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2402.16705}, 
  }

Abstract

This paper proposes SelectIT, a framework that ranks and selects the top 20% of instruction–response pairs by leveraging the base LLM’s own uncertainty signals—no external evaluators or retraining needed. Experiments on LLaMA-2-13B show consistent gains over vanilla Alpaca tuning across reasoning, coding, and multilingual benchmarks.

How they solved it

A three-stage self-reflection pipeline selects the highest-quality examples:

1. Token-Level Reflection

Pick the most probable score token:

\( S^{\mathrm{base}} = \arg\max_{k\in\{1,\dots,K\}} P'_k, \quad P'_k = \frac{P_k}{\sum_{j=1}^K P_j} \) Isn’t \(S^{\mathrm{base}}\) enough?

Since probabilities can be very close, we multiply by an uncertainty term:

\( S_{\mathrm{token}} = S^{\mathrm{base}} \times \underbrace{ \frac{1}{K-1}\sum_{i=1}^K |P'_i - P'_{S^{\mathrm{base}}}| }_{\text{Uncertainty}} \)

2. Sentence-Level Reflection

sentence model -Level Reflection Process — Figure: Sentence and Model -Level Reflection Process

Average multiple token scores (from varied rating prompts) and penalize high variance:

\( S^{\mathrm{sent}} = \frac{ \frac{1}{K}\sum_{i=1}^K S^{\mathrm{token}}_{i} }{ 1 + \alpha \times \underbrace{\mathrm{Std}\{S^{\mathrm{token}}_{i}\}_{i=1}^K}_{\text{Uncertainty}} } \) → w/ prompt phrasing.

3. Model-Level Reflection

Combine all sentence-level scores into a global ranking:

\( \mathrm{Quality} \propto S^{\mathrm{model}} = \sum_{i=1}^{N}\Bigl( \tfrac{\theta_i}{\sum_{j=1}^{N}\theta_j} \times S^{\mathrm{sent}}_{i} \Bigr) \) → w/ prompt phrasing.

Key Results

SelectIT outperforms vanilla Alpaca tuning and other selection baselines, with strong gains on reasoning (BBH, GSM), coding (HumanEval), and multilingual (TyDiQA) tasks.