Context Tuning for In-Context Optimization

Jack Lu, Ryan Teehan, Zhenbang Yang, Mengye Ren

New York University

teaser figure

Comparison of training-free methods, prompt adaptation techniques, and methods from our proposed In-Context Optimization framework (Test-Time Training, CT-Prompt, CT-KV) on solving tasks from a split of UnifiedQA and CrossFit.
are baselines, ★ are our methods, bolded methods attain the best performance-efficiency tradeoff.

Abstract

We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for large language models (LLMs), they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model's inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.

Our Method

We illustrate the CT-KV variant of Context Tuning. Our paper also contains details on the CT-Prompt variant.

- Context Tuning (left) first initializes a prefix \(\{K_i, V_i\}_{i=1}^k\) from demonstration pairs \(\{(x_i, y_i)\}_{i=1}^k\), then trains it to solve each pair. To prevent the model from simply retrieving the demonstration pair from the prefix, Leave-One-Out Masking prevents the model from attending to \(K_i, V_i\) when solving pair \(i\). No model weight updates are required!

- Generation (right) conditions on all optimized prefixes \(\{K_i^*, V_i^*\}_{i=1}^k\) to solve the query \(x_q\).

method figure

Qualitative Samples

We select sample tasks from ARC to illustrate how the generated answers gradually improve with CT-KV training. For each of the two ARC tasks at the top and bottom, we display 4 demonstration query-answer pairs, the test query, and model predictions at CT-KV training iterations 0, 50, 100, 150, 200. We color-code correct predictions in green and incorrect predictions in red.

- Top task: the model's prediction at iteration 0 (equivalent to In-Context Learning) shows a strong bias toward filling orange-border squares with yellow. As CT-KV training progresses, the model gradually learns to fill each orange-border square with the correct color.

- Bottom task: the model first learns that only grey grid cells can turn red, and then correctly completes the cross shapes.

arc samples

Quantitative Evaluation

We evaluate Context Tuning against training-free, prompt-based adaptation, and Test-Time Training methods on a diverse set of challenging datasets with GPT-2 and Llama 3 models.

Benchmarks

We show a test pair from BBH, NLP-LR, and MMLU each, and 3 demonstration pairs followed by a test pair from ARC.

benchmarks

Results

Based on our quantitative comparison of Context Tuning and baselines, we find that the CT-KV variant of Context Tuning significantly outperforms Zero-Shot Prompting, In-Context Learning, Prompt Tuning, and Prefix Tuning. CT-KV is also competitive with the more computationally intensive Test-Time Training approach. Finally, we show that CT-KV can serve as a post-hoc refinement step following Test-Time Training, leading to improved few-shot adaptation performance compared to either method used in isolation.

results table

BibTeX


@misc{lu2025contexttuning,
      title={Context Tuning for In-Context Optimization},
      author={Jack Lu and Ryan Teehan and Zhenbang Yang and Mengye Ren},
      year={2025},
      eprint={2507.04221},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}