EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving AgentsAuthors: Princeton University SJTU Team |Year: 2026 |arXiv: 2606.11182二、研究背景测试时 Prompt 学习旨在无需参数更新地提升 LLM 在目标任务上的表现。核心问题当目标任务来自多个数据集数学、代码、知识问答不同任务的 Prompt 更新相互干扰单一 Prompt 无法兼顾——更新一个任务的 Prompt 往往损害其他任务。EEVEE 的核心洞察维护一组专化 PromptP { p 1 , … , p K } \mathcal{P}\{p_1,\ldots,p_K\}P{p1​,…,pK​}和一个路由器R RR路由器将每个输入分配到最合适的 Prompt 槽从而避免任务干扰。|—|—|| 初始化 | 建立多样 Prompt 集合 | 在混合训练集上运行 Prompt 学习贪心按覆盖增益选 Top-K Prompt || 探索 | 联合搜索路由器-Prompt 设计 | 轻量预算交替进化权重从一致性/均衡逐渐退火到下游准确性 || 收敛 | 深度优化稳定路由器下的 Prompt | 固定R ⋆ R^\starR⋆重新路由花费更大预算优化每个槽 Prompt ||—|—|—|—|—|—|| Qwen3-4B | 基线 | 56.00 | 45.22 | 14.79 | 49.46 | 41.37 || Qwen3-4B | ACE | 48.93 | 39.67 | 15.84 | 35.23 | 34.92 || Qwen3-4B | GEPA | 50.84 | 49.83 | 19.62 | 30.62 | 37.73 || Qwen3-4B |EEVEE|54.55|54.55|25.27|72.63|51.75|| DeepSeek-V3.2 | 基线 | 64.98 | 30.00 | 21.21 | 42.82 | 39.75 || DeepSeek-V3.2 |EEVEE|63.08|60.55|39.84|92.82|64.07|4.2 消融实验变体平均得分基线41.37默认路由器不学习43.58手写路由器GPT-5.4 一次性编写37.18无协同进化路由器和 Prompt 分阶段学习42.88EEVEE完整51.75手写路由器甚至低于基线验证了路由器必须通过数据学习而非人工设计。4.3 多任务扩展曲线图3随着任务数增加GEPA 和 ACE 的累积保留率快速下降到负值EEVEE 始终保持正值最终 41.53。4.4 泛化性跨模型泛化Qwen3 学到的 Prompt 直接用于 DeepSeek-V3.2平均 54.10优于 DeepSeek 基线 39.75跨任务泛化在 MBPP、MMLU-Pro 等未见任务上保持正收益报告生成时间2026-06-11 | 论文来源arXiv:2606.11182原文摘要:In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.PDF链接:https://arxiv.org/pdf/2606.11182v1部分平台可能图片显示异常请以我的博客内容为准