Princeton与SJTU联合打造EEVEE：多数据集学习新突破！

张

张建站

2026/6/12 8:00:01

10分钟阅读

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving AgentsAuthors: Princeton University SJTU Team |Year: 2026 |arXiv: 2606.11182二、研究背景测试时 Prompt 学习旨在无需参数更新地提升 LLM 在目标任务上的表现。核心问题当目标任务来自多个数据集数学、代码、知识问答不同任务的 Prompt 更新相互干扰单一 Prompt 无法兼顾——更新一个任务的 Prompt 往往损害其他任务。EEVEE 的核心洞察维护一组专化 PromptP { p 1 , … , p K } \mathcal{P}\{p_1,\ldots,p_K\}P{p1,…,pK}和一个路由器R RR路由器将每个输入分配到最合适的 Prompt 槽从而避免任务干扰。|—|—|| 初始化 | 建立多样 Prompt 集合 | 在混合训练集上运行 Prompt 学习贪心按覆盖增益选 Top-K Prompt || 探索 | 联合搜索路由器-Prompt 设计 | 轻量预算交替进化权重从一致性/均衡逐渐退火到下游准确性 || 收敛 | 深度优化稳定路由器下的 Prompt | 固定R ⋆ R^\starR⋆重新路由花费更大预算优化每个槽 Prompt ||—|—|—|—|—|—|| Qwen3-4B | 基线 | 56.00 | 45.22 | 14.79 | 49.46 | 41.37 || Qwen3-4B | ACE | 48.93 | 39.67 | 15.84 | 35.23 | 34.92 || Qwen3-4B | GEPA | 50.84 | 49.83 | 19.62 | 30.62 | 37.73 || Qwen3-4B |EEVEE|54.55|54.55|25.27|72.63|51.75|| DeepSeek-V3.2 | 基线 | 64.98 | 30.00 | 21.21 | 42.82 | 39.75 || DeepSeek-V3.2 |EEVEE|63.08|60.55|39.84|92.82|64.07|4.2 消融实验变体平均得分基线41.37默认路由器不学习43.58手写路由器GPT-5.4 一次性编写37.18无协同进化路由器和 Prompt 分阶段学习42.88EEVEE完整51.75手写路由器甚至低于基线验证了路由器必须通过数据学习而非人工设计。4.3 多任务扩展曲线图3随着任务数增加GEPA 和 ACE 的累积保留率快速下降到负值EEVEE 始终保持正值最终 41.53。4.4 泛化性跨模型泛化Qwen3 学到的 Prompt 直接用于 DeepSeek-V3.2平均 54.10优于 DeepSeek 基线 39.75跨任务泛化在 MBPP、MMLU-Pro 等未见任务上保持正收益报告生成时间2026-06-11 | 论文来源arXiv:2606.11182原文摘要:In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.PDF链接:https://arxiv.org/pdf/2606.11182v1部分平台可能图片显示异常请以我的博客内容为准

收藏！5个AI热门岗位从零基础到高薪，小白也能快速入行！

本文介绍了AI行业5个热门岗位：AI大模型应用开发工程师、AI测试工程师、AI训练师、AI产品经理和AI解决方案架构师。这些岗位需求量大，薪资高，适合不同背景的人入行。文章详细描述了每个岗位的日常工作内容、所需技能以及薪资待遇，强…...

2026/6/12 7:58:11 阅读更多 →

从握手到传输：拆解AXI协议的VALID/READY机制，看它如何提升FPGA设计效率

从握手到传输：拆解AXI协议的VALID/READY机制，看它如何提升FPGA设计效率在FPGA设计领域，AXI协议已经成为高性能片上通信的事实标准。但许多工程师仅仅停留在"会使用"的层面，对其底层机制的理解往往不够深入。本文将聚焦A…...

2026/6/12 7:58:01 阅读更多 →

Buildroot开发板Qt字体安装与显示统一教程

Buildroot 开发板安装字体并统一 Qt 显示效果教程 1. 问题背景在 OK3506 Buildroot 开发板上运行 Qt 程序时，虽然已经通过下面方式解决了横屏显示问题： export QT_QPA_FB_DRM1 export QT_QPA_PLATFORMlinuxfb:rotation90 ./New_Jydl但是仍然可能出现一个…...

2026/6/12 7:57:27 阅读更多 →

JPEXS Free Flash Decompiler：SWF逆向工程架构解析与技术实践

JPEXS Free Flash Decompiler：SWF逆向工程架构解析与技术实践【免费下载链接】jpexs-decompiler JPEXS Free Flash Decompiler 项目地址: https://gitcode.com/gh_mirrors/jp/jpexs-decompiler JPEXS Free Flash Decompiler是一款基于Java开发的开源SWF文件…...

2026/6/11 13:26:37 阅读更多 →