HuggingFace 每日AI论文速递

40 Episodes
Subscribe

By: duan

每天10分钟,带您快速了解当日HuggingFace热门AI论文内容。每个工作日更新,欢迎订阅。 📢播客节目在小宇宙、Apple Podcast平台搜索【HuggingFace 每日AI论文速递】 🖼另外还有图文版,可在小红书搜索并关注【AI速递】

✂️ Turn this podcast into clips
2026.06.02 | 多智能体框架生成可编辑图表;参数高效微调支撑百万个性化模型
Yesterday at 11:00 PM

【目录】
本期的 15 篇论文如下:

[00:33] 🎨 Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs(Crafter:一种用于从多样化输入生成可编辑科学图形的多智能体框架)
[01:39] 🧩 On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters(关于参数高效微调的规模化:迈向万亿参数级别的百万个性化模型)
[02:35] 🧪 A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks(品味之道:提升智能体基准测试的覆盖度与难度)
[03:25] 🌐 K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts(K-BrowseComp:基于韩国语境的网页浏览代理基准测试)
[04:21] ⚡ Draft-OPD: On-Policy Distillation for Speculative Draft Models(Draft-OPD:面向推测草稿模型的在策略蒸馏)
[05:10] 🎓 VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization(视觉语言模型作为视频推理的优质教师:通过自适应测试时优化)
[06:18] 📡 X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding(X-Stream:探索多模态大语言模型作为多流理解的多路复用器)
[07:13] 🎬 VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion(VideoMLA:用于分钟级自回归视频扩散的低秩潜在KV缓存)
[07:59] 🤖 SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories(SkillAdaptor:面向LLM智能体的自适应技能从轨迹中学习)
[08:54] 🧠 Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models(哪种预训练范式更好地服务于空间智能?视觉语言模型与视频生成模型的实证比较)
[09:51] 🧠 NITP: Next Implicit Token Prediction for LLM Pre-training(NITP:面向大语言模型预训练的下一隐式词元预测)
[10:50] 👀 Where to Look: Can Foundation Models Reach a Target Viewpoint


2026.06.01 | 知识蒸馏炼技能;表示强制破瓶颈
Last Monday at 11:00 PM

【目录】
本期的 15 篇论文如下:

[00:30] 🧠 COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation(COLLEAGUE.SKILL:通过专家知识蒸馏实现自动化AI技能生成)
[01:17] 🧠 Representation Forcing for Bottleneck-Free Unified Multimodal Models(表示强制:无瓶颈统一多模态模型)
[02:07] 🎙 SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue(SwanVoice:面向独白与对话的表现力丰富长文本零样本语音合成)
[02:58] 🔍 LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards(长迹强化学习:利用评分奖励从搜索代理轨迹中学习长上下文推理)
[03:59] 🎧 Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer(面向流式同步空间音频生成的自回归扩散Transformer)
[04:48] 🖼 GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration(GGT-100K:面向通用真实世界图像恢复的生成式真实标签)
[05:39] 🎤 Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios(多样化场景下长篇语音生成的综合基准测试)
[06:46] 🛋 Function2Scene: 3D Indoor Scene Layout from Functional Specifications(从功能规格到场景:基于功能说明的3D室内布局生成)
[07:36] 🎥 SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer(SANA-Streaming:基于混合扩散Transformer的实时流式视频编辑)
[08:29] 🧠 Task-Focused Memorization for Multimodal Agents(面向多模态智能体的任务聚焦记忆机制)
[09:30] 🤖 Exploring Autonomous Agentic Data Engineering for Model Specialization(探索面向模型专业化的自主代理数据工程)
[10:15] 🎓 Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation(并非所有分歧都是可学习的:在线策略蒸馏中的令牌可教性)
[11:10] 🧩 dMoE: dL


【月末特辑】5月最火AI论文 | 多智能体世界建模;开源机器人VLA模型
Last Monday at 8:13 AM

【目录】
本期的 10 篇论文如下:
[00:45] TOP1(🔥407) | 🌍 Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players(Gamma-World:超越双玩家的生成式多智能体世界建模)
[03:09] TOP2(🔥347) | 🤖 MolmoAct2: Action Reasoning Models for Real-world Deployment(MolmoAct2:面向实际部署的動作推理模型)
[05:30] TOP3(🔥269) | 🔍 CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence(CiteVQA:为可信文档智能建立证据归因基准)
[07:51] TOP4(🔥231) | 🧠 Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers(均值模式尖叫:面向千层扩散Transformer的均值-方差分裂残差)
[10:04] TOP5(🔥219) | 🏗 MinT: Managed Infrastructure for Training and Serving Millions of LLMs(MinT:用于训练和服务数百万大语言模型的托管基础设施)
[11:59] TOP6(🔥217) | 🧠 Heterogeneous Scientific Foundation Model Collaboration(异构科学基础模型协作)
[14:17] TOP7(🔥210) | 🤖 Code as Agent Harness(代码作为智能体框架)
[16:26] TOP8(🔥210) | 🧠 SkillOpt: Executive Strategy for Self-Evolving Agent Skills(SkillOpt:面向自进化智能体技能的执行策略)
[18:39] TOP9(🔥204) | 🎯 DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards(DelTA:面向可验证奖励强化学习的判别性令牌信用分配)
[20:25] TOP10(🔥195) | 🧠 Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information(基于点互信息的反自蒸馏用于推理强化学习)

【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递


在小宇宙查看该单集文稿


【周末特辑】5月第5周最火AI论文 | γ-World:多智能体世界建模新框架;SkillOpt:智能体技能优化新策略
Last Sunday at 2:32 AM

【目录】
本期的 5 篇论文如下:
[00:41] TOP1(🔥357) | 🌍 Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players(Gamma-World:超越双玩家的生成式多智能体世界建模)
[02:49] TOP2(🔥204) | 🧠 SkillOpt: Executive Strategy for Self-Evolving Agent Skills(SkillOpt:面向自进化智能体技能的执行策略)
[04:37] TOP3(🔥131) | 🎯 DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning(DVAO:面向多奖励强化学习的动态方差自适应优势优化)
[06:17] TOP4(🔥124) | 🎯 LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding(LocateAnything:基于并行框解码的快速高质量视觉-语言定位)
[08:56] TOP5(🔥110) | 🛡 AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security(AgentDoG 1.5:一种轻量级且可扩展的AI代理安全与安保对齐框架)

【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递


在小宇宙查看该单集文稿


2026.05.29 | AgentDoG 1.5实现毫秒级安全防护;Qwen-VLA统一跨任务动作建模。
Last Friday at 11:00 PM

【目录】
本期的 14 篇论文如下:
[00:25] 🛡 AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security(AgentDoG 1.5:一种轻量级且可扩展的AI代理安全与安保对齐框架)
[01:06] 🤖 Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments(Qwen-VLA:统一跨任务、环境和机器人本体的视觉-语言-动作建模)
[02:02] 🌐 OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources(OmniRetrieval:跨异构知识源的统一检索)
[02:52] 🎨 CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation(集合LoRA:通过多教师同策略蒸馏将50种效果收集到一个LoRA中)
[03:47] 🎬 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models(minWM:一个用于实时交互式视频世界模型的全栈开源框架)
[04:39] 🎥 YoCausal: How Far is Video Generation from World Model? A Causality Perspective(YoCausal:视频生成距离世界模型还有多远?一个因果视角)
[05:42] 🎨 GenClaw: Code-Driven Agentic Image Generation(GenClaw:代码驱动的智能体图像生成)
[06:40] ⚡ EarlyTom: Early Token Compression Completes Fast Video Understanding(EarlyTom:早期令牌压缩实现快速视频理解)
[07:37] 🎯 UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering(UniSteer:文本引导的激活空间流匹配实现多功能大语言模型操控)
[08:25] 🧠 How LoRA Remembers? A Parametric Memory Law for LLM Finetuning(LoRA如何记忆?大语言模型微调中的参数化记忆定律)
[09:20] 🔗 LoMo: Local Modality Substitution for Deeper Vision-Language Fusion(本地模态替换:实现更深入的视觉-语言融合)
[10:24] 🔍 LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in


2026.05.28 | ProRL主动引导推荐;γ-World实现多智能体零样本泛化
Last Thursday at 11:00 PM

【目录】
本期的 15 篇论文如下:
[00:24] 🎯 ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation(ProRL:通过修正策略梯度估计实现主动推荐的有效强化学习)
[01:27] 🌍 Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players(Gamma-World:超越双玩家的生成式多智能体世界建模)
[02:28] 🤖 Agent Explorative Policy Optimization for Multimodal Agentic Reasoning(面向多模态智能体推理的智能体探索性策略优化)
[03:24] 👁 From Pixels to Words -- Towards Native One-Vision Models at Scale(从像素到文字——迈向规模化的原生单视觉模型)
[04:19] 🔍 Self-Improving Language Models with Bidirectional Evolutionary Search(基于双向进化搜索的自我改进语言模型)
[05:01] 🧮 ResearchMath-14K: Scaling Research-Level Mathematics via Agents(ResearchMath-14K:通过智能体扩展研究级数学问题)
[06:03] 🔍 MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems(MemTrace:大型语言模型记忆系统中的错误追踪与归因)
[06:58] 🛠 DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes(DenoiseRL:引导推理模型从噪声前缀中恢复的自举强化学习)
[07:54] 🤖 GEM: Generative Supervision Helps Embodied Intelligence(GEM:生成式监督助力具身智能)
[08:41] 🎯 Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents(从弱点中学习:小型计算机使用智能体的自动化领域专精)
[09:30] 🔗 ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence(ScientistOne:通过证据链迈向人类级别的自主研究)
[10:27] 🔬 AI Research Agents Narrow Scientific Exploration(AI研究代理缩小科学探索范围)
[11:17] 🧠 Rethinking Memory as Continuously Evolving Connectivity(重新思考记忆作为持续演化的连接性)
[12:15] 🎥 OSP-Next: Efficient High-Quality


2026.05.27 | 并行框解码提速十倍;空间评测揭示模型短板
05/27/2026

【目录】
本期的 15 篇论文如下:
[00:24] 🎯 LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding(LocateAnything:基于并行框解码的快速高质量视觉-语言定位)
[01:13] 🧩 SpatialBench: Is Your Spatial Foundation Model an All-Round Player?(SpatialBench:你的空间基础模型是全能选手吗?)
[02:07] 🎬 EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation(EvalVerse:面向专业电影级视频生成的流水线感知与专家校准基准测试框架)
[03:06] 📱 MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research(MobileGym:一个可验证且高度并行的移动图形用户界面智能体研究仿真平台)
[04:05] 🏗 Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction(几何感知表示去噪:面向鲁棒的多视图三维重建)
[05:00] 🎬 LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV(LongAV-Compass:面向分钟级音视频生成的统一评估框架,涵盖文本到音视频、图像到音视频及视频到音视频)
[05:59] 🛡 $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing(D²-Monitor:基于犹豫感知路由的扩散大语言模型动态安全监控)
[06:54] 🤖 The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence(MiniMax-M2系列:微型激活释放最大现实智能)
[07:51] 🤝 Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling(多分享,少搜索:面向高效测试时扩展的协作式并行思考)
[08:46] 🎬 Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration(Soap2Soap:基于多智能体协作的长篇影视视频重制)
[09:46] 👁 LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence(LLaVA-OneVision-2:迈向下一代感知智能)
[10:37] 🤖 VitaBen


2026.05.26 | DVAO动态平衡多目标;WBench填补交互评估空白
05/26/2026

【目录】
本期的 15 篇论文如下:
[00:25] 🎯 DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning(DVAO:面向多奖励强化学习的动态方差自适应优势优化)
[01:15] 🎬 WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation(WBench:用于交互式视频世界模型评估的全面多轮基准)
[02:13] 🖥 Macaron-A2UI: A Model for Generative UI in Personal Agents(Macaron-A2UI:一种面向个人代理的生成式用户界面模型)
[02:56] 🤝 Foundation Protocol: A Coordination Layer for Agentic Society(基础协议:面向智能体社会的协调层)
[04:02] 🔺 TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction(TriSplat:面向模拟的馈通式三维场景重建)
[05:05] 🎬 ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning(ParaVT:驯服工具先验悖论,实现智能视频强化学习中的并行工具调用)
[06:00] 🧠 Toward Native Multimodal Modeling: A Roadmap(迈向原生多模态建模:路线图)
[06:50] 🔍 QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks(QUEST:通过完全合成任务训练前沿深度研究智能体)
[07:43] 🎯 ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention(ThriftAttention:面向长上下文的FP4注意力机制的选择性混合精度方法)
[08:50] 🔬 AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery(自动研究AI:迈向人工智能驱动的科学发现自动化研究)
[09:46] 🧠 Your Embedding Model is SMARTer Than You Think(你的嵌入模型比想象中更聪明)
[10:25] 💡 ControlLight: Towards Controllable, Consistent, and Generalizable Low-Light Enhancement(ControlLight:迈向可控、一致且泛化的低光增强)
[11:22] 🌐


2026.05.25 | SkillOpt实现技能自进化;Lens提升文生图训练效率
05/26/2026

【目录】
本期的 15 篇论文如下:
[00:25] 🧠 SkillOpt: Executive Strategy for Self-Evolving Agent Skills(SkillOpt:面向自进化智能体技能的执行策略)
[01:16] 🔍 Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models(Lens:重新思考基础文本到图像模型的训练效率)
[02:20] 🔀 Rethinking Cross-Layer Information Routing in Diffusion Transformers(重新思考扩散变换器中的跨层信息路由)
[03:01] 🧠 SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research(SciAtlas:面向自动化科学研究的大规模知识图谱)
[03:56] 🎙 StepAudio 2.5 Technical Report(StepAudio 2.5 技术报告)
[04:51] 👁 See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding(看懂我的意思:对齐视觉与语言表示以实现视频细粒度物体理解)
[05:50] 📸 PhotoFlow: Agentic 3D Virtual Photography Missions(PhotoFlow:智能体式的3D虚拟摄影任务)
[06:29] 🧠 From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills(从原始经验到技能消费:模型生成智能体技能的系统性研究)
[07:28] 🎥 VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis(VGenST-Bench:通过主动视频合成进行时空推理的基准测试)
[08:29] ⚡ PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion(PiD:基于像素扩散的快速高分辨率潜在解码)
[09:35] 🎨 RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution(RankE:面向离散文本到图像生成的端到端后训练与解码器协同进化)
[10:30] ✂ ETCHR: Editing To Clarify and Harness Reasoning(ETCHR:通过编辑来阐明和利用推理能力)
[11:26] 🎮 SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models(SCOPE:在可玩环境中模拟跨游戏操作以构建FPS世


【周末特辑】5月第4周最火AI论文 | CiteVQA揭示归因幻觉;代码成为智能体框架
05/24/2026

【目录】
本期的 5 篇论文如下:
[00:40] TOP1(🔥263) | 🔍 CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence(CiteVQA:为可信文档智能建立证据归因基准)
[03:21] TOP2(🔥199) | 🤖 Code as Agent Harness(代码作为智能体框架)
[05:25] TOP3(🔥191) | 🎯 DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards(DelTA:面向可验证奖励强化学习的判别性令牌信用分配)
[07:48] TOP4(🔥189) | 🧠 Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information(基于点互信息的反自蒸馏用于推理强化学习)
[09:58] TOP5(🔥167) | 🚌 TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation(TransitLM: 面向无地图公交路线生成的大规模数据集与基准)

【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递


在小宇宙查看该单集文稿


2026.05.22 | 大模型内化地理空间;判别性令牌优化推理
05/22/2026

【目录】
本期的 15 篇论文如下:
[00:25] 🚌 TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation(TransitLM: 面向无地图公交路线生成的大规模数据集与基准)
[01:21] 🎯 DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards(DelTA:面向可验证奖励强化学习的判别性令牌信用分配)
[02:15] 🤖 $π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows(π-Bench:评估长时工作流中的主动式个人助理代理)
[03:04] 🤔 Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?(感知还是偏见:多模态大语言模型能否超越对个性的第一印象?)
[03:51] 🔥 Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps(全注意力回归:在百步训练内将全注意力迁移至稀疏注意力)
[04:35] 🤖 ACC: Compiling Agent Trajectories for Long-Context Training(ACC:为长上下文训练编译智能体轨迹)
[05:35] 🧊 PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects(PhysX-Omni:面向刚体、可变形体和关节物体的统一仿真就绪物理3D生成)
[06:37] 🧠 LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning(LatentOmni:通过统一音频-视觉潜在推理重新思考全模态理解)
[07:37] 🌍 WorldKV: Efficient World Memory with World Retrieval and Compression(世界KV:结合世界检索与压缩的高效世界记忆)
[08:22] 📊 Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning(Spreadsheet-RL:通过强化学习推进大型语言模型智能体在真实电子表格任务中的应用)
[09:27] 🎥 FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching(FlowLong:基于流形


2026.05.21 | Mega-ASR降噪减幻觉;Video2GUI数据预训练提效
05/21/2026

【目录】
本期的 15 篇论文如下:
[00:23] 🎤 Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation(Mega-ASR:通过扩展真实世界声学模拟实现野外环境语音识别)
[01:22] 🎬 Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining(Video2GUI:合成大规模交互轨迹以实现通用型GUI代理预训练)
[02:11] 🎬 Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos(增强无训练无限帧生成以实现一致的长视频)
[03:04] 🚀 You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories(你仅需极简的RLVR训练:通过秩-1轨迹外推大语言模型)
[03:50] 🗜 OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond(OScaR:面向大语言模型及更广领域的极致KV缓存量化的奥卡姆剃刀)
[04:39] 🔧 IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools(IndusAgent:利用智能工具增强开放词汇工业异常检测)
[05:36] 🔊 A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook(大型音频语言模型综述:泛化、可信度与展望)
[06:35] 🤝 It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs(双管齐下:面向大语言模型语境完整性的互补式自蒸馏框架)
[07:26] 📈 Toto 2.0: Time Series Forecasting Enters the Scaling Era(Toto 2.0:时间序列预测进入规模化时代)
[08:20] ⚡ Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs(混合量化:面向智能体大语言模型的量化预填充与精确解码)
[09:25] 🧠 Generative Recursive Reasoning(生成式递归推理)
[10:29] 🎬 CutVerse: A Compositional GUI Agents Benchmark for Media Post-


2026.05.20 | 反自蒸馏优化推理;可验证环境测评智能体
05/20/2026

【目录】
本期的 15 篇论文如下:
[00:24] 🧠 Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information(基于点互信息的反自蒸馏用于推理强化学习)
[01:08] 🖥 OpenComputer: Verifiable Software Worlds for Computer-Use Agents(OpenComputer:为计算机使用智能体构建可验证的软件世界)
[01:53] 🧠 GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment(GoLongRL:面向能力的长上下文强化学习与多任务对齐)
[02:49] 🔬 Process Rewards with Learned Reliability(具有学习可靠性的过程奖励模型)
[03:44] 🤖 AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration(AutoResearchClaw:基于人类-人工智能协作的自我强化自主研究)
[04:48] 🎭 When Vision Speaks for Sound(当视觉为声音代言)
[05:50] 🏭 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL(EnvFactory:通过可执行环境合成与鲁棒强化学习扩展工具使用型智能体)
[06:45] 🎬 CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition(CogOmniControl: 基于推理驱动的可控视频生成与创意意图认知)
[07:40] 🎯 Active Learners as Efficient PRP Rerankers(主动学习器作为高效的成对排序提示重排序器)
[08:24] 🎥 Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos(Artifact-Bench:评估多模态大语言模型在检测与评估AI生成视频伪影方面的能力)
[09:14] 🎬 Aurora: Unified Video Editing with a Tool-Using Agent(Aurora:使用工具型代理的统一视频编辑框架)
[10:12] 🎯 CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization(对比证据策略优化:基于强化学习与可验证奖励的自蒸馏方法)
[11:01] 📱 OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments(OmniGUI:在全模态智能手机环境中评估GUI代理的基准测试)
[11:51] 🎬 MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation(MSAVBench:迈向全面且可靠的多镜头音视频生成评估)
[12:44] 🎥 Video Models Can Reason wit


2026.05.19 | 长视频生成提速降显存;轻量多模态模型超越大参数模型
05/19/2026

【目录】
本期的 15 篇论文如下:
[00:23] 🎬 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation(LongLive-2.0:用于长视频生成的NVFP4并行基础设施)
[01:17] 🎨 Lance: Unified Multimodal Modeling by Multi-Task Synergy(Lance:通过多任务协同实现统一多模态建模)
[02:24] 🤖 AI for Auto-Research: Roadmap & User Guide(人工智能自动研究:路线图与用户指南)
[03:26] 🛠 SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution(SkillsVote:从收集、推荐到演化的智能体技能全生命周期治理)
[04:20] 🎬 KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration(KVPO:基于KV语义探索的ODE原生GRPO自回归视频对齐方法)
[05:18] 🏠 Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis(代码即房间:通过智能体代码合成从俯视图生成三维房间)
[06:15] 🤖 OProver: A Unified Framework for Agentic Formal Theorem Proving(OProver:面向智能体形式定理证明的统一框架)
[07:14] ⚡ Post-Trained MoE Can Skip Half Experts via Self-Distillation(通过自蒸馏实现后训练MoE跳过半数专家)
[07:57] 🎥 LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs(LiteFrame:高效视觉编码器解锁视频大语言模型中的帧缩放)
[08:47] 🛑 Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models(当推理收敛时停止:面向推理模型的语义保持型早停方法)
[09:42] 🔀 Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement(扩散应进入语言模型的何处?基于几何引导的隐状态替换)
[10:39] 🧠 Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in


2026.05.18 | 人类视频炼物理常识;文档问答要查原文
05/18/2026

【目录】
本期的 15 篇论文如下:
[00:23] 🧠 PhysBrain 1.0 Technical Report(PhysBrain 1.0 技术报告)
[00:56] 🔍 CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence(CiteVQA:为可信文档智能建立证据归因基准)
[01:45] 🤖 MMSkills: Towards Multimodal Skills for General Visual Agents(MMSkills:面向通用视觉智能体的多模态技能)
[02:35] 👗 FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization(FashionChameleon:面向实时且交互式的人体-服装视频定制)
[03:20] 🦾 DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo(DexJoCo:面向任务型灵巧操作的MuJoCo基准测试与工具包)
[04:19] 🔮 Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation(学会预见:揭示在线策略蒸馏的解锁效率)
[04:54] 🖼 InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation(InsightTok:改进自回归图像生成中离散标记化的文本和人脸保真度)
[05:48] 🧠 Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding(通过协作式逐步多教师解码蒸馏长链思维推理)
[06:44] ⚡ Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization(Flash-GRPO:基于单步策略优化的高效视频扩散对齐方法)
[07:29] 🧭 Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR(超越舒适区的助推:用于RLVR的高效策略引导探索)
[08:10] 🎮 ReactiveGWM: Steering NPC in Reactive Game World Models(反应式游戏世界模型:在反应性游戏世界中操控非玩家角色)
[08:46] ⚖ Hölder Policy Optimisation(赫尔德策略优化)
[09:36] 🧠 Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution(Solvita:通过智能体进化增强大型语言模型在竞赛编程中的能力)
[10:22] 🌐


【周末特辑】5月第3周最火AI论文 | MinT让百万LoRA秒挂基础模型;千层DiT均值尖叫MV-Split破局
05/18/2026

【目录】
本期的 5 篇论文如下:
[00:44] TOP1(🔥211) | 🏗 MinT: Managed Infrastructure for Training and Serving Millions of LLMs(MinT:用于训练和服务数百万大语言模型的托管基础设施)
[02:56] TOP2(🔥183) | 🧠 Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers(均值模式尖叫:面向千层扩散Transformer的均值-方差分裂残差)
[04:44] TOP3(🔥171) | 🧠 SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture(SenseNova-U1: 基于NEO-unify架构统一多模态理解与生成)
[06:36] TOP4(🔥145) | 🥇 Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling(通过简单且统一的缩放实现金牌级别的奥赛推理)
[08:20] TOP5(🔥141) | 🔒 MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents(MemPrivacy:面向边缘-云智能体的隐私保护个性化记忆管理)

【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递


在小宇宙查看该单集文稿


2026.05.15 | 30B模型刷奥赛金牌;自蒸馏让3B小模型零外挂超能
05/17/2026

【目录】
本期的 15 篇论文如下:
[00:23] 🥇 Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling(通过简单且统一的缩放实现金牌级别的奥赛推理)
[01:00] 🤖 Self-Distilled Agentic Reinforcement Learning(自蒸馏智能体强化学习)
[01:46] 🧠 MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models(MemLens:大型视觉语言模型中多模态长期记忆的基准测试)
[02:57] 👁 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory(MemEye:面向多模态智能体记忆的视觉中心评估框架)
[04:00] 🎬 SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer(SANA-WM:高效分钟级世界建模的混合线性扩散Transformer)
[04:43] 🎬 Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation(因果强制++:面向实时交互式视频生成的可扩展少步自回归扩散蒸馏)
[05:21] 🧬 Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning(达尔文家族:基于MRI信任加权进化合并的无训练语言模型推理扩展)
[06:19] 🐾 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation(WildClawBench:面向真实世界长周期智能体评估的基准)
[07:11] 🧠 STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?(STALE:LLM代理能否知晓其记忆何时失效?)
[08:03] 🧠 Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems(超越个体智能:基于LLM的多智能体系统中的协作、故障归因与自我进化综述)
[08:44] 🎥 Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video(扭曲即历史:从单个训练视频实现可泛化的相机控制视频生成)
[09:24] 🧠 PREPING: Building Agent Memory without Tasks(PREPING:无需任务构


2026.05.14 | MinT用LoRA补丁解决大模型规模难题;MulTaBench对齐图文任务小模型胜大模型
05/14/2026

【目录】
本期的 15 篇论文如下:
[00:25] 🏗 MinT: Managed Infrastructure for Training and Serving Millions of LLMs(MinT:用于训练和服务数百万大语言模型的托管基础设施)
[01:08] 📊 MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image(MulTaBench:融合文本与图像的多模态表格学习基准测试)
[02:14] 🎬 AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation(AnyFlow:任意步数视频扩散模型与在线流图蒸馏)
[03:02] 📚 Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context(有效训练长上下文视觉语言模型,实现超越128K上下文的泛化能力)
[03:48] 🤖 Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling(从有限交互中通过文本-表格建模预测AI代理的决策)
[04:27] 🖼 Qwen-Image-VAE-2.0 Technical Report(千问图像变分自编码器2.0技术报告)
[05:05] 🎨 Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling(编辑指南针和编辑奖励指南针:图像编辑与奖励建模的统一基准)
[06:01] 🎯 TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking(TrackCraft3R:将视频扩散变换器重新用于密集3D跟踪)
[06:57] 🧠 Many-Shot CoT-ICL: Making In-Context Learning Truly Learn(多示例思维链上下文学习:让上下文学习真正学会)
[07:58] 🎯 FrameSkip: Learning from Fewer but More Informative Frames in VLA Training(FrameSkip:在VLA训练中从更少但更具信息量的帧中学习)
[08:52] 🌅 The DAWN of World-Action Interactive Models(世界-动作交互模型的黎明)
[09:43] 🌊 Asymmetric Flow Models(非对称流模型)
[10:24] 🤖 Learning Agentic Policy from Action Guidance(从行动引导中学习智能体策略)
[11:23] 💻 Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning


2026.05.13 | 原生统一看画;边缘隐私记管
05/13/2026

【目录】
本期的 15 篇论文如下:
[00:23] 🧠 SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture(SenseNova-U1: 基于NEO-unify架构统一多模态理解与生成)
[01:10] 🔒 MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents(MemPrivacy:面向边缘-云智能体的隐私保护个性化记忆管理)
[01:59] 🧠 $δ$-mem: Efficient Online Memory for Large Language Models(δ-mem:面向大型语言模型的高效在线记忆机制)
[02:43] 🤖 RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards(RubricEM:超越可验证奖励的元强化学习与基于量规引导的策略分解)
[03:33] 🤖 World Action Models: The Next Frontier in Embodied AI(世界动作模型:具身智能的下一个前沿)
[04:22] 🤖 AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward(AlphaGRPO:通过分解可验证奖励解锁统一多模态模型中的自反思多模态生成)
[05:09] 🧩 Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization(超越最后一层:多层表示融合用于视觉标记化)
[06:12] 🛠 ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents(ToolCUA:面向计算机使用代理的最优GUI-工具路径编排)
[06:51] 🏭 Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics(企业系统需要学习世界模型吗?上下文在推断动态中的重要性)
[07:52] 🎨 L2P: Unlocking Latent Potential for Pixel Generation(L2P:解锁像素生成的潜在潜能)
[08:33] 🎬 CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives(CausalCine:面向多镜头视频叙事的实时自回归生成)
[09:18] 🔍 Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents(面向视觉原生多模态深度搜索代理的在策略数据进化方法)
[10:


2026.05.12 | 数学家闭门出题考倒大模型;生图模型千字提示精准成画
05/12/2026

【目录】
本期的 15 篇论文如下:
[00:25] 🧮 Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs(Soohak:由数学家策划的基准测试,用于评估大语言模型的研究级数学能力)
[01:30] 🎨 Qwen-Image-2.0 Technical Report(Qwen-Image-2.0技术报告)
[02:23] 🎥 CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models(CollabVR:基于视觉语言与视频生成模型的协作式视频推理)
[03:08] 🧠 TMAS: Scaling Test-Time Compute via Multi-Agent Synergy(TMAS:通过多智能体协同扩展测试时计算)
[03:52] 📄 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents(PaperFit:面向科学文档的视觉在环排版优化)
[04:34] 📈 Model Merging Scaling Laws in Large Language Models(大语言模型中的模型合并缩放定律)
[05:19] 🧩 Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training(几何冲突:解释并控制大语言模型持续后训练中的遗忘现象)
[06:20] 🌍 WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors(世界推理基准:作为未来世界状态预测器的视频生成器的人类对齐压力测试)
[07:12] 📊 Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria(自动评分标准作为奖励:从隐式偏好到显式多模态生成准则)
[08:03] 🤖 X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction(X-OmniClaw技术报告:一种用于多模态理解与交互的统一移动智能体)
[08:51] 🧠 Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models(内存高效的循环Transformer:在循环语言模型中解耦计算与内存)
[09:35] 🔄 SEIF: Self-Evolving Reinforcement Learning for Instruction Following(SEIF:面向指令跟随的自我进化强化学习)
[10:19] 🔄 Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning(面向智能体强化学习的动态技能生命周期管理)
[11:10] 🎨 Pixal3D:


2026.05.11 | 音乐驱舞拆分专家;流匹配蒸馏全科状元
05/11/2026

【目录】
本期的 15 篇论文如下:
[00:29] 💃 MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation(MACE-Dance:音乐驱动舞蹈视频生成的运动与外观级联专家模型)
[01:07] 🎯 Flow-OPD: On-Policy Distillation for Flow Matching Models(Flow-OPD:面向流匹配模型的在线策略蒸馏)
[01:58] 🎯 Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex(列表式策略优化:基于组的RLVR作为LLM响应单纯形上的目标投影)
[02:52] 🔍 HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents(HyperEyes:面向并行多模态搜索代理的双粒度效率感知强化学习)
[03:37] 🤖 LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling(大语言模型自我进化:面向测试时扩展的智能体发现框架)
[04:20] 🎥 HumanNet: Scaling Human-centric Video Learning to One Million Hours(HumanNet:将人类中心视频学习扩展到一百万小时)
[05:09] 🧠 Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers(均值模式尖叫:面向千层扩散Transformer的均值-方差分裂残差)
[06:06] 🔍 Beyond Retrieval: A Multitask Benchmark and Model for Code Search(超越检索:面向代码搜索的多任务基准与模型)
[07:06] 🧩 Anisotropic Modality Align(各向异性模态对齐)
[07:58] 🤖 AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning(AEM:面向多轮智能体强化学习的自适应熵调制)
[08:49] 📜 TextLDM: Language Modeling with Continuous Latent Diffusion(TextLDM:基于连续潜在扩散的语言建模)
[09:41] 🧠 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding(4D思考者:利用4D图像进行动态空间理解的思考)
[10:25] 🎬 A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency(A²RD:用于长视频一致性的智能自回归扩散模型)
[11:04] 🛡 DecodingTrust-Agent Platform (DTap): A


【周末特辑】5月第2周最火AI论文 | MolmoAct2开源机器人大脑;长文狼人杀自练暗规则
05/10/2026

【目录】
本期的 5 篇论文如下:
[00:33] TOP1(🔥266) | 🤖 MolmoAct2: Action Reasoning Models for Real-world Deployment(MolmoAct2:面向实际部署的動作推理模型)
[03:10] TOP2(🔥145) | 🧠 From Context to Skills: Can Language Models Learn from Context Skillfully?(从上下文到技能:语言模型能否从上下文中巧妙学习?)
[05:03] TOP3(🔥117) | 🎥 Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation(Stream-R1:面向流式视频生成的可靠性-困惑度感知奖励蒸馏)
[07:22] TOP4(🔥101) | 🤖 RLDX-1 Technical Report(RLDX-1技术报告)
[09:45] TOP5(🔥99) | 🤖 ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration(自主研究:通过对抗性多智能体协作实现自动化科研)

【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递


在小宇宙查看该单集文稿


2026.05.08 | 全局速写助长文;技能库让智能体进化
05/08/2026

【目录】
本期的 15 篇论文如下:
[00:23] 🧠 MiA-Signature: Approximating Global Activation for Long-Context Understanding(MiA-签名:面向长上下文理解的全局激活近似方法)
[01:32] 🧬 Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning(Skill1:通过强化学习实现技能增强型智能体的统一进化)
[02:14] 🎯 MARBLE: Multi-Aspect Reward Balance for Diffusion RL(MARBLE:面向扩散强化学习的多维度奖励平衡方法)
[03:08] 🤖 When to Trust Imagination: Adaptive Action Execution for World Action Models(何时信任想象力:面向世界动作模型的自适应动作执行)
[04:06] 🧠 Continuous Latent Diffusion Language Model(连续潜在扩散语言模型)
[04:50] 🏆 RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation(RaguTeam 在 SemEval-2026 任务8:基于裁判编排的大语言模型集成实现忠实的多轮响应生成)
[05:36] 🧠 Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration(无意义文本助力:提示空间扰动拓宽推理探索)
[06:13] ⚡ Continuous-Time Distribution Matching for Few-Step Diffusion Distillation(连续时间分布匹配用于少步扩散蒸馏)
[06:48] 🎬 Audio-Visual Intelligence in Large Foundation Models(大型基础模型中的音视频智能)
[07:24] 🤖 Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes(利用专家智能体进行自动研究,开发高效且非平凡的训练方案)
[08:12] 🤖 A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping(A²TGPO:基于自适应回合级裁剪的智能体回合组策略优化)
[09:12] 🧩 UniPool: A Globally Shared Expert Pool for Mixture-of-Experts(UniPool:面向混合专家模型的全局共享专家池)
[09:58] 🧠 SkillOS: Learning Skill Curation for Self-Evolving Agents(SkillOS:学习技能策展以实现自我进化智能体)
[10:49] 🚗 ReflectDrive-2: Reinforceme


2026.05.07 | 奖励蒸馏让像素会“挑重点”;测试时扩展逐块稳长视频
05/07/2026

【目录】
本期的 15 篇论文如下:
[00:24] 🎥 Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation(Stream-R1:面向流式视频生成的可靠性-困惑度感知奖励蒸馏)
[01:27] 🎥 Stream-T1: Test-Time Scaling for Streaming Video Generation(Stream-T1:面向流式视频生成的测试时扩展)
[02:06] 🔍 OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents(OpenSearch-VL:前沿多模态搜索智能体的开放配方)
[03:07] 🤖 RLDX-1 Technical Report(RLDX-1技术报告)
[04:06] 🚗 HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation(HERMES++:迈向统一驾驶世界模型,用于3D场景理解与生成)
[04:50] ⚙ PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World(PhysForge:为交互式虚拟世界生成物理基础的3D资产)
[05:40] 🎨 D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models(D-OPSD:用于持续调优步蒸馏扩散模型的在策略自蒸馏方法)
[06:38] 🔍 Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems(重新思考推理密集型检索:评估与推进智能体搜索系统中的检索器)
[07:46] ⚡ Lightning Unified Video Editing via In-Context Sparse Attention(基于上下文稀疏注意力的闪电式统一视频编辑)
[08:38] 🧠 Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation(在多模态统一理解与生成中唤醒空间智能)
[09:27] 🎯 Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback(参数高效的多视角技能评估:从判别分类到生成式反馈)
[10:11] 🎵 APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music(APEX:面向AI生成音乐的大规模多任务审美感知流行度预测)
[10:54] 🧠 ResRL: Boosting LLM Reas


2026.05.06 | ARIS自怼写论文;PRISM三段洗数据再RL
05/07/2026

【目录】

本期的 15 篇论文如下:

[00:25] 🤖 ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration(自主研究:通过对抗性多智能体协作实现自动化科研)

[00:59] 🎯 Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL(超越SFT到RL:通过黑盒在线策略蒸馏实现多模态强化学习的预对齐)

[01:54] 🔍 OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories(OpenSeeker-v2:用信息丰富且高难度的轨迹推动搜索智能体的极限)

[02:42] 🎯 X2SAM: Any Segmentation in Images and Videos(X2SAM:图像与视频中的任意分割)

[03:23] 🧠 HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness(HeavySkill:智能体框架中的深度思考作为内在技能)

[04:23] 🎬 Video Generation with Predictive Latents(基于预测性潜变量的视频生成)

[05:05] 📜 PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination(PatRe:面向专利审查的全阶段审查意见与答复生成基准)

[05:45] 🎨 SVGS: Enhancing Gaussian Splatting Using Primitives with Spatially Varying Colors(SVGS:利用空间变化颜色基元增强高斯泼溅)

[06:31] 📂 Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies(工作空间基准1.0:针对具有大规模文件依赖的工作空间任务评估AI代理)

[07:28] 🤒 SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment(SymptomAI: 面向日常症状评估的对话式AI代理)

[08:11] 🤖 Reinforcement Learning for LLM-based Multi-Ag...


2026.05.05 | 开源MolmoAct2实战87%成功率;GPT上下文提炼技能再升级
05/05/2026

【目录】
本期的 14 篇论文如下:
[00:21] 🤖 MolmoAct2: Action Reasoning Models for Real-world Deployment(MolmoAct2:面向实际部署的動作推理模型)
[01:02] 🧠 From Context to Skills: Can Language Models Learn from Context Skillfully?(从上下文到技能:语言模型能否从上下文中巧妙学习?)
[01:44] 🔁 Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling(重复胜于多样:面向样本高效德语语言模型的高信号数据过滤)
[02:35] 👁 Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs(持久视觉记忆:在大视觉语言模型中维持感知以支持深度生成)
[03:18] 🌊 OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models(海洋堆:面向基础模型的大规模多模态海洋语料库)
[03:56] 🧩 ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models(ComboStoc:扩散生成模型的组合随机性)
[04:44] 🎓 AcademiClaw: When Students Set Challenges for AI Agents(AcademiClaw:当学生为AI代理设置挑战时)
[05:25] 🏥 PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments(医师基准:在真实电子健康记录环境中评估大语言模型智能体)
[06:06] 🤖 T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning(T²PO:不确定性引导的探索控制实现稳定多轮智能体强化学习)
[07:04] 🌳 Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation(面向跨文档检索增强生成的分层抽象树)
[07:54] 🌌 Generative Modeling with Orbit-Space Particle Flow Matching(基于轨道空间粒子流匹配的生成式建模)
[08:30] 🧠 Perceptual Flow Network for Visually Grounded Reasoning(感知流网络用于视觉基础推理)
[09:06] 🎬 Motion-Aware Caching for Efficient Autoregressive Video Generation(运动感知缓存实现高效自回归视频生成)
[09:54] 🤖 Code World Model Preparedness Report(代码世界模型准备情况报告)

【关注我们】
您还可以在以下平台找到我们,获得播


【月末特辑】4月最火AI论文 | GrandCode登顶Codeforces;高频Prompt提效大模型
05/05/2026

【目录】
本期的 10 篇论文如下:
[00:47] TOP1(🔥626) | 🏆 GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning(GrandCode:通过智能体强化学习在竞技编程中达到宗师级水平)
[02:41] TOP2(🔥501) | 📈 Adam's Law: Textual Frequency Law on Large Language Models(亚当定律:大语言模型上的文本频率定律)
[04:54] TOP3(🔥364) | 🔄 DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models(DataFlex:面向大语言模型数据中心化动态训练的统一框架)
[07:02] TOP4(🔥350) | 🧠 FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization(FIPO:通过未来KL影响策略优化引导深度推理)
[08:57] TOP5(🔥341) | 🚁 CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence(CARLA-Air:在CARLA世界中飞行无人机——面向空地具身智能的统一基础设施)
[11:14] TOP6(🔥323) | 🧠 Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability(重新审视推理监督微调中的泛化问题:关于优化、数据与模型能力的条件性分析)
[13:08] TOP7(🔥289) | 🧬 SkillClaw: Let Skills Evolve Collectively with Agentic Evolver(SkillClaw:让技能在智能体演化器中集体进化)
[15:08] TOP8(🔥261) | 🤖 ClawBench: Can AI Agents Complete Everyday Online Tasks?(ClawBench:AI智能体能否完成日常在线任务?)
[16:40] TOP9(🔥252) | 🔄 Recursive Multi-Agent Systems(递归多智能体系统)
[18:31] TOP10(🔥249) | 👗 Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items(Tstars-Tryon 1.0:面向多样化时尚商品的鲁棒且逼真的虚拟试穿系统)

【关注我们】
您还可


2026.05.04 | 统一扩散框架十五合一;多智能体搜索碾压单兵
05/04/2026

【目录】
本期的 15 篇论文如下:
[00:23] 🎥 UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors(UniVidX:一种基于扩散先验的统一多模态框架用于多功能视频生成)
[01:20] 🕸 Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction(Web2BigTable:一种用于互联网规模信息搜索与提取的双层多智能体大语言模型系统)
[02:11] 🌍 Map2World: Segment Map Conditioned Text to 3D World Generation(Map2World:基于分割地图条件文本到3D世界生成)
[03:05] 🤖 Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies(边部署边学习:面向通用机器人策略的车队级强化学习)
[03:46] 🧩 From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills(从技能文本到技能结构:面向智能体技能的调度-结构-逻辑表示)
[04:39] 🎨 Let ViT Speak: Generative Language-Image Pre-training(让ViT说话:生成式语言-图像预训练)
[05:21] 🧩 When Do Diffusion Models learn to Generate Multiple Objects?(扩散模型何时学会生成多个物体?)
[06:14] 🌳 Trees to Flows and Back: Unifying Decision Trees and Diffusion Models(从树到流再回归:统一决策树与扩散模型)
[07:14] 🖼 End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer(端到端自回归图像生成与一维语义分词器)
[08:09] 🔍 Online Self-Calibration Against Hallucination in Vision-Language Models(视觉语言模型中针对幻觉的在线自校准方法)
[08:49] 🤖 Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization(学习行动与合作:面向分布式黑箱共识优化的自设计方法)
[09:34] 🗣 LA...


【周末特辑】5月第1周最火AI论文 | 潜空间套娃提分快;世界模型分级演化
05/03/2026

【目录】

本期的 5 篇论文如下:

[00:35] TOP1(🔥241) | 🔄 Recursive Multi-Agent Systems(递归多智能体系统)

[02:34] TOP2(🔥219) | 🌍 Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond(智能体世界建模:基础、能力、法则及其超越)

[04:45] TOP3(🔥188) | 🧠 Heterogeneous Scientific Foundation Model Collaboration(异构科学基础模型协作)

[06:31] TOP4(🔥116) | 🏢 From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company(从技能到人才:将异构智能体组织为现实世界公司)

[08:19] TOP5(🔥115) | 🌍 World-R1: Reinforcing 3D Constraints for Text-to-Video Generation(世界-R1:通过强化学习为文本到视频生成注入3D约束)

【关注我们】

您还可以在以下平台找到我们,获得播客内容以外更多信息

小红书: AI速递


在小宇宙查看该单集文稿


2026.05.01 | Eywa让LLM牵手领域模型提效30%;视觉生成五级跃迁仍卡第三关
05/01/2026

【目录】
本期的 15 篇论文如下:
[00:25] 🧠 Heterogeneous Scientific Foundation Model Collaboration(异构科学基础模型协作)
[01:24] 🌍 Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling(新时代的视觉生成:从原子映射到智能体世界建模的演进)
[02:04] 🧬 Co-Evolving Policy Distillation(共同演化策略蒸馏)
[02:47] 🤖 ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control(ExoActor:外视点视频生成作为可泛化的交互式人形机器人控制)
[03:38] 🚀 Efficient Training on Multiple Consumer GPUs with RoundPipe(在多块消费级GPU上使用RoundPipe进行高效训练)
[04:17] 🧠 Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows(Claw-Eval-Live:一个面向不断演变的真实世界工作流的实时智能体基准测试)
[05:08] 🎨 Leveraging Verifier-Based Reinforcement Learning in Image Editing(利用基于验证器的强化学习进行图像编辑)
[06:18] 📏 Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling(长度价值模型:面向令牌级长度建模的可扩展价值预训练)
[07:15] 🔬 Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists(Intern-Atlas:作为AI科学家研究基础设施的方法演化图)
[08:31] 🌐 InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?(InteractWeb-Bench:多模态智能体能否在交互式网站生成中摆脱盲目执行?)
[09:15] 🎨 Representation Fréchet Loss for Visual Generation(用于视觉生成的表示空间弗雷歇损失)
[10:05] 🖥 Synthetic Computers at Scale for Long-Horizon Productivity Simulation(面向长周期生产力模拟的大规模合成计算机)
[10:52] 🧠 Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models(合规性与敏感性:大型语言模型中的推理可控性研究)
[11:25] 🤖 The Last Hum


2026.04.30 | GLM-5V一锅端训多模态;潜在蒸馏采样省样本
04/30/2026

【目录】
本期的 11 篇论文如下:
[00:22] 🤖 GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents(GLM-5V-Turbo:迈向多模态智能体的原生基础模型)
[01:26] 🔬 Large Language Models Explore by Latent Distilling(大型语言模型通过潜在蒸馏进行探索)
[02:16] 🌊 Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models(扭转潮流:面向扩散大语言模型的跨架构蒸馏)
[03:02] 🦾 ClawGym: A Scalable Framework for Building Effective Claw Agents(ClawGym:一个构建高效Claw智能体的可扩展框架)
[03:49] 🤖 RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments(RADIO-ViPE:面向动态环境中开放词汇语义SLAM的在线紧耦合多模态融合)
[04:35] 🧩 Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion(扩散模板:一种用于可控扩散的统一插件框架)
[05:20] 🚀 Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding(通过系统集成的推测解码加速强化学习后训练中的自回归生成)
[06:08] 🌍 Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising(基于异步去噪的视频先验的统一4D世界动作建模)
[07:02] 💬 A Survey on LLM-based Conversational User Simulation(基于大语言模型的对话式用户模拟综述)
[07:55] 👗 FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing(FASH-iCNN:通过多模态CNN探针使时尚编辑身份可审查)
[08:43] 🧩 Probing Visual Planning in Image Editing Models(探究图像编辑模型中的视觉规划能力)

【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递


在小宇宙查看该单集文稿


2026.04.29 | 递归多智能体套娃提速;数据编程Git式自改进
04/29/2026

【目录】
本期的 15 篇论文如下:
[00:25] 🔄 Recursive Multi-Agent Systems(递归多智能体系统)
[01:01] 🔧 Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora(数据编程:面向自改进大语言模型从原始语料库进行测试驱动数据工程)
[01:55] 📊 DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios(DV-World:在真实世界场景中评估数据可视化智能体的基准)
[02:36] 🔬 AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery(AutoResearchBench:基于复杂科学文献发现的AI智能体基准测试)
[03:23] 🖼 Meta-CoT: Enhancing Granularity and Generalization in Image Editing(元链式思维:增强图像编辑的粒度与泛化能力)
[04:07] 🎨 Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models(通过重生成进行精炼:扩大修改空间提升统一多模态模型中的图像精炼效果)
[05:03] 🎥 Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation(相互强迫:用于快速自回归音视频角色生成的双模式自演化)
[05:46] 🎧 Step-Audio-R1.5 Technical Report(Step-Audio-R1.5 技术报告)
[06:26] 🎬 Co-Director: Agentic Generative Video Storytelling(联合导演:基于智能体的生成式视频故事讲述)
[07:13] 🖥 Toward Scalable Terminal Task Synthesis via Skill Graphs(面向可扩展终端任务合成的技能图方法)
[07:57] 🎓 TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents(TCOD:面向多轮自主智能体的在策略蒸馏中的时序课程探索)
[08:53] 🛡 BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate(BARRED:通过非对称辩论进行自定义策略护栏的合成训练)
[09:36] 🎓 MAIC-UI: Making Interactive Courseware with Generative UI(MAIC-UI:利用生成式用户界面制作交互式课件)
[10:35] 🎨 V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think(V-GRPO:去噪生成模型的在线强化学习比你想象的要简单)
[11:15] 🏃 IAM: Identity-Aware Human Motion and Shape


2026.04.28 | 强化学习逼出几何一致视频;AI公司乐高式组队降本提效
04/28/2026

【目录】
本期的 15 篇论文如下:
[00:24] 🌍 World-R1: Reinforcing 3D Constraints for Text-to-Video Generation(世界-R1:通过强化学习为文本到视频生成注入3D约束)
[01:29] 🏢 From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company(从技能到人才:将异构智能体组织为现实世界公司)
[02:26] 🧠 ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning(重建视觉空间智能评估:精准评估VLM三维推理能力)
[03:23] 🛡 Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms(视觉-语言-动作安全:威胁、挑战、评估与机制)
[04:12] 🖼 Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation(Tuna-2:像素嵌入在多模态理解与生成中击败视觉编码器)
[05:02] 🤖 ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents(ClawMark:面向多轮、多日、多模态协作者智能体的现实世界基准测试)
[06:20] ✍ SketchVLM: Vision language models can annotate images to explain thoughts and guide users(SketchVLM:视觉语言模型可以通过图像标注来解释思维并引导用户)
[07:17] 🔬 Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis(奖励科学过程:面向智能体数据分析的过程级奖励建模)
[08:24] ⚖ Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment(通过辩证对齐驯服智能体中的行动者-观察者不对称性)
[09:20] 🔀 Efficient Agent Evaluation via Diversity-Guided User Simulation(通过多样性引导的用户模拟实现高效智能体评估)
[10:02] ⚡ For-Value: Efficient Forward-Only Data Valuation for finetuning LLMs and VLMs(For-Value:面向微调大语言模型和视觉语言模型的高效前向数据估值方法)
[11:04] 🎬 OmniShot


2026.04.27 | 坐标系统摄世界模型;扩散重建提速临床CT
04/27/2026

【目录】
本期的 11 篇论文如下:
[00:31] 🌍 Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond(智能体世界建模:基础、能力、法则及其超越)
[01:24] 🩻 DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction(DiffNR:基于扩散增强的神经表示优化用于稀疏视角三维断层重建)
[02:10] 🛡 LLM Safety From Within: Detecting Harmful Content with Internal Representations(从内部保障大语言模型安全:利用内部表征检测有害内容)
[02:50] 🎬 FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing(FlowAnchor:稳定无反转视频编辑中的编辑信号)
[03:34] 📚 Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets(上下文永远不够长:面向长文档集的可扩展问答的结构化推理)
[04:23] 🔍 AgentSearchBench: A Benchmark for AI Agent Search in the Wild(AgentSearchBench:野外AI智能体搜索基准测试)
[05:03] 🎬 Building a Precise Video Language with Human-AI Oversight(构建具有人机监督的精准视频语言)
[06:11] 🤖 dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model(dWorldEval:基于离散扩散世界模型的可扩展机器人策略评估)
[06:52] 🔍 Sessa: Selective State Space Attention(Sessa:选择性状态空间注意力)
[07:32] 🌾 AgriIR: A Scalable Framework for Domain-Specific Knowledge Retrieval(AgriIR:一种面向领域特定知识检索的可扩展框架)
[08:19] 🔦 Learning Evidence Highlighting for Frozen LLMs(学习为冻结的大语言模型高亮证据)

【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递


在小宇宙查看该单集文稿


【周末特辑】4月第4周最火AI论文 | Tstars-Tryon登顶虚拟试穿;LLaDA2.0-Uni统一多模态生成
04/26/2026

【目录】
本期的 5 篇论文如下:
00:31 TOP1(🔥244) | 👗 Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items(Tstars-Tryon 1.0:面向多样化时尚商品的鲁棒且逼真的虚拟试穿系统)
02:42 TOP2(🔥229) | 🔮 LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model(LLaDA2.0-Uni:基于扩散大语言模型统一多模态理解与生成)
05:07 TOP3(🔥154) | 🤖 AgentSPEX: An Agent SPecification and EXecution Language(AgentSPEX:一种智能体规范与执行语言)
07:06 TOP4(🔥96) | 🚀 Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation(从类别标签到文本:通过判别性文本表征扩展一步图像生成)
08:48 TOP5(🔥84) | 🚗 OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation(OneVL:基于视觉语言解释的单步潜在推理与规划)

【关注我们】
您还可以在以下平台找到我们,获得播客内容以外更多信息
小红书: AI速递


在小宇宙查看该单集文稿


2026.04.24 | LLaTiSA四级闯关教模型读时序;WorldMark统一基准测视频世界模型
04/24/2026

【目录】
本期的 15 篇论文如下:
00:23 📈 LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics(LLaTiSA:从视觉感知到语义的难度分层时间序列推理)
01:11 🎮 WorldMark: A Unified Benchmark Suite for Interactive Video World Models(WorldMark:交互式视频世界模型的统一基准套件)
01:54 🤖 UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling(UniT:面向人形机器人策略学习与世界建模的统一物理语言)
02:44 🎨 StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition(StyleID:一种面向风格化无关面部身份识别的感知感知数据集与度量)
03:56 ⏩ Seeing Fast and Slow: Learning the Flow of Time in Videos(快慢视觉:学习视频中的时间流动)
04:39 ⚡ TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale(TingIS:企业级规模下从嘈杂客户事件中实时发现风险事件)
05:16 🧠 Hybrid Policy Distillation for LLMs(面向大语言模型的混合策略蒸馏)
05:48 🧠 Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks(面向长时域任务的LLM决策与技能库智能体协同进化)
06:44 🤖 VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation(VLAA-GUI:一种用于GUI自动化的模块化框架——知晓何时停止、恢复与搜索)
07:43 🧩 Context Unrolling in Omni Models(全模态模型中的上下文展开)
08:31 🎨 EditCrafter: Tuning-free High-Resolution Image Editing via Pretrained Diffusion Model(EditCrafter:基于预训练扩散模型的无调优高分辨率图像编辑)
09:34 🔗 UniGenDet: A Unified Generative-Discriminative Framework for Co-Evoluti...


2026.04.23 | LLaDA2.0统一多模态;未来经验外挂RL
04/23/2026

【目录】
本期的 15 篇论文如下:
00:28 🔮 LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model(LLaDA2.0-Uni:基于扩散大语言模型统一多模态理解与生成)
01:17 🔮 Near-Future Policy Optimization(近未来策略优化)
02:07 🤖 DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data(DR-Venus:仅用1万条开放数据迈向前沿边缘规模深度研究代理)
02:53 🤖 DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation(DeVI:基于物理的灵巧人机交互通过合成视频模仿)
03:42 🎭 Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges(大模型时代的奖励黑客:机制、涌现性失调与挑战)
04:36 🧠 Exploring Spatial Intelligence from a Generative Perspective(从生成视角探索空间智能)
05:21 🤖 A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression(一种通过观测上下文压缩实现高效终端代理的自演化框架)
06:18 🎤 WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training(WavAlign:通过自适应混合后训练增强口语对话模型的智能与表现力)
07:06 🤖 SWE-chat: Coding Agent Interactions From Real Users in the Wild(SWE-chat:来自真实用户的编码智能体交互记录)
07:53 🤖 Cortex 2.0: Grounding World Models in Real-World Industrial Deployment(Cortex 2.0:在现实工业部署中基于世界模型进行规划)
08:36 🧠 Convergent Evolution: How Different Language Models Learn Similar Number Representations(趋同演化:不同语言模型如何学习相似的数值表示)
09:21 🤝 SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution(SAVOIR:通过沙普利值奖励归因学习社交


2026.04.22 | 虚拟试衣3.9秒高清生成;协同生成HOI视频物理一致
04/22/2026

【目录】
本期的 15 篇论文如下:
00:23 👗 Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items(Tstars-Tryon 1.0:面向多样化时尚商品的鲁棒且逼真的虚拟试穿系统)
01:05 🎬 CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation(CoInteract:通过空间结构化协同生成实现物理一致的人-物交互视频合成)
01:58 🤖 AgentSPEX: An Agent SPecification and EXecution Language(AgentSPEX:一种智能体规范与执行语言)
02:51 📐 AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model(AnyRecon:基于视频扩散模型的任意视角三维重建)
03:33 🚀 TEMPO: Scaling Test-time Training for Large Reasoning Models(TEMPO:扩展大型推理模型的测试时训练规模)
04:26 🎮 PlayCoder: Making LLM-Generated GUI Code Playable(PlayCoder:让LLM生成的GUI代码可玩)
05:08 🕶 ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning(ShadowPEFT:用于参数高效微调的影子网络)
05:58 🤖 Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language(Chat2Workflow:基于自然语言生成可执行视觉工作流的基准)
06:44 ⚖ AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation(AJ-Bench:面向环境感知评估的Agent-as-a-Judge基准测试)
07:31 🔄 Dual-View Training for Instruction-Following Information Retrieval(面向指令跟随信息检索的双视图训练)
08:41 🔍 Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers(代码转换信息检索:基准测试、分析与当前检索系统的局限)
09:20 🔗 Understanding and Enforcing Weight Disentanglement in Task Arithmetic(理解与强制任务算术中的权重解耦)
10:00 ⚡ Speculative Decoding for Autoregressive Video Generation(用于自回归视频生成的推测解码)
11:01 🧠 Target


2026.04.21 | 一步听懂句子出图;单步潜码搞定驾驶推理
04/21/2026

【目录】
本期的 15 篇论文如下:
00:24 🚀 Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation(从类别标签到文本:通过判别性文本表征扩展一步图像生成)
01:08 🚗 OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation(OneVL:基于视觉语言解释的单步潜在推理与规划)
01:54 🤖 Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence(Agent-World:通过可扩展环境合成推进通用智能体智能的自我演化训练场)
02:41 🎮 OpenGame: Open Agentic Coding for Games(OpenGame:面向游戏开发的开放式智能体编码框架)
03:48 🤖 MultiWorld: Scalable Multi-Agent Multi-View Video World Models(MultiWorld:可扩展的多智能体多视角视频世界模型)
04:44 🎬 EasyVideoR1: Easier RL for Video Understanding(EasyVideoR1:面向视频理解的简易强化学习框架)
05:42 🧭 WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models(WebCompass:面向代码语言模型的多模态网页编码评估)
06:46 🧠 GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification(GFT:从模仿到奖励微调——基于无偏群体优势与动态系数校正)
07:34 🧠 SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents(SkillFlow:面向自主智能体的终身技能发现与演化基准测试)
08:22 🧩 Crowded in B-Space: Calibrating Shared Directions for LoRA Merging(B空间拥挤:为LoRA合并校准共享方向)
09:13 🧠 When Can LLMs Learn to Reason with Weak Supervision?(大型语言模型何时能在弱监督下学会推理?)
10:04 🤖 ClawEnvKit: Automatic Environment Generation for Claw-Like Agents(ClawEnvKit:面向爪状智能体的自动环境生成系统)
10:52 🎬 OmniScript: Towards Audio-Visual Script Generation for Long-Form Ci


2026.04.20 | DPM零训画质糖;两位翻转毁模型
04/20/2026

【目录】
本期的 15 篇论文如下:
00:20 🔍 Elucidating the SNR-t Bias of Diffusion Probabilistic Models(阐明扩散概率模型的信噪比-时间步偏差)
01:00 💥 Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips(无需数据或优化的最大脑损伤:通过符号位翻转破坏神经网络)
01:45 🧠 PersonaVLM: Long-Term Personalized Multimodal LLMs(PersonaVLM:面向长期个性化的多模态大语言模型)
02:56 🧩 Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems(面向高效且经济高效的检索增强生成系统的Web检索感知分块(W-RAC))
03:40 ✂ Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning(削减你的损失!学习早期剪枝路径以实现高效并行推理)
04:32 🚀 Qwen3.5-Omni Technical Report(Qwen3.5-Omni技术报告)
05:17 🧱 Repurposing 3D Generative Model for Autoregressive Layout Generation(重新利用三维生成模型进行自回归布局生成)
06:02 🔍 (1D) Ordered Tokens Enable Efficient Test-Time Search((一维)有序分词实现高效的测试时搜索)
06:55 📈 QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies(QuantCode-Bench:评估大语言模型生成可执行算法交易策略能力的基准)
07:36 🧠 Learning Adaptive Reasoning Paths for Efficient Visual Reasoning(学习自适应推理路径以实现高效视觉推理)
08:29 🔍 TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment(TIPSv2:通过增强的补丁-文本对齐推进视觉-语言预训练)
09:33 💡 Can Large Language Models Reinvent Foundational Algorithms?(大型语言模型能否重新发明基础算法?)
10:17 📊 GTA-2: Ben