Daily Paper Cast

40 Episodes
Subscribe

By: Jingwen Liang, Gengyu Wang

We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art

A Survey of Context Engineering for Large Language Models
#987
Today at 3:56 AM

πŸ€— Upvotes: 96 | cs.CL

Authors:
Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu

Title:
A Survey of Context Engineering for Large Language Models

Arxiv:
http://arxiv.org/abs/2507.13334v1

Abstract:
The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of inf...


VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
#986
Today at 3:56 AM

πŸ€— Upvotes: 52 | cs.CV, cs.AI, cs.CL, cs.LG

Authors:
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia

Title:
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Arxiv:
http://arxiv.org/abs/2507.13348v1

Abstract:
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a s...


$Ο€^3$: Scalable Permutation-Equivariant Visual Geometry Learning
#985
Today at 3:56 AM

πŸ€— Upvotes: 36 | cs.CV

Authors:
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He

Title:
$Ο€^3$: Scalable Permutation-Equivariant Visual Geometry Learning

Arxiv:
http://arxiv.org/abs/2507.13347v1

Abstract:
We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In c...


The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
#984
Today at 3:55 AM

πŸ€— Upvotes: 33 | cs.CL

Authors:
Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen

Title:
The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

Arxiv:
http://arxiv.org/abs/2507.13332v1

Abstract:
Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To...


AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
#983
Today at 3:55 AM

πŸ€— Upvotes: 30 | cs.CV

Authors:
Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu

Title:
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Arxiv:
http://arxiv.org/abs/2507.12841v1

Abstract:
Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce Any...


Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models
#982
Today at 3:55 AM

πŸ€— Upvotes: 29 | cs.CV

Authors:
Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yifan Yang, Yujun Shen, Hujun Bao, Xiaowei Zhou

Title:
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Arxiv:
http://arxiv.org/abs/2507.13344v1

Abstract:
This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lac...


RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization
#981
Today at 3:54 AM

πŸ€— Upvotes: 23 | cs.LG, cs.CL, cs.NA, math.DG, math.NA, 68T07, 65F55, 53Z50

Authors:
Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Soboleva, Aibek Alanov, Maxim Rakhuba

Title:
RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization

Arxiv:
http://arxiv.org/abs/2507.12142v1

Abstract:
Low-Rank Adaptation (LoRA) has become a widely adopted standard for parameter-efficient fine-tuning of large language models (LLMs), significantly reducing memory and computational demands. However, challenges remain, including finding optimal initialization strategies or mitigating overparametrization in low-rank matrix factorization. In this work, we...


Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
#980
Yesterday at 2:55 AM

πŸ€— Upvotes: 50 | cs.CL, cs.AI

Authors:
Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu

Title:
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

Arxiv:
http://arxiv.org/abs/2507.09477v2

Abstract:
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls sho...


Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
#979
Last Thursday at 3:05 AM

πŸ€— Upvotes: 32 | cs.CV

Authors:
Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao

Title:
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

Arxiv:
http://arxiv.org/abs/2507.07104v2

Abstract:
Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Lan...


EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes
#978
Last Thursday at 3:04 AM

πŸ€— Upvotes: 24 | cs.CL, cs.AI

Authors:
LG AI Research, :, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Kyubeen Han, Seokhee Hong, Junwon Hwang, Taewan Hwang, Joonwon Jang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Euisoon Kim, Hyosang Kim, Jihoon Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Gwangho Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Young Min Paik, Yongmin Park, Youngyong Park, Sanghyun Seo, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun

Title:
...


Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
#977
Last Wednesday at 3:43 AM

πŸ€— Upvotes: 44 | cs.LG, cs.AI, cs.CL

Authors:
Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang

Title:
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Arxiv:
http://arxiv.org/abs/2507.10532v1

Abstract:
The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or...


SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
#976
Last Wednesday at 3:42 AM

πŸ€— Upvotes: 43 | cs.CV, eess.AS

Authors:
Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li

Title:
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Arxiv:
http://arxiv.org/abs/2507.09862v1

Abstract:
The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in...


Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
#975
Last Wednesday at 3:42 AM

πŸ€— Upvotes: 31 | cs.CL, cs.LG

Authors:
Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun

Title:
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Arxiv:
http://arxiv.org/abs/2507.10524v1

Abstract:
Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mix...


EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
#974
Last Wednesday at 3:42 AM

πŸ€— Upvotes: 25 | cs.CV, cs.AI, cs.CL

Authors:
Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi

Title:
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments

Arxiv:
http://arxiv.org/abs/2507.10548v1

Abstract:
Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person per...


REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
#973
Last Wednesday at 3:41 AM

πŸ€— Upvotes: 22 | cs.CL

Authors:
Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu

Title:
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Arxiv:
http://arxiv.org/abs/2507.10541v2

Abstract:
Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MAT...


Test-Time Scaling with Reflective Generative Model
#972
Last Tuesday at 3:57 AM

πŸ€— Upvotes: 68 | cs.LG, cs.CL

Authors:
Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie

Title:
Test-Time Scaling with Reflective Generative Model

Arxiv:
http://arxiv.org/abs/2507.01951v2

Abstract:
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the bac...


Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
#971
Last Tuesday at 3:57 AM

πŸ€— Upvotes: 47 | cs.CV, cs.CL

Authors:
Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel

Title:
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Arxiv:
http://arxiv.org/abs/2507.05255v1

Abstract:
The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer thi...


NeuralOS: Towards Simulating Operating Systems via Neural Generative Models
#970
Last Tuesday at 3:57 AM

πŸ€— Upvotes: 45 | cs.CV, cs.AI, cs.CL, cs.HC, cs.LG

Authors:
Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng

Title:
NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

Arxiv:
http://arxiv.org/abs/2507.08800v1

Abstract:
We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural ren...


CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering
#969
Last Tuesday at 3:56 AM

πŸ€— Upvotes: 43 | cs.CV

Authors:
Zhengqing Wang, Yuefan Wu, Jiacheng Chen, Fuyang Zhang, Yasutaka Furukawa

Title:
CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

Arxiv:
http://arxiv.org/abs/2507.08776v2

Abstract:
This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, giv...


KV Cache Steering for Inducing Reasoning in Small Language Models
#968
Last Tuesday at 3:56 AM

πŸ€— Upvotes: 26 | cs.CL, cs.AI

Authors:
Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano

Title:
KV Cache Steering for Inducing Reasoning in Small Language Models

Arxiv:
http://arxiv.org/abs/2507.08799v1

Abstract:
We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to...


Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
#967
Last Tuesday at 3:56 AM

πŸ€— Upvotes: 24 | cs.CL, cs.AI

Authors:
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, IlaΓ― Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre RamΓ©, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian...


Neural-Driven Image Editing
#966
Last Tuesday at 3:55 AM

πŸ€— Upvotes: 22 | cs.CV

Authors:
Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You

Title:
Neural-Driven Image Editing

Arxiv:
http://arxiv.org/abs/2507.05397v1

Abstract:
Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image edi...


Scaling RL to Long Videos
#965
07/12/2025

πŸ€— Upvotes: 95 | cs.CV, cs.AI, cs.CL

Authors:
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

Title:
Scaling RL to Long Videos

Arxiv:
http://arxiv.org/abs/2507.07966v1

Abstract:
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K l...


T-LoRA: Single Image Diffusion Model Customization Without Overfitting
#964
07/12/2025

πŸ€— Upvotes: 83 | cs.CV

Authors:
Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev

Title:
T-LoRA: Single Image Diffusion Model Customization Without Overfitting

Arxiv:
http://arxiv.org/abs/2507.05964v1

Abstract:
While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest pra...


Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
#963
07/12/2025

πŸ€— Upvotes: 37 | cs.CV, cs.AI, cs.CL

Authors:
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang

Title:
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Arxiv:
http://arxiv.org/abs/2507.07999v1

Abstract:
Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a d...


OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
#962
07/12/2025

πŸ€— Upvotes: 29 | cs.CV

Authors:
JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang

Title:
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

Arxiv:
http://arxiv.org/abs/2507.07984v1

Abstract:
Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of...


Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
#961
07/12/2025

πŸ€— Upvotes: 24 | cs.CV, cs.AI

Authors:
Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim

Title:
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Arxiv:
http://arxiv.org/abs/2507.07990v1

Abstract:
Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to...


Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
#960
07/12/2025

πŸ€— Upvotes: 23 | cs.CV, cs.AI

Authors:
Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian

Title:
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Arxiv:
http://arxiv.org/abs/2507.07982v1

Abstract:
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nat...


PyVision: Agentic Vision with Dynamic Tooling
#959
07/12/2025

πŸ€— Upvotes: 22 | cs.CL, cs.AI, cs.CV

Authors:
Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei

Title:
PyVision: Agentic Vision with Dynamic Tooling

Arxiv:
http://arxiv.org/abs/2507.07998v1

Abstract:
LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based too...


4KAgent: Agentic Any Image to 4K Super-Resolution
#958
07/11/2025

πŸ€— Upvotes: 56 | cs.CV, eess.IV

Authors:
Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu

Title:
4KAgent: Agentic Any Image to 4K Super-Resolution

Arxiv:
http://arxiv.org/abs/2507.07105v1

Abstract:
We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly dis...


Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
#957
07/11/2025

πŸ€— Upvotes: 41 | cs.CV

Authors:
Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, Jingbo Wang

Title:
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

Arxiv:
http://arxiv.org/abs/2507.07095v1

Abstract:
Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited siz...


Perception-Aware Policy Optimization for Multimodal Reasoning
#956
07/11/2025

πŸ€— Upvotes: 34 | cs.CL

Authors:
Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

Title:
Perception-Aware Policy Optimization for Multimodal Reasoning

Arxiv:
http://arxiv.org/abs/2507.06448v1

Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In par...


MIRIX: Multi-Agent Memory System for LLM-Based Agents
#955
07/11/2025

πŸ€— Upvotes: 33 | cs.CL, cs.AI

Authors:
Yu Wang, Xi Chen

Title:
MIRIX: Multi-Agent Memory System for LLM-Based Agents

Arxiv:
http://arxiv.org/abs/2507.07957v1

Abstract:
Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling lan...


Rethinking Verification for LLM Code Generation: From Generation to Testing
#954
07/11/2025

πŸ€— Upvotes: 23 | cs.CL

Authors:
Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen

Title:
Rethinking Verification for LLM Code Generation: From Generation to Testing

Arxiv:
http://arxiv.org/abs/2507.06920v2

Abstract:
Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but...


SingLoRA: Low Rank Adaptation Using a Single Matrix
#953
07/10/2025

πŸ€— Upvotes: 68 | cs.AI

Authors:
David BensaΓ―d, Noam Rotstein, Roy Velich, Daniel BensaΓ―d, Ron Kimmel

Title:
SingLoRA: Low Rank Adaptation Using a Single Matrix

Arxiv:
http://arxiv.org/abs/2507.05566v1

Abstract:
Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two smaller matrices that together form a low-rank matrix update. Recent research has shown that scale disparities between these two matrices often cause unstable training dynamics, leading to suboptimal performance. In th...


A Survey on Latent Reasoning
#952
07/10/2025

πŸ€— Upvotes: 60 | cs.CL

Authors:
Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian

Title:
A Survey on Latent Reasoning

Arxiv:
http://arxiv.org/abs/2507.06203v1

Abstract:
Large Language Models (LLMs) hav...


OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
#951
07/10/2025

πŸ€— Upvotes: 45 | cs.CV

Authors:
Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, Xihui Liu

Title:
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion

Arxiv:
http://arxiv.org/abs/2507.06165v1

Abstract:
The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among com...


How to Train Your LLM Web Agent: A Statistical Diagnosis
#950
07/10/2025

πŸ€— Upvotes: 40 | cs.AI, cs.LG, stat.ML

Authors:
Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel MuΓ±oz-MΓ‘rmol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre PichΓ©, Alexandre Lacoste, Massimo Caccia

Title:
How to Train Your LLM Web Agent: A Statistical Diagnosis

Arxiv:
http://arxiv.org/abs/2507.04103v1

Abstract:
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progre...


StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
#949
07/10/2025

πŸ€— Upvotes: 35 | cs.RO, cs.CV

Authors:
Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang

Title:
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Arxiv:
http://arxiv.org/abs/2507.05240v1

Abstract:
Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face tra...


CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
#948
07/10/2025

πŸ€— Upvotes: 35 | cs.CL

Authors:
Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, Luming Li, Minghao Liu, Yihang Xia, Jiawei Shen, Yuchen Wu, Yixin Cao, Zhaoxiang Zhang, Wenhao Huang, Jiaheng Liu, Ge Zhang

Title:
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

Arxiv:
http://arxiv.org/abs/2507.06181v1

Abstract:
Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has bee...