Daily Paper Cast
We publish 10 episodes every day to discuss 10 AI research papers. Both the podcast scripts and audio are generated by AI. The 10 papers are selected from the highest-voted ones on Huggingface Daily Paper (https://huggingface.co/papers). Feedback and suggestions are welcome! Email us: dailypapercast.ai@gmail.com Creator: Jingwen Liang, 3D ML, https://www.linkedin.com/in/jingwen-liang/ Gengyu Wang, NLP, http://wanggengyu.com Listen on: Spotify: https://open.spotify.com/show/21nrhmdaA8qoBiH8q03NXL Apple Podcast: https://podcasts.apple.com/us/podcast/daily-paper-cast/id1777620236 Cover Image by Kawen Kuang https://kawen.art
A Survey of Context Engineering for Large Language Models
π€ Upvotes: 96 | cs.CL
Authors:
Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu
Title:
A Survey of Context Engineering for Large Language Models
Arxiv:
http://arxiv.org/abs/2507.13334v1
Abstract:
The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of inf...
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
π€ Upvotes: 52 | cs.CV, cs.AI, cs.CL, cs.LG
Authors:
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
Title:
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Arxiv:
http://arxiv.org/abs/2507.13348v1
Abstract:
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a s...
$Ο^3$: Scalable Permutation-Equivariant Visual Geometry Learning
π€ Upvotes: 36 | cs.CV
Authors:
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He
Title:
$Ο^3$: Scalable Permutation-Equivariant Visual Geometry Learning
Arxiv:
http://arxiv.org/abs/2507.13347v1
Abstract:
We introduce $\pi^3$, a feed-forward neural network that offers a novel approach to visual geometry reconstruction, breaking the reliance on a conventional fixed reference view. Previous methods often anchor their reconstructions to a designated viewpoint, an inductive bias that can lead to instability and failures if the reference is suboptimal. In c...
The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
π€ Upvotes: 33 | cs.CL
Authors:
Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen
Title:
The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
Arxiv:
http://arxiv.org/abs/2507.13332v1
Abstract:
Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To...
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
π€ Upvotes: 30 | cs.CV
Authors:
Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu
Title:
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
Arxiv:
http://arxiv.org/abs/2507.12841v1
Abstract:
Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce Any...
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models
π€ Upvotes: 29 | cs.CV
Authors:
Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yifan Yang, Yujun Shen, Hujun Bao, Xiaowei Zhou
Title:
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models
Arxiv:
http://arxiv.org/abs/2507.13344v1
Abstract:
This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lac...
RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization
π€ Upvotes: 23 | cs.LG, cs.CL, cs.NA, math.DG, math.NA, 68T07, 65F55, 53Z50
Authors:
Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Soboleva, Aibek Alanov, Maxim Rakhuba
Title:
RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization
Arxiv:
http://arxiv.org/abs/2507.12142v1
Abstract:
Low-Rank Adaptation (LoRA) has become a widely adopted standard for parameter-efficient fine-tuning of large language models (LLMs), significantly reducing memory and computational demands. However, challenges remain, including finding optimal initialization strategies or mitigating overparametrization in low-rank matrix factorization. In this work, we...
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
π€ Upvotes: 50 | cs.CL, cs.AI
Authors:
Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu
Title:
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
Arxiv:
http://arxiv.org/abs/2507.09477v2
Abstract:
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls sho...
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
π€ Upvotes: 32 | cs.CV
Authors:
Tiezheng Zhang, Yitong Li, Yu-cheng Chou, Jieneng Chen, Alan Yuille, Chen Wei, Junfei Xiao
Title:
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
Arxiv:
http://arxiv.org/abs/2507.07104v2
Abstract:
Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Lan...
EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes
π€ Upvotes: 24 | cs.CL, cs.AI
Authors:
LG AI Research, :, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Kyubeen Han, Seokhee Hong, Junwon Hwang, Taewan Hwang, Joonwon Jang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Euisoon Kim, Hyosang Kim, Jihoon Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Gwangho Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Young Min Paik, Yongmin Park, Youngyong Park, Sanghyun Seo, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun
Title:
...
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
π€ Upvotes: 44 | cs.LG, cs.AI, cs.CL
Authors:
Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang
Title:
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
Arxiv:
http://arxiv.org/abs/2507.10532v1
Abstract:
The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or...
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
π€ Upvotes: 43 | cs.CV, eess.AS
Authors:
Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li
Title:
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
Arxiv:
http://arxiv.org/abs/2507.09862v1
Abstract:
The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in...
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
π€ Upvotes: 31 | cs.CL, cs.LG
Authors:
Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun
Title:
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Arxiv:
http://arxiv.org/abs/2507.10524v1
Abstract:
Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mix...
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
π€ Upvotes: 25 | cs.CV, cs.AI, cs.CL
Authors:
Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi
Title:
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Arxiv:
http://arxiv.org/abs/2507.10548v1
Abstract:
Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person per...
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
π€ Upvotes: 22 | cs.CL
Authors:
Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu
Title:
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
Arxiv:
http://arxiv.org/abs/2507.10541v2
Abstract:
Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MAT...
Test-Time Scaling with Reflective Generative Model
π€ Upvotes: 68 | cs.LG, cs.CL
Authors:
Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie
Title:
Test-Time Scaling with Reflective Generative Model
Arxiv:
http://arxiv.org/abs/2507.01951v2
Abstract:
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the bac...
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
π€ Upvotes: 47 | cs.CV, cs.CL
Authors:
Yana Wei, Liang Zhao, Jianjian Sun, Kangheng Lin, Jisheng Yin, Jingcheng Hu, Yinmin Zhang, En Yu, Haoran Lv, Zejia Weng, Jia Wang, Chunrui Han, Yuang Peng, Qi Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel
Title:
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning
Arxiv:
http://arxiv.org/abs/2507.05255v1
Abstract:
The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer thi...
NeuralOS: Towards Simulating Operating Systems via Neural Generative Models
π€ Upvotes: 45 | cs.CV, cs.AI, cs.CL, cs.HC, cs.LG
Authors:
Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng
Title:
NeuralOS: Towards Simulating Operating Systems via Neural Generative Models
Arxiv:
http://arxiv.org/abs/2507.08800v1
Abstract:
We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural ren...
CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering
π€ Upvotes: 43 | cs.CV
Authors:
Zhengqing Wang, Yuefan Wu, Jiacheng Chen, Fuyang Zhang, Yasutaka Furukawa
Title:
CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering
Arxiv:
http://arxiv.org/abs/2507.08776v2
Abstract:
This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, giv...
KV Cache Steering for Inducing Reasoning in Small Language Models
π€ Upvotes: 26 | cs.CL, cs.AI
Authors:
Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano
Title:
KV Cache Steering for Inducing Reasoning in Small Language Models
Arxiv:
http://arxiv.org/abs/2507.08799v1
Abstract:
We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to...
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
π€ Upvotes: 24 | cs.CL, cs.AI
Authors:
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, IlaΓ― Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre RamΓ©, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian...
Neural-Driven Image Editing
π€ Upvotes: 22 | cs.CV
Authors:
Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You
Title:
Neural-Driven Image Editing
Arxiv:
http://arxiv.org/abs/2507.05397v1
Abstract:
Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image edi...
Scaling RL to Long Videos
π€ Upvotes: 95 | cs.CV, cs.AI, cs.CL
Authors:
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
Title:
Scaling RL to Long Videos
Arxiv:
http://arxiv.org/abs/2507.07966v1
Abstract:
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K l...
T-LoRA: Single Image Diffusion Model Customization Without Overfitting
π€ Upvotes: 83 | cs.CV
Authors:
Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev
Title:
T-LoRA: Single Image Diffusion Model Customization Without Overfitting
Arxiv:
http://arxiv.org/abs/2507.05964v1
Abstract:
While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest pra...
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
π€ Upvotes: 37 | cs.CV, cs.AI, cs.CL
Authors:
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang
Title:
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Arxiv:
http://arxiv.org/abs/2507.07999v1
Abstract:
Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a d...
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
π€ Upvotes: 29 | cs.CV
Authors:
JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang
Title:
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
Arxiv:
http://arxiv.org/abs/2507.07984v1
Abstract:
Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of...
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
π€ Upvotes: 24 | cs.CV, cs.AI
Authors:
Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim
Title:
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Arxiv:
http://arxiv.org/abs/2507.07990v1
Abstract:
Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to...
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
π€ Upvotes: 23 | cs.CV, cs.AI
Authors:
Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian
Title:
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Arxiv:
http://arxiv.org/abs/2507.07982v1
Abstract:
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nat...
PyVision: Agentic Vision with Dynamic Tooling
π€ Upvotes: 22 | cs.CL, cs.AI, cs.CV
Authors:
Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei
Title:
PyVision: Agentic Vision with Dynamic Tooling
Arxiv:
http://arxiv.org/abs/2507.07998v1
Abstract:
LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based too...
4KAgent: Agentic Any Image to 4K Super-Resolution
π€ Upvotes: 56 | cs.CV, eess.IV
Authors:
Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu
Title:
4KAgent: Agentic Any Image to 4K Super-Resolution
Arxiv:
http://arxiv.org/abs/2507.07105v1
Abstract:
We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly dis...
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
π€ Upvotes: 41 | cs.CV
Authors:
Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, Jingbo Wang
Title:
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
Arxiv:
http://arxiv.org/abs/2507.07095v1
Abstract:
Generating diverse and natural human motion sequences based on textual descriptions constitutes a fundamental and challenging research area within the domains of computer vision, graphics, and robotics. Despite significant advancements in this field, current methodologies often face challenges regarding zero-shot generalization capabilities, largely attributable to the limited siz...
Perception-Aware Policy Optimization for Multimodal Reasoning
π€ Upvotes: 34 | cs.CL
Authors:
Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji
Title:
Perception-Aware Policy Optimization for Multimodal Reasoning
Arxiv:
http://arxiv.org/abs/2507.06448v1
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In par...
MIRIX: Multi-Agent Memory System for LLM-Based Agents
π€ Upvotes: 33 | cs.CL, cs.AI
Authors:
Yu Wang, Xi Chen
Title:
MIRIX: Multi-Agent Memory System for LLM-Based Agents
Arxiv:
http://arxiv.org/abs/2507.07957v1
Abstract:
Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling lan...
Rethinking Verification for LLM Code Generation: From Generation to Testing
π€ Upvotes: 23 | cs.CL
Authors:
Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen
Title:
Rethinking Verification for LLM Code Generation: From Generation to Testing
Arxiv:
http://arxiv.org/abs/2507.06920v2
Abstract:
Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but...
SingLoRA: Low Rank Adaptation Using a Single Matrix
π€ Upvotes: 68 | cs.AI
Authors:
David BensaΓ―d, Noam Rotstein, Roy Velich, Daniel BensaΓ―d, Ron Kimmel
Title:
SingLoRA: Low Rank Adaptation Using a Single Matrix
Arxiv:
http://arxiv.org/abs/2507.05566v1
Abstract:
Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two smaller matrices that together form a low-rank matrix update. Recent research has shown that scale disparities between these two matrices often cause unstable training dynamics, leading to suboptimal performance. In th...
A Survey on Latent Reasoning
π€ Upvotes: 60 | cs.CL
Authors:
Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian
Title:
A Survey on Latent Reasoning
Arxiv:
http://arxiv.org/abs/2507.06203v1
Abstract:
Large Language Models (LLMs) hav...
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
π€ Upvotes: 45 | cs.CV
Authors:
Yunhan Yang, Yufan Zhou, Yuan-Chen Guo, Zi-Xin Zou, Yukun Huang, Ying-Tian Liu, Hao Xu, Ding Liang, Yan-Pei Cao, Xihui Liu
Title:
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
Arxiv:
http://arxiv.org/abs/2507.06165v1
Abstract:
The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among com...
How to Train Your LLM Web Agent: A Statistical Diagnosis
π€ Upvotes: 40 | cs.AI, cs.LG, stat.ML
Authors:
Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel MuΓ±oz-MΓ‘rmol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre PichΓ©, Alexandre Lacoste, Massimo Caccia
Title:
How to Train Your LLM Web Agent: A Statistical Diagnosis
Arxiv:
http://arxiv.org/abs/2507.04103v1
Abstract:
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progre...
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
π€ Upvotes: 35 | cs.RO, cs.CV
Authors:
Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang
Title:
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
Arxiv:
http://arxiv.org/abs/2507.05240v1
Abstract:
Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face tra...
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
π€ Upvotes: 35 | cs.CL
Authors:
Zhongyuan Peng, Yifan Yao, Kaijing Ma, Shuyue Guo, Yizhe Li, Yichi Zhang, Chenchen Zhang, Yifan Zhang, Zhouliang Yu, Luming Li, Minghao Liu, Yihang Xia, Jiawei Shen, Yuchen Wu, Yixin Cao, Zhaoxiang Zhang, Wenhao Huang, Jiaheng Liu, Ge Zhang
Title:
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Arxiv:
http://arxiv.org/abs/2507.06181v1
Abstract:
Translating natural language mathematical statements into formal, executable code is a fundamental challenge in automated theorem proving. While prior work has focused on generation and compilation success, little attention has bee...