alphaXiv

Explore

Sign In

Blog

Feedback

Browser Extension

Upgrade to Pro

Dark mode

Ask or search anything...

What are the most popular benchmarks for math reasoning?

Alt+↵ To search

Events

Watch Recordings

Why LLMs Aren’t Scientists Yet: Lessons from Four Autonomous Research Attempts05/08 · Prof. Dhruv Kumar and Dhruv Trehan · alphaXiv

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

27 Apr 2026

Bojie Li

The study introduces Incompressible Knowledge Probes (IKPs) to intrinsically measure the factual knowledge capacity of large language models, providing a black-box method to estimate undisclosed parameter counts. This approach achieved a 0.917 R² on open-weight models, enabling robust parameter estimation for proprietary models and demonstrating that factual knowledge scales log-linearly with total parameters, remaining distinct from compressible procedural capabilities.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

A Milestone in Formalization: The Sphere Packing Problem in Dimension 8

28 Apr 2026

Sidharth Hariharan

Christopher Birkbeck

Seewoo Lee

This research formalizes Maryna Viazovska's proof for the optimal sphere packing in eight dimensions within the Lean Theorem Prover, confirming the E8 lattice packing density of π⁴/384. The formalization involved human experts and an AI autoformalization model, resulting in a 60,000-line machine-checkable proof.

#computer-science #artificial-intelligence #logic-in-computer-science

Paper thumbnail

Recursive Multi-Agent Systems

28 Apr 2026

Xiyuan Yang

Jiaru Zou

Rui Pan

RecursiveMAS introduces a framework that integrates recursive computation into multi-agent systems, enabling agents to refine collaborative reasoning through iterative latent-space interactions rather than explicit text. This approach leads to average accuracy improvements of up to 20.2% and inference speedups of up to 2.4x compared to text-based recursive multi-agent systems, significantly reducing token usage.

#agentic-frameworks #agents #chain-of-thought

Resources 5,027

Paper thumbnail

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

30 Apr 2026

Jiahang Lin

Shichun Liu

Chengjun Pan

Agentic Harness Engineering (AHE) is a closed-loop system for automatically evolving the external components (harness) of coding agents, such as system prompts, tools, and middleware. AHE enhances `pass@1` scores on Terminal-Bench 2 from 69.7% to 77.0%, outperforming human-designed and other automated baselines, and demonstrates transferability to new benchmarks and base models.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Thinking_with_Visual_Primitives

30 Apr 2026

Unknown Author

DeepSeek-AI researchers introduced "Thinking with Visual Primitives," a framework that integrates points and bounding boxes as fundamental units of thought into Multimodal Large Language Models (MLLMs) to address the "Reference Gap" in complex visual reasoning. This approach improves performance on tasks like counting, spatial deduction, and topological navigation while significantly enhancing visual token efficiency.

0

Paper thumbnail

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

27 Apr 2026

Zhiheng Liu

Weiming Ren

Xiaoke Huang

Meta AI researchers introduced Tuna-2, a unified multimodal model that performs visual understanding and generation directly from pixel embeddings, eliminating the need for pretrained vision encoders. This encoder-free architecture achieved competitive performance across nine VQA benchmarks and state-of-the-art results among native UMMs for image generation and editing tasks.

#computer-science #computer-vision-and-pattern-recognition #image-generation

Paper thumbnail

Representation Fréchet Loss for Visual Generation

30 Apr 2026

Jiawei Yang

Zhengyang Geng

Xuan Ju

This research introduces FD-loss, a method that directly optimizes Fréchet Distance as a training objective for generative models by decoupling population statistics from batch-level gradient computation. This approach enhances the visual quality of one-step generators and repurposes multi-step models for efficient single-step generation, while also proposing FDr_k, a new multi-representation metric for comprehensive evaluation.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

29 Apr 2026

GLM-V Team

Wenyi Hong

Xiaotao Gu

GLM-5V-Turbo introduces a multimodal foundation model that deeply integrates visual perception into core agentic capabilities like reasoning, planning, and tool use, moving beyond auxiliary interfaces. The model achieves competitive performance across diverse multimodal agentic benchmarks, including an approximately eightfold improvement on MMSearch-Plus and outperforming Claude Opus 4.6 on Design2Code, while also maintaining strong text-only coding capabilities.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

30 Apr 2026

Keming Wu

Zuhao Yang

Kaichen Zhang

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

The Last Human-Written Paper: Agent-Native Research Artifacts

29 Apr 2026

Jiachen Liu

Jiaxin Pei

Jintao Huang

Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce the Agent-Native Research Artifact (ARA), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers: scientific logic, executable code with full specifications, an exploration graph that preserves the failures compilation discards, and evidence grounding every claim in raw outputs. Three mechanisms support the ecosystem: a Live Research Manager that captures decisions and dead ends during ordinary development; an ARA Compiler that translates legacy PDFs and repos into ARAs; and an ARA-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste. On PaperBench and RE-Bench, ARA raises question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%. On RE-Bench's five open-ended extension tasks, preserved failure traces in ARA accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent's capabilities.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

29 Apr 2026

Jun Guo

Qiwei Li

Peiyan Li

Researchers from Tsinghua University and Xiaomi Robotics developed X-WAM, a Unified 4D World Action Model, integrating real-time robotic action execution with high-fidelity 4D world synthesis (video and 3D reconstruction). The model achieved an average success rate of 79.2% on 24 RoboCasa manipulation tasks, outperforming prior VLA-based methods by 12.1 percentage points, and demonstrated efficient real-time operation on a physical robot.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

28 Apr 2026

Nguyen Khoi Hoang

Shuhaib Mehri

Tse-An Hsu

This research introduces PSI-Bench, an automatic and clinically grounded framework for evaluating depression patient simulators by comparing simulated conversations against real patient data across multiple dimensions. The framework revealed that simulators prematurely resolve emotional narratives, lack natural conversational disfluencies, and often appear overly coherent, with these findings confirmed by strong alignment with expert clinical judgments.

#agents #ai-for-health #computer-science

Paper thumbnail

MotionBricks: Scalable Real-Time Motions with Modular Latent Generative Model and Smart Primitives

27 Apr 2026

Tingwu Wang

Olivier Dionne

Michael De Ruyter

Researchers at NVIDIA developed MotionBricks, a framework for real-time interactive motion control combining a modular latent generative model with "smart primitives" for intuitive, fine-grained control. It achieves 15,000 FPS with 2ms latency and reports improved motion quality (MMD 0.1056, FID 1.054 on a 350k dataset) and high user satisfaction in animation and robotics applications.

#agents #computer-science #artificial-intelligence

Resources 1,661

Paper thumbnail

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

27 Apr 2026

Weijie Wang

Xiaoxuan He

Youping Gu

Researchers from Zhejiang University and Microsoft Research developed World-R1, a reinforcement learning framework that imbues existing text-to-video foundation models with robust 3D geometric consistency without architectural modifications. This approach led to a PSNR improvement of up to 10.23dB over baseline models and a 92% user preference for geometric consistency in generated videos.

#computer-science #computer-vision-and-pattern-recognition #deep-reinforcement-learning

Paper thumbnail

Co-Evolving Policy Distillation

29 Apr 2026

Naibin Gu

Chenxu Yang

Qingyi Si

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.

#agents #computer-science #machine-learning

Paper thumbnail

Skill Retrieval Augmentation for Agentic AI

27 Apr 2026

Weihang Su

Jianming Long

Qingyao Ai

A new paradigm, Skill Retrieval Augmentation (SRA), is proposed to address the scalability of external skill utilization in agentic Large Language Models. Researchers at Tsinghua University developed SRA-Bench, a benchmark featuring 26,262 skills, and demonstrated that while external skills enhance agent performance, current LLMs face significant challenges in effectively incorporating and applying these skills under noisy conditions, lacking both relevance and need-awareness.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Robot Learning from Human Videos: A Survey

30 Apr 2026

Junyi Ma

Erhang Zhang

Haoran Yang

This survey systematically reviews Robot Learning from Human Videos (LfHV), categorizing skill transfer into task-oriented, observation-oriented, and action-oriented mechanisms. It analyzes how these pathways address the robot data bottleneck by leveraging abundant human video sources and highlights a growing preference for egocentric video data for precise action grounding.

#computer-science #computer-vision-and-pattern-recognition #robotics

Paper thumbnail

ClawGym: A Scalable Framework for Building Effective Claw Agents

29 Apr 2026

Fei Bai

Huatong Song

Shuang Sun

The ClawGym framework unifies scalable task synthesis, agent training, and diagnostic evaluation for Claw-style personal agents. It provides a large dataset of 13.5K tasks, trains agents that outperform baselines (e.g., ClawGym-30A3B exceeding Qwen3-235B-A23B), and offers a 200-instance benchmark for robust evaluation, demonstrating improved agent performance and generalization.

#agentic-frameworks #agents #computer-science

Paper thumbnail

MotuBrain: An Advanced World Action Model for Robot Control

30 Apr 2026

MotuBrain Team

Chendong Xiang

Fan Bao

MotuBrain presents a unified world action model that jointly predicts future visual dynamics and robot actions, learning from diverse multimodal data. The model achieved the highest average success rates on the RoboTwin 2.0 simulation benchmark and the top EWMScore on the WorldArena world model benchmark, while enabling rapid real-world deployment on humanoid robots with an 11 Hz control frequency.

#computer-science #robotics

Paper thumbnail

Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling

29 Apr 2026

Zhen Zhang

Changyi Yang

Zijie Xia

Token serves as the fundamental unit of computation in modern autoregressive models, and generation length directly influences both inference cost and reasoning performance. Despite its importance, existing approaches lack fine-grained length modeling, operating primarily at the coarse-grained sequence level. We introduce the Length Value Model (LenVM), a token-level framework that models the remaining generation length. By formulating length modeling as a value estimation problem and assigning a constant negative reward to each generated token, LenVM predicts a bounded, discounted return that serves as a monotone proxy for the remaining generation horizon. This formulation yields supervision that is annotation-free, dense, unbiased, and scalable. Experiments on LLMs and VLMs demonstrate LenVM provides a highly effective signal at inference time. On the LIFEBench exact length matching task, applying LenVM to a 7B model improves the length score from 30.9 to 64.8, significantly outperforming frontier closed-source models. Furthermore, LenVM enables continuous control over the trade off between performance and efficiency. On GSM8K at a budget of 200 tokens, LenVM maintains 63% accuracy compared to 6 percent for token budget baseline. It also accurately predicts total generation length from the prompt boundary. Finally, LenVM's token-level values offer an interpretable view of generation dynamics, revealing how specific tokens shift reasoning toward shorter or longer regimes. Results demonstrate that LenVM supports a broad range of applications and token length can be effectively modeled as a token-level value signal, highlighting the potential of LenVM as a general framework for length modeling and as a length-specific value signal that could support future RL training. Code is available at this https URL.

#computer-science #computation-and-language #explainable-ai

Paper thumbnail

There are no more papers matching your filters at the moment.