What are the most popular benchmarks for math reasoning?
The study introduces Incompressible Knowledge Probes (IKPs) to intrinsically measure the factual knowledge capacity of large language models, providing a black-box method to estimate undisclosed parameter counts. This approach achieved a 0.917 R² on open-weight models, enabling robust parameter estimation for proprietary models and demonstrating that factual knowledge scales log-linearly with total parameters, remaining distinct from compressible procedural capabilities.
View blogThis research formalizes Maryna Viazovska's proof for the optimal sphere packing in eight dimensions within the Lean Theorem Prover, confirming the E8 lattice packing density of π⁴/384. The formalization involved human experts and an AI autoformalization model, resulting in a 60,000-line machine-checkable proof.
View blogRecursiveMAS introduces a framework that integrates recursive computation into multi-agent systems, enabling agents to refine collaborative reasoning through iterative latent-space interactions rather than explicit text. This approach leads to average accuracy improvements of up to 20.2% and inference speedups of up to 2.4x compared to text-based recursive multi-agent systems, significantly reducing token usage.
View blogAgentic Harness Engineering (AHE) is a closed-loop system for automatically evolving the external components (harness) of coding agents, such as system prompts, tools, and middleware. AHE enhances `pass@1` scores on Terminal-Bench 2 from 69.7% to 77.0%, outperforming human-designed and other automated baselines, and demonstrates transferability to new benchmarks and base models.
View blogDeepSeek-AI researchers introduced "Thinking with Visual Primitives," a framework that integrates points and bounding boxes as fundamental units of thought into Multimodal Large Language Models (MLLMs) to address the "Reference Gap" in complex visual reasoning. This approach improves performance on tasks like counting, spatial deduction, and topological navigation while significantly enhancing visual token efficiency.
View blogMeta AI researchers introduced Tuna-2, a unified multimodal model that performs visual understanding and generation directly from pixel embeddings, eliminating the need for pretrained vision encoders. This encoder-free architecture achieved competitive performance across nine VQA benchmarks and state-of-the-art results among native UMMs for image generation and editing tasks.
View blogThis research introduces FD-loss, a method that directly optimizes Fréchet Distance as a training objective for generative models by decoupling population statistics from batch-level gradient computation. This approach enhances the visual quality of one-step generators and repurposes multi-step models for efficient single-step generation, while also proposing FDr_k, a new multi-representation metric for comprehensive evaluation.
View blogGLM-5V-Turbo introduces a multimodal foundation model that deeply integrates visual perception into core agentic capabilities like reasoning, planning, and tool use, moving beyond auxiliary interfaces. The model achieves competitive performance across diverse multimodal agentic benchmarks, including an approximately eightfold improvement on MMSearch-Plus and outperforming Claude Opus 4.6 on Design2Code, while also maintaining strong text-only coding capabilities.
View blogResearchers from Tsinghua University and Xiaomi Robotics developed X-WAM, a Unified 4D World Action Model, integrating real-time robotic action execution with high-fidelity 4D world synthesis (video and 3D reconstruction). The model achieved an average success rate of 79.2% on 24 RoboCasa manipulation tasks, outperforming prior VLA-based methods by 12.1 percentage points, and demonstrated efficient real-time operation on a physical robot.
View blogThis research introduces PSI-Bench, an automatic and clinically grounded framework for evaluating depression patient simulators by comparing simulated conversations against real patient data across multiple dimensions. The framework revealed that simulators prematurely resolve emotional narratives, lack natural conversational disfluencies, and often appear overly coherent, with these findings confirmed by strong alignment with expert clinical judgments.
View blogResearchers at NVIDIA developed MotionBricks, a framework for real-time interactive motion control combining a modular latent generative model with "smart primitives" for intuitive, fine-grained control. It achieves 15,000 FPS with 2ms latency and reports improved motion quality (MMD 0.1056, FID 1.054 on a 350k dataset) and high user satisfaction in animation and robotics applications.
View blogResearchers from Zhejiang University and Microsoft Research developed World-R1, a reinforcement learning framework that imbues existing text-to-video foundation models with robust 3D geometric consistency without architectural modifications. This approach led to a PSNR improvement of up to 10.23dB over baseline models and a 92% user preference for geometric consistency in generated videos.
View blogA new paradigm, Skill Retrieval Augmentation (SRA), is proposed to address the scalability of external skill utilization in agentic Large Language Models. Researchers at Tsinghua University developed SRA-Bench, a benchmark featuring 26,262 skills, and demonstrated that while external skills enhance agent performance, current LLMs face significant challenges in effectively incorporating and applying these skills under noisy conditions, lacking both relevance and need-awareness.
View blogThis survey systematically reviews Robot Learning from Human Videos (LfHV), categorizing skill transfer into task-oriented, observation-oriented, and action-oriented mechanisms. It analyzes how these pathways address the robot data bottleneck by leveraging abundant human video sources and highlights a growing preference for egocentric video data for precise action grounding.
View blogThe ClawGym framework unifies scalable task synthesis, agent training, and diagnostic evaluation for Claw-style personal agents. It provides a large dataset of 13.5K tasks, trains agents that outperform baselines (e.g., ClawGym-30A3B exceeding Qwen3-235B-A23B), and offers a 200-instance benchmark for robust evaluation, demonstrating improved agent performance and generalization.
View blogMotuBrain presents a unified world action model that jointly predicts future visual dynamics and robot actions, learning from diverse multimodal data. The model achieved the highest average success rates on the RoboTwin 2.0 simulation benchmark and the top EWMScore on the WorldArena world model benchmark, while enabling rapid real-world deployment on humanoid robots with an 11 Hz control frequency.
View blog