AI · Web3 · Tech trends and insights at a glance
AI · Web3 · Tech trends and insights at a glance
Deep dives into the latest research papers
A new study questions the unexamined default that every layer in a language model deserves the same parameter budget. Under a fixed budget, giving more width to early layers and tapering it toward the end improves performance at zero extra cost. The effect holds across four architectures, exposing depth-aware allocation as a free lever in model design.
Language models collapse into hallucination when forced to simulate bitwise arithmetic in their heads. A team entering the NVIDIA Nemotron reasoning challenge abandoned arithmetic entirely, reframing the task around string similarity and backtracking search to reach over 96% accuracy. The deeper move is teaching an LLM to search and recover from errors rather than to calculate.
Almost every large language model is trained with AdamW, yet no one has proven it converges under the heavy-tailed gradient noise that real pretraining actually produces. Lion, Muon, and AdaGrad already cleared that bar, so why is AdamW still the blank entry? The answer may lie in how its denominator quietly remembers a past spike and buries the next important gradient.
Text-to-image models grow so faithful to the prompt that their outputs collapse into a single interpretation. A team at Tel Aviv University proposes inducing diversity at the text level rather than in pixels, letting a vision-language model lay out interpretable axes of variation that users can navigate like a gallery. The result reframes generation as controllable exploration instead of a slot machine.
Humanoid robots have long had to halt before they could manipulate anything. CoorDex compresses both whole-body and twenty-finger control into separate frozen latent priors and learns only a small residual on top, letting a Unitree G1 grasp bottles and open fridge doors while still in motion. The real lesson lies less in the algorithm than in the interface that finally makes high-dimensional contact control trainable.
Teaching a robot hand to grasp dexterously demands data that records what physically happens when it tries—data that until now meant either slow human teleoperation or simulations that cannot certify real contact. AutoDex closes the entire collection loop, from perception to execution to labeling to reset, with no human in between, gathering physically validated grasp data 4.8 times faster than teleoperation.
A new study finds that benign compliance demonstrations can sometimes increase — not decrease — harmful compliance in safety-aligned models, depending on training methodology. Preference optimization, not supervised fine-tuning alone, emerges as the critical stage that blocks this pathway. The work moves past showing that demonstration-based jailbreaks work to explaining how and why they do.
Whether AGENTS.md files actually help coding agents has been surprisingly contested. This paper identifies the decisive variable: not whether guidance exists, but how it is produced. Using synthetic bug-fix probes to iteratively refine guidance files, the method improves agent performance by expanding coverage — helping agents find the right file — not by improving code-editing quality.
LiveCodeBench, the de facto standard for LLM code evaluation, has always had a quiet blind spot: it only tests Python. Multi-LCB extends it to twelve languages and reveals an uncomfortable truth about where current models actually stand — and where they don't.
Rather than generating text about what should be done, Data Intelligence Agents execute code, observe outcomes, and repair failures in a tight loop — a structural shift that compresses the chronic handoff problem in enterprise data pipelines. Deployed in production and matching or surpassing state-of-the-art across seven SQL benchmarks, DIA offers a credible template for autonomous data intelligence at scale.
The two dominant paradigms for training language model reasoning — supervised distillation and reinforcement learning with scalar rewards — each carry structural weaknesses that have largely been treated as inevitable. A Yale research team's Rubric-Conditioned Self-Distillation framework addresses both at once, using structured evaluation criteria as token-level training signal to enable fine-grained credit assignment over the reasoning process.
U.S. local ordinances — the rules governing zoning, noise, and business licenses — have long been the missing layer in legal AI research, locked inside vendor platforms not built for bulk access. LOCUS changes that, releasing machine-readable codes from 9,239 cities and counties alongside ModernBERT classifiers that can measure, for the first time at national scale, how opaque and paternalistic local law actually is.
A cyber attacker's moves are never directly visible to the defender — only their aftermath is. This paper proposes an imitation learning framework that reconstructs red agent policy purely from network observations and defender actions, without ever seeing the attacker act. Integrated into a neurosymbolic defense architecture, the approach achieves high prediction accuracy across diverse simulated attack scenarios.
Zero-shot object goal navigation has long been bottlenecked by a fundamental paradox: foundation models bring broad knowledge, but that knowledge is frozen at training time, leaving agents unable to learn from their own mistakes. EvolveNav breaks this cycle with a self-evolving memory that distills past trajectories into actionable rules, then uses those rules to forecast outcomes before each move — cutting unnecessary steps and pushing success rates 10.1 points ahead of prior baselines.
Deployed robots have always faced a hard ceiling: without fresh demonstrations, a policy simply stops learning. VERITAS changes that by pairing a generalist robot policy with a gradient-free visual verifier — steering actions at inference time and converting verified rollouts into autonomous self-improvement data. The results rival expert demonstrations, with no human involvement required.
The exploding and vanishing gradient problem has long been treated as an empirical nuisance in deep learning, managed through engineering workarounds rather than truly understood. A new paper by Vivek S. Borkar applies multiplicative ergodic theory to give the first rigorous mathematical account of why gradients misbehave in deep networks — and why residual connections fix it.
Every technique that compresses an LLM agent's context window risks invalidating the KV cache, turning token savings into compute losses. TokenPilot resolves this structural tension through a two-layer framework that stabilizes prompt prefixes at ingestion and defers eviction until task relevance genuinely expires — cutting inference costs by up to 87% without sacrificing agent performance.
Every real-world robot episode ends with a single binary signal: success or failure. HABC argues that compressing this outcome into one scalar conflates two fundamentally distinct objectives — viability and efficiency — and that naively assigning labels across human intervention segments corrupts learning at its foundation. Two independent critic heads and a state-adaptive gate resolve both problems, tripling success rates on the hardest contact-rich tasks.
How much data does it take to learn well? A new theoretical result shows that the majority vote of just three independently trained classifiers achieves the provably optimal sample complexity in the PAC learning framework. The elegance of this finding lies as much in its proof as in its conclusion: a short, clean argument that subsumes previous complex analyses and reshapes how we think about ensemble methods.
A new framework combines a hybrid CNN-cellular automaton fire model with gradient-based optimization to automatically design aerial suppression strategies. Validated on the 2020 Bear Fire, the system unifies prediction and intervention planning within a single differentiable loop while supporting rigorous uncertainty analysis.
The Slovak Massive Text Embedding Benchmark exposes a counterintuitive finding: models built specifically for Slovak NLU tasks perform poorly on embedding benchmarks, while large multilingual models dominate despite no Slovak-specific training. The researchers' answer — vocabulary trimming applied to Multilingual E5 — cuts model size by 62% while matching proprietary API performance, and the full pipeline is released openly as a blueprint for other underserved languages.
Despite receiving dense, token-level teacher supervision, on-policy distillation updates only a sparse subset of model parameters — and those updates preferentially land where the source model's weights are near zero. A new study offers the first systematic account of the geometric and sparsity signatures that OPD leaves on model internals.
A new paper introduces 'cognitive colonization' — the idea that AI systems can embed external interests within the architecture of the self in ways users cannot easily perceive. Drawing on the concept of System 0, the authors argue that AI does not merely assist human thinking but shapes the cognitive landscape before thought begins, raising urgent philosophical and practical questions about autonomy in an AI-saturated world.
As language models grow more capable, EurekAgent argues the real constraint on autonomous scientific discovery has shifted from agent design to environment design. By engineering permissions, artifact management, budgets, and human oversight as first-class concerns, the system sets new records on mathematical optimization problems for less than $11 in API costs.
Most LLM-based research agents reduce scientific papers to abstracts and flat citation edges — missing the claims, evidence, and method lineages that make scientific reasoning possible. Agents-K1 is an end-to-end pipeline that converts raw papers into structured knowledge graphs, processing 2.46 million scientific documents to produce Scholar-KG. It is a fundamental rethinking of what it means for an AI agent to know a paper.
Researchers from UC Berkeley and Stanford have introduced Mana, a sim-to-real framework that reimagines robot dexterous manipulation as a computer animation problem. By combining procedural keyframe generation with motion planning and reinforcement learning, Mana achieves zero-shot sim-to-real transfer across four articulated tools with less than one minute of human annotation per tool.
Conventional RAG retrieves what looks similar, but for mathematical reasoning, what matters is what solves similarly. RA-RFT trains a retriever to rank candidates by expected reasoning benefit rather than surface overlap, then fine-tunes the policy model via reinforcement learning with those analogical demonstrations. The result: up to 7.1 points of gain on AIME 2025 over GRPO, with a framework that operates orthogonally to reward design and training curriculum.
A new study shows that LLMs can automate the laborious task of scientific reproducibility assessment, matching or exceeding human reanalysts across 76 published studies. The pipeline achieved 96% agreement on qualitative conclusions versus 74% for human reviewers, pointing toward scalable infrastructure for systematic auditing of empirical research.
Spectral and walk-based positional encodings are theoretically equivalent in full form — but the truncated variants practitioners actually deploy tell a very different story. A new theoretical analysis shows that truncated spectral PEs fall back to the 1-WL baseline while truncated walk-based PEs retain their expressive advantage, a divergence that reshapes how GNN architectures should be designed. Mixing truncated PE families, rather than relying on any single one, emerges as both theoretically motivated and empirically superior.
A new framework from NVIDIA Research proposes that the limiting factor in agentic spatial reasoning is not the quality of perception tools, but the design of the interface through which they are invoked. SpatialClaw uses a stateful Python kernel for step-by-step code execution, outperforming the prior state of the art by 11.2 points across twenty spatial reasoning benchmarks — with no training or fine-tuning required.
A new design principle for Mixture-of-Experts models argues that each router row should align with the principal singular direction of its associated expert matrix — the most mathematically expressive summary of that matrix. Manifold Power Iteration (MPI) enforces this alignment during training through a "Power-then-Retract" paradigm grounded in classical numerical methods. Pretraining experiments from 1B to 11B parameters confirm that principled router-expert alignment translates to measurably stronger model performance.
Most commodity robot arms lack dedicated force sensors, creating a fundamental barrier for contact-rich manipulation research. NEXT (Neural External Torque Estimation) learns a robot's internal dynamics from just ten minutes of free-motion data and achieves torque estimates competitive with dedicated hardware after only one minute of training. Paired with FIRST, a force-guided resampling strategy for behavior cloning, the system outperforms prior force-aware policies by over 17% across five long-horizon tasks — all without adding a single sensor.
Modern conversational AI systems re-encode the full dialogue history at every turn, causing costs and quality to degrade as conversations grow longer. C-DIC reframes context compression as an ongoing memory management problem, maintaining revisable per-thread states in a compact dialogue memory via a lightweight retrieve-revise-write-back loop. Experiments demonstrate stable inference latency and perplexity across hundreds of dialogue turns.
AI evaluation results flood the internet daily, but the numbers rarely come with enough context to mean anything. A score of 87 on MMLU tells you nothing if you don't know the prompt format, the shot count, or whether the training data overlapped with the test set. EvalCards proposes a unified reporting layer that makes these hidden variables visible — and applies it to over 100,000 real evaluation results to expose just how broken current practice is.
Standard diffusion models collapse all data into a single Gaussian terminal distribution before generation, forcing the reverse network to reconstruct manifold structure entirely unaided. PTL-Diffusion replaces this single endpoint with a periodic family of Gaussian terminal laws, embedding geometric structure directly into the forward noising dynamics. Experiments on torus, cylinder, and face datasets show measurable improvements in manifold-level distributional fidelity over matched DDPM baselines.
Reinforcement learning for large language models is structurally off-policy, and trust-region control is essential for stable optimization. Methods like PPO and GRPO rely on importance-ratio clipping, while DPPO improves on this with divergence-based masking — yet both ultimately discard gradients at the boundary, providing no corrective signal for errant updates. DRPO replaces the hard mask with a smooth, advantage-weighted quadratic regularizer that attenuates diverging updates and preserves directional correction beyond the trust-region boundary.
Not all parts of a robot's cognition need to tick at the same rate. AHA-WAM proposes a dual Diffusion Transformer architecture that separates a low-frequency world planner from a high-frequency action executor, letting each operate at its natural tempo. The result is state-of-the-art manipulation performance at 24.17 Hz with a 4.59× speedup over prior baselines — and no robot-specific pretraining required.
Training reinforcement learning policies from scratch is expensive — and often unnecessary when a functional but suboptimal baseline already exists. This paper proposes agency transfer, a method that structures the baseline into training as a progressive arbitrator, formally guaranteeing high goal-reaching rates throughout and deriving explicit lower bounds for the final, baseline-free policy.
Most game benchmarks for VLM agents report a single first-attempt score and call it done. OmniGameArena challenges this with twelve new UE5 games and the Improvement Dynamics Curve — an autonomous reflection harness that measures not just where an agent begins, but how fast it learns and how well it generalizes.
Most multi-agent reinforcement learning systems train agents to maximize individual rewards, but offer no guarantee of convergence to a strategically stable equilibrium. DNQ embeds a game-theoretic solver directly inside the training loop, making Nash Equilibrium an explicit supervision target at every visited state. The framework's pairwise approximation scales to large agent populations, revealing a fundamental tradeoff between strategic fidelity and computational tractability.
As AI writing assistants become ubiquitous, the binary of human-written versus AI-generated text has collapsed into a messy continuum. The OpAI-Bench study finds that documents in intermediate mixed-authorship states — partially human, partially AI — are harder to detect than either purely human or heavily AI-edited text, exposing a non-monotonic detection paradox that current systems are ill-equipped to handle.
Code language models struggle with repository-specific conventions — imports, APIs, naming idioms — that neither RAG nor per-repo fine-tuning can handle cheaply at scale. Code2LoRA trains a hypernetwork to generate repository-specific LoRA adapters on demand, encoding repository knowledge directly into model weights with zero additional tokens at inference time. A new benchmark of 604 Python repositories tests both static snapshots and commit-by-commit evolution scenarios, showing the approach matches per-repository fine-tuning quality without the cost.
Post-training compression of large language models has long operated under two taken-for-granted assumptions: that removal must happen at the full-layer level, and that targeted components must be contiguous. SubFit, from researchers at the University of Trento, dismantles both simultaneously. By treating attention and feedforward submodules as independent, non-contiguous compression targets, it achieves perplexity degradation less than half that of the strongest baseline at 25% sparsity.
Autonomous robots sharing space with people must reason continuously about human intent — a task fraught with uncertainty that makes formal safety guarantees elusive. This paper cracks that problem by combining belief-space safety filtering with conformal prediction, focusing certification on regions where runtime inference is reliable to achieve provably safe yet meaningfully permissive robot behavior.
When multimodal large language models learn tasks sequentially, semantically similar tasks with incompatible output structures get routed to the same expert adapter — quietly corrupting specialized parameters over time. ProtoAda fixes this format-blind assignment with prototypes that encode both what a task is about and how it expects to answer, offering a cleaner path toward models that learn without forgetting.
As agentic AI systems grow more capable than the humans tasked with supervising them, the meaning of oversight becomes unclear. Calibrated Collective Oversight (CCO) from Stanford addresses this by aggregating diverse overseer signals into a collective conservatism penalty, calibrated online via Conformal Decision Theory to keep unsafe behavior below user-specified thresholds with finite-time guarantees. Experiments on SWE-bench and MACHIAVELLI show that weaker overseers can successfully constrain a misaligned stronger agent, with empirical violation rates closely matching theoretical predictions.
Catastrophic forgetting remains one of the most stubborn obstacles in continual learning systems. AREA, accepted at ICML 2026, reframes the problem by decomposing CLIP-based recognition into two distinct stages—attribute extraction and attribute aggregation—and stabilizes each independently using hyperspherical anchoring, variational bottlenecks, and optimal transport routing.
Parameter-efficient fine-tuning has been evaluated almost exclusively on downstream accuracy, leaving the erosion of pretrained capabilities unmeasured. PEFT-Arena reframes the problem through the stability-plasticity dilemma, revealing that orthogonal fine-tuning achieves the most favorable trade-off among competing methods. The paper also shows that final SFT checkpoints routinely overshoot the optimal operating point, and that path-wise rewinding can recover better-balanced models without additional training.
RAG and retrieval agent pipelines expose dozens of configuration choices — which LLM, how many documents, how many hops — yet most systems pick one setup and stick with it. BRANE shows that selecting configurations per query, guided by lightweight predictors trained on workload characteristics, can match the best static setup's accuracy at up to 89% lower cost.
Vision-language models have long serialized bounding boxes into independent coordinate tokens, a choice that quietly undermines geometric coherence and caps inference throughput. LocateAnything introduces Parallel Box Decoding, treating boxes as atomic units decoded in a single step, and pairs it with a 138-million-sample training corpus to push the speed-accuracy frontier outward on both axes.
Most LLM agent frameworks produce skills that are static from the moment of creation — useful once, but unable to improve with experience. MUSE-Autoskill proposes a full lifecycle for agent skills, from creation and memory to evaluation and refinement, treating each skill as a long-lived, testable asset. The result is an agent that compounds its capabilities over time rather than resetting with every new task.
Most high-resolution text-to-image systems generate content in a compact latent space and rely on a VAE decoder to convert latents back to pixels — a stage that has long been a bottleneck for both quality and speed. NVIDIA's PiD reformulates latent decoding as conditional pixel diffusion, merging decoding and upsampling into a single generative module. The result is a decoder that synthesizes fine detail from scratch, runs 6× faster than cascaded super-resolution pipelines, and produces 2048×2048 images in under one second on a consumer GPU.
Camera-controlled video re-rendering has long relied on synthetic datasets, leaving models brittle when confronted with real-world footage. Geo-Align introduces the first reinforcement learning framework for this task, correcting camera trajectory errors through a scale-aware geometric reward signal — no paired real-world data required. Its consistent gains over supervised baselines signal that RL alignment is beginning to reshape video generation just as it reshaped language models.
Agent skills have traditionally been hand-crafted, one-shot generated, or loosely evolved — none of which guarantees reliable improvement under feedback. SkillOpt proposes the first systematic text-space optimizer for agent skills, treating skill documents as trainable parameters with the same discipline applied to neural network weights. Across 52 evaluation cells spanning six benchmarks and three execution harnesses, SkillOpt matches or beats every competing approach.
Generating long videos is computationally prohibitive. LongLive-2.0 applies NVFP4 (4-bit floating point) throughout the full training and inference pipeline of a long video generation model, achieving 2.15x training speedup and 1.84x inference speedup. The 5B parameter model reaches 45.7 FPS — a signal that the bottleneck in long video generation is shifting from capability to cost.
The combination of blockchain and AI in security systems has generated substantial literature and proportional hype. This review paper cuts through both by mapping what each technology actually contributes — blockchain provides provenance and auditability, AI provides detection and adaptation — and honestly assessing that empirical evidence remains mostly at prototype level.
NVIDIA researchers identified a subtle but consequential flaw in existing linear attention models: they use a single gate to control both memory erasure and new information writing. Gated DeltaNet-2 separates these into two independent channel-wise gates, outperforming Mamba-2, Mamba-3, and KDA at 1.3B parameters — particularly on long-context retrieval tasks.
The instinctive response to LLM agent performance degradation on long tasks has been to expand the context window. GenericAgent argues this is the wrong optimization target. By maximizing information density within a fixed context budget — through hierarchical memory, minimal tool interfaces, and self-evolving execution traces — it outperforms leading agents while using fewer tokens.
AutoResearchClaw reframes failure in autonomous research: instead of discarding failed experiments, it uses them as strategic decision points — pivot or refine — while allowing selective human intervention at seven precision levels. On ARC-Bench, it outperforms AI Scientist v2 by 54.7%, with results that compound across research sessions.
AI agents that conduct research can produce outputs that sound convincing but lack actual evidentiary support. ARIS proposes a structural fix: adversarial multi-agent verification, where one agent challenges another's claims against an evidence ledger — making trustworthiness a system property rather than a model property.