REVEAL

Reasoning-Entropy Visual Re-Attention via Learned Displacement and Salience

provisional patent: filed — USPTO, Apr 2026
signal pipeline: validated — 499 clips
training corpus: 10.9k CoT examples
attention training: in preparation

Visual attention decay in vision-language model reasoning

Modern vision-language models exhibit a structural pathology during extended chain-of-thought reasoning: as the model generates longer reasoning sequences, attention to visual tokens decays. No specific model is to blame; the decay follows from how attention is computed across long contexts in current architectures. The visual tokens are a fixed prefix of the sequence; every generated token dilutes their share. For one open family of reasoning VLMs this has been proven formally: attention to visual tokens declines as reasoning progresses [1].

The failure mode is invisible. The model continues to produce confident outputs that read as grounded in the visual input, while the actual visual signal has dropped out of the reasoning process. Standard evaluation metrics do not surface this: the outputs look correct because they are linguistically coherent, but the model is no longer looking at what it claims to be looking at.

Visual attention decay during extended chain-of-thought generation, schematically. Visual tokens are a fixed prefix of the sequence; every generated reasoning token dilutes their share of attention mass. For one open reasoning-VLM family the effect has been proven formally: attention to visual tokens declines as reasoning progresses [1]. The dashed line marks the regime REVEAL is built to hold; measured results come later.

The existing remedy is token re-injection: pasting copies of the visual tokens back into the sequence at reflection points. It works, but every re-injection lengthens the sequence and its memory footprint. REVEAL instead modulates the geometry of attention over the visual tokens that are already there, at zero additional sequence length.

Framework

REVEAL addresses the decay by modulating visual attention in response to reasoning-time signals, including model uncertainty and exogenous conditioning inputs, to recover visual grounding without retraining the base model.

The framework comprises three mechanisms:

Phase displacement of position encodings [3], which geometrically broadens the attention pattern over visual tokens
Patch salience bias, a per-patch content-aware modulation operating at the attention layer
Exogenous conditioning signals, allowing domain-specific predictive models to drive attention adjustments based on prediction-vs-observation residuals

Block-level structure. The generation path (video → encoder → decoder) is entirely frozen. Two reasoning-time signals, an exogenous prediction–observation residual ε and the model's own reasoning uncertainty H, condition two attention-level interventions: a phase displacement of visual position encodings [3] and a per-patch salience bias. Mechanism internals are covered by the April 2026 provisional filing and are not described here.

The three mechanisms can be applied independently or composed. The framework is architecture-agnostic and designed for fine-tuning rather than pretraining, making it deployable on existing models without the cost of training from scratch.

A provisional patent application covering the framework was filed with the USPTO on April 8, 2026. A nonprovisional application is planned for April 2027. Mechanism internals (the loss formulation, module architectures, and training procedure) are covered by the filing and not described here.

The conditioning signal, measured

The third mechanism rests on a premise that can be tested without touching a model: that a lightweight, interpretable predictor running alongside the video can flag the moments worth looking at. The pipeline fits classical predictive baselines over pose kinematics extracted from video, and scores every frame by how far the observed motion departs from prediction. When parallel channels of evidence disagree, widening perceptual search is the statistically sound response, a result with deep roots in human cue integration [2].

Per-frame residual magnitude from a classical predictive baseline fit over pose kinematics: a fixed-camera arena clip (126 frames), scored against the pooled walk-gait baseline. The horse breaks from a walk into a trot, surges into a canter, and settles back to a walk; the residual flags the whole transient and peaks at the canter→trot transition (frame 90, verified frame-by-frame against the source video). The late spikes near the right edge are keypoint-tracker degradation as the horse approaches the frame boundary — noise, not motion, and one reason the channels are reliability-weighted.

A close reading of this trace, including why the late spikes are tracker noise and why that distinction shapes the signal design, is in the first research note.

Empirical base

REVEAL trains against a purpose-built corpus. Every clip passes through pose estimation, depth lifting, kinematic feature extraction, predictive baselining, and multi-pass chain-of-thought generation with cross-model review.

499: video clips instrumented end-to-end
30,167: frames with dual-skeleton pose — 39-keypoint quadruped + 17-keypoint rider
41: kinematic features extracted per frame
7: residual channels feeding the conditioning signal
10,931: validated chain-of-thought examples in the current corpus
2: validation domains — behavioral video analysis, equestrian biomechanics

All six numbers describe the training scaffolding. Attention-level results will be reported when the training runs complete; this page makes no claims about them yet.

Validation domains

REVEAL is being validated across two application domains, both selected because they require sustained visual attention during multi-step reasoning over temporally rich signals:

Deception detection in video. Behavioral cues for deception are subtle, distributed across modalities (facial expression, gaze, gesture, vocal patterns), and require integration over time. This is a stress test for visual grounding during extended reasoning: the kind of task where attention decay pushes a model onto linguistic priors and away from the visual evidence.

Equestrian biomechanics analysis. Rider posture and horse gait analysis involve fine-grained spatial reasoning over multiple subjects (rider plus horse) with temporally evolving signals. The domain has well-developed quantitative biomechanics literature, providing reference data for empirical validation. The measured trace above comes from this domain.

Status

2026-04Provisional patent application filed (USPTO).
2026-05Signal-extraction pipeline validated across 499 clips; chain-of-thought corpus assembly complete.
2026-06Training infrastructure validated on local dual-GPU hardware.
2026-07Attention-level training runs — in preparation.
2027-04Nonprovisional filing planned.

Notes from the work in progress are published in /writing as they become available.

References

[1] Chu, X. et al. Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information (2025). arXiv:2505.23558.
[2] Ernst, M. O. & Banks, M. S. Humans integrate visual and haptic information in a statistically optimal fashion. Nature 415, 429–433 (2002).
[3] Su, J. et al. RoFormer: Enhanced Transformer with Rotary Position Embedding (2021). arXiv:2104.09864.

Contact

For research inquiries, dataset access discussions, or collaboration: Arjun.Joshi@Agni.works