REVEAL
Reasoning-Entropy Visual Re-Attention via Learned Displacement and Salience
Visual attention decay in vision-language model reasoning
Modern vision-language models exhibit a structural pathology during extended chain-of-thought reasoning: as the model generates longer reasoning sequences, attention to visual tokens decays. This is not a bug in any specific model — it is a consequence of how attention is computed across long contexts in current architectures. The longer the reasoning chain, the less the model effectively attends to the image while reasoning about it.
The failure mode is invisible. The model continues to produce confident outputs that read as grounded in the visual input, while the actual visual signal has dropped out of the reasoning process. Standard evaluation metrics do not surface this — the outputs look correct because they are linguistically coherent, but the model is no longer looking at what it claims to be looking at.
Framework
REVEAL addresses this decay by modulating visual attention in response to reasoning-time signals — including model uncertainty and exogenous conditioning inputs — to recover visual grounding without requiring base model retraining.
The framework comprises three mechanisms:
- Phase displacement of position encodings, which geometrically broadens the attention pattern over visual tokens
- Patch salience bias, a per-patch content-aware modulation operating at the attention layer
- Exogenous conditioning signals, allowing domain-specific predictive models to drive attention adjustments based on prediction-vs-observation residuals
The three mechanisms can be applied independently or composed. The framework is architecture-agnostic and designed for fine-tuning rather than pretraining, making it deployable on existing models without the cost of training from scratch.
A provisional patent application covering the framework was filed with the USPTO on April 8, 2026. A nonprovisional application is planned for April 2027.
Validation domains
REVEAL is being validated across two application domains, both selected because they require sustained visual attention during multi-step reasoning over temporally rich signals:
Deception detection in video. Behavioral cues for deception are subtle, distributed across modalities (facial expression, gaze, gesture, vocal patterns), and require integration over time. This is a stress test for visual grounding during extended reasoning — the kind of task where attention decay would cause the model to fall back on linguistic priors rather than continuing to attend to the visual evidence.
Equestrian biomechanics analysis. Rider posture and horse gait analysis involve fine-grained spatial reasoning over multiple subjects (rider plus horse) with temporally evolving signals. The domain has well-developed quantitative biomechanics literature, providing reference data for empirical validation.
Both applications are in active development. Results will be published in /writing as they become available.
Contact
For research inquiries, dataset access discussions, or collaboration: Arjun.Joshi@Agni.works