research

Active research threads at agni.works converge on a single question: when vision is the primary signal, how do you evaluate, condition, and deploy multimodal models against it? The threads below are independent investigations in their own right; collectively they form the empirical base that REVEAL is validated against.

Conversational sentiment analysis

Standard emotion-detection pipelines produce scalar outputs — a confidence score per emotion class — that strip away the context that makes affect legible. The research question here is whether a hybrid architecture, where continuous edge-side recognition produces a structured affect timeline and a cloud-hosted vision-language model interprets that timeline against the surrounding media, can produce contextual narratives that survive scrutiny: interpretations grounded in observable behavior rather than confabulated from priors.

The architectural choice — signal extraction at the edge, contextual interpretation in the cloud — separates two computationally and epistemologically different problems. Real-time emotion classification is a well-defined supervised task; explaining why an emotion event occurred is open-ended and requires a model with sufficient context window and reasoning capacity to integrate across modalities.

Implementation lives in the engineering work; the research question is whether the resulting narratives are reliable enough for downstream use cases — which itself folds into the broader evaluation problem below.

Equine biomechanics

Equestrian biomechanics is a domain where computer vision and biomechanical modeling converge: rider posture, horse gait, and their coupling are quantitative phenomena measurable from video, but the field's analysis tools have historically been built on hand-engineered features and 2D-derived approximations. The interesting research surface is the gap between what these tools measure and what's actually present in the signal.

Active work includes a multi-stream analysis pipeline on the PFERD dataset (3D motion capture at 240 Hz, 10 synchronized camera angles, parametric body model, multi-camera 2D keypoints, segmentation masks), comparison against existing 2D-derived scoring systems, and investigation of where deep-learning pose estimation produces information loss versus where it preserves signal that classical methods miss. Specific subquestions include normalization strategy under asymmetric noise, cross-stream coupling between hoof kinematics and head movement, and frequency-domain analysis of gait stability.

The domain is also a validation context for REVEAL — fine-grained spatial reasoning over multiple subjects with temporally evolving signals is exactly the regime where attention decay during extended reasoning would matter — but the biomechanics work has its own surface and stands on its own.

VLM evaluation in real-world applications

Benchmark performance on standard vision-language tasks is a poor predictor of behavior in deployment. Models that score competitively on captioning, VQA, or referring-expression benchmarks fail in characteristic ways when placed in long-running inference pipelines: hallucination under uncertainty, drift across multi-turn context, prior-driven outputs that ignore the visual evidence, and the visual attention decay that REVEAL was built to address.

The research here investigates evaluation patterns that surface these failure modes — multi-pass validation, cross-model agreement as a reliability signal, prompt protocols designed to elicit failure rather than success, and the use of structured-data sidecars to anchor model outputs against independently measurable ground truth.

Many of the patterns that motivated REVEAL were first observed during evaluation work on production VLM pipelines. The evaluation thread is upstream of REVEAL methodologically and continues independently.

Contact

For collaboration, dataset access, or correspondence: Arjun.Joshi@Agni.works