engineering

Research that depends on running models against real signals depends, in turn, on infrastructure: pipelines that compose models reliably, edge inference that holds frame budget under thermal and memory constraints, and architectures that split compute across edge and cloud where each is appropriate. The engineering work below predates and supports the research surfaces; in several cases the research questions only became visible because the infrastructure existed.

VLM pipelines and prompt engineering

Building reliable vision-language model pipelines requires more than a model endpoint. Multi-turn context handling, prefix-cache-aware history, structured extraction of chain-of-thought traces, multi-pass validation, batch processing with checkpoint recovery, and prompt-preset management are all load-bearing pieces. agni.works has built several of these as research tooling.

vision_lab. A multi-modal interface for vision-language model interaction — multi-turn chat with prefix-cache-aware history, batch captioning with second-pass refinement, thinking-text capture, agentic tool use (read/write files, list directories), prompt preset management, and inline caption review.

→ github.com/aiaxmaior/vision_lab

trainer_corpus_suite. A pipeline for building cross-validated training corpora from multimodal signals — composing emotion analysis, demographic estimation, action recognition, pose estimation, and VLM captioning into a single annotation graph with cross-validation across methods. The multi-pass validation pattern in this pipeline directly motivated REVEAL's per-patch salience conditioning.

→ github.com/aiaxmaior/trainer_corpus_suite

Edge inference

Deploying vision models at the edge is constrained by compute, memory, and power budgets that don't apply in datacenter settings. Practical edge inference is mostly about which corners can be cut without breaking the signal — which models can run at fractional rates, which can share intermediate representations, and which need every frame.

Work on Jetson Orin platforms includes TensorRT optimization (PyTorch to ONNX to engine, with the resulting per-frame inference reduced from hundreds-of-milliseconds to tens), DeepStream pipeline construction with custom buffer probes, multi-model orchestration with priority scheduling across critical (every-frame), important (every second-frame), and analytics (every fifth-frame) tiers, NvDCF tracker integration, multi-camera CSI ingestion (IMX219 + IMX477), and OpenCV-based camera calibration for downstream geometric work like monocular distance estimation.

The infrastructure has supported single-model real-time pipelines (face detection at 30fps with multi-frame consistency validation) and multi-model pipelines (object detection plus tracking plus emotion recognition plus scene understanding running concurrently on a single device). The edge-deployment work is also where the cost curve of attention modulation becomes concrete — a research question with an immediate engineering consequence.

Conversational sentiment infrastructure

Real-time emotion recognition at the edge produces a high-frequency stream of scalar signals. On its own, that's data, not understanding. The infrastructure here splits the problem along the line where it naturally divides: continuous signal extraction at the edge, event-triggered contextual interpretation in the cloud.

The edge device runs continuous emotion recognition and threshold-based event detection, producing structured timelines of affect signals (per-class probabilities plus valence over time, indexed against video frames). When an event is triggered, the timeline plus surrounding media frames are sent to a cloud-hosted vision-language model, which returns a contextual narrative explaining the event grounded in the observable behavior. The architecture separates the well-defined signal-extraction problem (where edge compute is sufficient) from the open-ended interpretation problem (where larger-model reasoning is required).

The research question this infrastructure makes answerable — whether VLM-generated narratives over multimodal affect signals are reliable enough for downstream use — appears under /research.

Contact

For technical inquiries, collaboration on infrastructure, or correspondence: Arjun.Joshi@Agni.works