NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation
Computer Science > Artificial Intelligence
arXiv:2606.18271 (cs)
Submitted on 5 Jun 2026
Authors: Juan Manuel Delfa Victoria, Taran Cyriac John, Andrew W. Herson
Abstract
Earth Observation data generation now outpaces downlink bandwidth and manual processing, creating a growing gap between onboard data collection and actionable ground-level intelligence. This paper presents NAVI-Orbital, a software system deployed on a Low Earth Orbit (LEO) spacecraft. On April 16, 2026, NAVI-Orbital achieved what is, to the authors' knowledge, the first in-orbit demonstration of a vision-language model performing autonomous multi-modal inference entirely onboard. The system uses a local vision-language model, Gemma 3, to classify each captured scene, generate a text description of its content and inter-feature relationships, and respond to follow-up operator queries via natural-language dialogue. Unlike conventional command sequences, NAVI-Orbital is re-tasked through plain-English prompts and orchestrated by a graph-based state machine (LangGraph) that coordinates dedicated agents for detection and dialogue. Results from ground benchmarking (88.16% accuracy on the 7,960-image curated AID benchmark), Flatsat validation, and live on-orbit captures of newly acquired, previously unseen Earth imagery—including uncorrected YAM-9 images processed onboard with hardware-accelerated GPU inference and no fine-tuning for the flight instrument—demonstrate the feasibility of deploying foundation models on satellite-class edge computers. This approach inverts the traditional "acquire-then-downlink-everything" bandwidth profile by performing semantic compression of Earth observations in orbit.
Additional Information
- Comments: 17 pages, 47 figures
- Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Cite as: arXiv:2606.18271 [cs.AI]
- DOI: https://doi.org/10.48550/arXiv.2606.18271
via ArXiv AI
