Image:
AI-generated illustration by Nano Banana. Photo by the Norwegian Computing Center.

Trends in Visual Intelligence 2026

The field of Visual Intelligence is continuously transforming. Chief Research Scientist Arnt-Børre Salberg dives deeper into the current trends in the field of visual intelligence as of early 2026.

Trends in Visual Intelligence 2026

The field of Visual Intelligence is continuously transforming. Chief Research Scientist Arnt-Børre Salberg dives deeper into the current trends in the field of visual intelligence as of early 2026.

By Arnt-Børre Salberg, Chief Research Scientist at Visual Intelligence and the Norwegian Computing Center

While previous years were influenced by the emergence of Large Language Models (LLMs), generative models, and multi-modal capabilities, the current landscape is increasingly expanding toward specialized foundation models for the physical world, efficient 3D representations, and sophisticated AI agents.

Foundation models for Earth and science

Foundation models (FMs) are expanding beyond text and general images into complex scientific domains. A major trend is the development of domain specific multi-modal FMs, like for Earth observation (EO) and weather forecasting.

Aurora [Bodnar et al., 2025] represents a leap for Earth system forecasting. Pretrained on over one million hours of diverse data like ERA5 and CMIP6 climate simulations, it uses a 3D Perceiver encoder to ingest heterogeneous variables at any resolution. This 1.3-billion-parameter FM employs a 3D Swin Transformer backbone to outperform operational systems in predicting air quality, tropical cyclone tracks, and 0.1° high-resolution weather at a fraction of the computational cost.

Aardvark Weather [Allen et al., 2025] focuses on a fully end-to-end approach, ingesting raw observations like satellite and station data directly to produce forecasts in roughly one second on GPUs, a massive speedup compared to the 1,000 node-hours required by traditional systems. Together, these models demonstrate that AI can deliver high-fidelity global and local predictions without the resource-heavy reliance on traditional numerical solvers at deployment time.

Within EO, the FM TerraMind [Jakubik et al., 2025] demonstrates superior benchmark performance, but also generative capabilities. New models like THOR  [Forgaard et al, 2026] (developed by NR and UiT) is a flexible, multi-sensor model for EO that unifies Sentinel-1 SAR, Sentinel-2 MSI, and Sentinel-3 OLCI & SLSTR data from 10 m - 1000 m GSD, built on a compute-adaptive vision transformer backbone. This enables the use of smaller patch sizes at inference, providing a denser token sequence that is more effective for fine-tuning on limited data.

However, one of the largest breakthroughs within EO is Google’s AlphaEarth Foundations [Brown et al., 2025], providing a dense Earth embeddings of 64-element features vectors in a 10 m grid with global coverage, monthly temporal resolution, and accessible via Google Earth Engine. The embedding is created by integrating data from various sources, including satellite images, radar, 3D laser mapping, and climate models.

Left: The global AlphaEarth embedding field. Right: Each embedding vector has 64 components that maps to a point on a 64-dimensional sphere. Illustration provided by Arnt-Børre Salberg

3D Vision

3D vision is experiencing a revolution fueled by AI, moving from iterative optimization to direct prediction. New architectures like the Visual Geometry Grounded Transformer (VGGT ) [Wang et al., 2025] can directly infer key 3D attributes, such as camera parameters and depth maps, from just a few views. Another interesting direction is 3D Gaussian Splatting, a technique that creates photorealistic scenes by modeling them as a set of 3D anisotropic Gaussians. This method supports complex view-dependent effects and enables real-time rendering for VR and AR, surpassing traditional point clouds by better capturing fine-grained geometric details and surface textures.

Researchers are also integrating these techniques into multimodal pre-training. For example, UniGS (Unified Language-Image-3D Pretraining with Gaussian Splatting) [Li et al., 2025] enhances 3D representations by aligning them with text and image encoders via a Gaussian-aware guidance module, enabling tasks like zero-shot classification and text-driven retrieval in 3D scenes.

Generative AI and architecture shifts

Diffusion models remain a cornerstone of generative AI, but their architectures is evolving. We are seeing a shift from traditional U-Nets to Diffusion Transformers [e.g., Peebles & Xie, 2023], where self-attention mechanisms, often optimized for scale, capture the global context necessary for coherent high-resolution generation. Efficiency is also a priority, with techniques like distillation and rectified flows reducing the number of steps required for generation of content.

Beyond diffusion, Visual Autoregressive Modeling  [Tian et al., 2025] is emerging as a scalable alternative. Unlike traditional next-token prediction, VAR generates images via "next-scale prediction," building the image from low to high resolutions in a coarse-to-fine manner.

World Models

In robotics, Navigation World Models  [Bar et al., 2025] represent a shift from fixed policies to predictive planning. These models allow robots to hallucinate future visual observations based on proposed actions, enabling them to simulate paths in novel environments from a single input image. Crucially, they support test-time conditioning, meaning users can dynamically impose new constraints (e.g., 'avoid the wet floor') without retraining.

From movie to 3D information. VGGT enables creation of 3D models from just a few views of an object.

AI Agents

One of the most promising frontiers is the development of AI Agents, autonomous systems capable of perceiving their environment, reasoning, and acting to achieve goals without constant human input. A prime example is Claude Code, an agentic command-line interface (CLI) that allows developers to interact with the model directly in their terminal to automate software development. It understands local file structures, can read/edit files, run terminal commands, manage git operations, and solve complex debugging tasks through natural language prompts.  Beyond the terminal, 2025 has introduced “Computer use” capabilities (e.g., OpenAI Operator), enabling agents to visually navigate desktop GUIs and web browsers like a human.

Fueled by AI Agents we have seen also seen the rise of “vibe coding”.  Natural language interfaces are lowering the barrier to software creation, enabling non-developers to build tools by describing what they want. Framework like Lovable (and Claude Code) could fundamentally change who builds new software and how fast.

Final comments

As the demand for data grows, synthetic data is becoming indispensable. It serves as a proxy for real-world data, addressing privacy concerns and the high cost of labeling. Gartner predicts that by 2030, synthetic data will completely overshadow real data in AI models.

Finally, as these technologies mature, Trustworthy AI remains critical. The focus is shifting from "black box" models to explainable frameworks and models that quantify their uncertainty, ensuring transparency in decision-making. Addressing algorithmic bias and ensuring data privacy through governance frameworks like GDPR and the EU AI Act are essential steps to maintain human oversight and trust in high-stakes applications.

References

Allen, A., Markou, S., Tebbutt, W., Requeima, J., Bruinsma, W. P., Andersson, T. R., ... & Turner, R. E. (2025). End-to-end data-driven weather prediction. Nature, 641(8065), 1172-1179.

Bar, A., Zhou, G., Tran, D., Darrell, T., & LeCun, Y. (2025). Navigation world models. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 15791-15801).

Bodnar, C., Bruinsma, W. P., Lucic, A., Stanley, M., Allen, A., Brandstetter, J., ... & Perdikaris, P. (2025). A foundation model for the Earth system. Nature, 1-8.

Brown, C. F., Kazmierski, M. R., Pasquarella, V. J., Rucklidge, W. J., Samsikova, M., Zhang, C., ... & Kohli, P. (2025). AlphaEarth foundations: an embedding field model for accurate and efficient global mapping from sparse label data. arXiv preprint arXiv:2507.22291.

Forgaard, T., Reksten, J. H., Waldeland, A. U., Marsocci, V., Longépé, N., Kampffmeyer, M., & Salberg, A. B. (2026). THOR: A Versatile Foundation Model for Earth Observation Climate and Society Applications. arXiv preprint arXiv:2601.16011.

Jakubik, J., Yang, F., Blumenstiel, B., Scheurer, E., Sedona, R., Maurogiovanni, S., ... & Longépé, N. (2025). Terramind: Large-scale generative multimodality for earth observation. arXiv preprint arXiv:2504.11171.

Li, H., Zhou, Y., Tang, T., Song, J., Zeng, Y., Kampffmeyer, M., ... & Liang, X. (2025). Unigs: Unified language-image-3d pretraining with gaussian splatting. arXiv preprint arXiv:2502.17860.

Peebles, W., & Xie, S. (2023). Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4195-4205).

Tian, K., Jiang, Y., Yuan, Z., Peng, B., & Wang, L. (2024). Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems, 37, 84839-84865.

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., & Novotny, D. (2025). VGGT: Visual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 5294-5306).

Latest news

Centre-developed seismic foundation model is now open source!

April 6, 2026

The NCS model, a seismic foundation model trained on data from the Norwegian data repository for subsurface data, is now available as an open-source model, allowing anyone to download, utilize, and further develop the model.

Visual Intelligence Annual Report 2025

March 31, 2026

The Visual Intelligence Annual Report 2025, highlighting the centre's progress, activities, achieved innovations, staff, funding, and publications for 2025, is now available to read on our websites.

Visual Intelligence strengthens ties with Pioneer Centre for AI in EHR-related research

March 26, 2026

Visual Intelligence researchers contributed to the Pioneer Centre for AI workshop on Electronic Health Records research. The aim was to strengthen ties between the two centres on EHR-related research.

Nordlys: Her blir KI-studentene grillet av sin «egen» teknologi

March 24, 2026

Tre av studentene i sivilingeniør i Kunstig Intelligens ved UiT skal delta i NM i KI. Slik gikk det da de ble intervjuet ved hjelp av kunstig intelligens (News article in nordlys.no).

My Research Stay at Visual Intelligence: Rami Al-Belmpeisi

March 15, 2026

Rami Al-Belmpeisi is a PhD Research Fellow in the Visual Computing section at DTU Compute, Technical University of Denmark. He visited Visual Intelligence in Tromsø from November 2025 to February 2026.

Visual Intelligence inspires future students at the UiT Open Day

March 12, 2026

Visual Intelligence researchers came to the UiT Open Day to inform the students about UiT's study programme, and inspire them to pursue AI-related studies and career paths.

Dagsavisen: Hun lærer kunstig intelligens å forstå medisinske bilder

March 4, 2026

Elisabeth Wetzer forsker på hvordan maskiner kan lære å analysere medisinske bilder – og samtidig forstå hva legene faktisk ser etter (Norwegian news article on dagsavisen.no)

My Research Stay at Visual Intelligence: Artur Radzivil

February 12, 2026

Artur Radzivil is a PhD Research Fellow at Vilnius Gediminas Technical University. He visited Visual Intelligence in Oslo from September to November 2025.

Centre Director Robert Jenssen meets Norwegian Minister of Research and Higher Education, Sigrun Aasland

February 6, 2026

Centre Director Robert Jenssen met with Sigrun Aasland, Norwegian Minister of Research and Higher Education, alongside UiT and Aker Nscale representatives to give an update on Aker Nscale's AI Giga Factory in Narvik, Norway.