Networking

Introducing Virgo Network, Google’s scale-out AI data center fabric

April 22, 2026

https://storage.googleapis.com/gweb-cloudblog-publish/images/GCN26_102_BlogHeader_2436x1200_Opt_2_Dark.max-2500x2500.jpg

Benny Siman-Tov

Senior Director Product Management

Arjun Singh

Engineering Fellow

Try Gemini Enterprise Business Edition today

The front door to AI in the workplace

Try now

The AI era requires a fundamental rethink of physical cloud architecture — networking, in particular. With foundational model parameters growing exponentially, traditional general-purpose networks are reaching their breaking points. To fuel the next decade of machine learning, Google designed Virgo Network, a new megascale AI data center fabric that embraces a "campus-as-a-computer" philosophy, and that underpins our AI Hypercomputer.

Legacy network designs simply cannot handle some of the constraints of modern AI:

Massive scale: Training demands now exceed the power and space of a single data center, requiring unified, multi-data-center domains.
Explosive bandwidth growth: Because foundational model training is heavily network-bound, the required bandwidth per accelerator has surged significantly over the last few years, creating throughput and congestion bottlenecks for older architectures.
Synchronized bursts: Intense, millisecond-level traffic spikes (figure 1) put immense pressure on network buffers. The outcome is that even a single "straggler" node can throttle the entire cluster’s performance.
Low latency: ML serving requires fast, consistent response times to deliver real-time inference, making strict latency control a critical architectural constraint.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_Sub-millisecond_line-rate_bursts_of_an_A.max-1700x1700.png

Figure 1: Sub-millisecond line-rate bursts of an AI training workload

Reimagining the data center network

Meeting the demands of the AI era requires a fundamental shift away from general-purpose network design towards a specialized flat, low-latency network architecture. To address the unique scale and latency constraints, we leverage our proven Jupiter network for north-south traffic and are introducing a new fabric for east-west communication. The resulting architecture consists of three distinct and specialized layers that operate as one unified compute domain:

Scale-up domain: A high-bandwidth, low-latency interconnect fabric designed for tightly coupled communication between accelerators within a single pod.
Scale-out accelerator fabric (east-west): A dedicated accelerator-to-accelerator remote direct memory access (RDMA) fabric optimized for massive horizontal scale across pods. This layer is engineered for deterministic latency and maximum resilience, to provide high “goodput” for the ML workload.
Jupiter front-end network (north-south): A high-capacity fabric that provides fast, reliable access to distributed storage and general-purpose compute resources. It ensures that data access does not become a bottleneck for training and serving workloads, and is also used to scale-across multiple sites for very large training runs.

This architectural decoupling provides key strategic advantages:

Independent evolution: We can evolve and upgrade each network domain independently, preventing system-wide disruptions while accelerating the innovation cycle.
Dedicated scale-out bandwidth: A non-blocking network delivers massive bisectional bandwidth to accelerators for critical training tasks.
ML and network co-design: The network is built in lockstep with each new generation of ML accelerators, helping ensure the fabric is matched to the hardware it supports.

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_Data_center_network_architecture.max-1500x1500.png

Figure 2: Data center network architecture

Introducing Virgo Network: Megascale data center fabric

Virgo Network is a scale-out fabric designed for the extreme requirements of modern AI workloads. Built on high-radix switches that reduce network layers by allowing more ports per switch, it employs a flat, two-layer non-blocking topology. Compared with traditional datacenter networks, this significantly reduces latency by minimizing network tiers. It features a multi-planar design with independent control domains to connect accelerators (figure 3). The accelerator racks also connect with the Jupiter north-south fabric to access compute and storage services. Together, this streamlined architecture delivers the massive bisection bandwidth and deterministic low latency necessary for both distributed training and serving workloads.

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_Megascale_data_center_fabric_Virgo_Netwo.max-1600x1600.png

Figure 3: Megascale data center fabric (Virgo Network)

Virgo Network is the foundation of our next-generation accelerator designs and delivers the following advantages:

Massive fabric scale: Virgo Network can link 134,000 chips (TPU 8t) with up to 47 petabits/sec of non-blocking bi-sectional bandwidth in a single fabric.
Generational performance leap: With up to 4x the bandwidth per accelerator (TPU 8t) over the previous generation, Virgo Network delivers the bandwidth you need to get the full power of every chip.
Predictable low latency: Virgo Network delivers 40% lower unloaded fabric latency for TPUs compared to previous generation leading to more predictable performance for latency sensitive AI workloads.

Improving reliability at scale

In a system supporting hundreds of thousands of chips, hardware failures are inevitable. Because a single faulty component can disrupt a synchronized training job, reliability at scale is a primary focus. To maximize workload goodput, we designed the Virgo Network architecture around fault isolation, deep observability, and the rapid mitigation of hangs and stragglers.

At this scale, system-wide resilience requires a solid network foundation. Virgo Network integrates independent switching planes that provide robust fault isolation, protecting cluster-wide goodput from being degraded by localized hardware failures.

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_How_fail-stop_and_fail-slow_impact_MTTR.max-1600x1600.png

Figure 4: How fail-stop and fail-slow impact MTTR

Building on this foundation, we optimize the software and orchestration stack to maximize mean-time between interruptions (MTBI) and minimize mean-time to recovery (MTTR) through two primary areas:

Observability: Reliability at scale requires high-fidelity visibility. We use sub-millisecond telemetry to monitor network systems. This deep visibility allows us to detect transient congestion, optimize buffer management, and pinpoint the root causes of slowdowns across the hardware and software stack.
Identifying stragglers and hangs: Proactive monitoring is critical for identifying nodes that are experiencing performance degradation (stragglers) or that have stopped responding completely (hangs). By rapidly localizing these bottlenecks, with automated straggler and newly added hang detection, we accelerate the training job and protect it from localized slowdowns.

The foundation of the AI Hypercomputer

Virgo Network is a reimagined scale-out data center network custom-built for the stringent demands of modern AI workloads. This flat, multi-planar architecture unifies accelerators across pods into a single compute domain, addressing the bandwidth and scale limitations of traditional networks. By providing robust fault isolation directly at the hardware level, Virgo Network serves as the foundation for system-wide resilience, protecting synchronized workloads from localized hardware faults.

Ultimately, Virgo Network delivers the scale, predictable latency, and reliability necessary to accelerate the agentic AI era. To learn more about how we are building infrastructure for the future of AI, visit our AI infrastructure solutions page, explore the technical documentation, or attend the dedicated breakout session at Google Cloud Next.

Posted in