Revenue of at least $1 trillion! Jensen Huang's speech ignites GTC, NVIDIA reasserts control over AI's life and death game (includes 20,000-word transcript)

MaticHoleFiller

2026-03-23 01:05:28

Article | “Silicon Valley Watch” by Zheng Jun

The SAP Center in San Jose, Silicon Valley, is nearly full.

This arena, usually home to the NHL San Jose Sharks, has today become the annual “AI Mecca.” Developers, engineers, corporate buyers, and investors from 190 countries fill every seat, all eyes fixed on a familiar figure: the middle-aged man in a leather jacket.

NVIDIA CEO Jensen Huang took the stage with his first words: “It all starts here.” — Everything begins here. Over the next two hours, he proved the weight of this statement. He smiled and said today’s event is like the Super Bowl.

He projects that NVIDIA’s new generation AI accelerator architecture Blackwell and the next-gen Rubin products will generate at least $1 trillion in revenue by the end of 2027. This figure far exceeds Huang’s October 2025 forecast of $500 billion in sales, once again highlighting the rapid expansion of the AI infrastructure investment wave.

Trillion-Order: Recalibrating Demand Narratives

The most direct numerical impact in this speech comes from orders. Huang estimates that by the end of next year, NVIDIA’s procurement orders for the Blackwell and Vera Rubin architectures will surpass $1 trillion. This is double the $500 billion NVIDIA expected last year.

NVIDIA had already raised its outlook previously. Last month, CFO Colette Kress hinted during earnings calls that chip sales would exceed prior expectations, and today Huang quantifies this confidence with a concrete number.

This confidence is based on: NVIDIA’s latest earnings report shows data center revenue reaching $62.3 billion, up 75% year-over-year; yet, NVIDIA’s stock price has not risen in tandem, instead retreating about 11% from its October 2022 high of $207. The capital markets harbor doubts about whether NVIDIA can sustain growth into 2027, and growth potential directly influences stock price. Huang’s trillion-dollar figure directly addresses these “vacuous concerns.”

Core Product: Vera Rubin Full Stack Debut

Vera Rubin is the undisputed star of this speech, though Huang only officially announced it after an hour and a half. This system was first revealed at a Washington, D.C. event last year, with further details shown at CES 2026 earlier this year, and today it is fully launched. Key highlights:

Vera Rubin NVL72 is the flagship model, equipped with 72 GPUs interconnected via NVLink 6, all cooled by liquid cooling. Huang emphasized: “All cables are gone” — replaced by modular trays, reducing installation from two hours with Blackwell to just five minutes. The system runs with 45°C hot water cooling. Huang called it the “engine of the supercharged AI era.”

Rubin Ultra expands to 144 GPUs in a single rack, using the new Kyber vertical chassis, with front-end compute and rear-end NVLink interconnect. Compared to the Hopper generation, the Rubin platform’s inference throughput can theoretically reach 7 million tokens/sec, versus 2 million for the x86 Hopper combo. Huang describes this as “the most important chart for the future of AI factories,” and divides inference compute into four tiers: Free, High, Premium, Ultra, priced by tokens/sec — “Tokens are the new commodities.”

Vera CPU will be sold as a standalone product, creating an independent revenue stream for NVIDIA in the CPU market. NVIDIA expects this to develop into a “billion-dollar-level” business. The first Vera Rubin system is already running on Microsoft Azure cloud, with smooth sampling progress — contrasting with early yield issues in the Blackwell generation.

Groq Acquisition: LPU Officially Integrated

Last Christmas Eve, NVIDIA completed the acquisition of Groq’s core assets for about $20 billion, bringing in founder Jonathan Ross and key team members. Today, Huang announced the product of this acquisition: Groq 3 LPU (Language Processing Unit).

Groq 3 is positioned as a reasoning accelerator for Vera Rubin, not a GPU replacement. From a technical architecture perspective, large language model (LLM) inference involves two stages: compute-intensive prefill (processing input prompts) and bandwidth-intensive decode (generating output tokens). NVIDIA’s GPUs excel at high-throughput prefill, while Groq’s LPU, with 22 TB/s HBM4 memory bandwidth, is optimized for decode, about 7 times faster than comparable GPUs. They operate together via Disaggregated Inference architecture: GPUs handle prefill, LPUs handle decode, all managed by NVIDIA’s Dynamo system.

NVIDIA has launched a dedicated LPX rack, housing 256 Groq 3 LPUs, designed to sit alongside the Vera Rubin NVL72 rack, interconnected via custom Spectrum-X. Each Groq 3 LPU has 500MB on-chip memory, manufactured by Samsung, shipping expected in Q3. Official data shows that deploying Vera Rubin NVL72 with Groq 3 LPX yields 35x higher tokens/sec per megawatt compared to Blackwell.

NVIDIA executives indicated that this architecture enables “thousands of tokens per second” low-latency inference for large language models — a tier previously dominated by specialized inference chips from Cerebras and SambaNova.

NVIDIA Official Cultivates AI Agents

Beyond hardware, Huang spent considerable time discussing NVIDIA’s software positioning. The focus is on the hottest AI agent wave, including the recent viral open-source platform OpenClaw. He praised OpenClaw as the most successful open-source project ever.

Huang likened OpenClaw to an operating system: “It’s the OS for agent computing, just like Windows made personal computers possible.” He even claimed “every company worldwide needs an OpenClaw strategy,” comparing it to the adoption of Linux or HTTP/HTML in the past.

NVIDIA released NemoClaw — an open-source enterprise reference stack for OpenClaw. Its core function is enterprise security: helping companies protect sensitive internal data during AI agent deployment, preventing leaks during autonomous operation. Microsoft’s security team announced a partnership with NVIDIA to develop real-time adaptive defenses based on Nemotron and NemoClaw.

Additionally, NVIDIA positions DGX Spark and DGX Station as local development and deployment platforms for enterprise AI agents, bringing NemoClaw capabilities to edge environments.

Roadmap: From Feynman to Space Data Centers

On the hardware roadmap, Huang for the first time outlined the next-gen Feynman architecture, planned for 2028. Feynman will include a new GPU, a new LPU (LP40), and a new CPU named Rosa (in honor of Rosalind Franklin), along with BlueField-5 DPU, CX10 NIC, and support for copper cabling and Co-Packaged Optics (CPO) via Kyber interconnect.

More surprisingly, Huang announced NVIDIA is developing a space version of Vera Rubin — Space-1, aiming to deploy AI data centers in orbit. He acknowledged radiation protection in space as a key challenge but said NVIDIA has begun R&D. This aligns with strategies from SpaceX, Google, Amazon, and others.

NVIDIA also released the DSX AI Factory reference design, integrated with Omniverse DSX Blueprint, to help enterprises plan, simulate, and manage full lifecycle of large-scale AI data centers. AWS announced an expanded partnership, committing over 1 million NVIDIA GPUs including Blackwell, Rubin, and Groq 3 LPUs, with deployment across global regions within this year.

Autonomous Vehicles and Robotics: Large-Scale Partner Expansion

Autonomous driving is the third major theme. Huang announced that NVIDIA Drive AV software is entering deployment with Uber: by 2028, Uber will deploy NVIDIA-supported autonomous fleets in 28 cities across four continents, starting in Los Angeles and San Francisco in 2027.

Meanwhile, automakers like BYD, Geely, Nissan, and Hyundai are developing L4 autonomous passenger cars on the Drive Hyperion platform. Isuzu and Chinese firm Tier IV are working on autonomous buses with NVIDIA AGX Thor chips. Huang quoted: “The ChatGPT moment for autonomous vehicles has arrived.”

In robotics, Disney’s Olaf robot (from Frozen) appeared on stage for interaction, trained in NVIDIA’s simulation environment, demonstrating embodied AI (Physical AI) applications in entertainment.

Perhaps Patrick Moorhead of Moor Insights & Strategy summed it up best: NVIDIA is no longer just a chip company — it’s a platform.

In the first hour and a half, Huang emphasized platforms and infrastructure. He repeatedly stressed NVIDIA is no longer just a chipmaker, but an ecosystem platform and infrastructure enterprise. Today’s presentation shows NVIDIA’s strategic scope extends to training, inference, orchestration, software security, Physical AI, autonomous vehicles, robotics, and even space data centers.

More specifically, NVIDIA is building a moat on three levels: full-stack hardware (GPU + LPU + CPU + DPU + Network), software ecosystem (CUDA, NemoClaw, Dynamo, Omniverse), and industry applications (automotive, healthcare, industrial, entertainment). Among these, software is becoming an increasingly distinctive competitive advantage — the part most difficult for competitors like AMD to replicate.

The large-scale expansion of autonomous vehicle partnerships and the integration of OpenClaw agent platform indicate NVIDIA’s growth will extend from data center hardware to broader AI application infrastructure. Huang’s vision: AI will evolve from current text-generation tools into autonomous systems capable of reasoning, planning, and executing tasks, powered by AI data centers centered on “Token factories” — NVIDIA aims to be the full solution provider for this factory.

Stock and Analyst Reactions: Confidence Confirmed, Divergences Remain

During the event, NVIDIA’s stock closed up about 1.65%, rising from around $181 to approximately $183, with a trading volume of 217 million shares, above the daily average of 177 million, and a market cap of $4.45 trillion — at least short-term, this GTC boosted market confidence.

Wedbush analyst Dan Ives responded most positively, calling Huang the “AI godfather” and describing this GTC as a “confidence boost the tech investors desperately need,” asserting NVIDIA “sits atop the AI mountain.” Ives reaffirmed that the AI revolution is accelerating, not slowing, with trillion-dollar demand forecasts coming from enterprises, governments, and AI-native companies working in tandem. He estimates each dollar spent on NVIDIA chips creates an 8-10x multiplier in downstream sectors like software, cybersecurity, energy, and data centers.

C.J. Muse of Cantor Fitzgerald set a target price of $300 before the event, maintaining a buy rating, stating “we are at a critical point of rebuilding confidence”; he believes Huang’s message will reinforce NVIDIA’s positioning as a “full-system AI infrastructure company,” with clear demand visibility into 2027.

Deepwater partner Gene Munster was more cautious before the event, believing the real challenge lies in long-term concerns over slowing growth post-2027 — closely tied to the broader narrative of whether AI capital spending has peaked.

In the past year, amid AI bubble fears and infrastructure investment surges, Huang’s speech injected a strong dose of optimism, depicting a broader AI ecosystem vision. NVIDIA’s strategic layout now extends across training, inference, orchestration, software security, Physical AI, autonomous driving, robotics, and space data centers, firmly holding a foundational position.

AI bubble? The middle-aged man in a leather jacket thinks this is just the beginning.

【Full transcript of the speech attached】

Welcome to GTC! I want to remind everyone, this is a technology conference. So many people queued early this morning — it’s great to see you all here. At GTC, we explore technology and platforms. NVIDIA has three major platforms; many might think we mainly discuss CUDA X, but systems are another platform, and now we have a new one called AI Factories. We’ll cover all these, but most importantly, we focus on the ecosystem.

Before starting, I want to thank the pre-show hosts Sarah Go and Alfred Lin, as well as NVIDIA’s first venture capital partner, Sequoia Capital’s Gavin Baker. As the first major institutional investor, they have deep expertise in technology, industry insights, and a broad ecosystem. Of course, I also want to thank all the VIP guests I personally invited, and all the sponsors here today. NVIDIA is a platform company, with technology, platforms, and a rich ecosystem. Today, we gather representatives from the trillion-dollar industry, with 450 sponsoring companies, over a thousand technical sessions, and 2,000 speakers. This conference covers every layer of the five-tier AI architecture—from infrastructure like land, power, and buildings, to chips, platforms, and models, culminating in applications that will propel the entire industry.

It all begins here. This year marks the 20th anniversary of CUDA. For two decades, we have dedicated ourselves to developing this architecture. This revolutionary invention allows scalar code to be written with single-instruction multiple-thread (SIMT) execution, enabling multi-threaded applications more easily than SIMD. Recently, we added Tiles to help developers program Tensor Cores and the fundamental math structures of AI today. Thousands of tools, compilers, frameworks, libraries, and hundreds of thousands of open-source projects have deeply integrated CUDA. The hardest part is the massive installed base.

We spent 20 years building hundreds of millions of CUDA-enabled GPUs and computing systems worldwide, covering every cloud platform and hardware vendor, serving nearly every industry. The installed base of CUDA is the core driver of the flywheel effect. It attracts developers, who then create breakthroughs like deep learning algorithms. These breakthroughs open new markets and build ecosystems, attracting more companies and generating even larger installed bases. This flywheel is accelerating now — NVIDIA’s library downloads are surging at an incredible rate. This effect not only supports countless applications and breakthroughs but also extends the lifespan of infrastructure.

With so many applications running on NVIDIA CUDA, we support every stage of the AI lifecycle and every data platform, accelerating scientific solvers. The broad scope means that once NVIDIA GPUs are installed, they have a very long lifecycle. That’s why architectures shipped six years ago, like Ampere, still see rising cloud prices. High installed base, the flywheel effect, extensive developer coverage, and continuous software updates all lower costs. Accelerated computing greatly boosts application speed, and ongoing software cultivation and updates ensure users benefit from performance gains and cost reductions over time. Because of the vast installed base, our new optimizations benefit millions of compatible GPUs worldwide. Dynamic combinations expand NVIDIA’s influence, driving growth while reducing costs — that’s the core value of CUDA.

But our journey actually began 25 years ago with GeForce. GeForce is NVIDIA’s greatest marketing success; many grew up with it. Long before you could afford it, your parents paid for your NVIDIA experience, and eventually you became a computer scientist and developer. GeForce built NVIDIA today and nurtured CUDA. Twenty-five years ago, we invented the world’s first programmable accelerator — pixel shaders, to make accelerators programmable. Five years later, CUDA was born. Our biggest investment, pouring company profits into, was to promote CUDA on GeForce to every PC. After 20 years and 13 generations, CUDA is everywhere. A decade ago, we launched RTX, a complete redesign for modern computer graphics. GeForce brought CUDA to the world and helped many pioneers realize GPUs are ideal for accelerating deep learning, sparking the AI boom. Ten years ago, we integrated programmable shading and hardware ray tracing, believing AI would revolutionize computer graphics. Just as GeForce brought AI to the world, AI will now transform computer graphics.

Today, I will showcase the next-generation graphics technology — Neural Rendering, combining 3D graphics with AI, i.e., DLSS 5.0. We fuse controllable 3D graphics, structured data of virtual worlds, and probabilistic AI. Structured data is perfectly controlled; combined with generative AI, it creates stunning, controllable content. This fusion of structured info and generative AI will continuously impact various industries, with structured data as the foundation of trustworthy AI.

Next, we’ll explore structured data in detail. Everyone knows SQL, Spark, Pandas, Velox, and large platforms like Snowflake, Databricks, Amazon EMR, Azure Fabric, Google Cloud BigQuery — all handling dataframes. These dataframes are huge spreadsheets, holding the single source of truth for enterprise computing and business. Historically, we’ve accelerated structured data processing to run companies more efficiently at lower costs and higher frequency. In the future, AI will rapidly utilize these structured databases. Beyond that, most of the world’s information resides in unstructured generative databases: vector databases, PDFs, videos, speeches. 90% of data generated annually is unstructured. Until now, lacking simple indexing and understanding, these data couldn’t be queried or searched efficiently.

Now, we let AI solve this problem. Using multimodal perception and understanding, AI can read PDFs, grasp their meaning, and embed them into searchable, queryable larger structures. To do this, NVIDIA created two foundational libraries: cuDF for structured dataframes, and cuVS for vector storage and unstructured AI data. These platforms will become the most important in the future, deeply integrated into complex global data processing networks.

Today, we announce several key collaborations. IBM, the inventor of domain-specific language SQL, is using cuDF to accelerate WatsonX data processing. Sixty years ago, IBM launched System/360, opening the computing era; SQL and data warehouses became the backbone of modern enterprise computing. Today, IBM and NVIDIA are leveraging GPU compute libraries to accelerate watsonx.data’s SQL engine, redefining data processing for the AI era. As current CPU-based data systems can’t meet AI’s demand for rapid access to massive datasets, enterprises must transform. For example, Nestlé makes thousands of supply chain decisions daily, updating global order-to-cash data marts only a few times a day on CPUs. After running accelerated watsonx on NVIDIA GPUs, speed improves fivefold, costs drop 83%.

Accelerated computing for AI is here. We’re speeding up cloud data processing and on-prem deployments. Leading system and storage vendors like Dell are integrating cuDF and cuVS into their AI data platforms. We’re working with Google Cloud to accelerate Vertex AI and BigQuery. In partnership with Snapchat, we reduced their compute costs by nearly 80%. Accelerated computing and data processing deliver speed, scale, and most importantly, cost advantages. Moore’s Law, which predicted performance doubling every few years, is now waning. Accelerated computing enables us to leap forward.

As an algorithm company, NVIDIA, with broad market reach and a huge installed base, continuously optimizes algorithms to lower costs and scale. We’ve built an accelerated computing platform with libraries like RTX, cuDF, cuVS, and more, integrated into global cloud services and OEMs. This model repeats on platforms like Google Cloud and Snapchat. We’re proud of our work on JAX, XLA, and PyTorch — the only accelerators performing excellently across these frameworks. Customers like Baseten, CrowdStrike, Puma, Salesforce are not just clients but also developers.

We integrate NVIDIA tech into their products and bring them to the cloud. Our relationships with cloud providers are fundamentally about bringing them customers. Most cloud providers are eager to partner because we continuously supply acceleration. This year, I’m especially excited that we’ll bring OpenAI to AWS, which will drive huge cloud consumption and expand OpenAI’s compute capacity.

At AWS, we’ve accelerated EMR, SageMaker, and Bedrock. NVIDIA and AWS have deep integration; they are our first cloud partner. For Microsoft Azure, we built and installed the first NVIDIA A100 supercomputer, laying the groundwork for our successful partnership with OpenAI. Our collaboration with Azure has long included accelerating their cloud services and Bing Search, plus deep cooperation on AIFoundry. As AI expands globally, Azure Regions’ collaboration becomes crucial. A key feature we provide is Confidential Computing — ensuring operators cannot access or view data and models. NVIDIA’s GPU is the world’s first to enable this, supporting secure deployment of valuable models like OpenAI and Anthropic across clouds and regions. This is thanks to our vital Confidential Computing tech.

In customer collaborations, Synopsis is a key partner, accelerating all their EDA and CAU workflows on Microsoft Azure. We are Oracle’s first supplier and their first AI customer. I’m proud to have introduced Oracle to AI cloud concepts and to have been their first customer, helping them soar. We’ve onboarded partners like Quark, Cohere, Fireworks, and OpenAI. CoreWeave is the world’s first AI-native cloud, built to host GPUs for accelerated computing and AI cloud services, with a rapidly growing customer base.

I also highly favor Palantir and Dell platforms. Together, we’ve created a new AI platform — Palantir Ontology — capable of fully local, on-site deployment in any country or isolated (air-gapped) region. AI can be deployed almost anywhere. Without our Confidential Computing, end-to-end system building, and full-stack acceleration, none of this would be possible. These examples showcase our unique partnerships with global cloud providers, all present here today — I thank you all for your hard work.

NVIDIA is a vertically integrated yet openly collaborative company — a recurring theme. The reason is simple: accelerated computing isn’t just about chips or systems; it’s about application acceleration. If it’s just faster computers, that’s CPU work, but CPUs are no longer enough. The only way to achieve huge performance gains and cost reductions is through application or domain-specific acceleration — application-accelerated compute. Therefore, NVIDIA must develop a library for each vertical and industry.

As a vertically integrated computing company, we must deeply understand applications, domains, and algorithms. We also need to figure out how to deploy algorithms across data centers, clouds, on-premises, edge, or robots. From chips to systems, we achieve vertical integration. NVIDIA’s strength lies in our openness across these layers. We aim to combine our software, libraries, and technologies with partners’ to bring acceleration to everyone. This GTC exemplifies that philosophy.

Today, we have domain-specific libraries for various industries, solving key problems. For example, in financial services (the largest GTC audience), algorithmic trading is shifting from traditional machine learning relying on feature engineering to supercomputers analyzing vast data and discovering insights automatically — the deep learning and Transformer moment for finance. Healthcare is also experiencing a ChatGPT moment. We’re applying AI physics and biology to drug discovery, developing AI Agents for customer service and diagnostics.

In industry, we’re launching the largest expansion in history, building AI factories across sectors. Many chip and computer manufacturers are here today. In media and entertainment, real-time AI platforms support translation, broadcasting, live gaming, and video — most content will be AI-enhanced. In quantum computing, 35 companies are building next-gen quantum-GPU hybrid systems on our Holoscan platform. Retail and CPG are using NVIDIA-optimized supply chains, building agent-based shopping and customer service AI — a $35 trillion market.

In the $50 trillion manufacturing robotics sector, NVIDIA has been deeply involved for a decade, building foundational computers for robot systems and partnering with all major robot manufacturers. We showcased 110 robots at this event. The telecom industry, worth about $2 trillion, is on the brink of a complete overhaul — future base stations will be AI infrastructure platforms, running AI at the edge. Our Aerial (AIRAN) platform is collaborating with Nokia, T-Mobile, and others.

All of this centers on our self-invented CUDA-X libraries, the core of NVIDIA’s identity as an algorithm company and what sets us apart. Algorithms let us deeply understand industries, transforming top computer science solutions into libraries. At this GTC, we’ll release many new libraries and models — these continually updated assets are our treasure, activating computing platforms and solving real problems. Examples include cuDNN, which sparked the AI explosion, cuOPT for decision optimization, cuLitho for lithography, cuDSS for sparse solvers, and Parabricks for genomics — thousands of CUDA-X libraries helping scientists and engineers make breakthroughs. What you see isn’t animation but full simulation based on physics solvers, AI physics models, and physical AI robots. Combining algorithm understanding with computing platforms, NVIDIA’s vertical integration and openness unlock new opportunities.

Today, besides traditional giants, a wave of AI-native startups like OpenAI and Anthropic has emerged. With the reinvention of compute, venture capital has poured a record $150 billion into startups. For the first time, these companies need vast compute and trillions of tokens — either generating tokens themselves or adding value to existing ones. Just as PC, internet, and mobile cloud gave rise to Google, Amazon, and Meta, we are at the dawn of a new platform shift, with new influential companies emerging.

The past two years’ explosion stems from three milestones. First, ChatGPT ushered in the generative AI era, capable of perception, understanding, translation, and original content creation. Second, generative compute has transformed how we compute — shifting from retrieval-based to generative, profoundly changing architecture and design. Third, the rise of reasoning AI, with models like O1 and O3, enables AI to reflect, think independently, decompose problems, and verify itself, making generative AI more trustworthy and fact-based. This reasoning capability greatly increases token usage in context and output, boosting compute demand. Subsequently, ClaudeCode, the first intelligent agent model, can read files, write code, compile, test, and iterate automatically, revolutionizing software engineering.

All our employees now use NVIDIA AI tools like ClaudeCode, Codex, and Cursor to assist coding. No longer do you ask AI what to do; you give it context, and it creates, executes, and builds. AI has evolved from perception to generation to reasoning, now capable of high productivity. As AI can finally perform productive work, market demand for NVIDIA GPUs has skyrocketed. Despite large shipments, demand continues to grow.

AI must now think, act, and read — requiring reasoning and logical inference. Every part of AI, when thinking, acting, or generating tokens, must reason. We are past training; we are in the inference era, and the inflection point has arrived. The compute needed has increased about 10,000 times. Over the past two years, compute demand has grown 10,000-fold, while usage has increased 100-fold. Believing that compute demand has grown a million times in two years is a shared experience among startups, OpenAI, and Anthropic. More compute means more tokens, higher revenue, and smarter AI.

We are now in this positive feedback loop — the reasoning inflection point is here. Last year, I said that by 2026, the combined high-confidence demand and orders for Blackwell and Rubin would reach $500 billion. While this may seem modest compared to annual revenue records, I now tell you that by 2027, this number will be at least $1 trillion. In fact, we will face a compute shortage, with demand far exceeding this.

Over the past year, we’ve done a lot. 2025 will be NVIDIA’s inference year. We want to excel in both training and inference, across all AI stages. Infrastructure investments can scale long-term; NVIDIA’s systems have long lifespans and low costs. Undoubtedly, NVIDIA’s infrastructure is the world’s most cost-effective AI foundation. Last year was all about inference, driving the inflection point. Meanwhile, platforms like Llama from Meta and others, representing a third of global open-source AI models, chose NVIDIA. Open models are near the cutting edge and ubiquitous. NVIDIA is the only platform capable of running across all languages and AI domains — biology, graphics, vision, speech, proteins, chemistry, robotics. Our architecture is universal from edge to cloud, making it the most cost-efficient and trusted platform.

Facing a trillion-dollar infrastructure scale, we must ensure high performance, cost-effectiveness, and long lifespan. You can confidently choose NVIDIA — whether in cloud, on-prem, or anywhere else, we support you. We are now a full AI compute platform, reflected in our business. 60% of our revenue comes from the top five hyperscale cloud providers, some for internal AI use. Internal workloads like recommendation and search are shifting from traditional methods to deep learning and large language models, migrating to NVIDIA’s powerful GPUs. Through partnerships with AI labs and a vast native ecosystem, we bring compute to the cloud, which is rapidly consumed. The other 40% covers regional clouds, sovereign clouds, enterprise, industrial, robotics, edge, and supercomputers. NVIDIA’s broad reach and diversity make AI resilient — it’s now a foundational technology and a new computing platform revolution.

Our mission is to keep advancing technology. Last year, as inference year, we undertook a major overhaul of Hopper architecture. We decided to elevate the architecture to a new level, decoupling compute and creating NVLINK-72. Its design, manufacturing, and programming changed fundamentally. Grace Blackwell and NVLINK-72 are huge bets — thanks to all partners’ efforts. NVFP4 isn’t just about precision; it represents a completely different type of Tensor Core and compute unit. We proved that inference can be done without loss of accuracy, with significant performance and energy efficiency gains, and it can also be used for training. Combining NVLINK-72, NVFP4, Dynamo, TensorRT-LLM, and new algorithms, we invested billions in DGX Cloud supercomputers to optimize kernels and software stacks. People used to think inference was simple, but it’s the ultimate challenge and core revenue driver. The most comprehensive inference performance data shows that watts per token are critical. Data centers are power-limited; physics dictates that a 1 GW factory can’t become 2 GW. So, we must produce the maximum tokens within limited power, aiming for peak efficiency.

Inference speed determines response time — the interactivity of a single inference. Faster inference means handling more context and tokens, reflecting AI’s intelligence and throughput. Smarter AI takes longer to think, reducing throughput. From now on, CEOs worldwide will see their business as token factories directly linked to revenue. Better watts-per-performance means higher throughput and more tokens per unit of power. NVIDIA leads globally in performance; Moore’s Law predicted 1.5x performance every few years, but we achieved a 35x leap.

Last year, I said Grace Blackwell and NVLink-72 improved watts per performance by 35x — no one believed it, some analysts even thought I was conservative, estimating up to 50x. This makes our per-token cost the lowest globally. If architecture is wrong, even free isn’t cheap, because building and amortizing a gigawatt factory costs $40 billion. The best systems are needed for optimal cost efficiency. Through extreme co-design, we integrate vertically and open horizontally, packaging all software and tech for global inference services.

For example, platforms like Fireworks and Together grow rapidly; their efficiency is everything. After software updates, with hardware unchanged, average speed jumped from about 700 tokens/sec to nearly 5,000 — a sevenfold increase. Data centers that stored files are now power-limited token factories. Inference is the new workload; tokens are the new commodity, and compute equals revenue. Future cloud and AI companies will think about their token factory efficiency — this intelligence will be token-enhanced.

Looking back over ten years: in 2016, we launched DGX-1, the world’s first deep learning computer, with 8 Pascal GPUs connected via NVLink delivering 170 TFLOPS. Later, with Volta, we introduced NVLink switches, running 16 GPUs as a giant GPU. As models grew, data centers needed to become single compute units, so Mellanox joined NVIDIA. In 2020, DGX A100 SuperPOD combined vertical and horizontal scaling. Then, the Hopper architecture, with FP8, launched the generative AI era, and Blackwell with NVLINK-72 redefined AI supercomputing, achieving 130 TB/s full bandwidth.

Today, the compute demand for intelligent agents is exponential. Vera Rubin, designed for AI agents, provides 3.6 exaflops and 260 TB/sec full bandwidth. Paired with Vera CPU racks, BlueField-4 storage, Spectrum-X switches, and Groq-3 LPX accelerators, it achieves 35x throughput increase per megawatt. This new platform, with seven chips and five racks, has increased compute by 40 million times in ten years.

Previously, I could lift a chip like Hopper; now, Vera Rubin is a large, integrated system. The key for intelligent agents is the reasoning process of large language models, which increasingly strain memory and storage. We reinvented storage systems accordingly. AI needs tools to run as fast as possible, so we built Vera CPU, optimized for single-thread performance, the world’s only LPDDR5 data center CPU, with unmatched energy efficiency. It’s designed to work with other rack components for agent processing. Vera Rubin is fully liquid-cooled, cable-free, with installation reduced from two days to two hours. It uses 45°C hot water cooling, drastically lowering cooling costs and energy use. It’s the only sixth-generation vertically scalable switch system, with revolutionary co-packaged optics (CPO). Vera CPU as a standalone product is now a multi-billion-dollar business.

This four-rack system, with structured cabling, is highly efficient. The Rubin Ultra node further improves this, installed in a new Kyber rack, connecting 144 GPUs within a single NVLINK domain. The compute node is inserted vertically, no longer limited by copper cable length, connected via NVLINK switches, forming a massive computer. Ultimately, under power constraints, the throughput and token generation speed of AI factories will determine next year’s revenue — the most critical metric for AI factory future.

The vertical axis is throughput; the horizontal is token rate. As token generation accelerates and models grow, demand for tokens and context length surges. Input and output tokens are shifting from hundreds of thousands to millions. These factors will deeply influence future token commercialization and pricing.

Tokens are becoming a new commodity. Like all commodities, once mature and at a turning point, markets will segment. High-throughput but low-generation versions suit free tiers; mid-tier offers larger models, faster generation, and longer context windows, with tiered pricing. Cloud services already show tiered models from free to $3 or $6 per million tokens.

The industry strives to push capability boundaries: larger models are smarter, longer context means higher relevance, and faster generation enables better reasoning and iteration, creating smarter AI with higher service premiums. Future high-end models might cost $45 or even $150 per million tokens, supporting critical R&D or long-term complex tasks. But realistically, if a research team consumes 50 million tokens daily at $150 per million, that’s unaffordable. We believe tiered, segmented AI is the future. AI must start with value and practicality, iterating continuously, adopting multi-level models.

Looking at Hopper, everyone expected performance improvements; Grace Blackwell’s leap exceeded all. It achieved 35x throughput in the free tier, a core area for monetization, directly boosting throughput 35-fold. As with all industries, higher service tiers mean better quality and performance but lower capacity. We improved the base by 35x and introduced new tiers — a huge leap for Grace Blackwell over Hopper.

Next, Vera Rubin. Across all tiers, we achieved throughput jumps. Especially in the most valuable top tier, we increased throughput tenfold. Such performance leaps in top segments are extremely challenging. That’s the advantage of NVLink-72 and low-latency architecture. Through extreme co-design, we pushed the industry’s limits.

From a customer’s perspective, suppose a data center has 1 GW power. We need precise resource allocation: e.g., 25% for free, 25% for mid, 25% for high, 25% for premium tiers. Free for customer acquisition, top tiers for high-value clients, converting to revenue. Under the same power, Blackwell can generate over five times the revenue; Vera Rubin can also achieve fivefold growth. Customers should migrate early to Vera Rubin — it boosts throughput and reduces per-token costs.

But we aim higher. Ultra-high throughput requires massive FLOPS; ultra-low latency and high-frequency interaction depend on huge memory bandwidth. Physical chip area limits system design, making it hard to optimize both simultaneously. High throughput and low latency are fundamentally conflicting.

To break this physical barrier, we acquired Groq’s chip team and licensed their tech. We’ve integrated system architectures. Now, in the most valuable high-end tiers, performance is again boosted 35x. NVIDIA’s dominance in most AI workloads stems from understanding the importance of throughput. NVLink-72’s architecture is disruptive; it remains the optimal path even after integrating Groq tech.

However, if we extend demand scenarios — e.g., providing 1,000 tokens/sec instead of 400 — NVLink-72’s bandwidth limits will be challenged. That’s where Groq shines. Its tech surpasses current limits and even NVLink-72’s performance ceiling. Converting tech into revenue, Vera Rubin’s revenue is 5x Blackwell’s. For high-throughput workloads, I recommend 100% Vera Rubin deployment; for code-heavy or high-value token tasks, adding Groq makes sense. A balanced approach: 25% Groq nodes, 75% Vera Rubin. Deep integration can further push system performance.

Groq’s appeal lies in its deterministic dataflow architecture, relying on static compilation and precise scheduling, ensuring compute and data arrive synchronized. It’s tailored for AI inference, with massive SRAM and no dynamic scheduling. As demand for ultra-smart, high-speed tokens explodes, this integrated system’s value grows.

Within this system, two extremes exist: a Vera Rubin chip with 288GB memory; or, for massive models and context, stacks of Groq chips. Large memory needs once limited Groq’s mainstream adoption — until we devised a solution: Disaggregated Inference via Dynamo software.

We restructured the inference pipeline: Vera Rubin handles prefill, Groq handles decode. Attention mechanisms are on Vera Rubin; feedforward and token generation on Groq. They connect via Ethernet, with optimized transfer modes reducing latency by nearly half. On this robust hardware base, we run Dynamo OS, achieving 35x performance leap and unprecedented token inference levels. This is the new Vera Rubin with Groq tech.

Special thanks to Samsung, which manufactures Groq LP30 chips, now in mass production. An upgraded LPX version is expected in Q3.

Looking back, NVLink-72 architecture was complex, and early prototypes faced challenges; but Vera Rubin’s testing went smoothly. As Satya announced, the first Vera Rubin rack is live on Microsoft Azure. Our global supply chain is highly capable, producing thousands of these large systems weekly, enough to deploy gigawatt-scale AI factories monthly. We’re also mass-producing Vera Rubin racks.

Vera CPU also achieved unprecedented success. AI still relies heavily on CPU instructions for complex tasks like tool use. Vera CPU’s design matches this need perfectly. It integrates deeply with BlueField processors and CX9 NICs, connecting to BlueField-4 network stack. All major storage companies are integrating into our ecosystem. In the future, AI agents will read and write at massive scale, seamlessly supporting cuDF acceleration, cuVS, and large KV caches.

In just two years, we’ve shattered Moore’s Law’s linear growth with innovative

NVDAX-0.91%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes