NVIDIA’s Inference Inflection Point: The New Stack for Agentic AI Artwork

DX Today | No-Hype Podcast & News About AI & DX

The DX Today Podcast: Real Insights About AI and Digital Transformation

Tired of AI hype and transformation snake oil? This isn't another sales pitch disguised as expertise. Join a 30+ year tech veteran and Chief AI Officer who's built $1.2 billion in real solutions—and has the battle scars to prove it.

No vendor agenda. No sponsored content. Just unfiltered insights about what actually works in AI and digital transformation, what spectacularly fails, and why most "expert" advice misses the mark.

If you're looking for honest perspectives from someone who's been in the trenches since before "digital transformation" was a buzzword, you've found your show. Real problems, real solutions, real talk.

For executives, practitioners, and anyone who wants the truth about technology without the sales pitch.

All Episodes

DX Today | No-Hype Podcast & News About AI & DX

NVIDIA’s Inference Inflection Point: The New Stack for Agentic AI

March 20, 2026

0:00 | 14:57

Send us Fan Mail

Today’s episode is a deep dive into NVIDIA’s GTC 2026 message that AI is entering an “inference inflection point” — where running models at scale (not just training them) becomes the main economic and operational battleground.

We break down what inference means in 2026, why agentic AI can dramatically increase inference demand, and how NVIDIA is positioning a full-stack “AI factory” approach across hardware, software, and security. We cover new platform roadmaps discussed at GTC, real-world implications for cloud providers and enterprises, and why production AI shifts priorities toward cost-per-task, latency, reliability, and capacity planning.

We also dig into the biggest risks: runaway spend from agent loops, reliability challenges in real products and physical AI, and the security shift from prompt-based guardrails to enforceable runtime policy for tools, network access, and data handling. Finally, we close with practical takeaways for teams moving from pilots to production.

SPEAKER_02 0:00

Welcome to the DX Today podcast, where you get facts and no hype. I'm Mike.

SPEAKER_01 0:04

And I'm Alex. Today we're doing a deep dive into NVIDIA's big GTC 2026 message, the inference inflection point, and the stack NVIDIA is rolling out to run agentic AI in production. From new inference-focused hardware all the way up to a security runtime called Open Shell.

SPEAKER_02 0:24

If you've been following AI for the last couple of years, you've probably heard the phrase training is expensive. But the story NVIDIA is pushing right now is training with chapter one. Chapter two is inference, actually running models at scale, in real products, in real businesses, with real latency and cost requirements.

SPEAKER_01 0:44

And in NVIDIA's framing, inference isn't just answer a question. Inference is what happens when AI agents are doing work, retrieving data, reasoning, calling tools, making API requests, generating code, and repeating that loop over and over. It's not a one-off response. It's continuous execution.

SPEAKER_02 1:06

So here's our plan for this episode. We'll start with what the inference inflection point actually means. Then we'll hit the recent developments at GTC 2026, specific products, claims, timelines. Then we'll talk impact, what enterprises and cloud providers are building, and why. Then risks, cost blowups, security, and why agent guardrails are moving from prompts to infrastructure policy. And we'll close with practical takeaways.

SPEAKER_01 1:35

Let's define terms. Training is the heavy compute phase where you take a foundation model and optimize weights. Inference is when that trained model is used to produce outputs, tokens, classifications, embeddings, actions, in response to inputs. The business pain point is that once AI goes mainstream, inference volume can dwarf training.

SPEAKER_02 2:00

Exactly. Nvidia's CEO Jensen Huang put some huge numbers on this at GTC. In Nvidia's official conference updates, Huang said he now sees at least$1 trillion in revenue from 2025 through 2027. That's not a forecast everyone agrees with, but it signals where NVIDIA thinks demand is heading. AI infrastructure, especially for running models, not just training them.

SPEAKER_01 2:25

And the inference inflection point language matters because it implies a shift in what buyers optimize for. Training was about peak throughput and time to train. Inference is about cost per token, latency, reliability, and the ability to run many distinct workloads simultaneously. In other words, operational characteristics. So what did NVIDIA actually announce or highlight at GTC 2026 that's relevant to inference? One, a platform roadmap. In NVIDIA's live updates, Huang described VeraRubin as the next full-stack computing platform, seven chips, five rack scale systems, and one supercomputer built to support a Gentic AI.

SPEAKER_02 3:12

Two, they talked beyond Rubin to a future architecture named Feynman, plus a new CPU called Rosa, again from NVIDIA's own GTC coverage. Whether you love or hate the naming, the important part is the direction. Nvidia is presenting itself as a vertically integrated stack across compute, networking, storage, and security for what they call an AI factory. And three, the inference battleground. Reuters previewed Huang's coin note by noting NVIDIA was expected to discuss inference and reference Grok, an inference specialist, as a major theme, and that NVIDIA was likely to introduce a next generation ship called Fine Men and talk through the broader stack, CUDA, AI agents, and robotics.

SPEAKER_01 3:55

We also saw a reporting that NVIDIA's inference push includes Grok-derived technology. Business Insider wrote that Huang debuted an inference system incorporating Grok Tech, claimed up to 35 times faster inference for some workloads, and said it would ship in the second half of 2026. That 35 times number should be treated carefully. It's probably benchmark dependent, but it shows the focus, acceleration, and cost efficiency.

SPEAKER_02 4:20

Now, something I want to underline: inference isn't just hardware, it's also software orchestration and security. That's where Nvidia's OpenShell and Nemo Claw come in.

SPEAKER_01 4:30

Right. Nvidia published a technical blog post explaining OpenShell, the short version. OpenShell is a runtime that sits between an AI agent and the infrastructure it runs on. It's designed for long-running, self-evolving agents. They use the term claws, that have memory, can install skills, spawn sub-agents, and keep executing.

SPEAKER_02 4:50

And the security angle is basically prompt-based guardrails aren't enough once you give an agent tools and credentials. NVIDIA's open shell approach is out-of-process policy enforcement. In their words, it's like the browser tab model for agents. Sessions are isolated and permissions are verified by the runtime before any action executes.

SPEAKER_01 5:08

Let's make that concrete. In NVIDIA's OpenShell post, they describe three main components: a sandbox for isolated execution, a policy engine that enforces constraints across file system, network, and process layers, and a privacy router that decides when to use local models versus route into frontier models based on policy.

SPEAKER_02 5:29

This is a big deal because it changes the trust model. Instead of trusting the agent to behave, you treat the agent as potentially compromised and enforce rules outside the agent's process. That's the same mental model we use with untrusted code, least privilege, isolation, auditing.

SPEAKER_01 5:45

And it lines up with broader security thinking. The Cloud Security Alliance wrote this week that conversational guardrails protect input and output. But once AI systems execute actions, enterprises need a control layer that sits between decisions and executions, policy enforcement, access validation, logging, auditability. That's governance, not just guardrails.

SPEAKER_02 6:07

Okay, so we've got the theme. Inference demand is ramping. Nvidia is positioning hardware and software for AI factories and agentic workloads. Now let's talk real-world impact. Who's adopting what and where does this show up outside a keynote?

SPEAKER_01 6:21

Start with cloud scale. In NVIDIA's GTC updates, they say AWS will deploy NVIDIA infrastructure, including more than one million NVIDIA GPUs, starting this year across AWS regions. Even allowing for marketing spin, that's a signal that hyperscalers are preparing for sustained inference diamond, not a temporary spike.

SPEAKER_02 6:41

And beyond hyperscalers, NVIDIA is talking about AI factories as a standard enterprise pattern, pre-validated designs, reference architectures, into stack where you can build once and scale everywhere. They're essentially trying to standardize how you buy and operate AI compute, the way companies standardize virtualization and Kubernetes.

SPEAKER_01 7:03

A very practical example is the physical AI angle. Nvidia announced the Physical AI Data Factory Blueprint, a reference architecture to generate, augment, and evaluate training data for robotics, vision AI agents, and autonomous vehicles. They named early users: Field AI, Hexagon Robotics, Linker Vision, Milestone Systems, Skilled AI, Uber, and Pterodyne Robotics.

SPEAKER_02 7:31

That's interesting because it's not only about inference and data centers. Physical AI systems have a dual loop. You need data generation and training, but at the edge, you need inference that's low latency and reliable. Robots don't get to buffer a response for 12 seconds.

SPEAKER_01 7:48

And data is the bottleneck. In that blueprint release, NVIDIA quotes Rev Leberadian, saying, Physical AI is the next frontier of the AI revolution, where success depends on the ability to generate massive amounts of data. And he adds, in this new era, compute is data. That's a strong claim, but it captures a real shift. Synthetic and simulated data pipelines are becoming core infrastructure.

SPEAKER_02 8:15

Let's talk numbers and specifics that executives can use. Nvidia's GTC post includes some crisp specs for their desk side systems, DGX station with 748 gigabytes of coherent memory, and up to 20 petaflops of FP4 performance, and the ability to run open models up to 1 trillion parameters locally. That's about bringing inference and agent development back to the desk, not only the cloud.

SPEAKER_01 8:42

And they also highlight industry examples, like Caterpillar building an in-cabin conversational assistant, and Johnson ⁇ Johnson adopting IGX Thor to bring real-time AI inference into the operating room. Whether those are pilots or production varies, but it shows where inference is going, safety critical environments.

SPEAKER_02 9:03

Now a reality check. When companies move from AI demo day to production inference, they often hit three walls: cost, latency, and security. Let's take those one at a time.

SPEAKER_01 9:16

On cost, with agentic workloads, the number of model calls can explode. An agent that writes code might make dozens or hundreds of tool calls and LLM calls per task. Multiply that across thousands of employees, and you get runaway spend. That's part of why the industry is fixated on cost per token and specialized inference acceleration.

SPEAKER_02 9:40

And cost isn't just the model provider, it's infrastructure utilization. Inference is bursty. You might need huge capacity at 9 a.m. and far less at 2 a.m. So you care about multi-tenancy, scheduling, and resource isolation because you want to consolidate workloads without creating cross-tenant risk.

SPEAKER_01 9:58

On latency and reliability, user-facing copilots have one expectation. But autonomous agents and physical AI have stricter requirements. If your system is calling internal APIs, you need predictable response times, retries, and circuit breakers. If your robot is moving through a warehouse, you need low latency perception and planning.

SPEAKER_02 10:22

This is where Nvidia's full stock pitch is designed to be attractive. Hardware plus networking plus software plus orchestration. But enterprises should remember a vertically integrated stack can simplify operations, yet it also increases vendor concentration risk.

SPEAKER_01 10:39

And on security, the risk model is changing fast. Nvidia's open shell post makes the point that a stateless chatbot has a smaller attack surface than an agent with persistent shell access, live credentials, and the ability to install skills and spawn sub-agents. In that world, every prompt injection can turn into a credential leak or unauthorized action.

SPEAKER_02 11:04

So the security question becomes: how do you enforce least privilege for AI agents? How do you keep an agent from exfiltrating data, installing an untrusted package, or hitting an unauthorized endpoint? OpenShell's answer is deny by default plus runtime enforcement. File system, network, and process constraints evaluated outside the agent.

SPEAKER_01 11:26

There's also supply chain and ecosystem risk. If you're using third-party skills or plugins for agents, you're effectively importing executable code. Nvidia explicitly warns that a third-party skill an agent installs can be an unreviewed binary with file system access. That is basically the plug-in problem from browsers and IDEs, now applied to autonomous systems.

SPEAKER_02 11:52

And this is why we're seeing security vendors partner with agent runtime stacks. Cisco, for example, wrote about integrating NVIDIA OpenShell with Cisco AI defense, highlighting sandbox containment, deny by default access, per-endpoint network policy, and privacy routing. Again, you don't have to buy that exact combo, but the pattern is clear. Enterprises want a control plane for agents.

SPEAKER_00 12:17

Let's also talk about governance and compliance. As Agentic AI starts touching systems of record, HR, finance, customer data, auditors are going to ask: what permissions did the agent have? What actions did it take? Can you reproduce the sequence? Do you have an audit log? Traditional application security has answers. Agent security needs the same fundamentals.

SPEAKER_02 12:43

And inference infrastructure itself becomes a security boundary. Multi-tenant acceleration is efficient, but it raises questions about isolation. Even if you're not worried about side channels day-to-day, the operational reality is you need strong tenant boundaries, monitoring and incident response around your AI stack, because model inputs can contain sensitive data.

SPEAKER_01 13:07

So, what should listeners do with all this? Let's give practical takeaways. Takeaway one, if your AI roadmap stops at we picked a model, you're behind. The winning teams will treat inference like a production platform. Capacity planning, latency budgets, cost controls, and reliability engineering. Takeaway two, assume agentic workloads will increase call volume dramatically. Build measurement early. Cost per task, tokens per task, tool call counts, and failure rates. If you can't measure, you can't optimize.

SPEAKER_02 13:40

Takeaway three. Invest in an agent control air. Whether it's open shell or something else, you need an enforceable policy boundary outside the model. Deny by default permissions, network allow lists, secrets management, and audit logs of every action.

SPEAKER_01 13:54

Takeaway four. Treat skills and plugins as software supply chain risk. Require code review, signing, or vetted registries. Don't let agents install arbitrary packages in production environments.

SPEAKER_02 14:07

Takeaway five, decide where inference should run. Some workloads belong in the cloud for elasticity. Others may belong on-prem or at the edge for privacy, latency, or regulation. Nvidia's push for desk side and local systems is a reminder that hybrid isn't going away.

SPEAKER_01 14:23

And if you're a technical leader listening, the meta trend is standardization. The AI ecosystem is moving toward reference designs, validated stacks, and repeatable patterns, AI factories, data factories, agent runtimes. So teams can get from pilot to production faster without reinventing everything.

SPEAKER_02 14:41

Bottom line, the hot topic this week isn't just new chips. It's the shift of AI into continuous execution, agents and real products, which turns inference into the core economic and security battleground.

SPEAKER_01 14:53

Thank you for joining us today for the DX Today podcast. Stay curious.