xAI's Voice API Might Kill My Three-Vendor Stack
I used to need Cartesia, Deepgram, and xAI together for a good voice agent. xAI's new Voice Agent API changes that architecture fast.
xAI's Voice API Might Kill My Three-Vendor Stack
The first voice stack I really liked was annoyingly good and annoyingly annoying at the same time.
Deepgram handled listening. xAI handled the brain. Cartesia handled the speaking. It stayed on task, sounded good, and felt a lot more conversational than the robotic junk most people still call a voice agent.
The problem was not that the stack was bad. The problem was that it was three moving parts.
Every provider hop created another place for latency, streaming weirdness, interruption bugs, and prompt drift to sneak in. That was fine when xAI only sat in the middle as the LLM. It is a different conversation now that xAI has a real Voice Agent API, separate Text to Speech, and Speech to Text.
I do not think this means "always use one vendor now." I do think it means the default voice architecture deserves a reset.
The Old Stack Worked For A Reason
I still think the old split stack made sense.
When you are building realtime voice, you usually optimize three different jobs:
- speech recognition quality
- model behavior and tool use
- speech synthesis quality
That is why a lot of good voice agents ended up looking like a relay race between providers. One model listened better. Another reasoned better. Another sounded more human. If the end result felt good on a live call, you accepted the plumbing.
That is basically how I looked at it too. I was happy to stitch the pieces together because the experience justified it.
But the hidden tax is always in the seams. Interruptions get trickier. Turn-taking gets trickier. Logging gets messier. Tool timing gets messier. You can absolutely make it work, but you are spending real engineering time on glue code instead of on the part the caller actually feels.
What xAI Actually Shipped
The reason I am paying attention is not just "xAI does voice now." It is that the voice surface is pretty complete out of the gate.
As of April 19, 2026, the current xAI docs say the realtime Voice Agent API pricing is $0.05/min ($3.00/hr), billed by audio duration, with 100 concurrent sessions per team and a 30 minute max session duration. The realtime endpoint is wss://api.x.ai/v1/realtime.
The bigger deal is the feature set around it:
- built-in speech-to-speech over WebSocket
- server-side turn detection with
server_vad - built-in tools like web search, X search, collections search, and remote MCP tools
- custom function calling during live conversations
- ephemeral tokens for browser and mobile clients, so you do not have to leak your API key
- OpenAI Realtime API compatibility, which matters a lot if you already have clients or wrappers built around that shape
The surrounding audio APIs matter too.
The current xAI pricing page lists Text to Speech at $4.20 / 1M characters, with five voices and support for mp3, wav, pcm, mulaw, and alaw. It lists Speech to Text at $0.10/hr for REST and $0.20/hr for streaming. The TTS docs also explicitly support mulaw and alaw at 8000 Hz, which is exactly the kind of boring telephony detail that saves you pain later.
That last part matters more than the launch headline. Voice products usually stop feeling clean the moment the phone system gets involved.
Why The Architecture Changes
The old path looked something like this:
The new path can be much simpler:
That does not automatically make it the best choice for every build. It does change where the complexity lives.
Before, a lot of the work sat in stitching providers together cleanly. Now, for some builds, the hard part moves back where it belongs:
- prompt and instruction quality
- tool design
- handoff rules
- business logic
- logging and evals
That is a much healthier place to spend time.
I also think this is why the new pricing is more interesting than it first looks. A flat realtime voice price is not just a billing detail. It changes how easy it is to reason about an early build. If I can estimate the live voice layer in minutes, then separately think about tool calls and whatever app logic I attach, I get to scope faster.
If you want the longer pricing discussion, I already broke that part out in What Does a Voice AI Agent Actually Cost in 2026?.
Where The Agent Platforms Fit Now
I do not think xAI replaces agent platforms. I think it changes what you ask them to do.
xAI Native
If I want the shortest path to a browser or app-based voice agent, I would start here.
xAI already gives me the realtime voice session, browser-safe ephemeral tokens, built-in search tools, collections, MCP support, and custom functions. For a focused use case, that is enough to get a real voice agent live without assembling a small orchestra of vendors first.
Pipecat
I still really like Pipecat's Grok Realtime integration when I want more control over the conversation pipeline.
Pipecat makes sense when the interesting part is not just "talk naturally," but "follow a structured flow, hand off between roles, and keep the live voice loop predictable." That is especially true if I want to mix free-form conversation with more explicit step logic. In other words: voice agent on the surface, workflow engine underneath.
LiveKit
LiveKit Agents is the fit when transport is the real problem.
It already gives you a realtime agent framework, rooms, WebRTC, and telephony. LiveKit also now has an xAI Grok Voice Agent API plugin plus broader xAI integration docs. That is a strong setup if you want xAI for the model layer but still want LiveKit handling the realtime infrastructure and SIP side of the house.
Vapi
Vapi still makes sense when you care a lot about the phone system, assistant management, testing, and operational tooling.
The important nuance is that Vapi does not become irrelevant just because xAI can do speech-to-speech now. It becomes a different kind of layer. Vapi's docs say you can use any OpenAI-compatible endpoint, and xAI's REST API docs say the platform offers full compatibility with the OpenAI REST API. So the clean path there is usually: keep Vapi for telephony and orchestration, and route the model layer through xAI or through your own wrapper.
That is not the same as collapsing the whole voice loop into xAI. But it is still a useful way to keep your current agent platform while upgrading the reasoning layer.
How I'd Wire This Into OpenClaw
This part is the most interesting to me.
If I were wiring this into OpenClaw, I would not try to make OpenClaw "the voice model." I would treat xAI voice as the front door and OpenClaw as the company-specific action layer behind it.
That split feels right to me:
- xAI handles live listening, speaking, and turn timing
- the live voice session gets only the tools it needs right now
- OpenClaw handles the business context, task execution, and longer-running follow-up work
The cleanest pattern is probably either MCP or a thin function layer in front of OpenClaw. xAI's docs say Remote MCP Tools are supported in the Voice Agent API, which means you can keep the conversation fast while exposing only the business tools the caller should be allowed to trigger.
That matters because I do not want a live phone agent carrying the entire company brain in one giant prompt. I want it to stay fast, grounded, and narrow:
- answer the question if the answer is clearly available
- call a business tool if the tool is safe and relevant
- create a task for later if the work is bigger than the live conversation
- route to a human when the call stops being a good fit for automation
That is a much better OpenClaw story than "let's shove everything into the voice loop and hope it behaves."
One Example I'd Actually Ship
If I were building a first pass with this stack, I would not start with a fake all-knowing receptionist. I would start with one narrow job: after-hours intake that can actually move work forward.
For example:
- answer the phone or web call
- identify whether the issue is urgent
- answer one or two grounded questions from a collection or knowledge base
- collect name, phone number, address, and the short problem summary
- create a callback or dispatch task
- transfer or escalate if the issue crosses the urgent threshold
At a high level, the session config would look something like this:
That is a real agent. Not because it sounds cool, but because it can complete one whole job.
It is also a good example of where the newer stack changes the work. The hard part is not "how do I make five providers stream nicely together?" The hard part is "what tools should this thing have, what should it never do, and what counts as a successful call?"
That is the right hard part.
I still think there will be cases where I keep the split stack. Best-of-breed audio is not going away. But xAI just made the all-in-one option much more serious than it was a few weeks ago. For browser voice agents, OpenClaw front doors, and narrower phone workflows, that is a big deal because the stack is finally small enough to let the real product questions show up.
And honestly, that is the kind of shift I want to keep writing about. Voice got more interesting. OpenClaw Employee got a more believable voice front end. The Voice Agent Pilot got easier to scope. And a lot of the new tools showing up right now, including browser-use style workflows, are getting better for the same reason: less glue code, more actual product work.
Written from home, still very happy to delete a provider hop when the product gets better because of it.
Work With Us
Want to build something like this?
We scope and ship practical AI for SMB teams — voice agents, custom assistants, and workflow automations that actually get used. Real starting prices, no bloated discovery phases.
Enjoyed this post?
Get more build logs and random thoughts delivered to your inbox. No spam, just builds.

