
Microsoft Just Shipped Its Own AI Models. That's the Point.
I keep watching Microsoft do something that doesn't get talked about enough: they're quietly building their own AI model lineup while still writing checks to OpenAI. The latest move is three new models under the MAI brand, all live in Microsoft Foundry right now. And honestly, the models themselves matter less than why Microsoft felt the need to build them.
The three models
MAI-Transcribe-1 does speech-to-text across 25 languages. Microsoft says it ranks first on FLEURS benchmarks in 11 core languages and beats Whisper-large-v3 on the remaining 14. It also outperforms Gemini 3.1 Flash on 11 of those 14. Batch transcription runs at 2.5x the speed of Microsoft's existing Azure Fast offering. Pricing starts at $0.36 per hour.
MAI-Voice-1 generates speech from text. Microsoft claims it can produce 60 seconds of audio in one second, and custom voice creation in Foundry works from just a few seconds of sample audio. That's aggressive. Pricing starts at $22 per 1M characters.
MAI-Image-2 landed as a top-3 model family on the Arena.ai leaderboard. Microsoft says it generates images at least 2x faster than previous offerings on Foundry and Copilot, with equivalent quality based on production traffic data. It's designed for natural lighting, accurate skin tones, and readable in-image text. Pricing: $5 per 1M tokens for text input, $33 per 1M tokens for image output.
WPP is already building with MAI-Image-2 at scale. Rob Reilly, their Global Chief Creative Officer, called it "a genuine game-changer" and said it respects "the sheer craft involved in generating real-world, campaign-ready images." That's a meaningful endorsement from a company that actually has to ship creative work to paying clients.

Why this matters more than the benchmarks suggest
Here's what I think most coverage is missing. The interesting question is not whether MAI-Transcribe-1 beats Whisper on 14 languages. It's that Microsoft is systematically reducing how much of its AI stack depends on OpenAI.
Think about it. Microsoft poured billions into OpenAI. They got GPT integration across their products. That partnership isn't going away. But if you're Satya Nadella, you don't want your entire AI future controlled by a company you don't own. You want options.
These three models cover speech, voice, and image generation. That's a multimodal stack built in-house, running on Microsoft's own infrastructure, priced to compete with Amazon and Google. The models come from Microsoft's MAI Superintelligence team, and they're integrated into products that enterprises already pay for: Azure, Copilot, Bing, PowerPoint.
That last part is where the real leverage is. Google can build a great model. So can Amazon. But Microsoft can put a new model into PowerPoint and have it running inside Fortune 500 companies by Tuesday. Distribution is the moat, and Microsoft has more distribution than anyone in enterprise software.
The pricing tells a story
Look at the numbers again. $0.36 per hour for transcription. $22 per million characters for voice. These are not premium prices. Microsoft is pricing these to get volume, fast.
That's a familiar Microsoft playbook. Price aggressively, get adoption, build lock-in through integration. They did it with Teams. They did it with Azure. Now they're doing it with their own AI models inside Foundry.
For enterprises already on Azure, the switching cost to try these models is basically zero. They're already authenticated, already paying, already inside the ecosystem. That's a distribution advantage that no standalone AI lab can match, no matter how good their model is.

What I'm watching next
Microsoft is framing this under a "Humanist AI" banner with emphasis on red-teaming, guardrails, governance, and enterprise-grade controls. That's partly marketing, but it's also smart positioning for regulated industries where "we built it ourselves and we control the safety stack" is a better pitch than "we're reselling someone else's model."
Phased rollouts are already underway in Bing and PowerPoint. If MAI-Image-2 becomes the default image generation engine in PowerPoint, that's tens of millions of users generating images through a Microsoft-owned model instead of DALL-E.
I don't think any one of these three models is going to rewrite the competitive landscape on its own. The transcription model is good but incremental. The voice model has interesting speed claims that need real-world validation. The image model's Arena.ai ranking is promising but early.
What matters is the pattern. Microsoft is building a full-stack, multimodal AI capability in-house. They're pricing it to move. And they can distribute it through channels that no competitor can replicate. If you're evaluating enterprise AI platforms right now, that combination of capability, price, and reach is getting harder to ignore.