
The setup first, because that is where this one went sideways. I took the exact prompt behind our landing-page brand-voice demo, the one that reads a few of your posts and writes a new one in your voice, and ran it through three models: Claude's Sonnet 4.6 (what we run in production), DeepSeek, and GLM. Six different voice samples, each one twice, every anonymized output scored by a blind judge. The question had real money behind it: could a cheaper model do this job as well as Sonnet, or better?
I ran it three ways, and the three runs disagreed so completely that "which model" turned out to be the wrong question. Round by round.
Round 1: swap the model string
This is the test most people actually run. Same settings as production, just a different model name.
| Model | Valid JSON | Latency | Cost/gen |
|---|---|---|---|
| Sonnet 4.6 | 12/12 | 7.9s | $0.0079 |
| DeepSeek | 4/12 | 22.7s | $0.0013 |
| GLM | 0/12 | 26.7s | $0.0041 |
GLM produced zero usable outputs. DeepSeek produced four. Not because they are bad models, but because both are reasoning-first: under the production token budget they think out loud ("1. Let me analyze the posts...") and run out of room before they emit the JSON the page needs. The naive swap doesn't hand you a worse demo. It hands you a broken one. If I'd decided on a two-output spot check, I'd have learned that from users instead.
Round 2: give them room
Same models, but with JSON mode on and a bigger token budget, so the reasoning models can finish their thought.
| Model | Valid JSON | Latency | Judge score |
|---|---|---|---|
| Sonnet 4.6 | 12/12 | 7.7s | 3.37 |
| DeepSeek | 11/12 | 43s | 4.33 |
| GLM | 12/12 | 44s | 4.73 |
The picture flips. Given room to finish, both challengers emit valid JSON, and the judge rates their voice match at or above Sonnet. So the models themselves were never the problem; the first test was. But look at the latency. Forty-plus seconds. Nobody waits 44 seconds for a landing-page demo. Great output, unusable feature.
Round 3: the version you'd actually ship
JSON mode on, reasoning turned off, DeepSeek on its fast endpoint.
| Model | Valid JSON | Latency | Cost/gen | Judge score |
|---|---|---|---|---|
| Sonnet 4.6 | 12/12 | 8.7s | $0.0079 | 3.17 |
| DeepSeek (fast) | 12/12 | 2.8s | $0.00019 | 4.10 |
| GLM (no-think) | 12/12 | 9.0s | $0.0017 | 4.77 |
With the right plumbing, both turn genuinely viable. DeepSeek's fast endpoint came back in 2.8 seconds, quicker than Sonnet, at roughly one-fiftieth of the cost, with valid JSON every single time. GLM matched Sonnet's speed at a fifth of the cost and took the top judge score. The samples backed the numbers. GLM nailed a creator's sign-off ("currently romanticizing my overpriced matcha, and I will not be taking questions"), and the bilingual samples came back correctly in Italian across all three models.
What I actually learned
The model was the least interesting variable. The same three models went from broken to better-than-the-incumbent without changing the model at all, only the plumbing around it. Token budget, JSON mode, and whether reasoning was on or off mattered more than which lab trained the thing. The whole exercise was a reminder that "just switch to the cheaper model" is a sentence hiding all of the actual engineering.
One caveat, because the judge wasn't perfect either. It correctly caught that Sonnet wraps its JSON in markdown fences, which is real and the reason production has a fence-stripper. But it also hallucinated em-dash violations in outputs that contained zero em dashes. So I trust the eyeballed samples more than the exact decimal scores. The direction is solid; the third decimal place is not.
The practical decision? For this one demo the spend is tiny, so Sonnet stays as the safe default and DeepSeek's fast endpoint is the upgrade I'd reach for first. The real savings aren't here anyway. They are in the high-volume features that run thousands of times a day, where 50x cheaper stops being a rounding error. That's the test worth running next.
If you take one thing from this: before you swap a model to save money, run it through the exact plumbing it will live in. The model you think you're testing is not the one your users get.