
Here's the moment I knew I had to build this.
I was testing a popular AI content tool - I won't name it - and I gave it what I thought was a clear prompt: "Write an Instagram caption in my brand's casual, slightly sarcastic voice." It came back with something that sounded like a corporate intern who'd been told to "be relatable." Words like "unleash" and "game-changer." An emoji after every sentence. The classic "Ready to level up? Link in bio!" closer.
That's not my voice. That's nobody's voice. That's the default AI voice - the one that reads like it was generated by a machine, because it was.
I'd already been building Sydium for months at that point. Scheduling, publishing, analytics. The stuff every social media tool does. But this moment stuck with me because it exposed the real problem: every AI content tool generates the same voice. They don't learn how you write. They learn how to sound generically "engaging." And that generic output is exactly what makes AI-generated content feel hollow.
So I decided to build something different. A voice system that actually reads your existing content, extracts the patterns that make your writing yours, and then uses those patterns to generate new content that sounds like you wrote it. Not "professional tone" or "casual tone." Your tone.
This is the story of how I built it, what went wrong, and what I learned along the way.
Why "pick a tone" dropdowns don't work
Let's start with how most AI content tools handle voice.
You get a dropdown. Maybe 5-10 options: Professional, Casual, Friendly, Authoritative, Humorous. Some tools let you write a text description of your brand voice. Jasper calls theirs "Brand Voice" and lets you upload samples or a URL. Typeface requires 15,000 words for long-form and up to 15 examples for short-form content. These are legitimate approaches and way better than a dropdown.
But here's the problem I kept running into.
A "tone" is not a voice. Two writers can both be "casual" and sound nothing alike. One might use short fragments. The other might write long, winding sentences with three parenthetical asides. One opens with questions. The other opens with stats. One uses emojis ironically. The other never uses them at all.
Tone is maybe 20% of what makes someone's writing recognizable. The other 80% is structural - sentence length patterns, how they start and end posts, their vocabulary range, whether they use hashtags or hate them, their signature phrases, their hook style.
I found this confirmed in research too. Hashmeta's technical guide on brand voice training emphasizes that real voice replication needs to capture "linguistic fingerprints" from 50 to 100 high-performing content pieces. A one-line description of your tone captures none of that.
Mavik Labs wrote about this for 2026: voice should match the stakes of the communication, and defining traits with "do/don't" language patterns matters more than vague descriptors. I'd go further. Voice needs to be extracted from how someone actually writes, not described by how they think they write. Those are almost never the same thing.
The first approach: prompt engineering (and why it broke)
My first attempt was simple. Take a few of the user's posts, paste them into the prompt, and tell the LLM "write like this."
This is few-shot prompting, and it's the foundation of most brand voice tools. Research shows that 2-5 examples are usually sufficient for the model to pick up on patterns. DataCamp's tutorial on few-shot prompting confirms that well-chosen examples outperform larger sets of lower-quality ones.
So I built a quick prototype. Pull the user's latest 10 posts from their connected accounts, include them in the system prompt, generate new content.
It worked... sort of.
The AI picked up surface-level patterns. If the user used emojis, the generated content used emojis. If they wrote short sentences, it wrote short sentences. But it felt like a photocopy - technically accurate but missing something essential. The generated posts were recognizably "in the style of" but never felt like the person actually wrote them.
The problem was that 10 posts weren't enough context, and just pasting them into a prompt doesn't give the AI enough signal about what to prioritize. Is the user's emoji usage intentional, or just something they do on Instagram but not LinkedIn? Is their sentence length a stylistic choice, or does it vary by platform? The raw posts don't answer those questions.
I needed something between "paste examples in a prompt" and "fine-tune a model on your data." Something that could extract the DNA of someone's voice without needing 5,000 to 15,000 annotated content samples like enterprise solutions require.
The pipeline that actually works

After weeks of iteration, I landed on a multi-stage pipeline that combines statistical analysis with AI-powered pattern extraction. Here's how it works in Sydium.
Stage 1: Data collection
The system pulls content from every source it can find. Social posts across up to 5 platforms (up to 50 posts per platform), scraped website content, uploaded documents, pasted examples, and manual configuration. The more data, the better the voice profile - but the system works with as little as a handful of posts.
This matters because most people's writing differs across platforms. Your LinkedIn posts are probably more formal than your Instagram captions. The system needs to see both to understand the range of your voice, not just one slice of it.
Stage 2: Statistical analysis
Before any AI touches the data, I run statistical analysis. This sounds boring but it's the foundation everything else builds on.
The system calculates concrete numbers: average sentence length, emoji frequency per 100 words, hashtag density, vocabulary level (using standard readability metrics), punctuation patterns, paragraph length distribution. These are objective measurements that don't require interpretation.
Why do this step at all? Because LLMs are notoriously bad at counting. If you ask Claude or GPT-4 to analyze a text and tell you the average sentence length, you'll get an approximation that's often wrong. But if you calculate it statistically and tell the AI "this person's average sentence length is 12 words with a standard deviation of 4," now the AI has a reliable anchor.
Stage 3: AI-powered pattern extraction
This is where it gets interesting. I send the collected content to Claude or GPT-4 (Sydium supports both) with a very specific instruction: identify the qualitative patterns that statistics can't capture.
The AI analyzes tone descriptors (from a set of 10 presets that I've tested extensively), signature phrases, hook patterns (how they open posts), closing styles (how they end posts), CTA preferences, and sentence structure tendencies. It identifies things like "this person almost always opens with a question" or "they tend to end posts with a one-sentence kicker" or "they never use the word 'leverage.'"
Stage 4: Few-shot example selection
The system picks the best examples from the collected content to use as few-shot demonstrations. Not random posts - the ones that best represent the user's voice based on the patterns extracted in stages 2 and 3. A post that's an outlier (maybe they were trying something different that day) gets filtered out. The most representative samples become the examples the generation model sees.
Stage 5: Platform-specific adjustments
Here's something that tripped me up for weeks. A person's voice on LinkedIn is not their voice on TikTok. They're both authentically "that person," but the register shifts. Professional vocabulary on LinkedIn, slang on TikTok, somewhere in between on Instagram.
The system applies platform adjustments after establishing the base voice. It's like how you talk differently to your boss than to your friends - both are authentically you, but the context shapes the expression.
Stage 6: Quality scoring
Every generated piece gets a quality score from 0-100 based on how closely it matches the extracted voice profile. This isn't just a vibes check - it measures concrete alignment: does the sentence length match the user's pattern? Is the emoji frequency within their normal range? Are the hooks structured the way they usually structure hooks?
Content below a configurable threshold gets flagged or regenerated.
The Problem Nobody Warns You About: Voice Drift
Here's a problem I didn't anticipate. If you don't measure voice consistency, it will drift.
In the first version, the voice profile was static. Extract it once, use it forever. But people's voices evolve. They pick up new phrases. They shift platforms. They rebrand. A voice profile from January might be noticeably off by June.
Worse, the generated content itself can cause drift. This is actually a known problem in machine learning. Research from Rice University on "self-consuming AI" found that when AI systems train on their own generated content, quality degrades over time - they call it "model autophagy disorder." The output gets progressively more generic, reinforcing patterns that aren't actually characteristic of the user.
I had to build safeguards against this. The system periodically re-analyzes the user's actual organic content (not AI-generated posts) and recalibrates the voice profile. Generated content is tagged internally so the system knows not to learn from its own output. The quality score serves as a drift detector - if scores start trending down, the profile needs refreshing.
The Edit Feedback Loop: Where the Real Learning Happens
This is the feature I'm most proud of, and the one that took the longest to get right.
When a user generates content and then edits it before publishing, the system captures the before/after pair. It records what was generated, what the user changed, which platform it was for, and the magnitude of the change. Sydium stores up to 20 of these edit pairs per user.
These pairs are gold. They tell the system exactly where the voice model is wrong.
If a user consistently shortens opening sentences, the system learns that its hooks are too wordy. If they always remove certain phrases, those phrases get deprioritized. If they add emoji to Instagram captions but remove them from LinkedIn posts, the platform-specific adjustments get refined.
This is inspired by RLHF (Reinforcement Learning from Human Feedback), the same technique used to train ChatGPT. The core idea is the same: the system generates output, a human corrects it, and the correction feeds back into future generation. The difference is that we're not fine-tuning a model - we're adjusting the prompt context and voice profile parameters. It's lightweight RLHF without the infrastructure costs of actual model training.
IrisAgent wrote about the power of feedback loops in AI: systems that incorporate correction data "don't just learn from mistakes - they develop an intuition for avoiding them." That's exactly what I was going for. Not a static voice model, but one that sharpens itself every time you use it.
The result is that the more you use Sydium's content generation, the more it sounds like you. Not in a vague "it's getting better" sense. In a measurable, quality-scored, pattern-matched sense.
What I Got Wrong (Twice)
Wrong approach 1: Letting users describe their voice
My first version had a form where users could describe their brand voice. "I write in a casual but knowledgeable tone. I use humor sometimes. I'm direct."
This was useless.
People are terrible at describing how they write. They describe how they think they write, or how they want to write, or how their favorite writer writes. The gap between "how I describe my voice" and "how I actually write" is enormous. I found this across every user who tested the early version. Their self-descriptions were aspirational, not accurate.
I replaced the form with the automated extraction pipeline. Now the user connects their accounts, the system reads their actual content, and the voice profile is built from evidence rather than self-perception. Users can still tweak it manually, but the starting point is real data, not wishful thinking.
Wrong approach 2: One voice profile per user
The second version had one voice profile that applied everywhere. But as I mentioned above, people write differently on different platforms. They also write differently for different content types - a product announcement sounds different than a personal story.
The system now maintains a base voice profile with platform-specific overlays. The base captures the fundamental patterns (vocabulary, sentence structure, personality). The overlays adjust for platform norms (more formal on LinkedIn, shorter on Twitter, more visual language on Instagram). This was a pain to build but it's the difference between "this kinda sounds like me" and "this actually sounds like me."
Technical choices I'd make differently
Using both Claude and GPT-4. I built the system to work with both AI providers, which sounded smart until I realized they interpret voice analysis prompts differently. Claude tends to produce more nuanced analysis but sometimes over-explains. GPT-4 is more consistent in format but occasionally misses subtlety. I now recommend Claude for the analysis stage and GPT-4 for generation, but letting users choose means the voice can subtly shift depending on their provider settings. If I started over, I'd pick one and optimize for it.
The quality score calibration. My initial quality scores were too generous. Everything scored 70-85, which told users nothing useful. The scores need to have real variance - a 50 should mean "this doesn't sound like you" and a 90 should mean "this is indistinguishable from your writing." I recalibrated three times before the scores became meaningful. The lesson: if your quality metric doesn't create uncomfortable results sometimes, it's not measuring anything.
Storing voice profiles. I stored voice profiles as flat JSON documents in Firestore. This works fine at current scale but the profiles are getting complex enough that I'm already hitting document size limits for power users with lots of connected platforms and edit history. If I started fresh, I'd structure voice profiles as subcollections from day one.
What Other Tools Are Doing (And the Gap They All Share)
Jasper's Brand IQ is the most sophisticated system I've seen in the market. It functions as a "proprietary RAG system" that grounds AI outputs in company-specific data - brand voice, strategy documents, audience profiles. It's built for enterprise teams.
Typeface requires significant data volume - 15,000 words minimum for long-form voice training, with training taking several hours. They've gone deep on web scraping capabilities to pull content from URLs automatically.
Blaze.ai learns from existing content and applies it across channels. Search Engine Land published a guide on training in-house LLMs on brand voice that covers some of the same territory.
What most of these miss, in my opinion, is the feedback loop. They capture a snapshot of your voice and apply it. But they don't learn from your corrections. The voice profile is a photograph, not a video. It captures who you were, not who you're becoming.
The other thing most tools miss is the quality scoring transparency. They generate content and you either accept it or you don't. But you can't see why the system made the choices it made, or how confident it is that the output matches your voice. Sydium shows you the score and the factors that contributed to it. I think transparency is what separates "AI magic" from a tool you can actually trust.
Where this is going
The voice system is live in Sydium now, and the feedback loop means it improves with every user interaction. But there's a lot I still want to build.
Voice cloning across content types. Right now the system is optimized for social media posts. But your brand voice extends to emails, blog posts, ad copy. The pipeline should work for any text output, using the same voice profile with format-specific adjustments.
Collaborative voice profiles. For agencies managing multiple clients, the voice system needs to handle team-based workflows where different team members can generate content for the same brand. The voice profile becomes a shared asset, not a personal one.
Better outlier detection. The system should get smarter about which posts to ignore during voice extraction. A viral post might not be representative - it might have gone viral because it was different from the user's normal voice. Currently the statistical outlier detection is basic. I want to make it context-aware.

Lessons for Other Builders
![]()
If you're building anything with AI-powered personalization, here's what I'd pass along from this experience.
Start with data, not descriptions. Never ask users to describe what you can observe directly. Their self-knowledge is unreliable. Extract patterns from their actual behavior.
Statistical foundations beat pure AI. Let AI do the qualitative analysis. But anchor it with hard numbers. LLMs hallucinate about data; they don't hallucinate about data you give them.
Build the feedback loop from day one. I added the edit feedback loop late and regretted it. Every AI system should capture corrections from the moment it ships. The compounding improvement is the real competitive moat.
Your quality metric needs teeth. If every output scores "good," your metric is useless. Build a scoring system that produces uncomfortable results. A 45 out of 100 that tells the user "this doesn't match your voice" is more valuable than a 78 that tells them nothing.
Voice is a spectrum, not a setting. People don't have one voice. They have a voice range. Your system needs to capture the range and the contexts that trigger different parts of it.
I wrote about the reality of building in public before, and the brand voice system is a good example of what that actually looks like. Weeks of iteration. Dead ends. Three complete rewrites of the quality scoring. Features that sounded brilliant in my head and were useless in practice. But at the end of it, I have something that genuinely gets better the more you use it. That feels like progress.
If you're a creator who's sick of AI content that sounds like it was written by a marketing textbook, you can try Sydium for free and see what your actual voice profile looks like. The analysis alone is worth it, even if you never generate a post.
The future of AI content isn't about generating more. It's about generating content that's indistinguishable from what you'd write yourself - and getting measurably closer every time you use it. That's the system I built, and it's live right now.
Questions Builders and Creators Keep Asking
How does AI brand voice training actually work?
The technical approach combines statistical analysis of your existing content with AI-powered pattern extraction. The system measures concrete things like sentence length, emoji frequency, and vocabulary level, then uses Claude or GPT-4 to identify qualitative patterns like your hook style, closing preferences, and signature phrases. Research shows that 50-100 high-performing content pieces provide the best foundation for extracting reliable "linguistic fingerprints." The result is a voice profile that captures how you actually write, not how you describe your writing.
How is this different from Jasper or Typeface brand voice?
Jasper's Brand IQ uses a RAG-based system optimized for enterprise teams. Typeface requires 15,000+ words for long-form voice training. Sydium's approach works with fewer samples (even a handful of posts) and adds two key features most competitors lack: a self-improving feedback loop that learns from your edits, and a transparent quality score that shows how closely the output matches your voice profile. The system gets measurably better the more you use it.
Can AI really capture someone's unique writing voice?
Yes, but not through a tone dropdown. Research confirms that well-chosen few-shot examples outperform simple tone descriptions. Sydium's pipeline goes further by combining statistical measurements (sentence length, emoji patterns, vocabulary level) with AI analysis (hook style, CTA preferences, signature phrases). The result captures about 80% of what makes someone's writing recognizable. The remaining 20% comes from the feedback loop as you correct and refine generated content.
What is a voice quality score and why does it matter?
Sydium assigns every generated piece a score from 0-100 based on how closely it matches your extracted voice profile. It measures concrete alignment: sentence length patterns, emoji frequency, hook structure, vocabulary choices. If the score is below your threshold, the content gets flagged for revision. This matters because without measurement, voice consistency drifts over time. The score is a safeguard against the AI gradually defaulting to its own generic voice.
Does the AI learn from my edits?
Yes. Every time you edit AI-generated content before publishing, Sydium captures the before/after pair. It records what changed, which platform it was for, and how significant the edit was. The system stores up to 20 of these pairs and uses them to improve future generation. This is inspired by RLHF (Reinforcement Learning from Human Feedback), the same technique behind ChatGPT's improvement process. The more you use and correct the system, the more accurately it reproduces your voice.
How many posts does the system need to build a voice profile?
The system works with as few as a handful of posts but improves significantly with more data. It can pull up to 50 posts per connected platform across 5 platforms, plus content from website scraping, uploaded documents, and pasted examples. Enterprise solutions typically require 5,000-15,000 annotated samples for comprehensive training. Sydium needs far less because the pipeline combines statistical analysis with AI-powered extraction rather than attempting to fine-tune a model directly.
Can I have different brand voices for different platforms?
Yes. The system supports multiple voice profiles and can detect platform-specific patterns automatically. Your LinkedIn posts probably sound more professional than your Twitter posts - that's intentional, and the AI recognizes it. When generating content, you can choose which voice profile to use, or let the system auto-select based on the target platform. This is useful for agencies managing multiple clients or creators who maintain distinct personas across platforms.
How do I improve my voice profile if the AI keeps missing my tone?
Start by reviewing your training data. If the system consistently misses your tone, it often means your input samples are inconsistent or don't represent your best work. Remove outliers - posts that performed poorly or were written when you were rushed. Add more examples of your strongest content. Then use the edit feedback loop aggressively: every correction you make teaches the system what you actually want. Most users see noticeable improvement within 15-20 edit cycles as the system learns your preferences.
Related free tools
Free, no signup, runs in your browser.
- Caption Generator - Generate engaging captions for any platform using AI. Get 3 variations with hashtags included.