Cloning a Voice from Chat Logs: A Markov-Chain Style Bot

This one is off my usual beat. It is not a firewall or a packet broker — it is a small side project that turned out to be a cleaner demonstration of a privacy problem than anything I could have written about on purpose. I built a Discord bot that writes like a specific person, trained on a scraped archive of that person's real messages. The bot is a joke. The pipeline behind it is not.

What it is

The bot exposes a slash command, /lifesux, that returns a sentence in a particular person's voice: self-deprecating, goofy, a little unhinged. A few variants sit on top of the same core — /mad, /rant, and /cursed — but they are all the same trick wearing different hats. The bot does not understand anything it produces. It is a Markov chain in a costume.

The Markov chain is not the interesting part. It is old, well-understood technology, and that is precisely why it works here. The interesting part is the pipeline around it: take a real person's chat history, harvest it into a corpus, and weight that corpus so the generated output leans toward the parts of their personality you actually want to imitate. That is the point at which this stops being a toy and becomes a small preview of something less comfortable, which I will come back to.

How the generator works

The core of the bot is markovify, a Python library that builds a Markov chain text model. A Markov chain predicts the next state based only on the current one. It has no long-term memory, no grammar, and no plan. For text, the state is the last N words, and the model only ever answers one question: given these last N words, what word tends to follow, and with what probability?

I set state_size=2, a second-order chain, so the model keys off the previous two words to choose the next. If the training data contains the phrase "i'm such a clown" often enough, then whenever the chain produces the pair ("i'm", "such") it becomes likely to follow with "a," and the pair ("such", "a") leans toward "clown." Chain enough of those transitions together and the result is a sentence nobody actually wrote but that reads like something the person would have said.

State size is a trade-off, and it matters more than anything else in the model:

state_size=1 produces near-random word salad. Each word looks at only one predecessor, so the output is grammatically loose and rarely coherent, though it does generate unexpected combinations.
state_size=2 is the right setting for short, casual chat messages. It is coherent enough to read as a real sentence and loose enough to generate novel phrasing rather than reciting training lines verbatim. This is what I used.
state_size=3 and higher increasingly reproduces the original messages, because a small corpus does not contain enough distinct three-word sequences for the chain to branch. At that point the model effectively plagiarizes its source.

I also use markovify.NewlineText rather than the plain Text class, which treats each line of the corpus as a self-contained sentence. That is the correct choice for Discord data, where each message is already a complete utterance and there is no reason to let the end of one message bleed into the start of the next.

Weighting the corpus

A Markov chain trained on someone's entire message history sounds like their average self, and the average self is mostly logistics: "yeah ok," "what time," "on my way." The goal of the bot was to capture a particular flavor — the self-deprecating, goofy humor — so before training I bias the corpus toward the messages that already carry it.

The mechanism is straightforward. As each message loads, it passes through a scoring function, and the high-scoring messages are duplicated in the training data. The chain does not know it has been steered. It simply encounters those phrasings more often, so the transition probabilities tilt toward them.

def vibe_score(text):
    t = text.lower()
    score = 0
    for kw in SELF_DEPRECATION:   # "i'm such", "i suck", "my fault"...
        if kw in t:
            score += 2            # self-deprecation counts double
    for kw in GOOFY:              # "lmao", "bruh", "wait what"...
        if kw in t:
            score += 1
    return score

Self-deprecation keywords are worth two points each and goofy keywords one. The corpus builder then clones any message scoring above zero into the training set multiple times:

lines.append(text)
if vibe_score(text) > 0:
    extra = min(vibe_score(text), 3) * WEIGHT   # WEIGHT = 4
    lines.extend([text] * extra)

A message with a vibe score of 1 therefore appears 1 + (1 × 4) = 5 times in total, and a message scoring 3 or higher caps at 1 + (3 × 4) = 13 copies. The min(..., 3) cap is deliberate. Without it, a single unusually on-brand message could dominate the model and the bot would quote that one line indefinitely. The cap keeps each message's influence strong without letting any one of them take over.

The WEIGHT constant, set to 4, is the master volume control for the personality. Raise it and the bot becomes a caricature in which every line drips with self-deprecation. Lower it toward 1 and the output relaxes back into the person's ordinary voice. A value of 4 reads as recognizably them, only funnier.

The same idea, reused per command

That scoring approach is reused at generation time as well as during training. The generate_line_scored() function generates roughly twenty candidate sentences, filters them to a reasonable length, sorts them by a scoring function, and selects at random from the top five:

pool.sort(key=scorer, reverse=True)
return random.choice(pool[:5])

Choosing randomly from the top five, rather than always taking the single highest scorer, is what keeps the output from repeating itself. The results stay on theme while rarely coming out the same way twice. Each command passes in a different scoring function, so all of them share one engine:

/lifesux ranks by vibe_score (self-deprecating and goofy) and serves as the default personality.
/mad ranks by aggro_score, then uppercases the result and appends "!!!".
/cursed ranks by cursed_score, randomly capitalizes words, and scatters in emoji.
/rant is the most elaborate. It generates four separate aggressive lines and stitches them into a paragraph with an opener ("okay so,"), escalators ("and on top of that,"), a closer ("and i'm done"), and a few randomly injected emoji. It is a Markov chain imitating a structured argument.

Limitations

None of this is intelligence, and it is worth being clear about that. The bot has no idea what any word means. The self-deprecation it produces is simply a set of statistically likely transitions that happen to originate in self-deprecating source messages. The keyword lists are hand-written and brittle, catching the obvious markers and missing anything subtle or sarcastic. And the whole thing depends on corpus size. A few hundred messages produce a parrot; tens of thousands produce something uncomfortably convincing. That last point leads directly to the part of this project that actually matters.

Where the data came from

The JSON file that feeds this model was not generated. It was harvested — a scraped archive of one real person's messages, pulled out of a Discord server and saved so a program could study how they write. For a bot that exists to tease a friend, that is harmless. But it is worth stepping back to look at the shape of what happened, because the shape is the whole point.

I took a corpus of one identifiable person's unedited, off-the-cuff writing and built a system that produces new text in their voice on demand. The output is currently short, crude, and obviously machine-generated, but that is only a function of the engine being a toy. The pipeline itself — harvesting a person's public text, modeling their style, and generating utterances they never actually wrote — is the same pipeline behind every text-based impersonation system on the horizon. It is simply running on a lawnmower engine rather than a jet turbine.

the reframe

Every message you have posted publicly is a labeled writing sample, permanently attributed to you by name, sitting in someone's archive. You did not consent to it becoming a corpus, but consent is not technically required, because the text is already public, already attributed, and already stored. A Discord export, a decade of forum posts, an entire social-media reply history: each one is a ready-made personality dataset with your name on it.

With a Markov chain, the worst this yields is a bot that sounds vaguely like you. Scale the model up and the same input produces something quite different. A large language model fine-tuned on one person's writing can reproduce not only vocabulary but cadence, opinions, recurring jokes, and characteristic mistakes — often convincingly enough to fool people who know that person well. The harvesting step does not change. Only the engine does.

Why this is no longer only about reputation

The old advice to watch what you post was about reputation: do not say something today that embarrasses you later. The newer concern is different in kind. What you say does not merely reflect on you; it can teach a model to be you. And it is genuinely unclear what the engine at the far end of this pipeline will look like in five or ten years. A few directions are already visible:

Text impersonation. A model trained on your message history that drafts messages in your voice and is used to manipulate the people who trust you. "It's me, my phone died, can you send me that code" becomes far more effective when it actually writes like you, references your private jokes, and reproduces your typing habits.
Voice and video synthesis. Text is the cheap and abundant input, but the same harvest-and-model approach applies to recordings of a voice or face. A few minutes of clean audio already supports usable voice cloning, and the data requirements are falling, not rising.
Clones built without consent. Models assembled from a person's data without their knowledge, or after their death, and presented as though the person is still speaking. The material needed to build them is being archived continuously, by all of us, at no cost to anyone.

The point worth landing is that none of this requires a breakthrough. The Markov bot demonstrates the full pipeline, end to end, using libraries a hobbyist can install in an afternoon. Harvesting is trivial. Modeling is a solved problem at every quality tier. The only remaining variable is how capable the engine becomes, and that is the one variable moving steadily in a single direction.

What to do about it

There is not much that is reassuring to say. You cannot retract what is already archived, and going silent is not realistic for most people. The practical conclusions are modest. Assume that anything you post publicly is permanent and attributable. Treat "public" as a synonym for "in someone's training set." Be skeptical of text or audio that sounds like someone you know but arrives through an unverified channel. And, because the individual level offers little leverage here, support consent and provenance standards around using a person's data to model them. The technology is neutral. A bot that teases a friend and a tool that impersonates them to empty their bank account are the same code with a different engine and a different intent. We built the harmless version. The harmful version is the same afternoon's work for anyone who means it.

takeaway

The bot is a joke, and a good one. But it also demonstrates how little stands between a pile of someone's old messages and a working imitation of how they talk — and that gap is only going to narrow. The same instinct that makes a security engineer minimize attack surface applies to your own writing: every public utterance is surface.