This one is off my usual beat. It is not a firewall or a packet broker — it is a small side project that turned out to be a cleaner demonstration of a privacy problem than anything I could have written about on purpose. I built a Discord bot that writes like a specific person, trained on a scraped archive of that person's real messages. The bot is a joke. The pipeline behind it is not.
What it is
The bot exposes a slash command, /lifesux, that returns a sentence in a
particular person's voice: self-deprecating, goofy, a little unhinged. A few variants sit
on top of the same core — /mad, /rant, and /cursed —
but they are all the same trick wearing different hats. The bot does not understand anything
it produces. It is a Markov chain in a costume.
The Markov chain is not the interesting part. It is old, well-understood technology, and that is precisely why it works here. The interesting part is the pipeline around it: take a real person's chat history, harvest it into a corpus, and weight that corpus so the generated output leans toward the parts of their personality you actually want to imitate. That is the point at which this stops being a toy and becomes a small preview of something less comfortable, which I will come back to.
How the generator works
The core of the bot is markovify, a Python library that builds a Markov chain
text model. A Markov chain predicts the next state based only on the current one. It has no
long-term memory, no grammar, and no plan. For text, the state is the last N words,
and the model only ever answers one question: given these last N words, what word tends to
follow, and with what probability?
I set state_size=2, a second-order chain, so the model keys off the previous
two words to choose the next. If the training data contains the phrase "i'm such a clown"
often enough, then whenever the chain produces the pair ("i'm", "such") it becomes likely to
follow with "a," and the pair ("such", "a") leans toward "clown." Chain enough of those
transitions together and the result is a sentence nobody actually wrote but that reads like
something the person would have said.
State size is a trade-off, and it matters more than anything else in the model:
state_size=1produces near-random word salad. Each word looks at only one predecessor, so the output is grammatically loose and rarely coherent, though it does generate unexpected combinations.state_size=2is the right setting for short, casual chat messages. It is coherent enough to read as a real sentence and loose enough to generate novel phrasing rather than reciting training lines verbatim. This is what I used.state_size=3and higher increasingly reproduces the original messages, because a small corpus does not contain enough distinct three-word sequences for the chain to branch. At that point the model effectively plagiarizes its source.
I also use markovify.NewlineText rather than the plain Text
class, which treats each line of the corpus as a self-contained sentence. That is the
correct choice for Discord data, where each message is already a complete utterance and
there is no reason to let the end of one message bleed into the start of the next.
Weighting the corpus
A Markov chain trained on someone's entire message history sounds like their average self, and the average self is mostly logistics: "yeah ok," "what time," "on my way." The goal of the bot was to capture a particular flavor — the self-deprecating, goofy humor — so before training I bias the corpus toward the messages that already carry it.
The mechanism is straightforward. As each message loads, it passes through a scoring function, and the high-scoring messages are duplicated in the training data. The chain does not know it has been steered. It simply encounters those phrasings more often, so the transition probabilities tilt toward them.
def vibe_score(text):
t = text.lower()
score = 0
for kw in SELF_DEPRECATION: # "i'm such", "i suck", "my fault"...
if kw in t:
score += 2 # self-deprecation counts double
for kw in GOOFY: # "lmao", "bruh", "wait what"...
if kw in t:
score += 1
return score
Self-deprecation keywords are worth two points each and goofy keywords one. The corpus builder then clones any message scoring above zero into the training set multiple times:
lines.append(text)
if vibe_score(text) > 0:
extra = min(vibe_score(text), 3) * WEIGHT # WEIGHT = 4
lines.extend([text] * extra)
A message with a vibe score of 1 therefore appears 1 + (1 × 4) = 5 times in
total, and a message scoring 3 or higher caps at 1 + (3 × 4) = 13 copies. The
min(..., 3) cap is deliberate. Without it, a single unusually on-brand message
could dominate the model and the bot would quote that one line indefinitely. The cap keeps
each message's influence strong without letting any one of them take over.
The WEIGHT constant, set to 4, is the master volume control for the
personality. Raise it and the bot becomes a caricature in which every line drips with
self-deprecation. Lower it toward 1 and the output relaxes back into the person's ordinary
voice. A value of 4 reads as recognizably them, only funnier.
The same idea, reused per command
That scoring approach is reused at generation time as well as during training. The
generate_line_scored() function generates roughly twenty candidate sentences,
filters them to a reasonable length, sorts them by a scoring function, and selects at random
from the top five:
pool.sort(key=scorer, reverse=True)
return random.choice(pool[:5])
Choosing randomly from the top five, rather than always taking the single highest scorer, is what keeps the output from repeating itself. The results stay on theme while rarely coming out the same way twice. Each command passes in a different scoring function, so all of them share one engine:
/lifesuxranks byvibe_score(self-deprecating and goofy) and serves as the default personality./madranks byaggro_score, then uppercases the result and appends "!!!"./cursedranks bycursed_score, randomly capitalizes words, and scatters in emoji./rantis the most elaborate. It generates four separate aggressive lines and stitches them into a paragraph with an opener ("okay so,"), escalators ("and on top of that,"), a closer ("and i'm done"), and a few randomly injected emoji. It is a Markov chain imitating a structured argument.
Limitations
None of this is intelligence, and it is worth being clear about that. The bot has no idea what any word means. The self-deprecation it produces is simply a set of statistically likely transitions that happen to originate in self-deprecating source messages. The keyword lists are hand-written and brittle, catching the obvious markers and missing anything subtle or sarcastic. And the whole thing depends on corpus size. A few hundred messages produce a parrot; tens of thousands produce something uncomfortably convincing. That last point leads directly to the part of this project that actually matters.
Where the data came from
The JSON file that feeds this model was not generated. It was harvested — a scraped archive of one real person's messages, pulled out of a Discord server and saved so a program could study how they write. For a bot that exists to tease a friend, that is harmless. But it is worth stepping back to look at the shape of what happened, because the shape is the whole point.
I took a corpus of one identifiable person's unedited, off-the-cuff writing and built a system that produces new text in their voice on demand. The output is currently short, crude, and obviously machine-generated, but that is only a function of the engine being a toy. The pipeline itself — harvesting a person's public text, modeling their style, and generating utterances they never actually wrote — is the same pipeline behind every text-based impersonation system on the horizon. It is simply running on a lawnmower engine rather than a jet turbine.
With a Markov chain, the worst this yields is a bot that sounds vaguely like you. Scale the model up and the same input produces something quite different. A large language model fine-tuned on one person's writing can reproduce not only vocabulary but cadence, opinions, recurring jokes, and characteristic mistakes — often convincingly enough to fool people who know that person well. The harvesting step does not change. Only the engine does.
Why this is no longer only about reputation
The old advice to watch what you post was about reputation: do not say something today that embarrasses you later. The newer concern is different in kind. What you say does not merely reflect on you; it can teach a model to be you. And it is genuinely unclear what the engine at the far end of this pipeline will look like in five or ten years. A few directions are already visible:
- Text impersonation. A model trained on your message history that drafts messages in your voice and is used to manipulate the people who trust you. "It's me, my phone died, can you send me that code" becomes far more effective when it actually writes like you, references your private jokes, and reproduces your typing habits.
- Voice and video synthesis. Text is the cheap and abundant input, but the same harvest-and-model approach applies to recordings of a voice or face. A few minutes of clean audio already supports usable voice cloning, and the data requirements are falling, not rising.
- Clones built without consent. Models assembled from a person's data without their knowledge, or after their death, and presented as though the person is still speaking. The material needed to build them is being archived continuously, by all of us, at no cost to anyone.
The point worth landing is that none of this requires a breakthrough. The Markov bot demonstrates the full pipeline, end to end, using libraries a hobbyist can install in an afternoon. Harvesting is trivial. Modeling is a solved problem at every quality tier. The only remaining variable is how capable the engine becomes, and that is the one variable moving steadily in a single direction.
What to do about it
There is not much that is reassuring to say. You cannot retract what is already archived, and going silent is not realistic for most people. The practical conclusions are modest. Assume that anything you post publicly is permanent and attributable. Treat "public" as a synonym for "in someone's training set." Be skeptical of text or audio that sounds like someone you know but arrives through an unverified channel. And, because the individual level offers little leverage here, support consent and provenance standards around using a person's data to model them. The technology is neutral. A bot that teases a friend and a tool that impersonates them to empty their bank account are the same code with a different engine and a different intent. We built the harmless version. The harmful version is the same afternoon's work for anyone who means it.