Telegram has been my personal playground for experimenting with NLP and conversational AI. Over the past five years, I accumulated a private group chat with recurring inside jokes, emotional context, sarcastic replies, short one-liners, and messy multilingual threads — the exact opposite of ideal training data. This talk documents how I fine-tuned a Large Language Model on that chat history to create a chatbot that speaks in my style.
Instead of focusing on already well-documented LoRA fine-tuning tutorials, this session shows the real challenges behind turning unstructured human communication into a usable dataset. I’ll walk through how raw Telegram exports become prompt-response training pairs, why Instruct models refused to learn my dark humor, how tokenization mistakes destroyed training stability, and why a 30k dataset of private interactions is worse than it looks.
Attendees will learn what works, what breaks, and what to avoid when training personality-driven LLMs. We’ll look at surprising insights: the model reproducing one-word time messages, the loss explosion caused by inappropriate humor threads, and how shorter context windows produced more coherent imitation of my tone.
This talk is an honest exploration of the technical, ethical, and psychological implications of building a digital version of yourself.
Outline:
Motivation — Why attempt a digital self From early Telegram bots and failed Seq2Seq models to attention-based LLMs; the project driven by curiosity, not necessity.
Data extraction: Telegram ≠ dataset HTML exports converted to JSONL; group chat chaos, multi-turn threads, and prompt–response formatting to teach conversational flow.
Dataset engineering: real mistakes Sarcasm, toxicity, and dark humor breaking models; overrepresented patterns (like time replies); why “include everything” distorts personality.
Model selection TinyLlama hallucinations vs. Mistral’s nuance; why Instruct models reject certain humor; switching to Mistral-7B-v0.1 to unlock tone.
Training strategy with LoRA QLoRA setup, tuning r/alpha/dropout, small batches with accumulation, avoiding catastrophic forgetting, and real GPU costs (L4 vs A100).
Tokenization: easiest way to fail Prompt+response concatenation with prompt masking; how token length and masking errors ruined runs; respecting max_length.
Inference and reality check TinyLlama’s broken tone vs. Mistral capturing inside jokes; temperature shaping sarcasm; absurd, funhouse-mirror outputs.
Unexpected lessons Humor spikes loss, time replies dominate, shorter threads learn better tone; imitation is sometimes distorted but compelling.
Ethics and psychological complexity Private data and consent, style vs identity, when the model sounds “too close,” and why curation is a moral decision.