
AI Handwriting Generator Part 1 — LSTMs & Gaussian Attention
Hey everyone!
I've been hacking on an AI handwriting generator for the past few months and I'm finally at a place where I can write it up. The short version of the project: type in some text, pick a handwriting style, get back a PNG of that text rendered in believable-looking handwriting. This first post is about the LSTM-based approach; part 2 will cover the transformer rewrite that came later.
The core idea is to model handwriting as a sequence of pen-stroke offsets, not as an image. Each "token" is a 3-tuple — (dx, dy, end_of_stroke) — representing how far the pen moves from the previous point and whether this point ends a stroke. A character is 30–80 of these tokens. A word is a longer sequence. Generating a page becomes "generate one very long sequence."
Model: a 3-layer LSTM with hidden size 400, about 14MB of weights total. It predicts the parameters of a mixture of 20 bivariate Gaussians over (dx, dy) at each step, plus a Bernoulli probability for the end_of_stroke flag. Training loss is the negative log likelihood of the observed next offset under that predicted distribution. This is straight out of Graves 2013 — Generating Sequences With Recurrent Neural Networks — and it still works great in 2025.
The interesting part is conditioning. If you just train on stroke sequences, the model will happily generate pretty scribbles that don't spell anything. To actually make it write the text you want, I added a soft Gaussian attention window over the character embeddings. At every step, the model emits a shift parameter that moves a set of K=10 Gaussian windows forward along the character sequence. Those windows softly pick out the current, next, and previous characters, and feed them back into the LSTM.
This is where things clicked. Once the attention mechanism is working, you can watch the training process and see the model learning to walk along the text in sync with the pen strokes — the window sweeps left-to-right roughly once per character. It's very cool.
Training details:
- Dataset: the IAM Online handwriting database plus my own captured samples.
- Optimizer: Adam with a cosine schedule, 1e-3 peak.
- Batch size: 32, sequence length: 700 tokens.
- Biggest gotcha: the gradient on the attention parameters is fragile. I had to clip aggressively (max norm 10) and warm up the LSTM for a few thousand steps before enabling the attention mechanism.
Style transfer: I wanted the model to imitate a specific handwriting, not just generate some generic "handwriting style." The trick was to condition on a short sample of the target handwriting at generation time, priming the LSTM's hidden state. It works surprisingly well with as little as one sentence of input.
What's rough:
- Text rendered by this model sometimes drifts vertically — the baseline isn't perfectly stable.
- Long words get visibly shaky near the end.
- Characters with multiple strokes (like lowercase
xort) have weird pen-lift behavior.
I have some ideas for fixing these, but I also started exploring a different architecture entirely — a cross-attention GPT decoder with a polar tokenizer — and it ended up dramatically outperforming the LSTM on the drift issue. That's the subject of part 2.
More soon!