AI Project

LSTM Handwriting Generator in PyTorch (Graves 2013 Redux)

February 10, 20255 min read894 wordsUpdated Apr 24, 2026

An LSTM handwriting generator in PyTorch, built on the Graves 2013 architecture (mixture-density outputs + Gaussian attention over characters), produces legible, styled handwriting from a short prompt. This is Part 1 of two posts documenting the build — a 3-layer LSTM with hidden size 400, 14MB of weights, trained on IAM Online plus my own captured samples. Part 2 is the transformer rewrite that beat it. The live version ships on hand-magic.com.

TL;DR

Architecture: 3-layer LSTM, hidden=400, 14MB weights, straight from Graves 2013.

Output: Mixture of 20 bivariate Gaussians over (dx, dy) plus a Bernoulli over end_of_stroke.

Conditioning: K=10 soft Gaussian attention windows over the character sequence.

Result: Believable handwriting with honest drift problems past long words — fixed in Part 2.

The core idea: handwriting is a sequence, not an image

Handwriting is modeled as a sequence of pen-stroke offsets, not pixels. Each "token" is a 3-tuple:

# Each training token
(dx, dy, end_of_stroke)
# dx, dy: real-valued pen movement from previous point
# end_of_stroke: 1 if this point ends a stroke (pen lifts), else 0

A character is 30–80 of these tokens. A word is a longer sequence. Generating a full page becomes "generate one very long sequence." This representation is why LSTMs do so well here — long-range rhythm across a word is exactly what recurrence is good at capturing.

The model

A 3-layer LSTM with hidden size 400, about 14MB of weights total. At every step it predicts the parameters of a mixture of 20 bivariate Gaussians over (dx, dy), plus a Bernoulli probability for the end_of_stroke flag. Training loss is the negative log-likelihood of the observed next offset under that predicted distribution.

# High-level forward pass sketch (PyTorch)
h, _ = self.lstm(strokes_in)              # (B, T, 400)
out = self.mdn_head(h)                    # (B, T, 6*M + 1) for M=20 mixtures
pi, mu, sigma, rho, eos = split_mdn(out)  # mixture weights, means, stds, corr, end-of-stroke
loss = -(mdn_log_prob(strokes_tgt, pi, mu, sigma, rho)
         + bernoulli_log_prob(eos_tgt, eos))

Straight out of Graves 2013 — Generating Sequences With Recurrent Neural Networks — and it still works great over a decade later.

The interesting part: conditioning with Gaussian attention

If you just train on stroke sequences, the model will happily generate pretty scribbles that don't spell anything. To actually make it write the text you type in, the LSTM needs to look at a character sequence as it generates strokes. I added a soft Gaussian attention window over the character embeddings.

At every step, the LSTM emits a shift parameter that moves a set of K=10 Gaussian windows forward along the character sequence. Those windows softly pick out the current, next, and previous characters, and feed the weighted sum of their embeddings back into the LSTM.

This is where things clicked. Once attention works, you can watch the training process and see the model learn to walk the window along the text in sync with the pen strokes — sweeping left-to-right roughly once per character. It's one of the most satisfying visualizations in ML.

Training details

| Choice | Value | Why | |---|---|---| | Dataset | IAM Online + my captures | Canonical stroke-level corpus | | Optimizer | Adam, 1e-3 peak, cosine schedule | Classic; stable here | | Batch size | 32 | Fits on one GPU | | Sequence length | 700 tokens | Long enough for a short sentence | | Gradient clip | Max-norm 10 | Attention gradients are fragile | | Warmup | LSTM-only for ~3k steps | Then enable attention |

Style transfer, the cheap way

I wanted the model to imitate a specific handwriting, not just generate some generic "handwriting style." The trick was to condition on a short sample of the target handwriting at generation time — priming the LSTM's hidden state with one sentence of that author's strokes before the first real stroke is emitted.

It works surprisingly well with as little as one sentence of input. Not perfect (consistency degrades after a long paragraph), but good enough that friends could pick their own handwriting out of a lineup of samples.

What's rough (and why Part 2 exists)

Vertical drift. Text rendered by this model sometimes wobbles off-baseline. By the end of a long word, the drift is visible.
Shaky ends. Long words get visibly tremulous near their trailing characters.
Multi-stroke characters. Lowercase x and t have weird pen-lift behavior — the model doesn't perfectly learn which sub-strokes belong to which character.

I had ideas for fixing these, but I also started exploring a different architecture entirely — a cross-attention GPT decoder with a polar tokenizer — and it ended up dramatically outperforming the LSTM on the drift issue. That's the subject of Part 2: Polar Tokens & Cross-Attention GPT.

Key takeaways

Stroke sequences beat images for handwriting generation — pen dynamics carry most of the signal.
Soft Gaussian attention over characters is the right conditioning mechanism for LSTM-based handwriting and watching it align is the most memorable moment of the project.
Style priming via a short sample is a very cheap style-transfer trick that works for personal handwriting.

References

Alex Graves, 2013 — Generating Sequences With Recurrent Neural Networks (arXiv:1308.0850)
IAM Online Handwriting Database — dataset page
Four Experiments in Handwriting with a Neural Network (Distill)
Project source — github.com/OmarMusayev/ai-handwriting-generator
Live demo — hand-magic.com
Next in series — Part 2: Polar Tokens & Cross-Attention GPT

Frequently Asked Questions

What is Alex Graves' 2013 handwriting paper?

It's the paper 'Generating Sequences With Recurrent Neural Networks' (arXiv:1308.0850) — the foundational work that showed an LSTM with a mixture-density output layer and soft Gaussian attention over characters can produce realistic, legible handwriting. Over a decade later, the architecture still works.

Why represent handwriting as pen strokes instead of images?

Stroke sequences carry the temporal information about how a pen moves — direction, speed, pressure events — that an image throws away. A 30–80-token sequence per character is far more compact than pixels and lets a recurrent model learn long-range rhythm across a whole word or sentence.

Why 20 bivariate Gaussians and K=10 attention windows?

20 mixture components give the model enough capacity to model the multi-modal distribution of possible next pen positions without exploding the parameter count. K=10 attention windows cover the current, next, and previous characters with comfortable margin, which is what you need for handwriting to look continuous rather than character-at-a-time.

Why does the baseline drift on long sequences?

Recurrence has finite memory. By the end of a long word, small vertical-drift errors compound. Clipping the gradient aggressively helps stability, but the fix that actually eliminated drift was swapping to a cross-attention transformer in Part 2.

Where can I try the handwriting generator?

The live demo is at hand-magic.com. Type text, pick a style, get back a PNG of the text rendered as believable handwriting. The code is on GitHub at github.com/OmarMusayev/ai-handwriting-generator.

lstmhandwriting-generationalex-gravesmixture-density-networkgaussian-attentionpytorchiam-onlinedeep-learning

Frequently Asked Questions

related posts