
AI Handwriting Generator Part 2 — Polar Tokens & Cross-Attention GPT
Hello everyone!
Picking up where part 1 left off. The LSTM handwriting generator worked, but it had two problems I couldn't shake: the baseline drifted over long sequences, and style transfer was fragile — prime it wrong and you'd get gibberish halfway through a paragraph. Both felt like symptoms of a recurrence that didn't have great long-range memory.
So I did what any reasonable 2025 developer does — I tried a transformer.
The new setup:
- 6-layer GPT-style decoder, about 102MB of weights.
- d_model = 384, 6 attention heads.
- Cross-attention over the character sequence instead of the soft Gaussian windows from part 1.
- A brand new tokenizer I'll explain below.
Polar tokenization. The original model predicted (dx, dy) as a mixture of Gaussians. Cute, but mixtures of Gaussians in 2D get messy to tune, and 20 components means 20 × 6 parameters per step just for the distribution.
I switched to a discrete tokenizer working in polar coordinates: each pen offset gets quantized into an (angle, radius, end_of_stroke) triple. I used 128 angle bins × 64 radius bins, giving me a vocabulary of 128 × 64 × 2 = 16,384 tokens. Training becomes standard cross-entropy over this vocab.
Why polar? Two reasons. One, angles are more meaningful than raw dy for handwriting — strokes have consistent directional patterns. Two, radius bins let me use a non-linear scale, so short strokes get fine-grained bins and long strokes get coarse ones. This matched the empirical distribution of stroke lengths much better than uniform dx/dy quantization would have.
Cross-attention conditioning. Each layer's cross-attention lets any token attend anywhere in the character sequence. The first few layers mostly learn "what character am I writing," and the later layers learn "what stroke in that character am I on." The baseline drift from part 1 just… disappeared. The model maintains perfect horizontal alignment over multi-line text.
Style transfer — now with embeddings. Instead of priming the hidden state, I added a small style-embedding table. The user picks a style index (I support up to 10 custom styles per deployment) and the model conditions on the embedding via a learned projection into every layer. Training uses a contrastive objective: two samples from the same author should produce similar embeddings; two samples from different authors should be far apart. At inference, you can pick an existing style slot or provide a few lines of reference handwriting, and the model finds the closest style embedding.
Numbers:
- Training: ~40 hours on a single A100.
- Inference: ~1.2 seconds per line on a laptop CPU, ~150ms on GPU.
- Character accuracy (measured by OCR'ing the output and comparing to the input): ~96%, up from ~82% on the LSTM.
Where it's deployed: on hand-magic.com. Try it. Please actually try it, and tell me when it fails — some of the weirdest failure modes only show up on specific uncommon character combinations and I genuinely can't predict which.
What's next: I want to push the decoder to handle longer context (currently I truncate at ~500 strokes) and experiment with emitting the stroke thickness as a separate head for pressure-style handwriting. If you have thoughts, reach out.
More soon!