Skip to content
Cursive Transformer Handwriting: Polar Tokens + Cross-Attention GPT
AI Project

Cursive Transformer Handwriting: Polar Tokens + Cross-Attention GPT

5 min read832 wordsUpdated Apr 24, 2026

A 6-layer cross-attention GPT decoder with a polar-coordinate stroke tokenizer produces cursive handwriting at 96% character accuracy — up from 82% with the LSTM in Part 1 — and the baseline-drift problem disappears entirely. This is Part 2 of the handwriting generator write-up. The model powers hand-magic.com. It's also the architecture that recently showed up in The Cursive Transformer (arXiv:2504.00051) — two projects converging on the same polar-GPT recipe independently.

TL;DR

  • 6-layer GPT decoder, d_model=384, 6 heads, ~102MB of weights.
  • Polar tokenizer: each pen offset → (angle_bin, radius_bin, end_of_stroke) — 128 × 64 × 2 = 16,384 tokens.
  • Cross-attention over the character sequence replaces the LSTM's Gaussian windows.
  • Style embedding with contrastive training beats hidden-state priming.
  • Result: 96% character accuracy, zero baseline drift, 150ms/line on GPU.

What was wrong with the LSTM version

The LSTM handwriting generator worked, but it had two problems I couldn't shake:

  1. Baseline drift. Vertical error compounded over long words. The last character of a paragraph would float half a line above or below the first.
  2. Fragile style transfer. Priming the LSTM's hidden state with a sample of target handwriting worked — until it didn't. Prime it wrong and you'd get recognizable gibberish halfway through.

Both felt like symptoms of a recurrence without strong long-range memory. So I did what any reasonable 2025 developer does — I tried a transformer.

The polar tokenizer

The original model predicted (dx, dy) as a mixture of Gaussians. Cute, but 20 Gaussian components means 6 × 20 = 120 output parameters per step just to parameterize the distribution, and mixtures of Gaussians in 2D get messy to tune.

I switched to a discrete tokenizer working in polar coordinates. Each pen offset is quantized into an (angle, radius, end_of_stroke) triple:

# pseudo: per-step pen offset → three discrete tokens
angle_bin  = quantize_angle (atan2(dy, dx), n_bins=128)   # 128 bins, 2π/128 per bin
radius_bin = quantize_radius(sqrt(dx*dx + dy*dy), n_bins=64, scale="log")
eos_bin    = 0 or 1
vocab_size = 128 * 64 * 2  # = 16,384

Training is now plain cross-entropy over this vocab. No more MDN log-likelihood.

Why polar?

  • Angles are more meaningful than raw dy for handwriting. Strokes have consistent directional patterns — upstrokes, crossbars, loops — that map cleanly onto angle bins.
  • Log-scaled radius bins match the empirical distribution of stroke lengths. Short strokes get fine bins; long strokes get coarse ones.

The Cursive Transformer (arXiv:2504.00051), published March 2025, arrives at essentially the same tokenization independently. Two projects, same polar-GPT recipe, same reasoning.

Cross-attention conditioning

Each decoder layer has a cross-attention block that attends to the character sequence. Unlike the LSTM's shifting Gaussian windows, any output token can attend anywhere in the prompt. In practice, the lower layers learn "what character am I currently writing" and the upper layers learn "what stroke within that character am I on."

The baseline drift from Part 1 just... disappeared. The model maintains perfectly horizontal alignment over multi-line text.

Style embedding with contrastive training

Instead of priming the hidden state (the LSTM trick), the transformer has a small style-embedding table — up to 10 style slots per deployment. A learned projection injects the style embedding into every decoder layer.

Training uses a contrastive objective:

| Pair type | Objective | |---|---| | Two samples from the same author | Pull embeddings together | | Two samples from different authors | Push embeddings apart |

At inference you either pick an existing style slot or provide a few lines of reference handwriting — the model projects the reference to the closest existing embedding.

LSTM vs. Transformer, side by side

| Metric | LSTM (Part 1) | Transformer (Part 2) | |---|---|---| | Weights | 14 MB | 102 MB | | Character accuracy (OCR round-trip) | ~82% | ~96% | | Baseline drift | Visible after long words | None | | Multi-line stability | Poor | Excellent | | Training time | ~12h on one GPU | ~40h on one A100 | | Inference (GPU) | ~90ms/line | ~150ms/line | | Inference (CPU, laptop) | ~0.8s/line | ~1.2s/line |

Where it's deployed

hand-magic.com. Type text, pick a style, get back a PNG. Actually try it, and tell me where it breaks — some of the weirdest failure modes only show up on specific uncommon character combinations and I genuinely can't predict which.

Key takeaways

  • Discrete polar tokens beat continuous MDN outputs for handwriting — standard cross-entropy, cleaner training, better generation.
  • Cross-attention kills baseline drift because every output token has direct access to the prompt.
  • Contrastive style embeddings scale better than hidden-state priming and are more interpretable.

What's next

  • Longer context — currently truncated at ~500 strokes.
  • A separate head for stroke thickness so the model can emit pressure-style handwriting.
  • Combining the style embedding with reference-image conditioning for offline (photographed) handwriting.

If you have thoughts, reach out — and check out my other projects.

References

Frequently Asked Questions

What is a polar tokenizer for handwriting?

Instead of predicting raw (dx, dy) pen offsets, a polar tokenizer quantizes each offset into (angle, radius, end_of_stroke). Handwriting has strong directional structure — strokes aren't uniformly distributed in Cartesian space — so angles are more meaningful than raw dy. Radius bins can be non-linearly scaled so short strokes get fine-grained bins and long strokes get coarse ones.

Why does cross-attention fix the baseline-drift problem?

The LSTM's attention is recurrent and has finite memory — vertical-drift errors compound over hundreds of tokens. Cross-attention lets every token attend directly to any character in the prompt, so there's no hidden-state bottleneck to carry baseline information through. In practice, multi-line output stayed perfectly horizontal after the switch.

How does the style embedding work?

There's a small style-embedding table (up to 10 entries per deployment). A contrastive loss pulls same-author samples together and pushes different-author samples apart. At inference, the user picks a slot or provides a few lines of reference handwriting, and the model projects them to the closest existing embedding.

How big is the transformer model?

6 transformer decoder layers, d_model=384, 6 attention heads, about 102MB of weights. It trains in roughly 40 hours on a single A100, and ships at 1.2 seconds per line on laptop CPU or 150ms per line on GPU.

What's the accuracy?

Character accuracy — measured by OCR'ing the generated handwriting back to text and comparing — is about 96%, up from 82% on the LSTM version. That gain comes mostly from eliminating drift and giving the model direct attention to every character in the prompt.

How does this relate to The Cursive Transformer paper?

The Cursive Transformer (arXiv:2504.00051) independently lands on the same core insight: tokenize pen offsets in polar coordinates and train a plain GPT decoder with cross-attention over ASCII text. The architecture converges because it works.

transformercursive-transformerhandwriting-generationpolar-tokenizercross-attentiongptpytorchstyle-embedding

related posts