
LSTM vs Transformer for Climate Forecasting (GISTEMP, PyTorch)
A 2-layer LSTM with 128 hidden units consistently beat a same-parameter-count transformer at forecasting global temperature anomalies on NASA GISTEMP. Persistence-over-RMSE improved by ~35%. With ~1,700 training windows, the LSTM's recurrence is a useful inductive bias that the transformer has to learn from scratch and overfits before it can. This is a small but honest data point in the ongoing question of when recurrent models still deserve a seat at the table in 2025.
TL;DR
- Data: NASA GISTEMP monthly anomalies, 1880–present, ~1,700 windowed samples.
- Model: 2-layer LSTM, hidden=128, 48-month lookback, 12-month horizon.
- Baseline: Persistence ("tomorrow = today") — hard to beat.
- Result: LSTM beats persistence by ~35% RMSE on held-out 2015–2023; transformer (same param count) loses.
- Lesson: On small datasets, recurrent priors beat "learn everything from scratch."
The motivation
Can a relatively small deep-learning model learn enough of the structure in temperature records to give useful year-ahead forecasts? And while I was at it — can I convince myself that LSTMs still deserve a seat at the table now that everyone uses transformers for everything?
Short answer: yes, on this dataset, for this horizon, at this size.
The data
Historical temperature records going back to 1880 from the public NASA GISTEMP dataset, plus some complementary monthly anomaly data. After normalization, each training example is a sliding window of the previous 48 months predicting the next 12. After windowing you end up with roughly 1,700 training samples — a small dataset in modern terms.
The persistence baseline — always check it
The simplest possible baseline is "tomorrow will be like today." For monthly temperature anomalies on a global scale, that's a surprisingly strong baseline because month-over-month autocorrelation is high. Any model you train has to clearly improve on persistence, otherwise you're just fitting noise.
Publishing the persistence number next to your model's is one of those professional habits that distinguishes a research result from a toy.
The model
A small 2-layer LSTM, hidden size 128, followed by a linear projection to the forecast horizon. Training target: MSE on the anomaly values. Lookback: 48 months. Forecast horizon: 12 months.
class TempLSTM(nn.Module):
def __init__(self, in_dim=1, hidden=128, horizon=12, dropout=0.2):
super().__init__()
self.lstm = nn.LSTM(in_dim, hidden, num_layers=2,
batch_first=True, dropout=dropout)
self.head = nn.Linear(hidden, horizon)
def forward(self, x): # (B, T, 1)
h, _ = self.lstm(x) # (B, T, hidden)
return self.head(h[:, -1, :]) # (B, horizon)
Head-to-head vs. a transformer baseline
I also trained a small encoder-only transformer with learned positional embeddings and a matched parameter count. On this dataset, the LSTM beat it consistently, which surprised me at first.
| Model | Params | Val RMSE | Val RMSE vs. persistence | |---|---|---|---| | Persistence | 0 | 0.175 °C | baseline | | 2-layer LSTM, h=128 | ~140k | 0.113 °C | −35% | | Transformer (same param count) | ~140k | 0.151 °C | −14% |
My working theory on why: when your training set is ~1,700 samples, the LSTM's strong recurrence prior acts as useful inductive bias. The transformer wants more data to figure out the temporal structure from scratch, and with this little training data it doesn't get there before it starts overfitting.
This tracks with the broader literature — hybrid CNN-LSTM models are still state-of-the-art for some climate tasks, and the "everything is a transformer" default doesn't always hold on small time-series problems.
What actually worked
| Trick | Impact | |---|---| | Differencing the input — predict month-over-month change instead of absolute value | ~20% lower RMSE | | Z-score normalization per month, not globally (summer months have different variance than winter months) | Noticeable | | Dropout 0.2 between LSTM layers — prevented overfitting in under 5 epochs otherwise | Critical |
What didn't work
- Seasonal features (sin/cos of month-of-year) — the LSTM already learned them.
- Scaling hidden size past 128 — more overfitting, no better validation loss.
- Attention on top of the LSTM — marginal gains, lots of extra complexity. Not worth it here.
Result
The model beat persistence by about 35% on RMSE over a held-out 2015–2023 window. Not groundbreaking, but real — and it reinforced for me that "small recurrent model" is still a sensible starting point for time-series problems where you don't have millions of examples.
Full paper PDF has the ablations.
Key takeaways
- Always publish your persistence baseline. If you don't, your model could be fitting noise.
- On small datasets, inductive bias beats capacity. A 140k-parameter LSTM beat a 140k-parameter transformer here.
- Difference, then normalize per-month. Two of the three biggest wins weren't architectural.
References
- NASA — GISTEMP Surface Temperature Analysis
- Nature Scientific Reports — Monthly climate prediction using CNN + LSTM
- MDPI Applied Sciences — Transformer–LSTM temperature prediction
- Full project PDF
- More projects
Frequently Asked Questions
Why did the LSTM beat the transformer on this dataset?
With ~1,700 training samples after windowing, the transformer has to learn temporal structure from scratch and tends to overfit. The LSTM's recurrence is a strong inductive bias — it assumes the next value depends on a gradually changing hidden state — which is basically the right prior for monthly climate anomalies. Given millions of examples the transformer would likely win.
What's a persistence baseline and why does it matter?
Persistence is the simplest possible baseline: 'tomorrow will be like today.' For monthly global temperature anomalies, it's hard to beat, because the autocorrelation of monthly anomalies is high. Any model that doesn't clearly beat persistence is just fitting noise, so you should always publish the persistence number alongside yours.
What dataset did you use?
NASA GISTEMP, the Goddard Institute for Space Studies Surface Temperature Analysis. Monthly anomaly data from 1880 onwards, roughly 140 years or ~1,700 months after windowing with a 48-month lookback.
Which tricks actually helped?
Three things materially moved the needle: differencing the input so the model predicts month-over-month change instead of absolute value (RMSE −20%), per-month z-score normalization instead of global (because summer and winter months have different variances), and a small 0.2 dropout between LSTM layers to prevent rapid overfitting.
What didn't help?
Adding sin/cos seasonal features (the LSTM learned them anyway), scaling hidden size past 128 (more overfitting, no better validation), and bolting attention onto the LSTM (marginal gains, lots of complexity).
What was the final accuracy?
About a 35% RMSE improvement over persistence on a held-out 2015–2023 window. Not groundbreaking, but real — and it reinforced that 'small recurrent model' is still a sensible starting point for time-series problems where you don't have millions of examples.