Introduction
Prediction markets are one of the most efficient price discovery mechanisms available to retail participants. On Polymarket, weather temperature brackets trade as binary options — each bracket represents a temperature range for a specific city on a specific date, and the market prices reflect the crowd's implied probability of that range being correct.
The NOX Weather Alpha Engine exploits a structural advantage: weather prediction models, when properly calibrated, can outperform the crowd on temperature brackets. The crowd incorporates recency bias, anchoring to recent temperatures, and sluggish reaction to updated forecasts. A quantitative model that processes three independent weather models, calibrates its output against historical Polymarket pricing, and sizes positions using the Kelly criterion can extract consistent edge.
This article walks through the complete pipeline — from raw weather data to executed trades — explaining exactly how the NOX Weather Alpha Engine works at every stage.
Data Ingestion — Three Weather Models
The foundation of the engine is multi-model ensemble weather data sourced from the Open-Meteo API, a free, authentication-free weather forecast service that aggregates output from major numerical weather prediction (NWP) systems. We query three independent models every six hours:
In parallel, the pipeline fetches current Polymarket bracket prices via the Gamma API every 30 minutes for all ten active weather stations: New York (LGA), Chicago (ORD), Miami (MIA), Seattle (SEA), London (EGLC), Paris (LFPG), Ankara (LTAC), Tel Aviv (LLBG), Seoul (RKSI), and Buenos Aires (SAEZ). Both data streams are stored in SQLite databases for reproducibility, drift detection, and backtesting.
For ground truth validation, the engine pulls actual observed temperatures from the Iowa Environmental Mesonet (IEM) METAR archive — the same airport weather stations that Polymarket uses for settlement. This creates a fully auditable loop: forecast → prediction → trade → settlement → performance measurement.
Feature Engineering — 38 Signals per Station
Raw forecasts are not fed directly into the model. Instead, the compute_features function transforms three model outputs into 38 engineered features per station per date. These fall into four categories:
The feature vector is standardized and shared identically between the mean and variance models, ensuring both models see the same representation of the world. Feature names are tracked via a shared FEATURE_NAMES constant to prevent train/serve skew.
Prediction — The LightGBM Dual Model
The core prediction engine runs two separate LightGBM gradient boosting models that together define a full probability distribution over possible temperatures.
The mean model predicts the expected temperature using L2 (squared error) loss. The variance model predicts the squared residual of the mean model, effectively learning how uncertain the forecast is for each station/date combination. Together they produce a Gaussian distribution N(μ, σ²) per station per date.
Both models are tuned end-to-end using Optuna with a Bayesian TPE (Tree-structured Parzen Estimator) sampler over 100 trials. Critically, the optimization objective is not raw temperature RMSE but the Brier score on calibrated bracket probabilities — the actual metric that matters for trading. This means hyperparameters are tuned for profitability rather than forecast accuracy in isolation.
Calibration — CDF Mapping & Auto-Selected Scaling
A model that outputs N(μ, σ²) is not yet useful for trading. Polymarket temperature brackets use 2°F steps — for example 60-61°F, 62-63°F, up to an upper edge bracket like ≥78°F and a lower edge bracket like ≤59°F. What we need is the probability that the actual temperature falls within each bracket. This is where the Normal CDF mapping comes in.
For an interior bracket [a, b], the probability is Φ((b − μ) / σ) − Φ((a − μ) / σ). Edge brackets are one-sided: the lower edge bracket ≤59°F uses Φ((59 − μ) / σ) (CDF evaluated at the upper bound only), and the upper edge bracket ≥78°F uses 1 − Φ((78 − μ) / σ) (one minus CDF at the lower bound). All computed via scipy.stats.norm.cdf. The resulting probabilities sum to exactly 1.0 across all brackets — a mathematical guarantee that our probability mass is properly distributed.
But raw CDF probabilities still carry systematic biases from the underlying model. This is where auto-selected calibration layer (Platt scaling, isotonic regression, or temperature scaling) adds a critical correction. The best method is chosen per station on a held-out temporal validation set, transforming each raw probability into a calibrated probability. The result: when the calibrated model says 30%, the actual outcome lands in that bracket approximately 30% of the time. This brings calibrated Brier scores below 0.025 — a meaningful improvement that compounds across thousands of trades.
Proper calibration is non-negotiable for Kelly sizing. The Kelly criterion assumes your probabilities are accurate. Feed it overconfident probabilities and it will systematically over-bet. Feed it underconfident probabilities and it will leave money on the table. Calibration closes this gap.
Edge Detection & Kelly Sizing
With calibrated bracket probabilities in hand, the engine compares each bracket's model probability against the normalized Polymarket market price. The difference is the edge:
edge = model_prob − market_priceThe edge threshold is adaptive per station in the V8 engine, typically ranging from 4–7%. Stations like Tel Aviv and Seattle use a tighter 4% threshold because their temperature distributions are narrower and the model has higher accuracy there. The default is 5% with a configurable YAML override.
Position sizing follows the Kelly criterion, which maximizes the geometric growth rate of capital. Additional guards cap the maximum position at 25% of bankroll and require a minimum model probability of 10% to filter unreliable tail predictions.
The V8 engine also supports dynamic Kelly via an optimization layer that adjusts the multiplier based on recent win rate and drawdown metrics. When the engine is running hot, it tightens position sizes to protect gains. When it has been running cold, it maintains discipline rather than increasing exposure.
Backtesting Methodology
All published performance numbers are generated using a strict flat-bet methodology designed to eliminate compounding artifacts and isolate pure signal quality.
This methodology is deliberately conservative. Flat betting means that performance numbers are not inflated by compound reinvestment. Normalized prices prevent the illusion of edge from raw price discrepancies. The 2% fee is deducted from every winning trade. The daily trade cap prevents overfitting to high-frequency noise.
The 60-day backtest window produced a +335.76% flat ROI across 2,025 trades with a 64.3% overall win rate. 53 of 60 days were profitable. The BUY YES strategy captured +266% ROI at a 27% win rate (high payout per win), while BUY NO captured +71% ROI at an 88% win rate (consistent small gains). This asymmetry is by design — the engine exploits different market inefficiencies on each side of the book.
Importantly, the engine uses strict temporal cross-validation during model training. No future data leaks into any training fold. The backtest uses out-of-sample predictions only. Station-level performance attribution tracks whether edge is concentrated or diversified — and the data shows Tel Aviv, Chicago, and NYC as the strongest contributors.
Execution on Polymarket
When a signal passes all filters — edge above threshold, model probability above 10%, Kelly fraction within bounds — the engine submits a limit order to the Polymarket CLOB (Central Limit Order Book) via the py-clob-client Python library. Orders are placed on the Polygon blockchain with deterministic slug construction to match the correct market.
The execution layer handles the full complexity of on-chain trading: EIP-712 typed data signatures via a delegated proxy wallet, automatic nonce management, gas estimation, retry logic for transient failures, and position tracking via the portfolio endpoint. Both BUY YES and BUY NO orders are supported with the neg_risk=True parameter required for weather markets.
The complete cycle — data collection, prediction, calibration, edge detection, sizing, and execution — runs every six hours as an autonomous pipeline with fault tolerance and automatic recovery. No human intervention is required once the engine is armed. This is the core value proposition of the NOX Weather Alpha Engine: a fully automated Polymarket weather trading bot that operates 24/7 across 10 global weather stations.
Want to see this engine in action?