Proof and methodology
Each of the four sections below is collapsible. The intent is to give a technical reader enough evidence to either trust or reject the verdict on the Conclusion tab. Click any heading to expand.
1. The data we used
Everything in this report comes from a single local SQLite mirror of Polymarket's public data API (the same endpoints that power their website). The mirror is read-only inside this report. The upstream sync runs separately and is what we used to repeatedly run the same strategy thousands of times deterministically without hitting rate limits.
6359.2 MB SQLite
2026-04-23
resolved binary markets, non-voided
Polymarket public data + gamma APIs
Tables in the mirror (live counts queried at request time)
| SQLite table | Row count | What it stores |
|---|---|---|
markets | 49,553 | One row per Polymarket market: question, end date, raw payload from the gamma API, resolution outcome, voided flag. |
tokens | 99,129 | Yes/No tokens per market. Each token resolves to 1.00 or 0.00. |
trades | 2,981,448 | Raw BUY/SELL fills. proxy_wallet, token_id, size, price, timestamp. Source: Polymarket data-api trades endpoint. |
wallets | 15,000 | Per-wallet aggregates: trade_count, total_volume, first/last seen, category distribution. Used to label whale tier and screen capped wallets. |
Concretely: the trades table is the source of truth for every BUY in the audit. The wallets table aggregates trades into per-wallet metadata so we can label a wallet as whale-tier, track-qualified, or both at episode time without re-walking 3 million rows.
Sample large BUYs from the trades table
Live read from the local DB. These are real $50K-$500K BUY rows on resolved binary markets.
| Wallet | Question | Size $ | Price | Bought | Resolved | Timestamp |
|---|---|---|---|---|---|---|
0xcd9dd293… | US strikes Iran by February 25, 2026? | 500,000 | 0.0010 | Yes | No | 1772086494 |
0x4c4e2c68… | Will Zohran Mamdani win the 2025 NYC mayoral election? | 500,000 | 0.0020 | No | Yes | 1762310769 |
0x5ddd01dd… | Trump strikes another drug boat by Sep 30? | 500,000 | 0.0020 | Yes | No | 1759470870 |
0xbcb6ebb4… | Fordow nuclear facility destroyed before July? | 500,000 | 0.9990 | Yes | Yes | 1751305016 |
0x1d379e32… | Will Justin Trudeau be the next Canadian Prime Minister? | 500,000 | 0.0010 | Yes | No | 1743137278 |
Concretely: several of these are very large Yes buys at prices around 0.01-0.05 that resolved No. The bettor paid for a long-tail outcome that did not materialize. That is the type of row that breaks the 'big size means informed' intuition: a $50K bet at price 0.02 is a $50K wager that an event with a 2% implied probability will happen. It usually does not.
2. Are we sure the data is correct?
Two independent integrity passes were run before any verdict was finalised. Both surfaced real bugs in how the upstream sync was talking to Polymarket. Both are now fixed and every verdict in this report is computed on data fetched after the fixes.
Source-truth bug #1: wallet history was using the wrong query parameter
The wallet-history fetcher was passing proxyWallet= to Polymarket's data API. The accepted parameter is user=. The wrong parameter returned a partial response without raising an error, so the bug was silent. Concretely, this meant some wallets had visibly incomplete trade histories, which made the prior-skill labels (whale tier, track-qualified) systematically too low for those wallets.
The corrected loop. Note the explicit `user=` and the pagination bounds.
263 for _ in range(max_pages):
264 if offset > DATA_API_HARD_OFFSET_LIMIT:
265 log_info(
266 f"Reached Data API hard offset limit ({DATA_API_HARD_OFFSET_LIMIT}) "
267 f"for {entity_type} {entity_id[:20]}..."
268 )
269 break
270
271 params = {**params_base, "limit": page_size, "offset": offset}
272 raw_page = _fetch_trades_page(params, rate_limiter)
273
274 if raw_page is None:
275 result.error = f"fetch failed at offset {offset}"
276 break
277
278 if len(raw_page) == 0:
279 # Empty page = end of available data
280 break
281
282 result.pages_fetched += 1
283 result.trades_fetched += len(raw_page)
284
285 # Parse all trades in this page
286 parsed_page: list[dict] = []
287 for raw_trade in raw_page:
288 parsed = _parse_trade(raw_trade)
289 if parsed is not None:
290 parsed_page.append(parsed)
291
292 if not parsed_page:
293 offset += page_size
294 continue
295
296 page_timestamps = [t["timestamp"] for t in parsed_page]
297 page_min_ts = min(page_timestamps)
298 page_max_ts = max(page_timestamps)
299
300 # Timestamp early-stop: if the MINIMUM timestamp on this page is
301 # already below our high-water mark, we still insert (some trades
302 # at the top of the page may be new), then stop — all subsequent
303 # pages will be strictly older. Using page_min_ts (not max) avoids
304 # the BUG where same-second trades at a page boundary get dropped.
305 should_stop_after_this_page = (
306 known_max_ts is not None and page_min_ts < known_max_ts
307 )
308
309 # Insert this page
310 inserted = insert_trades_batch(parsed_page)
311 result.trades_inserted += inserted
312 result.trades_skipped += len(parsed_page) - inserted
313
314 # Only count wallets from newly inserted trades (not already-known ones)
315 # We can't know per-trade which inserts succeeded after batch insert,
316 # so we track wallets from all parsed trades in this page; the strategy
317 # layer re-checks wallet existence before flagging "first observed".
318 result.new_wallets.update(
319 t["proxy_wallet"] for t in parsed_page if t["proxy_wallet"]
320 )
321
322 all_min_ts.append(page_min_ts)
323 all_max_ts.append(page_max_ts)
324
325 if should_stop_after_this_page:
326 result.stopped_early = True
327 break
328
329 # Per-page dedup ratio stop (checked INSIDE loop, after insert)
330 if len(parsed_page) > 0:Source-truth bug #2: pagination skipped the offset=3000 page
The pagination guard was too strict and refused to fetch the page starting at offset 3000, even though the API returns it. Concretely, this silently chopped off up to 1,000 trades from any wallet with more than 3,000 recorded entries - heavy wallets, in other words. Fixed.
Residual issue: the API itself caps very heavy wallets
Even after the two fixes, the live data API returns at most ~4,000 history entries per wallet. There are 24 wallets globally at or above this cap. Any test that depends on prior-bet history is therefore re-run on the strict-completeness subset (Test 3 in the previous tab), where this cap is provably not biting. The heaviest wallets in the local DB:
| Wallet | Stored trade count |
|---|---|
0xcb3143ee858e… | 35,144 |
0xd218e4747764… | 34,821 |
0x59ee6c6a56d7… | 24,563 |
0xe8dd7741ccb1… | 22,581 |
0x4ce73141dbfc… | 18,894 |
Concretely: the top wallet alone has more recorded trades than the API would now return in a fresh fetch. Any whale-flavoured strategy that depends on knowing the wallet's full history is therefore mechanically biased toward the wallets where coverage happens to be best, unless the strict bound is applied. That is the bias Test 3 quantifies and removes.
What still has a bounded caveat
The market-level resolution timestamp is built from the last-trade proxy rather than the gamma API's closedTime field. We checked a sample and the two agree where both exist, but a full closedTime-vs-proxy comparison was deferred. Concretely: if a market's true close time is materially earlier than the last trade in our data, the deferred-resolution scorecard in Tests 2 and 3 would shift slightly. We do not believe this would flip any verdict because the surviving lane (track record) already fails for an unrelated reason (the Yes/No split).
3. How was the data processed (steps + code)
Step A - define excess return per dollar
For a BUY at price p: if the bought outcome resolves true, the wallet earns (1 - p) per $1; if it resolves false, the wallet loses p per $1. Token-adjusted, this puts Yes-side and No-side BUYs on the same scale, which is what makes the verdict tables comparable across lanes.
The bootstrap resamples market-level means, not raw rows, so a few crowded markets cannot mechanically dominate the confidence interval.
43 def _return_per_dollar(side: str, price: float, correct: bool) -> float:
44 """Return per $1 bet. BUY: pay price, win 1.0 if correct. SELL: receive price, owe 1.0 if wrong."""
45 if side == "BUY":
46 return (1.0 - price) if correct else -price
47 else: # SELL
48 return price if correct else -(1.0 - price)
49
50
51 def _bootstrap_excess_return(market_excess: dict[str, list[float]], n_iter: int = 10_000) -> tuple[float, float, float, float]:
52 """Bootstrap market-level mean excess return. Returns (mean, ci_lo, ci_hi, p_value)."""
53 market_ids = list(market_excess.keys())
54 market_means = np.array([np.mean(v) for v in market_excess.values()])
55 n = len(market_means)
56 if n == 0:
57 return 0.0, 0.0, 0.0, 1.0
58
59 rng = np.random.default_rng(42)
60 boot_means = np.array([
61 rng.choice(market_means, size=n, replace=True).mean()
62 for _ in range(n_iter)
63 ])
64 mean_val = float(market_means.mean())
65 ci_lo = float(np.percentile(boot_means, 2.5))
66 ci_hi = float(np.percentile(boot_means, 97.5))
67 p_val = float(np.mean(boot_means <= 0))
68 return mean_val, ci_lo, ci_hi, p_val
69
70
71 def _binomial_p(hits: int, n: int, base_rate: float) -> float:
72 """One-sided binomial test p-value (probability of seeing >= hits by chance)."""
73 from scipy.stats import binom
74 if n == 0:
75 return 1.0
76 return float(1 - binom.cdf(hits - 1, n, base_rate))
77
78
79 # ─────────────────────────────────────────────────────────────────────────────
80 # Step 1: Base rates
81 # ─────────────────────────────────────────────────────────────────────────────
82
83 def compute_base_rates(conn) -> dict:
84 """Compute market-level and trade-level base rates for both BUY and SELL."""
85 cur = conn.cursor()
86
87 # Market-level: what fraction resolved Yes vs No?
88 cur.execute("""
89 SELECT outcome, COUNT(*) FROM markets
90 WHERE resolved=1 AND voided=0Concretely: a BUY at price 0.30 that resolves True earns +$0.70 per dollar; the same BUY resolving False loses $0.30. We average those numbers per market, then resample 1,000 times to get a 95% CI. The market-clustered resampling is what protects us against a single highly-traded market dragging the result.
Step B - collapse split fills into episodes
The raw feed contains many split fills: one logical $200K bet often shows up as 30 raw rows of $5K-$10K. An episode is defined as: same wallet + same token + same side, all rows within a 60-second window, collapsed to one volume-weighted price. Without this step, large bets are silently double-counted in the row-level analysis.
Order: time-sort, then walk forward grouping by (wallet, token, side) within the time window.
184 def build_episodes(conn) -> list[dict]:
185 """Deduplicate trades into episodes. Group by (wallet, token, side) within 60s window.
186 Returns list of episode dicts sorted by (proxy_wallet, timestamp)."""
187 cur = conn.cursor()
188 cur.execute("""
189 SELECT tr.proxy_wallet, tr.token_id, tr.market_id, tr.side, tr.size, tr.price, tr.timestamp,
190 tk.outcome_label, m.outcome as market_outcome, m.resolved, m.voided
191 FROM trades tr
192 JOIN tokens tk ON tk.token_id = tr.token_id
193 JOIN markets m ON m.market_id = tr.market_id
194 WHERE tk.outcome_label IN ('Yes', 'No')
195 ORDER BY tr.proxy_wallet, tr.timestamp
196 """)
197
198 # Group into episodes: same (wallet, token, side) within 60s
199 episodes = []
200 pending = {} # key: (wallet, token, side) -> list of (size, price, timestamp, ...)
201
202 def flush(key, rows):
203 total_size = sum(r[0] for r in rows)
204 vwap = sum(r[0] * r[1] for r in rows) / total_size if total_size > 0 else rows[0][1]
205 first_ts = rows[0][2]
206 episodes.append({
207 "proxy_wallet": key[0],
208 "token_id": key[1],
209 "side": key[2],
210 "market_id": rows[0][3],
211 "bet_size": total_size,
212 "price": vwap,
213 "timestamp": first_ts,
214 "outcome_label": rows[0][4],
215 "market_outcome": rows[0][5],
216 "resolved": rows[0][6],
217 "voided": rows[0][7],
218 })
219
220 for row in cur.fetchall():
221 key = (row["proxy_wallet"], row["token_id"], row["side"])
222 entry = (row["size"], row["price"], row["timestamp"], row["market_id"],
223 row["outcome_label"], row["market_outcome"], row["resolved"], row["voided"])
224
225 if key in pending:
226 last_ts = pending[key][-1][2]
227 if row["timestamp"] - last_ts <= 60:
228 pending[key].append(entry)
229 else:
230 flush(key, pending[key])
231 pending[key] = [entry]
232 else:
233 pending[key] = [entry]
234
235 for key, rows in pending.items():
236 flush(key, rows)
237
238 episodes.sort(key=lambda e: (e["proxy_wallet"], e["timestamp"]))
239 return episodesConcretely: we run the same Test 1 lanes twice - once on raw rows, once on episodes. If the episode-level result disagreed with the raw-row result, that would be a hint that the negative is a double-counting artifact. They agree, so the negative is real.
Step C - build wallet scorecards with deferred resolution
For prior-skill labels, a BUY only counts toward a wallet's track record once we believe the underlying market had already finished trading before the current episode. Without this guard, the wallet's 'prior' bets would include bets on the same market the wallet is actively in - a leakage that would inflate the apparent skill of any wallet.
The point-in-time scorecard used to label whale tier and track-qualified state at episode time.
94 def build_scorecards(episodes: list[dict], base_rates: dict) -> dict:
95 """Build as-of-time wallet scorecards with deferred resolution.
96
97 Critical: a prior bet on market X only counts as "resolved" in the scorecard
98 if market X's last trade timestamp < current episode's timestamp. This is a
99 conservative proxy for "market X had resolved by the time of the current trade."
100
101 Without this, the scorecard uses future information (whether a market eventually
102 resolved), creating look-ahead bias. GPT-5.4 review identified this as a critical
103 flaw: 57% of scorecard entries had at least one look-ahead instance.
104
105 For each wallet, iterate chronologically. At each episode, compute the scorecard
106 from only those prior bets whose markets had finished trading (last_trade < now):
107 - n_resolved: resolved bets known at this time
108 - hit_rate: fraction correct among known-resolved
109 - excess_return: mean excess return per $1 vs token-adjusted base
110 - max_trade_so_far: max bet_size in all prior episodes
111
112 Returns dict: wallet -> list of (episode, scorecard_at_time) tuples.
113 """
114 # Build resolution proxy: last trade per market
115 market_last_trade = _build_market_last_trade(episodes)
116
117 wallet_episodes: dict[str, list[dict]] = defaultdict(list)
118 for ep in episodes:
119 wallet_episodes[ep["proxy_wallet"]].append(ep)
120
121 # wallet -> list of (episode_dict, scorecard_dict)
122 wallet_cards: dict[str, list[tuple[dict, dict]]] = {}
123
124 for wallet, eps in wallet_episodes.items():
125 eps.sort(key=lambda e: e["timestamp"])
126
127 max_trade_so_far = 0.0
128 # Deferred resolution: store pending bets, resolve them when their market's
129 # last trade is in the past. Each entry: (market_last_ts, correct, ret, excess)
130 pending_bets: list[tuple[int, bool, float, float]] = []
131 # Resolved history (only bets whose markets are known-resolved at current time)
132 resolved_history: list[tuple[bool, float, float]] = []
133
134 cards: list[tuple[dict, dict]] = []
135
136 for ep in eps:
137 current_ts = ep["timestamp"]
138
139 # Resolve any pending bets whose market's last trade is now in the past
140 still_pending = []
141 for market_last_ts, correct, ret, excess in pending_bets:
142 if market_last_ts < current_ts:
143 resolved_history.append((correct, ret, excess))
144 else:
145 still_pending.append((market_last_ts, correct, ret, excess))
146 pending_bets = still_pending
147
148 # Scorecard BEFORE this episode (as-of-time, deferred resolution)
149 n_resolved = len(resolved_history)
150 n_correct = sum(1 for c, _, _ in resolved_history if c)
151 cum_excess = sum(e for _, _, e in resolved_history)
152 cum_return = sum(r for _, r, _ in resolved_history)
153
154 scorecard = {
155 "n_resolved": n_resolved,
156 "hit_rate": n_correct / n_resolved if n_resolved > 0 else 0.0,
157 "mean_excess_return": cum_excess / n_resolved if n_resolved > 0 else 0.0,
158 "mean_return": cum_return / n_resolved if n_resolved > 0 else 0.0,
159 "max_trade_so_far": max_trade_so_far,
160 "tier_so_far": _classify_tier(max_trade_so_far),
161 "resolved_history": list(resolved_history), # copy for rolling window
162 }
163 cards.append((ep, scorecard))
164
165 # Update running stats after recording scorecard
166 max_trade_so_far = max(max_trade_so_far, ep["bet_size"])
167
168 # Add this episode to pending if it's on a resolved market
169 if ep["resolved"] == 1 and ep["voided"] == 0:
170 correct = _is_correct(ep["side"], ep["outcome_label"], ep["market_outcome"])
171 ret = _return_per_dollar(ep["side"], ep["price"], correct)
172 base_ret = _token_base_rate(ep["outcome_label"], ep["side"], base_rates)
173 excess = ret - base_ret
174 market_last_ts = market_last_trade.get(ep["market_id"], current_ts)
175 pending_bets.append((market_last_ts, correct, ret, excess))
176
177 wallet_cards[wallet] = cards
178
179 return wallet_cardsConcretely: for an episode at 2026-03-15 09:30, the scorecard counts only bets the wallet placed on markets that had stopped trading before 2026-03-15 09:30. The whale and track-record labels for that episode are computed from that historical view, not from the wallet's full lifetime stats.
Step D - apply the strict completeness bound
For every wallet at the API cap (about 4,000 trades), mark its lane observations as 'capped'. Recompute every whale / track-dependent metric on the strict-complete subset only. Re-bootstrap. Then split each strict result by Yes-token vs No-token to test whether the signal is robust or a token-mix artifact.
Drives the right-hand columns of the Test 3 table. The 4,000-trade ceiling is the API cap; wallets at or above it are flagged as 'capped' and excluded from the strict-complete subset.
112 def _annotate_wallet_counts(observations: list[dict], wallet_trade_counts: dict[str, int]) -> list[dict]:
113 annotated: list[dict] = []
114 for obs in observations:
115 wallet_trade_count = wallet_trade_counts.get(obs["wallet"], 0)
116 annotated.append(
117 {
118 **obs,
119 "wallet_trade_count": wallet_trade_count,
120 "strict_complete_wallet": wallet_trade_count < WALLET_HISTORY_STRICT_LIMIT,
121 }
122 )
123 return annotated
124
125
126 def _track_10_60(obs: dict) -> bool:
127 return obs["n_resolved"] >= 10 and obs["hit_rate"] >= 0.60
128
129
130 def _track_20_65(obs: dict) -> bool:
131 return obs["n_resolved"] >= 20 and obs["hit_rate"] >= 0.65
132
133
134 def _whale(obs: dict) -> bool:
135 return obs["tier_so_far"] in ("whale", "mega_whale")
136
137
138 def _lane_observations(observations: list[dict]) -> dict[str, list[dict]]:
139 whale_plus = [obs for obs in observations if _whale(obs)]
140 track_10_60 = [obs for obs in observations if _track_10_60(obs)]
141 track_primary = [obs for obs in observations if _track_20_65(obs)]
142 combined = [obs for obs in observations if _whale(obs) and _track_10_60(obs)]
143 whale_only = [obs for obs in whale_plus if not _track_10_60(obs)]
144 track_only = [obs for obs in track_10_60 if not _whale(obs)]
145 return {
146 "aot_whale_plus": whale_plus,
147 "whale_only_nontrack_10_60": whale_only,
148 "track_primary_20_65": track_primary,
149 "track_10_60_all": track_10_60,
150 "track_only_nonwhale_10_60": track_only,
151 "combined_whale_track_10_60": combined,
152 }
153
154
155 def _lane_wallets(lane_sets: dict[str, list[dict]]) -> dict[str, list[str]]:
156 return {
157 lane: sorted({obs["wallet"] for obs in values})
158 for lane, values in lane_sets.items()
159 }
160
161
162 def _refresh_wallets(wallets: list[str]) -> dict:
163 endpoints = [WALLET_CHECKPOINT_PREFIX + wallet for wallet in wallets]
164 deleted = delete_checkpoints(endpoints)
165 results = ingest_wallet_histories(proxy_wallets=wallets)
166 inserted = sum(result.trades_inserted for result in results.values())
167 errors = {wallet: result.error for wallet, result in results.items() if result.error}
168 four_page_wallets = sorted(
169 wallet for wallet, result in results.items() if result.pages_fetched >= 4
170 )
171 return {
172 "wallets_requested": len(wallets),
173 "wallet_checkpoints_deleted": deleted,
174 "wallets_returned": len(results),
175 "trades_inserted": inserted,
176 "error_count": len(errors),
177 "errors": errors,
178 "wallets_with_4_pages": len(four_page_wallets),
179 "sample_wallets_with_4_pages": four_page_wallets[:25],
180 }
181
182
183 def _summarize_lane(name: str, observations: list[dict], n_iter: int) -> LaneSummary:
184 strict = [obs for obs in observations if obs["strict_complete_wallet"]]
185 yes_all = [obs for obs in observations if obs["outcome_label"] == "Yes"]
186 no_all = [obs for obs in observations if obs["outcome_label"] == "No"]
187 yes_strict = [obs for obs in strict if obs["outcome_label"] == "Yes"]
188 no_strict = [obs for obs in strict if obs["outcome_label"] == "No"]
189
190 capped_wallets = {
191 obs["wallet"] for obs in observations if not obs["strict_complete_wallet"]
192 }
193 strict_wallets = {
194 obs["wallet"] for obs in observations if obs["strict_complete_wallet"]
195 }
196 capped_obs = sum(1 for obs in observations if not obs["strict_complete_wallet"])
197 share = capped_obs / len(observations) if observations else 0.0
198
199 return LaneSummary(
200 overall=summarize_subset(f"{name}_overall", observations, n_iter=n_iter),
201 strict_complete=summarize_subset(f"{name}_strict_complete", strict, n_iter=n_iter),
202 yes_overall=summarize_subset(f"{name}_yes_overall", yes_all, n_iter=n_iter),
203 no_overall=summarize_subset(f"{name}_no_overall", no_all, n_iter=n_iter),
204 yes_strict_complete=summarize_subset(f"{name}_yes_strict_complete", yes_strict, n_iter=n_iter),
205 no_strict_complete=summarize_subset(f"{name}_no_strict_complete", no_strict, n_iter=n_iter),
206 observation_share_from_capped_wallets=round(float(share), 4),
207 capped_wallet_count_in_lane=len(capped_wallets),
208 strict_complete_wallet_count_in_lane=len(strict_wallets),
209 )
210 Concretely: this is the step that flips the whale headline. The pooled lane is positive. The strict-complete subset is not. The difference is the wallets we cannot fully verify.
3b. Cohort methodology - how Tests 6 and 7 were built
Tests 6 and 7 ride on top of the same point-in-time scorecards used by Test 3. The added pieces:
Cohort construction
- For each wallet (strict-complete subset only) sort episodes by timestamp.
- Find the first episode where the wallet has at least 10 known-resolved priors. Call this the 'qualifying episode'.
- Read the wallet's hit rate at the qualifying episode (no look-ahead - this only counts priors whose markets had stopped trading by then).
- If the hit rate is at or above 0.70, mark the wallet as early-skilled.
- Future bets = every episode the wallet places at or after the qualifying episode.
Permutation control
To test whether the cohort is genuinely different from a random selection of similar-sized wallets, we run a label-shuffle test:
- Bucket every qualifying wallet by its
n_resolvedcount at qualification (buckets at 10-15, 15-25, 25-50, 50-100, 100-200, 200+). - Within each bucket, randomly relabel the same number of wallets as 'early-skilled', preserving the trade-count distribution of the real cohort.
- Recompute the future-bet excess return for the shuffled cohort. Repeat 1,000 times.
- Compare the observed cohort mean to the empirical null distribution.
Concretely: the observed cohort mean of +0.0051 sits at the p=0.008 one-sided tail of the null (null median -0.0002, 95th-percentile null +0.0035). So the cohort really is different from 'any 1,045 random wallets that happened to graduate'. That is the basic-null gate and it passes. What does not pass is the Yes/No robustness split (see Test 6 in the previous tab).
Test 7: above-average vs at-or-below-average bet size
For each early-skilled wallet, we track a running mean of bet sizes over priors at episode time (updated after each episode). Each future bet is classified as above-average if its size is strictly greater than the running mean at that moment, otherwise at-or-below-average. Bootstrap each lane separately. The lanes are computed on the same cohort, so any difference is the size effect within a fixed wallet population.
Identify-cohort and future-observation helpers; the permutation null shuffles labels within trade-count buckets.
78
79 def _identify_cohort(
80 observations: list[dict],
81 training_bets: int,
82 threshold: float,
83 ) -> tuple[set[str], dict[str, dict]]:
84 """For each wallet, find the first episode where n_resolved >= training_bets.
85
86 Returns:
87 - set of early-skilled wallets (hit_rate >= threshold at qualifying episode)
88 - dict[wallet] -> {'qualifying_timestamp', 'hit_rate_at_qualification',
89 'n_resolved_at_qualification', 'is_early_skilled'}
90 """
91 by_wallet: dict[str, list[dict]] = defaultdict(list)
92 for obs in observations:
93 by_wallet[obs["wallet"]].append(obs)
94
95 cohort: set[str] = set()
96 wallet_meta: dict[str, dict] = {}
97 for wallet, obs_list in by_wallet.items():
98 obs_list.sort(key=lambda o: o["timestamp"])
99 qualifying = next(
100 (obs for obs in obs_list if obs["n_resolved"] >= training_bets),
101 None,
102 )
103 if qualifying is None:
104 continue
105 is_skilled = qualifying["hit_rate"] >= threshold
106 if is_skilled:
107 cohort.add(wallet)
108 wallet_meta[wallet] = {
109 "qualifying_timestamp": int(qualifying["timestamp"]),
110 "hit_rate_at_qualification": float(qualifying["hit_rate"]),
111 "n_resolved_at_qualification": int(qualifying["n_resolved"]),
112 "is_early_skilled": bool(is_skilled),
113 }
114 return cohort, wallet_meta
115
116
117 def _future_observations(
118 observations: list[dict],
119 wallet_meta: dict[str, dict],
120 cohort_wallets: set[str] | None = None,
121 ) -> list[dict]:
122 """Return episodes that occur at or after each wallet's qualifying timestamp.
123
124 If cohort_wallets is given, restrict to those wallets. Otherwise return
125 future bets for every wallet that has a qualifying timestamp.
126 """
127 out: list[dict] = []
128 for obs in observations:
129 meta = wallet_meta.get(obs["wallet"])
130 if meta is None:
131 continue
132 if cohort_wallets is not None and obs["wallet"] not in cohort_wallets:
133 continue
134 if obs["timestamp"] < meta["qualifying_timestamp"]:
135 continue
136 out.append(obs)
137 return out
138
139
140 def _market_clustered_mean(observations: list[dict]) -> float:
141 if not observations:
142 return 0.0
143 by_market: dict[str, list[float]] = defaultdict(list)
144 for obs in observations:
145 by_market[obs["market_id"]].append(float(obs["excess_return"]))
146 market_means = [sum(vals) / len(vals) for vals in by_market.values()]
147 return sum(market_means) / len(market_means)
148
149
150 def _permutation_control(
151 qualifying_obs: list[dict],
152 wallet_meta: dict[str, dict],
153 *,
154 n_permutations: int,
155 seed: int = 17,
156 ) -> dict:
157 """Trade-count-matched permutation: shuffle early-skilled labels within
158 n_resolved-at-qualification buckets, recompute the lane mean.
159
160 Buckets are integer ranges of n_resolved-at-qualification: [10,15), [15,25),
161 [25,50), [50,100), [100,200), [200, +inf). This keeps wallets compared
162 against peers who took roughly as long to graduate.
163 """
164
165 def _bucket(n_resolved: int) -> int:
166 if n_resolved < 15:
167 return 04. Results and how we interpreted them
The verdict tables in Tests & Results are the persisted output of the pipeline above. The interpretation rules below define what each cell means.
Reading the metrics
- Excess / $: token-adjusted return per dollar staked, averaged at the market level then resampled. Negative means losing money on average. Magnitudes in the 0.01-0.10 range are typical because resolved binary markets rarely close exactly at the BUY price.
- 95% CI: market-clustered bootstrap, 1,000 resamples. If the CI excludes 0 the lane has statistical support.
- p-value: bootstrap p-value against zero excess. We treat p<0.05 as 'has signal', p>0.5 as 'no signal'.
- Top-10 share: how concentrated the lane is. Above 30% means a few wallets are driving the result.
Decision rules used to land the final verdict
- If the lane is negative on both raw rows and episodes, it fails the literal test. (Test 1)
- If the lane is positive in pooled form but the strict-complete subset is non-positive, the pooled result is data-completeness-driven, not real. (Test 3)
- If the pooled lane is positive but the Yes-only or No-only strict subset is non-positive, the pooled result is a token-mix effect, not skill. (Track record case in Test 3)
- If the lane is positive on a non-overlap partition, the effect is real for that partition. (Track-only-non-whale in Test 2)
What we did not conclude
- That whales are bad traders. The audit only rules out a deployable wallet-following strategy. Individual whales may still be skilled.
- That track record is useless. The pooled positive result on track-qualified wallets is real; it is just not robust enough to ship.
- That the data is broken. After the two source-truth fixes, the residual issue is a documented API cap with a defensible bound, not a bug.