Proof and methodology

Each of the four sections below is collapsible. The intent is to give a technical reader enough evidence to either trust or reject the verdict on the Conclusion tab. Click any heading to expand.

1. The data we used

Everything in this report comes from a single local SQLite mirror of Polymarket's public data API (the same endpoints that power their website). The mirror is read-only inside this report. The upstream sync runs separately and is what we used to repeatedly run the same strategy thousands of times deterministically without hitting rate limits.

Mirror size
6359.2 MB SQLite
Snapshot date
2026-04-23
Universe
resolved binary markets, non-voided
Source
Polymarket public data + gamma APIs

Tables in the mirror (live counts queried at request time)

SQLite tableRow countWhat it stores
markets49,553One row per Polymarket market: question, end date, raw payload from the gamma API, resolution outcome, voided flag.
tokens99,129Yes/No tokens per market. Each token resolves to 1.00 or 0.00.
trades2,981,448Raw BUY/SELL fills. proxy_wallet, token_id, size, price, timestamp. Source: Polymarket data-api trades endpoint.
wallets15,000Per-wallet aggregates: trade_count, total_volume, first/last seen, category distribution. Used to label whale tier and screen capped wallets.

Concretely: the trades table is the source of truth for every BUY in the audit. The wallets table aggregates trades into per-wallet metadata so we can label a wallet as whale-tier, track-qualified, or both at episode time without re-walking 3 million rows.

Sample large BUYs from the trades table

Live read from the local DB. These are real $50K-$500K BUY rows on resolved binary markets.

WalletQuestionSize $PriceBoughtResolvedTimestamp
0xcd9dd293…US strikes Iran by February 25, 2026?500,0000.0010YesNo1772086494
0x4c4e2c68…Will Zohran Mamdani win the 2025 NYC mayoral election?500,0000.0020NoYes1762310769
0x5ddd01dd…Trump strikes another drug boat by Sep 30?500,0000.0020YesNo1759470870
0xbcb6ebb4…Fordow nuclear facility destroyed before July?500,0000.9990YesYes1751305016
0x1d379e32…Will Justin Trudeau be the next Canadian Prime Minister?500,0000.0010YesNo1743137278

Concretely: several of these are very large Yes buys at prices around 0.01-0.05 that resolved No. The bettor paid for a long-tail outcome that did not materialize. That is the type of row that breaks the 'big size means informed' intuition: a $50K bet at price 0.02 is a $50K wager that an event with a 2% implied probability will happen. It usually does not.

2. Are we sure the data is correct?

Two independent integrity passes were run before any verdict was finalised. Both surfaced real bugs in how the upstream sync was talking to Polymarket. Both are now fixed and every verdict in this report is computed on data fetched after the fixes.

Source-truth bug #1: wallet history was using the wrong query parameter

The wallet-history fetcher was passing proxyWallet= to Polymarket's data API. The accepted parameter is user=. The wrong parameter returned a partial response without raising an error, so the bug was silent. Concretely, this meant some wallets had visibly incomplete trade histories, which made the prior-skill labels (whale tier, track-qualified) systematically too low for those wallets.

Wallet-history fetch (post-fix)

The corrected loop. Note the explicit `user=` and the pagination bounds.

 263      for _ in range(max_pages):
 264          if offset > DATA_API_HARD_OFFSET_LIMIT:
 265              log_info(
 266                  f"Reached Data API hard offset limit ({DATA_API_HARD_OFFSET_LIMIT}) "
 267                  f"for {entity_type} {entity_id[:20]}..."
 268              )
 269              break
 270  
 271          params = {**params_base, "limit": page_size, "offset": offset}
 272          raw_page = _fetch_trades_page(params, rate_limiter)
 273  
 274          if raw_page is None:
 275              result.error = f"fetch failed at offset {offset}"
 276              break
 277  
 278          if len(raw_page) == 0:
 279              # Empty page = end of available data
 280              break
 281  
 282          result.pages_fetched += 1
 283          result.trades_fetched += len(raw_page)
 284  
 285          # Parse all trades in this page
 286          parsed_page: list[dict] = []
 287          for raw_trade in raw_page:
 288              parsed = _parse_trade(raw_trade)
 289              if parsed is not None:
 290                  parsed_page.append(parsed)
 291  
 292          if not parsed_page:
 293              offset += page_size
 294              continue
 295  
 296          page_timestamps = [t["timestamp"] for t in parsed_page]
 297          page_min_ts = min(page_timestamps)
 298          page_max_ts = max(page_timestamps)
 299  
 300          # Timestamp early-stop: if the MINIMUM timestamp on this page is
 301          # already below our high-water mark, we still insert (some trades
 302          # at the top of the page may be new), then stop — all subsequent
 303          # pages will be strictly older. Using page_min_ts (not max) avoids
 304          # the BUG where same-second trades at a page boundary get dropped.
 305          should_stop_after_this_page = (
 306              known_max_ts is not None and page_min_ts < known_max_ts
 307          )
 308  
 309          # Insert this page
 310          inserted = insert_trades_batch(parsed_page)
 311          result.trades_inserted += inserted
 312          result.trades_skipped += len(parsed_page) - inserted
 313  
 314          # Only count wallets from newly inserted trades (not already-known ones)
 315          # We can't know per-trade which inserts succeeded after batch insert,
 316          # so we track wallets from all parsed trades in this page; the strategy
 317          # layer re-checks wallet existence before flagging "first observed".
 318          result.new_wallets.update(
 319              t["proxy_wallet"] for t in parsed_page if t["proxy_wallet"]
 320          )
 321  
 322          all_min_ts.append(page_min_ts)
 323          all_max_ts.append(page_max_ts)
 324  
 325          if should_stop_after_this_page:
 326              result.stopped_early = True
 327              break
 328  
 329          # Per-page dedup ratio stop (checked INSIDE loop, after insert)
 330          if len(parsed_page) > 0:

Source-truth bug #2: pagination skipped the offset=3000 page

The pagination guard was too strict and refused to fetch the page starting at offset 3000, even though the API returns it. Concretely, this silently chopped off up to 1,000 trades from any wallet with more than 3,000 recorded entries - heavy wallets, in other words. Fixed.

Residual issue: the API itself caps very heavy wallets

Even after the two fixes, the live data API returns at most ~4,000 history entries per wallet. There are 24 wallets globally at or above this cap. Any test that depends on prior-bet history is therefore re-run on the strict-completeness subset (Test 3 in the previous tab), where this cap is provably not biting. The heaviest wallets in the local DB:

WalletStored trade count
0xcb3143ee858e…35,144
0xd218e4747764…34,821
0x59ee6c6a56d7…24,563
0xe8dd7741ccb1…22,581
0x4ce73141dbfc…18,894

Concretely: the top wallet alone has more recorded trades than the API would now return in a fresh fetch. Any whale-flavoured strategy that depends on knowing the wallet's full history is therefore mechanically biased toward the wallets where coverage happens to be best, unless the strict bound is applied. That is the bias Test 3 quantifies and removes.

What still has a bounded caveat

The market-level resolution timestamp is built from the last-trade proxy rather than the gamma API's closedTime field. We checked a sample and the two agree where both exist, but a full closedTime-vs-proxy comparison was deferred. Concretely: if a market's true close time is materially earlier than the last trade in our data, the deferred-resolution scorecard in Tests 2 and 3 would shift slightly. We do not believe this would flip any verdict because the surviving lane (track record) already fails for an unrelated reason (the Yes/No split).

3. How was the data processed (steps + code)

Step A - define excess return per dollar

For a BUY at price p: if the bought outcome resolves true, the wallet earns (1 - p) per $1; if it resolves false, the wallet loses p per $1. Token-adjusted, this puts Yes-side and No-side BUYs on the same scale, which is what makes the verdict tables comparable across lanes.

Return definition + market-clustered bootstrap

The bootstrap resamples market-level means, not raw rows, so a few crowded markets cannot mechanically dominate the confidence interval.

  43  def _return_per_dollar(side: str, price: float, correct: bool) -> float:
  44      """Return per $1 bet. BUY: pay price, win 1.0 if correct. SELL: receive price, owe 1.0 if wrong."""
  45      if side == "BUY":
  46          return (1.0 - price) if correct else -price
  47      else:  # SELL
  48          return price if correct else -(1.0 - price)
  49  
  50  
  51  def _bootstrap_excess_return(market_excess: dict[str, list[float]], n_iter: int = 10_000) -> tuple[float, float, float, float]:
  52      """Bootstrap market-level mean excess return. Returns (mean, ci_lo, ci_hi, p_value)."""
  53      market_ids = list(market_excess.keys())
  54      market_means = np.array([np.mean(v) for v in market_excess.values()])
  55      n = len(market_means)
  56      if n == 0:
  57          return 0.0, 0.0, 0.0, 1.0
  58  
  59      rng = np.random.default_rng(42)
  60      boot_means = np.array([
  61          rng.choice(market_means, size=n, replace=True).mean()
  62          for _ in range(n_iter)
  63      ])
  64      mean_val = float(market_means.mean())
  65      ci_lo = float(np.percentile(boot_means, 2.5))
  66      ci_hi = float(np.percentile(boot_means, 97.5))
  67      p_val = float(np.mean(boot_means <= 0))
  68      return mean_val, ci_lo, ci_hi, p_val
  69  
  70  
  71  def _binomial_p(hits: int, n: int, base_rate: float) -> float:
  72      """One-sided binomial test p-value (probability of seeing >= hits by chance)."""
  73      from scipy.stats import binom
  74      if n == 0:
  75          return 1.0
  76      return float(1 - binom.cdf(hits - 1, n, base_rate))
  77  
  78  
  79  # ─────────────────────────────────────────────────────────────────────────────
  80  # Step 1: Base rates
  81  # ─────────────────────────────────────────────────────────────────────────────
  82  
  83  def compute_base_rates(conn) -> dict:
  84      """Compute market-level and trade-level base rates for both BUY and SELL."""
  85      cur = conn.cursor()
  86  
  87      # Market-level: what fraction resolved Yes vs No?
  88      cur.execute("""
  89          SELECT outcome, COUNT(*) FROM markets
  90          WHERE resolved=1 AND voided=0

Concretely: a BUY at price 0.30 that resolves True earns +$0.70 per dollar; the same BUY resolving False loses $0.30. We average those numbers per market, then resample 1,000 times to get a 95% CI. The market-clustered resampling is what protects us against a single highly-traded market dragging the result.

Step B - collapse split fills into episodes

The raw feed contains many split fills: one logical $200K bet often shows up as 30 raw rows of $5K-$10K. An episode is defined as: same wallet + same token + same side, all rows within a 60-second window, collapsed to one volume-weighted price. Without this step, large bets are silently double-counted in the row-level analysis.

Episode construction

Order: time-sort, then walk forward grouping by (wallet, token, side) within the time window.

 184  def build_episodes(conn) -> list[dict]:
 185      """Deduplicate trades into episodes. Group by (wallet, token, side) within 60s window.
 186      Returns list of episode dicts sorted by (proxy_wallet, timestamp)."""
 187      cur = conn.cursor()
 188      cur.execute("""
 189          SELECT tr.proxy_wallet, tr.token_id, tr.market_id, tr.side, tr.size, tr.price, tr.timestamp,
 190                 tk.outcome_label, m.outcome as market_outcome, m.resolved, m.voided
 191          FROM trades tr
 192          JOIN tokens tk ON tk.token_id = tr.token_id
 193          JOIN markets m ON m.market_id = tr.market_id
 194          WHERE tk.outcome_label IN ('Yes', 'No')
 195          ORDER BY tr.proxy_wallet, tr.timestamp
 196      """)
 197  
 198      # Group into episodes: same (wallet, token, side) within 60s
 199      episodes = []
 200      pending = {}  # key: (wallet, token, side) -> list of (size, price, timestamp, ...)
 201  
 202      def flush(key, rows):
 203          total_size = sum(r[0] for r in rows)
 204          vwap = sum(r[0] * r[1] for r in rows) / total_size if total_size > 0 else rows[0][1]
 205          first_ts = rows[0][2]
 206          episodes.append({
 207              "proxy_wallet": key[0],
 208              "token_id": key[1],
 209              "side": key[2],
 210              "market_id": rows[0][3],
 211              "bet_size": total_size,
 212              "price": vwap,
 213              "timestamp": first_ts,
 214              "outcome_label": rows[0][4],
 215              "market_outcome": rows[0][5],
 216              "resolved": rows[0][6],
 217              "voided": rows[0][7],
 218          })
 219  
 220      for row in cur.fetchall():
 221          key = (row["proxy_wallet"], row["token_id"], row["side"])
 222          entry = (row["size"], row["price"], row["timestamp"], row["market_id"],
 223                   row["outcome_label"], row["market_outcome"], row["resolved"], row["voided"])
 224  
 225          if key in pending:
 226              last_ts = pending[key][-1][2]
 227              if row["timestamp"] - last_ts <= 60:
 228                  pending[key].append(entry)
 229              else:
 230                  flush(key, pending[key])
 231                  pending[key] = [entry]
 232          else:
 233              pending[key] = [entry]
 234  
 235      for key, rows in pending.items():
 236          flush(key, rows)
 237  
 238      episodes.sort(key=lambda e: (e["proxy_wallet"], e["timestamp"]))
 239      return episodes

Concretely: we run the same Test 1 lanes twice - once on raw rows, once on episodes. If the episode-level result disagreed with the raw-row result, that would be a hint that the negative is a double-counting artifact. They agree, so the negative is real.

Step C - build wallet scorecards with deferred resolution

For prior-skill labels, a BUY only counts toward a wallet's track record once we believe the underlying market had already finished trading before the current episode. Without this guard, the wallet's 'prior' bets would include bets on the same market the wallet is actively in - a leakage that would inflate the apparent skill of any wallet.

Deferred-resolution scorecard

The point-in-time scorecard used to label whale tier and track-qualified state at episode time.

  94  def build_scorecards(episodes: list[dict], base_rates: dict) -> dict:
  95      """Build as-of-time wallet scorecards with deferred resolution.
  96  
  97      Critical: a prior bet on market X only counts as "resolved" in the scorecard
  98      if market X's last trade timestamp < current episode's timestamp. This is a
  99      conservative proxy for "market X had resolved by the time of the current trade."
 100  
 101      Without this, the scorecard uses future information (whether a market eventually
 102      resolved), creating look-ahead bias. GPT-5.4 review identified this as a critical
 103      flaw: 57% of scorecard entries had at least one look-ahead instance.
 104  
 105      For each wallet, iterate chronologically. At each episode, compute the scorecard
 106      from only those prior bets whose markets had finished trading (last_trade < now):
 107      - n_resolved: resolved bets known at this time
 108      - hit_rate: fraction correct among known-resolved
 109      - excess_return: mean excess return per $1 vs token-adjusted base
 110      - max_trade_so_far: max bet_size in all prior episodes
 111  
 112      Returns dict: wallet -> list of (episode, scorecard_at_time) tuples.
 113      """
 114      # Build resolution proxy: last trade per market
 115      market_last_trade = _build_market_last_trade(episodes)
 116  
 117      wallet_episodes: dict[str, list[dict]] = defaultdict(list)
 118      for ep in episodes:
 119          wallet_episodes[ep["proxy_wallet"]].append(ep)
 120  
 121      # wallet -> list of (episode_dict, scorecard_dict)
 122      wallet_cards: dict[str, list[tuple[dict, dict]]] = {}
 123  
 124      for wallet, eps in wallet_episodes.items():
 125          eps.sort(key=lambda e: e["timestamp"])
 126  
 127          max_trade_so_far = 0.0
 128          # Deferred resolution: store pending bets, resolve them when their market's
 129          # last trade is in the past. Each entry: (market_last_ts, correct, ret, excess)
 130          pending_bets: list[tuple[int, bool, float, float]] = []
 131          # Resolved history (only bets whose markets are known-resolved at current time)
 132          resolved_history: list[tuple[bool, float, float]] = []
 133  
 134          cards: list[tuple[dict, dict]] = []
 135  
 136          for ep in eps:
 137              current_ts = ep["timestamp"]
 138  
 139              # Resolve any pending bets whose market's last trade is now in the past
 140              still_pending = []
 141              for market_last_ts, correct, ret, excess in pending_bets:
 142                  if market_last_ts < current_ts:
 143                      resolved_history.append((correct, ret, excess))
 144                  else:
 145                      still_pending.append((market_last_ts, correct, ret, excess))
 146              pending_bets = still_pending
 147  
 148              # Scorecard BEFORE this episode (as-of-time, deferred resolution)
 149              n_resolved = len(resolved_history)
 150              n_correct = sum(1 for c, _, _ in resolved_history if c)
 151              cum_excess = sum(e for _, _, e in resolved_history)
 152              cum_return = sum(r for _, r, _ in resolved_history)
 153  
 154              scorecard = {
 155                  "n_resolved": n_resolved,
 156                  "hit_rate": n_correct / n_resolved if n_resolved > 0 else 0.0,
 157                  "mean_excess_return": cum_excess / n_resolved if n_resolved > 0 else 0.0,
 158                  "mean_return": cum_return / n_resolved if n_resolved > 0 else 0.0,
 159                  "max_trade_so_far": max_trade_so_far,
 160                  "tier_so_far": _classify_tier(max_trade_so_far),
 161                  "resolved_history": list(resolved_history),  # copy for rolling window
 162              }
 163              cards.append((ep, scorecard))
 164  
 165              # Update running stats after recording scorecard
 166              max_trade_so_far = max(max_trade_so_far, ep["bet_size"])
 167  
 168              # Add this episode to pending if it's on a resolved market
 169              if ep["resolved"] == 1 and ep["voided"] == 0:
 170                  correct = _is_correct(ep["side"], ep["outcome_label"], ep["market_outcome"])
 171                  ret = _return_per_dollar(ep["side"], ep["price"], correct)
 172                  base_ret = _token_base_rate(ep["outcome_label"], ep["side"], base_rates)
 173                  excess = ret - base_ret
 174                  market_last_ts = market_last_trade.get(ep["market_id"], current_ts)
 175                  pending_bets.append((market_last_ts, correct, ret, excess))
 176  
 177          wallet_cards[wallet] = cards
 178  
 179      return wallet_cards

Concretely: for an episode at 2026-03-15 09:30, the scorecard counts only bets the wallet placed on markets that had stopped trading before 2026-03-15 09:30. The whale and track-record labels for that episode are computed from that historical view, not from the wallet's full lifetime stats.

Step D - apply the strict completeness bound

For every wallet at the API cap (about 4,000 trades), mark its lane observations as 'capped'. Recompute every whale / track-dependent metric on the strict-complete subset only. Re-bootstrap. Then split each strict result by Yes-token vs No-token to test whether the signal is robust or a token-mix artifact.

Strict completeness bound

Drives the right-hand columns of the Test 3 table. The 4,000-trade ceiling is the API cap; wallets at or above it are flagged as 'capped' and excluded from the strict-complete subset.

 112  def _annotate_wallet_counts(observations: list[dict], wallet_trade_counts: dict[str, int]) -> list[dict]:
 113      annotated: list[dict] = []
 114      for obs in observations:
 115          wallet_trade_count = wallet_trade_counts.get(obs["wallet"], 0)
 116          annotated.append(
 117              {
 118                  **obs,
 119                  "wallet_trade_count": wallet_trade_count,
 120                  "strict_complete_wallet": wallet_trade_count < WALLET_HISTORY_STRICT_LIMIT,
 121              }
 122          )
 123      return annotated
 124  
 125  
 126  def _track_10_60(obs: dict) -> bool:
 127      return obs["n_resolved"] >= 10 and obs["hit_rate"] >= 0.60
 128  
 129  
 130  def _track_20_65(obs: dict) -> bool:
 131      return obs["n_resolved"] >= 20 and obs["hit_rate"] >= 0.65
 132  
 133  
 134  def _whale(obs: dict) -> bool:
 135      return obs["tier_so_far"] in ("whale", "mega_whale")
 136  
 137  
 138  def _lane_observations(observations: list[dict]) -> dict[str, list[dict]]:
 139      whale_plus = [obs for obs in observations if _whale(obs)]
 140      track_10_60 = [obs for obs in observations if _track_10_60(obs)]
 141      track_primary = [obs for obs in observations if _track_20_65(obs)]
 142      combined = [obs for obs in observations if _whale(obs) and _track_10_60(obs)]
 143      whale_only = [obs for obs in whale_plus if not _track_10_60(obs)]
 144      track_only = [obs for obs in track_10_60 if not _whale(obs)]
 145      return {
 146          "aot_whale_plus": whale_plus,
 147          "whale_only_nontrack_10_60": whale_only,
 148          "track_primary_20_65": track_primary,
 149          "track_10_60_all": track_10_60,
 150          "track_only_nonwhale_10_60": track_only,
 151          "combined_whale_track_10_60": combined,
 152      }
 153  
 154  
 155  def _lane_wallets(lane_sets: dict[str, list[dict]]) -> dict[str, list[str]]:
 156      return {
 157          lane: sorted({obs["wallet"] for obs in values})
 158          for lane, values in lane_sets.items()
 159      }
 160  
 161  
 162  def _refresh_wallets(wallets: list[str]) -> dict:
 163      endpoints = [WALLET_CHECKPOINT_PREFIX + wallet for wallet in wallets]
 164      deleted = delete_checkpoints(endpoints)
 165      results = ingest_wallet_histories(proxy_wallets=wallets)
 166      inserted = sum(result.trades_inserted for result in results.values())
 167      errors = {wallet: result.error for wallet, result in results.items() if result.error}
 168      four_page_wallets = sorted(
 169          wallet for wallet, result in results.items() if result.pages_fetched >= 4
 170      )
 171      return {
 172          "wallets_requested": len(wallets),
 173          "wallet_checkpoints_deleted": deleted,
 174          "wallets_returned": len(results),
 175          "trades_inserted": inserted,
 176          "error_count": len(errors),
 177          "errors": errors,
 178          "wallets_with_4_pages": len(four_page_wallets),
 179          "sample_wallets_with_4_pages": four_page_wallets[:25],
 180      }
 181  
 182  
 183  def _summarize_lane(name: str, observations: list[dict], n_iter: int) -> LaneSummary:
 184      strict = [obs for obs in observations if obs["strict_complete_wallet"]]
 185      yes_all = [obs for obs in observations if obs["outcome_label"] == "Yes"]
 186      no_all = [obs for obs in observations if obs["outcome_label"] == "No"]
 187      yes_strict = [obs for obs in strict if obs["outcome_label"] == "Yes"]
 188      no_strict = [obs for obs in strict if obs["outcome_label"] == "No"]
 189  
 190      capped_wallets = {
 191          obs["wallet"] for obs in observations if not obs["strict_complete_wallet"]
 192      }
 193      strict_wallets = {
 194          obs["wallet"] for obs in observations if obs["strict_complete_wallet"]
 195      }
 196      capped_obs = sum(1 for obs in observations if not obs["strict_complete_wallet"])
 197      share = capped_obs / len(observations) if observations else 0.0
 198  
 199      return LaneSummary(
 200          overall=summarize_subset(f"{name}_overall", observations, n_iter=n_iter),
 201          strict_complete=summarize_subset(f"{name}_strict_complete", strict, n_iter=n_iter),
 202          yes_overall=summarize_subset(f"{name}_yes_overall", yes_all, n_iter=n_iter),
 203          no_overall=summarize_subset(f"{name}_no_overall", no_all, n_iter=n_iter),
 204          yes_strict_complete=summarize_subset(f"{name}_yes_strict_complete", yes_strict, n_iter=n_iter),
 205          no_strict_complete=summarize_subset(f"{name}_no_strict_complete", no_strict, n_iter=n_iter),
 206          observation_share_from_capped_wallets=round(float(share), 4),
 207          capped_wallet_count_in_lane=len(capped_wallets),
 208          strict_complete_wallet_count_in_lane=len(strict_wallets),
 209      )
 210  

Concretely: this is the step that flips the whale headline. The pooled lane is positive. The strict-complete subset is not. The difference is the wallets we cannot fully verify.

3b. Cohort methodology - how Tests 6 and 7 were built

Tests 6 and 7 ride on top of the same point-in-time scorecards used by Test 3. The added pieces:

Cohort construction

  • For each wallet (strict-complete subset only) sort episodes by timestamp.
  • Find the first episode where the wallet has at least 10 known-resolved priors. Call this the 'qualifying episode'.
  • Read the wallet's hit rate at the qualifying episode (no look-ahead - this only counts priors whose markets had stopped trading by then).
  • If the hit rate is at or above 0.70, mark the wallet as early-skilled.
  • Future bets = every episode the wallet places at or after the qualifying episode.

Permutation control

To test whether the cohort is genuinely different from a random selection of similar-sized wallets, we run a label-shuffle test:

  • Bucket every qualifying wallet by its n_resolved count at qualification (buckets at 10-15, 15-25, 25-50, 50-100, 100-200, 200+).
  • Within each bucket, randomly relabel the same number of wallets as 'early-skilled', preserving the trade-count distribution of the real cohort.
  • Recompute the future-bet excess return for the shuffled cohort. Repeat 1,000 times.
  • Compare the observed cohort mean to the empirical null distribution.

Concretely: the observed cohort mean of +0.0051 sits at the p=0.008 one-sided tail of the null (null median -0.0002, 95th-percentile null +0.0035). So the cohort really is different from 'any 1,045 random wallets that happened to graduate'. That is the basic-null gate and it passes. What does not pass is the Yes/No robustness split (see Test 6 in the previous tab).

Test 7: above-average vs at-or-below-average bet size

For each early-skilled wallet, we track a running mean of bet sizes over priors at episode time (updated after each episode). Each future bet is classified as above-average if its size is strictly greater than the running mean at that moment, otherwise at-or-below-average. Bootstrap each lane separately. The lanes are computed on the same cohort, so any difference is the size effect within a fixed wallet population.

Cohort + permutation logic

Identify-cohort and future-observation helpers; the permutation null shuffles labels within trade-count buckets.

  78  
  79  def _identify_cohort(
  80      observations: list[dict],
  81      training_bets: int,
  82      threshold: float,
  83  ) -> tuple[set[str], dict[str, dict]]:
  84      """For each wallet, find the first episode where n_resolved >= training_bets.
  85  
  86      Returns:
  87          - set of early-skilled wallets (hit_rate >= threshold at qualifying episode)
  88          - dict[wallet] -> {'qualifying_timestamp', 'hit_rate_at_qualification',
  89                            'n_resolved_at_qualification', 'is_early_skilled'}
  90      """
  91      by_wallet: dict[str, list[dict]] = defaultdict(list)
  92      for obs in observations:
  93          by_wallet[obs["wallet"]].append(obs)
  94  
  95      cohort: set[str] = set()
  96      wallet_meta: dict[str, dict] = {}
  97      for wallet, obs_list in by_wallet.items():
  98          obs_list.sort(key=lambda o: o["timestamp"])
  99          qualifying = next(
 100              (obs for obs in obs_list if obs["n_resolved"] >= training_bets),
 101              None,
 102          )
 103          if qualifying is None:
 104              continue
 105          is_skilled = qualifying["hit_rate"] >= threshold
 106          if is_skilled:
 107              cohort.add(wallet)
 108          wallet_meta[wallet] = {
 109              "qualifying_timestamp": int(qualifying["timestamp"]),
 110              "hit_rate_at_qualification": float(qualifying["hit_rate"]),
 111              "n_resolved_at_qualification": int(qualifying["n_resolved"]),
 112              "is_early_skilled": bool(is_skilled),
 113          }
 114      return cohort, wallet_meta
 115  
 116  
 117  def _future_observations(
 118      observations: list[dict],
 119      wallet_meta: dict[str, dict],
 120      cohort_wallets: set[str] | None = None,
 121  ) -> list[dict]:
 122      """Return episodes that occur at or after each wallet's qualifying timestamp.
 123  
 124      If cohort_wallets is given, restrict to those wallets. Otherwise return
 125      future bets for every wallet that has a qualifying timestamp.
 126      """
 127      out: list[dict] = []
 128      for obs in observations:
 129          meta = wallet_meta.get(obs["wallet"])
 130          if meta is None:
 131              continue
 132          if cohort_wallets is not None and obs["wallet"] not in cohort_wallets:
 133              continue
 134          if obs["timestamp"] < meta["qualifying_timestamp"]:
 135              continue
 136          out.append(obs)
 137      return out
 138  
 139  
 140  def _market_clustered_mean(observations: list[dict]) -> float:
 141      if not observations:
 142          return 0.0
 143      by_market: dict[str, list[float]] = defaultdict(list)
 144      for obs in observations:
 145          by_market[obs["market_id"]].append(float(obs["excess_return"]))
 146      market_means = [sum(vals) / len(vals) for vals in by_market.values()]
 147      return sum(market_means) / len(market_means)
 148  
 149  
 150  def _permutation_control(
 151      qualifying_obs: list[dict],
 152      wallet_meta: dict[str, dict],
 153      *,
 154      n_permutations: int,
 155      seed: int = 17,
 156  ) -> dict:
 157      """Trade-count-matched permutation: shuffle early-skilled labels within
 158      n_resolved-at-qualification buckets, recompute the lane mean.
 159  
 160      Buckets are integer ranges of n_resolved-at-qualification: [10,15), [15,25),
 161      [25,50), [50,100), [100,200), [200, +inf). This keeps wallets compared
 162      against peers who took roughly as long to graduate.
 163      """
 164  
 165      def _bucket(n_resolved: int) -> int:
 166          if n_resolved < 15:
 167              return 0
4. Results and how we interpreted them

The verdict tables in Tests & Results are the persisted output of the pipeline above. The interpretation rules below define what each cell means.

Reading the metrics

  • Excess / $: token-adjusted return per dollar staked, averaged at the market level then resampled. Negative means losing money on average. Magnitudes in the 0.01-0.10 range are typical because resolved binary markets rarely close exactly at the BUY price.
  • 95% CI: market-clustered bootstrap, 1,000 resamples. If the CI excludes 0 the lane has statistical support.
  • p-value: bootstrap p-value against zero excess. We treat p<0.05 as 'has signal', p>0.5 as 'no signal'.
  • Top-10 share: how concentrated the lane is. Above 30% means a few wallets are driving the result.

Decision rules used to land the final verdict

  • If the lane is negative on both raw rows and episodes, it fails the literal test. (Test 1)
  • If the lane is positive in pooled form but the strict-complete subset is non-positive, the pooled result is data-completeness-driven, not real. (Test 3)
  • If the pooled lane is positive but the Yes-only or No-only strict subset is non-positive, the pooled result is a token-mix effect, not skill. (Track record case in Test 3)
  • If the lane is positive on a non-overlap partition, the effect is real for that partition. (Track-only-non-whale in Test 2)

What we did not conclude

  • That whales are bad traders. The audit only rules out a deployable wallet-following strategy. Individual whales may still be skilled.
  • That track record is useless. The pooled positive result on track-qualified wallets is real; it is just not robust enough to ship.
  • That the data is broken. After the two source-truth fixes, the residual issue is a documented API cap with a defensible bound, not a bug.