Polymarket: should you follow the whales? Verdict: no.
Audit period: 2026-03 to 2026-04 · Data snapshot: 2026-04-23 · Local mirror: 6.7 GB SQLite, 2.98M trades, 49K markets, 15K wallets.
Following large bets on Polymarket does not work.
The literal version of the question fails on every cut we tried. Aggregate whale wallets look positive when pooled, but that headline collapses once we control for prior skill, wallet concentration, and incomplete trade history.
Recommendation: do not deploy a wallet-following or large-bet strategy. Prior wallet track record is the only surviving research direction, and even that is not clean enough to ship.
Headline numbers
| Question | Result | Verdict |
|---|---|---|
| Do large bets ($50K-$500K) outperform? Deduplicated BUY episodes, market-clustered bootstrap. | -0.0092 per $ (p=0.9693) | FAIL |
| Do very large bets ($500K+) outperform? Same lane, top size bucket. | -0.0015 per $ (N=298, p=0.7264) | FAIL |
| Do whale wallets outperform (pooled)? Aggregate whale lane, no completeness bound. | +0.0073 per $ | MISLEADING |
| Do whale wallets outperform after fixing data? Strict completeness bound: drop wallets where API truncates history. | -0.0025 per $ | FAIL |
| Does prior track record outperform? Wallets with 20+ resolved priors, hit rate ≥ 65%. | +0.0075 per $ (fails Yes/No robustness split) | INCONCLUSIVE |
| Pre-filter by first 10 resolved bets, then follow? Cohort: wallets with ≥70% hit rate on their first 10 resolved priors. | +0.0051 per $ (passes permutation null at p=0.008, but fails Yes/No split) | INCONCLUSIVE |
| Within that cohort, do bigger bets work better? Above-avg vs at-or-below-avg bet size for cohort wallets. | -0.0006 per $ delta (bigger bets very slightly worse, not better) | FAIL |
How to read this: Excess / $ is the average token-adjusted return per dollar staked,
bootstrapped with the markets as the resampling unit (not the rows). A value of
-0.0092 means that for every $1 you put through this lane,
on average you ended up with about
0.9 cents less than you started.
Negative across the board.
The most important row is row #4. The aggregate-whale lane swings by about 1.0 cents per dollar when we drop the wallets whose history we can not fully verify. That is the whole story in one number: the positive whale headline lives entirely inside the wallets where we did not have clean data.
What this means in plain English
- Big bets are not informed bets. A wallet writing a $200K ticket is no more likely to be right than the average BUY. The largest tickets in the data lose money in aggregate.
- The whale story was a data artifact. When we re-ran whales on the wallets where we trust the trade history end-to-end, the positive sign disappeared.
- Prior skill has a faint pulse, but no edge that would survive trading frictions. Pre-filtering wallets by their first 10 resolved bets does identify a group that genuinely differs from random selection (permutation control p=0.008), but the magnitude is about half a cent per dollar and the result fails when you split it into Yes-only and No-only future bets.
- Bet size adds nothing inside the cohort. Among the early-skilled wallets, their above-average future bets are not more profitable than their below-average future bets. Whatever weak signal exists is at the wallet level, not the bet level.
How we got here, in six steps
| Step | What we did | Why it matters |
|---|---|---|
| 1 | Pulled ~6.7 GB of trades, markets, tokens, and per-wallet history into a local SQLite mirror. | Removes API rate limits and lets us run the same strategy thousands of times deterministically. |
| 2 | Defined excess return as 1 - price for correct BUYs and -price for wrong BUYs. | Token-adjusted return. A naive hit-rate confuses Yes-side bets and No-side bets. |
| 3 | Collapsed split fills into episodes (same wallet, same token, same side, ≤60s window). | The raw feed has many split orders. Without dedupe, one logical bet looks like 30 wins or losses. |
| 4 | Bootstrapped uncertainty by resampling at the market level, not the row level. | A few crowded markets dominate the row count. Market-clustered CIs are the honest measure. |
| 5 | Tested the wallet-history fetch path against Polymarket's source API and fixed two bugs. | Without these fixes, prior-bet counts were under-counted by up to a full page (3000 entries). |
| 6 | Bounded each whale / track lane to wallets where the local history is provably complete. | The live API caps the heaviest wallets. The pooled positive whale signal sits inside that cap. |
Concretely: if you skip steps 3-6, the same data sets gives you a cleaner-looking but misleading whale story. Steps 3 and 4 take care of inflated row counts. Steps 5 and 6 take care of the silent data-coverage cap that is doing the heavy lifting in the pooled positive numbers.
What is final vs still bounded
| Status | Statement |
|---|---|
| FINAL | Following large current bets is not a supported strategy. |
| FINAL | The original whale headline did not survive proper data treatment. |
| FINAL | Bigger-than-usual bets within the early-skilled cohort do not outperform smaller bets. The 'skilled and confident' rescue fails. |
| BOUNDED | Pre-filtering by early track record produces a cohort that genuinely differs from random selection (permutation p=0.008), but the magnitude is small and the result fails the Yes/No robustness split. |
| BOUNDED | One optional follow-up: compare market closedTime against the last-trade resolution proxy. |
Concretely: "FINAL" means we tried hard to break the negative result and could not. "BOUNDED" means the result is positive in pooled form but does not pass at least one of our robustness checks, so we will not deploy it as is. The new cohort test (Tests 6 and 7 in the next tab) ruled out the strongest version of the wallet-following idea.