What we tested, and what we found
The original question - "can we follow large bets on Polymarket?" - hides several different hypotheses (large bet vs whale identity vs prior skill). We tested each one independently. Below: the question, the data cut, the result, and a plain-English reading of what the numbers mean.
Test 1 - Direct large-bet effect
Question: if you copy a wallet that just put down a large ticket, do you make money?
Cut: all resolved binary markets, BUYs in three size buckets ($10K-$50K, $50K-$500K, $500K+). Two units of analysis: every raw row, then deduplicated logical episodes (because one logical bet often shows up as 30 raw fills).
| Lane | Unit | N | Wallets | Markets | Excess / $ | 95% CI | p-value |
|---|---|---|---|---|---|---|---|
raw_trade_10k_50k | Raw BUYs | 39,739 | 20,297 | 458 | -0.0083 | [-0.0134, -0.0031] | p=0.9989 |
raw_trade_50k_500k | Raw BUYs | 8,977 | 4,377 | 399 | -0.0092 | [-0.0191, +0.0003] | p=0.9718 |
raw_trade_500k_plus | Raw BUYs | 359 | 131 | 83 | +0.0028 | [-0.0048, +0.0165] | p=0.3669 |
episode_10k_50k | Episodes | 38,091 | 20,535 | 460 | -0.0074 | [-0.0121, -0.0027] | p=0.9990 |
episode_50k_500k | Episodes | 8,859 | 4,581 | 411 | -0.0092 | [-0.0192, +0.0005] | p=0.9693 |
episode_500k_plus | Episodes | 298 | 159 | 94 | -0.0015 | [-0.0066, +0.0063] | p=0.7264 |
Concretely: the $50K-$500K episode lane lands at -0.0092 per dollar.
On a $10,000 ticket, that is roughly $92 of expected loss versus the
market's own implied price. The $10K-$50K lane is even worse at -0.0074. The
confidence intervals are far from zero, so this is not a small-sample fluke.
Test 2 - Are whales positive on their own, or only via track-record overlap?
Question: do wallets that have ever crossed the whale threshold beat the base rate, or is the positive headline being driven by the small subset that also happens to have a strong prior track record?
Cut: partition the whale-positive sample into three non-overlapping groups: pure whales (no track), pure track-qualified (no whale), and the overlap of both.
| Lane | What it tests | N | Wallets | Excess / $ | p-value |
|---|---|---|---|---|---|
aot_whale_plus | All wallets that hit whale tier at any point. | 54,354 | 3,146 | +0.0073 | p=0.0038 |
whale_only_nontrack_10_60 | Whale wallets that do not also pass the track-record filter. | 16,683 | 3,033 | -0.0027 | p=0.8351 |
track_only_nonwhale_10_60 | Track-qualified wallets that are not whales. | 55,193 | 1,627 | +0.0113 | p=0.0019 |
combined_whale_track_10_60 | Both whale and track-qualified. | 37,671 | 298 | +0.0130 | p=0.0002 |
Concretely: when we strip out the wallets that pass both filters, the whale-only group lands at
-0.0027 per dollar, while the track-only group lands at
+0.0113. The pooled aggregate looks positive at +0.0073,
but that is a weighted average. The signal lives in the prior-skill column, not the wallet-size column.
Test 3 - Strict completeness bound (the most important test)
Question: Polymarket's public data API caps each wallet's history at about 4,000 entries. If we drop every wallet whose history could be hitting that cap, do the positive lanes survive?
Cut: for every wallet at or above the 4,000-trade ceiling, exclude its observations from the lane. Recompute on the strictly-complete subset. Then split by Yes-token / No-token to test whether the result is a real signal or just a token-mix effect.
| Lane | Pooled excess | Strict-bound excess | Yes-only strict | No-only strict | Obs from capped wallets |
|---|---|---|---|---|---|
| Aggregate whales | +0.0073 | -0.0025 | -0.0085 | +0.0002 | 55.6% |
| Whale-only (no track overlap) | -0.0027 | -0.0070 | -0.0037 | -0.0001 | 7.8% |
| Track record (primary) | +0.0142 | +0.0075 | -0.0190 | -0.0097 | 59.5% |
| Whale & track overlap | +0.0130 | +0.0037 | -0.0155 | -0.0103 | 76.7% |
Concretely: the aggregate-whale lane has 55.6% of its
observations coming from wallets that the API cannot give us a full history for. When we drop those wallets,
the lane goes from +0.0073 per dollar to
-0.0025. That is a swing of about
1.0 cents per dollar driven entirely by the wallets
whose history we can not fully verify. The track-record lane survives in pooled form, but its Yes-only strict
excess is -0.0190 while its No-only is -0.0097. A real
skill signal would be similar on both sides; the gap means most of the apparent edge is the lane being weighted
toward Yes-tokens, not skill.
Test 4 - The literal interpretation: large bets from whales
If 'follow large bets' was rescued by the bettor's identity, restricting to large bets from already-whale-tier wallets should help. It does not. Lane large50kplus_aot_whale reports -0.0079 per $ (p=0.9364, N=4,677 on 347 markets).
Concretely: the most charitable version of the original question is "follow large bets, but only
when the bettor is already a whale". Even that filtered version loses money:
-0.0079 per dollar across 4,677
qualifying bets. There is no clever filter inside the data that turns the large-bet idea positive.
Test 5 - Base-rate sanity check
Question: what does a random BUY make in this dataset, broken down by which side of the market the bettor took?
| Baseline | Hits | Total | Hit rate | Mean return |
|---|---|---|---|---|
| Random BUY | 514,301 | 876,400 | 58.7% | -0.0011 |
| Yes-token BUY | 116,601 | 393,028 | 29.7% | +0.0071 |
| No-token BUY | 397,700 | 483,372 | 82.3% | -0.0077 |
Concretely: No-token BUYs hit 82.3% of the time and Yes-token BUYs
hit only 29.7%. That is a 53-percentage-point
gap baked into the structure of binary markets. Any strategy that happens to skew toward No-token bets
will look "right" more often without being skilled. This is why every verdict on the other tabs uses
token-adjusted excess return, not hit rate.
Test 6 - Early-skilled cohort (the strongest version of the wallet-following idea)
Question: if a wallet's first 10 resolved bets show a hit rate at or above 70%, are its future bets profitable?
Cut: strict-complete subset only (so we know "first 10" is the actual first 10). For each wallet, find the first episode where it has at least 10 known-resolved priors. Read the hit rate at that episode. If it is at or above 0.70, mark the wallet as early-skilled. Compute excess return on every future bet from that point onward.
| Cohort | Wallets | Future bets | Markets | Hit rate | Excess / $ | 95% CI | p-value |
|---|---|---|---|---|---|---|---|
| All wallets that crossed the 10 bet threshold | 5,140 | 81,385 | 463 | 67.4% | +0.0000 | [-0.0042, +0.0040] | p=0.4840 |
| Early-skilled (≥70% on first 10) | 1,045 | 32,830 | 432 | 82.6% | +0.0051 | [-0.0016, +0.0113] | p=0.0662 |
| Yes-token bets only | 741 | 8,974 | 368 | 68.9% | -0.0142 | [-0.0290, +0.0003] | p=0.9736 |
| No-token bets only | 863 | 23,856 | 404 | 87.7% | -0.0100 | [-0.0271, +0.0067] | p=0.8812 |
+0.0051 sits at the p=0.008
one-sided tail of the null (null median: -0.0002, 95th-percentile null:
+0.0035). So the cohort really is different from "any 1,045 random wallets that
happened to graduate". That is the basic-null gate.
+0.0051 per dollar. The Yes-only split is
-0.0142, the No-only split is
-0.0100. Both individual sides are non-positive; the pooled positive
comes from how the cohort's bets are spread across markets, not from skill on either side. Same failure mode as the
track-record lane in Test 3.
Concretely: screening wallets by their first 10 priors does identify a group that is genuinely different from random selection. That part works. But the difference is small (~0.5 cents per dollar in pooled form), it does not survive being looked at side-by-side, and it would almost certainly be eaten by Polymarket's bid-ask spread before any of it reached your account.
Test 7 - Within the early-skilled cohort, do bigger bets work better?
Question: the original intuition was "skilled wallets that size up are even more skilled". Inside the cohort from Test 6, do future bets that are larger than the wallet's running average outperform their smaller bets?
Cut: for each cohort wallet, track its running mean bet size over priors at episode time. Each future bet is classified as above-average (size strictly greater than the running mean) or at-or-below-average.
| Sub-lane | Wallets | Future bets | Markets | Hit rate | Excess / $ | 95% CI | p-value |
|---|---|---|---|---|---|---|---|
| Above-average bet size | 847 | 11,241 | 394 | 80.6% | +0.0051 | [-0.0031, +0.0130] | p=0.1068 |
| At-or-below average bet size | 887 | 21,575 | 415 | 83.6% | +0.0057 | [-0.0016, +0.0122] | p=0.0654 |
-0.0006 per dollar, with the larger group sitting
slightly worse, not better. This kills the "skilled and confident" version of the whale-following idea.
Concretely: if wallet identity were a real signal AND that signal got stronger when the wallet sized up, we would expect the above-average bet excess to be meaningfully higher than the below-average bet excess. Instead the two are statistically indistinguishable, with the bigger group nominally a hair behind. Whatever weak cohort signal exists in Test 6 is wallet-level, not bet-level.
Interpretation across all seven tests
- Tests 1 and 4 are the cleanest answer to the literal question. Both fail negatively across multiple cuts. Following large bets, in any flavour we tried, loses money.
- Tests 2 and 3 explain why earlier work seemed more positive: the whale lane was leaning on track-record overlap, and on wallets where the public API cannot give us a full history.
- Test 5 is the audit gate that justifies why we never report raw hit rates. The structure of Yes vs No tokens makes hit rates almost meaningless without normalisation.
- Test 6 is the strongest version of the wallet-following idea: pre-filter wallets by their first 10 resolved bets, then ride the survivors. It survives the basic permutation null but fails the Yes/No robustness split, same way the track-record lane fails. Magnitude is also too small to survive trading frictions.
- Test 7 rules out the "skilled and confident" rescue: bigger-than-usual bets within the cohort do not outperform the cohort's smaller bets. The signal does not amplify with size.