What we tested, and what we found

The original question - "can we follow large bets on Polymarket?" - hides several different hypotheses (large bet vs whale identity vs prior skill). We tested each one independently. Below: the question, the data cut, the result, and a plain-English reading of what the numbers mean.

Test 1 - Direct large-bet effect

Question: if you copy a wallet that just put down a large ticket, do you make money?

Cut: all resolved binary markets, BUYs in three size buckets ($10K-$50K, $50K-$500K, $500K+). Two units of analysis: every raw row, then deduplicated logical episodes (because one logical bet often shows up as 30 raw fills).

LaneUnitNWalletsMarketsExcess / $95% CIp-value
raw_trade_10k_50kRaw BUYs39,73920,297458-0.0083[-0.0134, -0.0031]p=0.9989
raw_trade_50k_500kRaw BUYs8,9774,377399-0.0092[-0.0191, +0.0003]p=0.9718
raw_trade_500k_plusRaw BUYs35913183+0.0028[-0.0048, +0.0165]p=0.3669
episode_10k_50kEpisodes38,09120,535460-0.0074[-0.0121, -0.0027]p=0.9990
episode_50k_500kEpisodes8,8594,581411-0.0092[-0.0192, +0.0005]p=0.9693
episode_500k_plusEpisodes29815994-0.0015[-0.0066, +0.0063]p=0.7264
Result: every direct large-bet bucket loses money. The episode-level numbers and the raw-row numbers tell the same story, so the negative is not a row-count artifact. The very-large $500K+ bucket is too sparse to support a positive claim either way.

Concretely: the $50K-$500K episode lane lands at -0.0092 per dollar. On a $10,000 ticket, that is roughly $92 of expected loss versus the market's own implied price. The $10K-$50K lane is even worse at -0.0074. The confidence intervals are far from zero, so this is not a small-sample fluke.

Test 2 - Are whales positive on their own, or only via track-record overlap?

Question: do wallets that have ever crossed the whale threshold beat the base rate, or is the positive headline being driven by the small subset that also happens to have a strong prior track record?

Cut: partition the whale-positive sample into three non-overlapping groups: pure whales (no track), pure track-qualified (no whale), and the overlap of both.

LaneWhat it testsNWalletsExcess / $p-value
aot_whale_plusAll wallets that hit whale tier at any point.54,3543,146+0.0073p=0.0038
whale_only_nontrack_10_60Whale wallets that do not also pass the track-record filter.16,6833,033-0.0027p=0.8351
track_only_nonwhale_10_60Track-qualified wallets that are not whales.55,1931,627+0.0113p=0.0019
combined_whale_track_10_60Both whale and track-qualified.37,671298+0.0130p=0.0002
Result: the aggregate whale headline is partly borrowed from track-record overlap. Pure-whale (no track) is not positive. Pure-track (no whale) is. Whale identity by itself does not stand up.

Concretely: when we strip out the wallets that pass both filters, the whale-only group lands at -0.0027 per dollar, while the track-only group lands at +0.0113. The pooled aggregate looks positive at +0.0073, but that is a weighted average. The signal lives in the prior-skill column, not the wallet-size column.

Test 3 - Strict completeness bound (the most important test)

Question: Polymarket's public data API caps each wallet's history at about 4,000 entries. If we drop every wallet whose history could be hitting that cap, do the positive lanes survive?

Cut: for every wallet at or above the 4,000-trade ceiling, exclude its observations from the lane. Recompute on the strictly-complete subset. Then split by Yes-token / No-token to test whether the result is a real signal or just a token-mix effect.

LanePooled excessStrict-bound excessYes-only strictNo-only strictObs from capped wallets
Aggregate whales+0.0073-0.0025-0.0085+0.000255.6%
Whale-only (no track overlap)-0.0027-0.0070-0.0037-0.00017.8%
Track record (primary)+0.0142+0.0075-0.0190-0.009759.5%
Whale & track overlap+0.0130+0.0037-0.0155-0.010376.7%
Result: aggregate whales flip from positive to non-positive on the strict subset. The whale-track overlap loses its statistical support. Track record stays positive in pooled form but fails the Yes-only / No-only split, so the pooled positive is partly a token-mix story rather than skill.

Concretely: the aggregate-whale lane has 55.6% of its observations coming from wallets that the API cannot give us a full history for. When we drop those wallets, the lane goes from +0.0073 per dollar to -0.0025. That is a swing of about 1.0 cents per dollar driven entirely by the wallets whose history we can not fully verify. The track-record lane survives in pooled form, but its Yes-only strict excess is -0.0190 while its No-only is -0.0097. A real skill signal would be similar on both sides; the gap means most of the apparent edge is the lane being weighted toward Yes-tokens, not skill.

Test 4 - The literal interpretation: large bets from whales

Test 4: large bets from known whale wallets only

If 'follow large bets' was rescued by the bettor's identity, restricting to large bets from already-whale-tier wallets should help. It does not. Lane large50kplus_aot_whale reports -0.0079 per $ (p=0.9364, N=4,677 on 347 markets).

Concretely: the most charitable version of the original question is "follow large bets, but only when the bettor is already a whale". Even that filtered version loses money: -0.0079 per dollar across 4,677 qualifying bets. There is no clever filter inside the data that turns the large-bet idea positive.

Test 5 - Base-rate sanity check

Question: what does a random BUY make in this dataset, broken down by which side of the market the bettor took?

BaselineHitsTotalHit rateMean return
Random BUY514,301876,40058.7%-0.0011
Yes-token BUY116,601393,02829.7%+0.0071
No-token BUY397,700483,37282.3%-0.0077
Why this matters: a naive hit-rate headline is mostly a token-mix effect. We use token-adjusted excess return everywhere instead of raw hit rates.

Concretely: No-token BUYs hit 82.3% of the time and Yes-token BUYs hit only 29.7%. That is a 53-percentage-point gap baked into the structure of binary markets. Any strategy that happens to skew toward No-token bets will look "right" more often without being skilled. This is why every verdict on the other tabs uses token-adjusted excess return, not hit rate.

Test 6 - Early-skilled cohort (the strongest version of the wallet-following idea)

Question: if a wallet's first 10 resolved bets show a hit rate at or above 70%, are its future bets profitable?

Cut: strict-complete subset only (so we know "first 10" is the actual first 10). For each wallet, find the first episode where it has at least 10 known-resolved priors. Read the hit rate at that episode. If it is at or above 0.70, mark the wallet as early-skilled. Compute excess return on every future bet from that point onward.

CohortWalletsFuture betsMarketsHit rateExcess / $95% CIp-value
All wallets that crossed the 10 bet threshold5,14081,38546367.4%+0.0000[-0.0042, +0.0040]p=0.4840
Early-skilled (≥70% on first 10)1,04532,83043282.6%+0.0051[-0.0016, +0.0113]p=0.0662
    Yes-token bets only7418,97436868.9%-0.0142[-0.0290, +0.0003]p=0.9736
    No-token bets only86323,85640487.7%-0.0100[-0.0271, +0.0067]p=0.8812
Permutation control: we re-ran the cohort with random labels (shuffled within trade-count buckets, 1,000 iterations). The observed cohort mean of +0.0051 sits at the p=0.008 one-sided tail of the null (null median: -0.0002, 95th-percentile null: +0.0035). So the cohort really is different from "any 1,045 random wallets that happened to graduate". That is the basic-null gate.
But it fails the Yes/No robustness split. The cohort's pooled future excess is +0.0051 per dollar. The Yes-only split is -0.0142, the No-only split is -0.0100. Both individual sides are non-positive; the pooled positive comes from how the cohort's bets are spread across markets, not from skill on either side. Same failure mode as the track-record lane in Test 3.

Concretely: screening wallets by their first 10 priors does identify a group that is genuinely different from random selection. That part works. But the difference is small (~0.5 cents per dollar in pooled form), it does not survive being looked at side-by-side, and it would almost certainly be eaten by Polymarket's bid-ask spread before any of it reached your account.

Test 7 - Within the early-skilled cohort, do bigger bets work better?

Question: the original intuition was "skilled wallets that size up are even more skilled". Inside the cohort from Test 6, do future bets that are larger than the wallet's running average outperform their smaller bets?

Cut: for each cohort wallet, track its running mean bet size over priors at episode time. Each future bet is classified as above-average (size strictly greater than the running mean) or at-or-below-average.

Sub-laneWalletsFuture betsMarketsHit rateExcess / $95% CIp-value
Above-average bet size84711,24139480.6%+0.0051[-0.0031, +0.0130]p=0.1068
At-or-below average bet size88721,57541583.6%+0.0057[-0.0016, +0.0122]p=0.0654
Result: bigger-than-usual bets do not outperform smaller-than-usual bets within the cohort. The delta is -0.0006 per dollar, with the larger group sitting slightly worse, not better. This kills the "skilled and confident" version of the whale-following idea.

Concretely: if wallet identity were a real signal AND that signal got stronger when the wallet sized up, we would expect the above-average bet excess to be meaningfully higher than the below-average bet excess. Instead the two are statistically indistinguishable, with the bigger group nominally a hair behind. Whatever weak cohort signal exists in Test 6 is wallet-level, not bet-level.

Interpretation across all seven tests