Tests And Results | Polymarket Strategy Report

What we tested, and what we found

The original question - "can we follow large bets on Polymarket?" - hides several different hypotheses (large bet vs whale identity vs prior skill). We tested each one independently. Below: the question, the data cut, the result, and a plain-English reading of what the numbers mean.

Test 1 - Direct large-bet effect

Question: if you copy a wallet that just put down a large ticket, do you make money?

Cut: all resolved binary markets, BUYs in three size buckets ($10K-$50K, $50K-$500K, $500K+). Two units of analysis: every raw row, then deduplicated logical episodes (because one logical bet often shows up as 30 raw fills).

Lane	Unit	N	Wallets	Markets	Excess / $	95% CI	p-value
`raw_trade_10k_50k`	Raw BUYs	39,739	20,297	458	-0.0083	[-0.0134, -0.0031]	p=0.9989
`raw_trade_50k_500k`	Raw BUYs	8,977	4,377	399	-0.0092	[-0.0191, +0.0003]	p=0.9718
`raw_trade_500k_plus`	Raw BUYs	359	131	83	+0.0028	[-0.0048, +0.0165]	p=0.3669
`episode_10k_50k`	Episodes	38,091	20,535	460	-0.0074	[-0.0121, -0.0027]	p=0.9990
`episode_50k_500k`	Episodes	8,859	4,581	411	-0.0092	[-0.0192, +0.0005]	p=0.9693
`episode_500k_plus`	Episodes	298	159	94	-0.0015	[-0.0066, +0.0063]	p=0.7264

Result: every direct large-bet bucket loses money. The episode-level numbers and the raw-row numbers tell the same story, so the negative is not a row-count artifact. The very-large $500K+ bucket is too sparse to support a positive claim either way.

Concretely: the $50K-$500K episode lane lands at -0.0092 per dollar. On a $10,000 ticket, that is roughly $92 of expected loss versus the market's own implied price. The $10K-$50K lane is even worse at -0.0074. The confidence intervals are far from zero, so this is not a small-sample fluke.

Test 2 - Are whales positive on their own, or only via track-record overlap?

Question: do wallets that have ever crossed the whale threshold beat the base rate, or is the positive headline being driven by the small subset that also happens to have a strong prior track record?

Cut: partition the whale-positive sample into three non-overlapping groups: pure whales (no track), pure track-qualified (no whale), and the overlap of both.

Lane	What it tests	N	Wallets	Excess / $	p-value
`aot_whale_plus`	All wallets that hit whale tier at any point.	54,354	3,146	+0.0073	p=0.0038
`whale_only_nontrack_10_60`	Whale wallets that do not also pass the track-record filter.	16,683	3,033	-0.0027	p=0.8351
`track_only_nonwhale_10_60`	Track-qualified wallets that are not whales.	55,193	1,627	+0.0113	p=0.0019
`combined_whale_track_10_60`	Both whale and track-qualified.	37,671	298	+0.0130	p=0.0002

Result: the aggregate whale headline is partly borrowed from track-record overlap. Pure-whale (no track) is not positive. Pure-track (no whale) is. Whale identity by itself does not stand up.

Concretely: when we strip out the wallets that pass both filters, the whale-only group lands at -0.0027 per dollar, while the track-only group lands at +0.0113. The pooled aggregate looks positive at +0.0073, but that is a weighted average. The signal lives in the prior-skill column, not the wallet-size column.

Test 3 - Strict completeness bound (the most important test)

Question: Polymarket's public data API caps each wallet's history at about 4,000 entries. If we drop every wallet whose history could be hitting that cap, do the positive lanes survive?

Cut: for every wallet at or above the 4,000-trade ceiling, exclude its observations from the lane. Recompute on the strictly-complete subset. Then split by Yes-token / No-token to test whether the result is a real signal or just a token-mix effect.

Lane	Pooled excess	Strict-bound excess	Yes-only strict	No-only strict	Obs from capped wallets
Aggregate whales	+0.0073	-0.0025	-0.0085	+0.0002	55.6%
Whale-only (no track overlap)	-0.0027	-0.0070	-0.0037	-0.0001	7.8%
Track record (primary)	+0.0142	+0.0075	-0.0190	-0.0097	59.5%
Whale & track overlap	+0.0130	+0.0037	-0.0155	-0.0103	76.7%

Result: aggregate whales flip from positive to non-positive on the strict subset. The whale-track overlap loses its statistical support. Track record stays positive in pooled form but fails the Yes-only / No-only split, so the pooled positive is partly a token-mix story rather than skill.

Concretely: the aggregate-whale lane has 55.6% of its observations coming from wallets that the API cannot give us a full history for. When we drop those wallets, the lane goes from +0.0073 per dollar to -0.0025. That is a swing of about 1.0 cents per dollar driven entirely by the wallets whose history we can not fully verify. The track-record lane survives in pooled form, but its Yes-only strict excess is -0.0190 while its No-only is -0.0097. A real skill signal would be similar on both sides; the gap means most of the apparent edge is the lane being weighted toward Yes-tokens, not skill.

Test 4 - The literal interpretation: large bets from whales

Test 4: large bets from known whale wallets only

If 'follow large bets' was rescued by the bettor's identity, restricting to large bets from already-whale-tier wallets should help. It does not. Lane large50kplus_aot_whale reports -0.0079 per $ (p=0.9364, N=4,677 on 347 markets).

Concretely: the most charitable version of the original question is "follow large bets, but only when the bettor is already a whale". Even that filtered version loses money: -0.0079 per dollar across 4,677 qualifying bets. There is no clever filter inside the data that turns the large-bet idea positive.

Test 5 - Base-rate sanity check

Question: what does a random BUY make in this dataset, broken down by which side of the market the bettor took?

Baseline	Hits	Total	Hit rate	Mean return
Random BUY	514,301	876,400	58.7%	-0.0011
Yes-token BUY	116,601	393,028	29.7%	+0.0071
No-token BUY	397,700	483,372	82.3%	-0.0077

Why this matters: a naive hit-rate headline is mostly a token-mix effect. We use token-adjusted excess return everywhere instead of raw hit rates.

Concretely: No-token BUYs hit 82.3% of the time and Yes-token BUYs hit only 29.7%. That is a 53-percentage-point gap baked into the structure of binary markets. Any strategy that happens to skew toward No-token bets will look "right" more often without being skilled. This is why every verdict on the other tabs uses token-adjusted excess return, not hit rate.

Test 6 - Early-skilled cohort (the strongest version of the wallet-following idea)

Question: if a wallet's first 10 resolved bets show a hit rate at or above 70%, are its future bets profitable?

Cut: strict-complete subset only (so we know "first 10" is the actual first 10). For each wallet, find the first episode where it has at least 10 known-resolved priors. Read the hit rate at that episode. If it is at or above 0.70, mark the wallet as early-skilled. Compute excess return on every future bet from that point onward.

Cohort	Wallets	Future bets	Markets	Hit rate	Excess / $	95% CI	p-value
All wallets that crossed the 10 bet threshold	5,140	81,385	463	67.4%	+0.0000	[-0.0042, +0.0040]	p=0.4840
Early-skilled (≥70% on first 10)	1,045	32,830	432	82.6%	+0.0051	[-0.0016, +0.0113]	p=0.0662
Yes-token bets only	741	8,974	368	68.9%	-0.0142	[-0.0290, +0.0003]	p=0.9736
No-token bets only	863	23,856	404	87.7%	-0.0100	[-0.0271, +0.0067]	p=0.8812

Permutation control: we re-ran the cohort with random labels (shuffled within trade-count buckets, 1,000 iterations). The observed cohort mean of +0.0051 sits at the p=0.008 one-sided tail of the null (null median: -0.0002, 95th-percentile null: +0.0035). So the cohort really is different from "any 1,045 random wallets that happened to graduate". That is the basic-null gate.

But it fails the Yes/No robustness split. The cohort's pooled future excess is +0.0051 per dollar. The Yes-only split is -0.0142, the No-only split is -0.0100. Both individual sides are non-positive; the pooled positive comes from how the cohort's bets are spread across markets, not from skill on either side. Same failure mode as the track-record lane in Test 3.

Concretely: screening wallets by their first 10 priors does identify a group that is genuinely different from random selection. That part works. But the difference is small (~0.5 cents per dollar in pooled form), it does not survive being looked at side-by-side, and it would almost certainly be eaten by Polymarket's bid-ask spread before any of it reached your account.

Test 7 - Within the early-skilled cohort, do bigger bets work better?

Question: the original intuition was "skilled wallets that size up are even more skilled". Inside the cohort from Test 6, do future bets that are larger than the wallet's running average outperform their smaller bets?

Cut: for each cohort wallet, track its running mean bet size over priors at episode time. Each future bet is classified as above-average (size strictly greater than the running mean) or at-or-below-average.

Sub-lane	Wallets	Future bets	Markets	Hit rate	Excess / $	95% CI	p-value
Above-average bet size	847	11,241	394	80.6%	+0.0051	[-0.0031, +0.0130]	p=0.1068
At-or-below average bet size	887	21,575	415	83.6%	+0.0057	[-0.0016, +0.0122]	p=0.0654

Result: bigger-than-usual bets do not outperform smaller-than-usual bets within the cohort. The delta is -0.0006 per dollar, with the larger group sitting slightly worse, not better. This kills the "skilled and confident" version of the whale-following idea.

Concretely: if wallet identity were a real signal AND that signal got stronger when the wallet sized up, we would expect the above-average bet excess to be meaningfully higher than the below-average bet excess. Instead the two are statistically indistinguishable, with the bigger group nominally a hair behind. Whatever weak cohort signal exists in Test 6 is wallet-level, not bet-level.

Interpretation across all seven tests

Tests 1 and 4 are the cleanest answer to the literal question. Both fail negatively across multiple cuts. Following large bets, in any flavour we tried, loses money.
Tests 2 and 3 explain why earlier work seemed more positive: the whale lane was leaning on track-record overlap, and on wallets where the public API cannot give us a full history.
Test 5 is the audit gate that justifies why we never report raw hit rates. The structure of Yes vs No tokens makes hit rates almost meaningless without normalisation.
Test 6 is the strongest version of the wallet-following idea: pre-filter wallets by their first 10 resolved bets, then ride the survivors. It survives the basic permutation null but fails the Yes/No robustness split, same way the track-record lane fails. Magnitude is also too small to survive trading frictions.
Test 7 rules out the "skilled and confident" rescue: bigger-than-usual bets within the cohort do not outperform the cohort's smaller bets. The signal does not amplify with size.