The simplest demo that big data breaks p-value stats

> # perfectly independent matrix of 161 observations; standard "small-n statistics" > # (rows have different sums but are all in 4:2:1 ratio) > tbl <- matrix(c(4, 2, 1, ... 48, 24, 12, ... 40, 20, 10), ncol=3) > chisq.test(tbl)$p.value [1] 1 Warning message:In chisq.test(tbl) : Chi-squared approximation may be incorrect # one more observation, still independent > # one more observation, still independent > tbl[3,3] <- tbl[3,3] + 1 > print(tbl) [,1] [,2] [,3] [1,] 4 48 40 [2,] 2 24 20 [3,] 1 12 11 > chisq.test(tbl)$p.value [1] 0.99974 Warning message:In chisq.test(tbl) : Chi-squared approximation may be incorrect > # Ten times more data in the same ratio is still independent > chisq.test(tbl*10)$p.value [1] 0.97722 # A hundred times more data in the s> # A hundred times more data in the same ratio is less independent > chisq.test(tbl*100)$p.value [1] 0.33017 > # A thousand times more data fails independence (and way below p<0.05) > chisq.test(tbl*1000)$p.value [1] 0.0000000023942 > print(tbl*1000) #(still basically all 4:2:1) [,1] [,2] [,3] [1,] 4000 48000 40000 [2,] 2000 24000 20000 [3,] 1000 12000 11000

All the matrices maintain a near perfect 4:2:1 ratio in the rows. But when the data grow from 162 to 162000 observations, p falls from 0.99 (indistinguishable from theoretical independence) to <0.00000001. The problem with chi^2 tests in particular is old actually: Berkson (1938). The first solution came right after: Hotelling's (1939) volume test. It amounts to an endorsement to do what we do today: for big data, use data-driven statistics, not small-n statistics. Small-n statistics were developed for small-n. https://www.tandfonline.com/doi/pdf/10.1080/01621459.1938.10502329 https://www.jstor.org/stable/2371512 Here's the code: # perfectly independent matrix of 161 observations; standard “small-n statistics” # (rows have different sums but are all in 4:2:1 ratio) tbl <- matrix(c(4, 2, 1, 48, 24, 12, 40, 20, 10), ncol=3) chisq.test(tbl)$p.value # one more observation, still independent tbl[3,3] <- tbl[3,3] + 1 print(tbl) chisq.test(tbl)$p.value # Ten times more data in the same ratio is still independent chisq.test(tbl*10)$p.value # A hundred times more data in the same ratio is less indepedent chisq.test(tbl*100)$p.value # A thousand times more data fails independence chisq.test(tbl*1000)$p.value print(tbl*1000)

Comment to Seth / Read but not published Cancel reply