The simplest demo that big data breaks p-value stats


> # perfectly independent matrix of 161 observations; standard "small-n statistics"
> # (rows have different sums but are all in 4:2:1 ratio)
> tbl <- matrix(c(4, 2, 1, ... 48, 24, 12, ... 40, 20, 10), ncol=3) > chisq.test(tbl)$p.value
[1] 1
Warning message:In chisq.test(tbl) : Chi-squared approximation may be incorrect
# one more observation, still independent
> # one more observation, still independent
> tbl[3,3] <- tbl[3,3] + 1 > print(tbl)
[,1] [,2] [,3]
[1,] 4 48 40
[2,] 2 24 20
[3,] 1 12 11
> chisq.test(tbl)$p.value
[1] 0.99974
Warning message:In chisq.test(tbl) : Chi-squared approximation may be incorrect
> # Ten times more data in the same ratio is still independent
> chisq.test(tbl*10)$p.value
[1] 0.97722
# A hundred times more data in the s> # A hundred times more data in the same ratio is less independent
> chisq.test(tbl*100)$p.value
[1] 0.33017
> # A thousand times more data fails independence (and way below p<0.05) > chisq.test(tbl*1000)$p.value
[1] 0.0000000023942
> print(tbl*1000) #(still basically all 4:2:1)
[,1] [,2] [,3]
[1,] 4000 48000 40000
[2,] 2000 24000 20000
[3,] 1000 12000 11000

All the matrices maintain a near perfect 4:2:1 ratio in the rows. But when the data grow from 162 to 162000 observations, p falls from 0.99 (indistinguishable from theoretical independence) to <0.00000001. The problem with chi^2 tests in particular is old actually: Berkson (1938). The first solution came right after: Hotelling's (1939) volume test. It amounts to an endorsement to do what we do today: for big data, use data-driven statistics, not small-n statistics. Small-n statistics were developed for small-n. https://www.tandfonline.com/doi/pdf/10.1080/01621459.1938.10502329 https://www.jstor.org/stable/2371512 Here's the code:
# perfectly independent matrix of 161 observations; standard “small-n statistics”
# (rows have different sums but are all in 4:2:1 ratio)
tbl <- matrix(c(4, 2, 1, 48, 24, 12, 40, 20, 10), ncol=3) chisq.test(tbl)$p.value # one more observation, still independent tbl[3,3] <- tbl[3,3] + 1 print(tbl) chisq.test(tbl)$p.value # Ten times more data in the same ratio is still independent chisq.test(tbl*10)$p.value # A hundred times more data in the same ratio is less indepedent chisq.test(tbl*100)$p.value # A thousand times more data fails independence chisq.test(tbl*1000)$p.value print(tbl*1000)

About

This entry was posted on Sunday, October 10th, 2021 and is filed under Uncategorized.