{"id":2433,"date":"2021-10-10T08:26:34","date_gmt":"2021-10-10T16:26:34","guid":{"rendered":"https:\/\/enfascination.com\/weblog\/?p=2433"},"modified":"2021-10-10T08:37:37","modified_gmt":"2021-10-10T16:37:37","slug":"the-simplest-demo-that-big-data-breaks-p-value-stats","status":"publish","type":"post","link":"https:\/\/enfascination.com\/weblog\/post\/2433","title":{"rendered":"The simplest demo that big data breaks p-value stats"},"content":{"rendered":"<p><code><br \/>\n> # perfectly independent matrix of 161 observations; standard \"small-n statistics\"<br \/>\n> #   (rows have different sums but are all in 4:2:1 ratio)<br \/>\n> tbl <- matrix(c(4, 2, 1,\n...                 48, 24, 12,\n...                 40, 20, 10), ncol=3)\n> chisq.test(tbl)$p.value<br \/>\n[1] 1<br \/>\nWarning message:In chisq.test(tbl) : Chi-squared approximation may be incorrect<br \/>\n# one more observation, still independent<br \/>\n> # one more observation, still independent<br \/>\n> tbl[3,3] <- tbl[3,3] + 1\n> print(tbl)<br \/>\n     [,1] [,2] [,3]<br \/>\n[1,]    4   48   40<br \/>\n[2,]    2   24   20<br \/>\n[3,]    1   12   11<br \/>\n> chisq.test(tbl)$p.value<br \/>\n[1] 0.99974<br \/>\nWarning message:In chisq.test(tbl) : Chi-squared approximation may be incorrect<br \/>\n> # Ten times more data in the same ratio is still independent<br \/>\n> chisq.test(tbl*10)$p.value<br \/>\n[1] 0.97722<br \/>\n# A hundred times more data in the s> # A hundred times more data in the same ratio is less independent<br \/>\n> chisq.test(tbl*100)$p.value<br \/>\n[1] 0.33017<br \/>\n> # A thousand times more data fails independence (and way below p<0.05)\n> chisq.test(tbl*1000)$p.value<br \/>\n[1] 0.0000000023942<br \/>\n> print(tbl*1000) #(still basically all 4:2:1)<br \/>\n     [,1]  [,2]  [,3]<br \/>\n[1,] 4000 48000 40000<br \/>\n[2,] 2000 24000 20000<br \/>\n[3,] 1000 12000 11000<br \/>\n<\/code><\/p>\n<p>All the matrices maintain a near perfect 4:2:1 ratio in the rows. But when the data grow from 162 to 162000 observations, p falls from 0.99 (indistinguishable from theoretical independence) to <0.00000001.\n\nThe problem with chi^2 tests in particular is old actually: Berkson (1938). The first solution came right after: Hotelling's (1939) volume test. It amounts to an endorsement to do what we do today: for big data, use data-driven statistics, not small-n statistics. Small-n statistics were developed for small-n.\n\nhttps:\/\/www.tandfonline.com\/doi\/pdf\/10.1080\/01621459.1938.10502329\nhttps:\/\/www.jstor.org\/stable\/2371512\n\nHere's the code:\n<code><br \/>\n# perfectly independent matrix of 161 observations; standard &#8220;small-n statistics&#8221;<br \/>\n#   (rows have different sums but are all in 4:2:1 ratio)<br \/>\ntbl <- matrix(c(4, 2, 1, \n                48, 24, 12, \n                40, 20, 10), ncol=3)\nchisq.test(tbl)$p.value\n# one more observation, still independent\ntbl[3,3] <- tbl[3,3] + 1\nprint(tbl)\nchisq.test(tbl)$p.value\n# Ten times more data in the same ratio is still independent\nchisq.test(tbl*10)$p.value\n# A hundred times more data in the same ratio is less indepedent\nchisq.test(tbl*100)$p.value\n# A thousand times more data fails independence\nchisq.test(tbl*1000)$p.value\nprint(tbl*1000)\n<\/code><\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content --><!-- AddThis Related Posts generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>> # perfectly independent matrix of 161 observations; standard &#8220;small-n statistics&#8221; > # (rows have different sums but are all in 4:2:1 ratio) > tbl chisq.test(tbl)$p.value [1] 1 Warning message:In chisq.test(tbl) : Chi-squared approximation may be incorrect # one more observation, still independent > # one more observation, still independent > tbl[3,3] print(tbl) [,1] [,2] &hellip; <a href=\"https:\/\/enfascination.com\/weblog\/post\/2433\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">The simplest demo that big data breaks p-value stats<\/span><\/a><!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><!-- AddThis Related Posts generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"wpupg_custom_link":[],"wpupg_custom_link_behaviour":["default"],"wpupg_custom_image":[],"wpupg_custom_image_id":[],"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/enfascination.com\/weblog\/wp-json\/wp\/v2\/posts\/2433"}],"collection":[{"href":"https:\/\/enfascination.com\/weblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/enfascination.com\/weblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/enfascination.com\/weblog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/enfascination.com\/weblog\/wp-json\/wp\/v2\/comments?post=2433"}],"version-history":[{"count":3,"href":"https:\/\/enfascination.com\/weblog\/wp-json\/wp\/v2\/posts\/2433\/revisions"}],"predecessor-version":[{"id":2436,"href":"https:\/\/enfascination.com\/weblog\/wp-json\/wp\/v2\/posts\/2433\/revisions\/2436"}],"wp:attachment":[{"href":"https:\/\/enfascination.com\/weblog\/wp-json\/wp\/v2\/media?parent=2433"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/enfascination.com\/weblog\/wp-json\/wp\/v2\/categories?post=2433"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/enfascination.com\/weblog\/wp-json\/wp\/v2\/tags?post=2433"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}