Pandas in 2018

I’m late to the game on data science in Python because I continue to do my data analysis overwhelmingly in R (thank god for data.table and the tidyverse and all the amazing stats packages. To hell with data.frame and factors). But I’m finally picking up Python’s approach as well, mainly because I want my students, if they’re going to learn only one language, to learn Python. So I’m teaching the numpy, pandas, matplotlib, seaborn combination. I got lucky to discover two things about pandas very quickly, and only because I’ve been through the same thing in R. 1) the way you learn to use a package is different i subtle ways from how it is documented and taught, and 2) the way a young data science package is used now is different from how it was first used (and documented) before it was tidied up. That means that StackExchange and other references are going to be irrelevant a lot of the time in ways that are hard to spot until someone holds your hand.

I just got the hand-holding—the straight-to-pandas-in-2018 fast-forward—and I’m sharing it. The pitfalls all come down to Python’s poor distinctions between copying objects and editing them in place. In a nutshell, use .query() and .assign() as much as possible, as well as .loc(), .iloc(), and .copy(). Use [], [[]], and simple df. as little as possible, and, if so, only when reading and never when writing or munging. In more detail, the resources below are up-to-date as of the beginning of 2018. They will spare your ontogeny from having to recapitulate pandas’ phylogeny:

https://tomaugspurger.github.io/modern-1-intro

http://nbviewer.jupyter.org/urls/dl.dropbox.com/s/sp3flbe708brblz/Pandas_Views_vs_Copies.ipynb

Thanks Eshin

About

This entry was posted on Tuesday, January 9th, 2018 and is filed under Uncategorized.