I’m late to the game on data science in Python because I continue to do my data analysis overwhelmingly in R (thank god for data.table and the tidyverse and all the amazing stats packages. To hell with data.frame and factors). But I’m finally picking up Python’s approach as well, mainly because I want my students, if they’re going to learn only one language, to learn Python. So I’m teaching the numpy, pandas, matplotlib, seaborn combination. I got lucky to discover two things about pandas very quickly, and only because I’ve been through the same thing in R. 1) the way you learn to use a package is different i subtle ways from how it is documented and taught, and 2) the way a young data science package is used now is different from how it was first used (and documented) before it was tidied up. That means that StackExchange and other references are going to be irrelevant a lot of the time in ways that are hard to spot until someone holds your hand.
I just got the hand-holding—the straight-to-pandas-in-2018 fast-forward—and I’m sharing it. The pitfalls all come down to Python’s poor distinctions between copying objects and editing them in place. In a nutshell, use .query() and .assign() as much as possible, as well as .loc(), .iloc(), and .copy(). Use [], [[]], and simple df.
https://tomaugspurger.github.io/modern-1-intro
http://nbviewer.jupyter.org/urls/dl.dropbox.com/s/sp3flbe708brblz/Pandas_Views_vs_Copies.ipynb
Thanks Eshin