Pandering to the Data Science Student Masses
There are a variety of ways to do things in pandas — translation: This is both a blessing and a curse. The novice user of Pandas (ie. me) can easily get confused. So while there exists probably more than one way to do any of the following these are the methods I found the least intimidating.
Create a Random DataFrame (Quickly!):
- Say you want to give an example or how to do something in Pandas, but you don’t want to bother scrolling through endless folders to find an existing dataset to work with. Solution: Create one with Pandas!
- Using the — pd.random.rand() — you can quickly generate an example DataFrame:
Pandas by default will generate a DataFrame of numbers for both the column and row names, but you can pass it a list of strings as shown up in order to make it easier to work with. The input generates:
Let’s say you decided you hate the column names, whether they are of your choosing or the columns of another dataset whose names make as much sense as Carol Baskin’s alibi….
You can quickly rename any Pandas DataFrame columns by inputting df.rename() and passing in a dictionary. Don’t forget to specify which axis you are working on!
And just like that we have renamed (almost) all our columns:
We can also add a prefix, or get rid of a prefix if need be:
Ahhh this is terrible. CHANGE IT BACK!!
Not to worry: this mistake can easily be remedied by using df.columns.str.replace() method as shown below:
Next, we examine D-Types and how to interact with them in order to get more out of our data exploration. For this we will be using our movies dataset:
In any dataset, we usually have a mixture of datatypes or D-types which sometimes makes it increasingly frustrating to work with and infuriating to discover after you have already error-red out. Believe me I share your rage…As such here are some quick and dirty examples of how to filter out d-types:
Here in our movie dataset we can see a variety of objects, integers, and floats.
Let’s say you ONLY wanted to look at the numerical values in this dataset. To do this simply employ the following: df.select_dtypes(include=) as shown below:
Luckily, this feature also works in reverse! Let’s use the ‘exclude’ feature to only include numbers:
3) Subsetting DataFrames
Lastly, lets look at Pandas DataFrames themselves, namely how to create subsets of a DataFrame.
If for any reason you want to split up your DataFrame, Pandas has an answer to that. It will allow you to create two subsets of data according to the parameters you set, whether that be 75:25 or 60:40 you get the idea.
Let’s go back to our movie dataset. It has 5043 rows. For whatever reason we don’t need this many and have decided that we would like ehh roughly 4000 movies instead. In order to subset our ‘movies’ DataFrame we call the following function: df.sample(frac= , random_state= ). In practice it looks much like this:
The ‘random_state’ simply splits up the data set randomly which the ‘frac’ takes the percentage of the data set you want to subset. Now we can create a second subset with the remaining 20% of our movies data.
Here we can see the index numbers of our two movie subsets. And confirm that their concatenation will add up back to the original DataFrame.
Note: This method will only work if you have unique id’s for your index.
Hopefully, the above methods have been insightful or at the very least offer some help to the struggling data science students out there. I feel your pain.