Pandas functions. Lots of them.
Pandas is your best friend for your data needs. It is the king of data manipulation in the Python empire.
Any data scientist intending to use Python as their tool of choice must master Pandas. It is compulsory, like learning to walk before you run.
So here is a quick reference list of functions, just for you.
But reading written material is no substitute for repeated practice. And hence, you should not expect to remember the functions below. The list is a cheatsheet, not an oracle.
Creating a dataframe Link to heading
The bread and butter of Pandas. Let’s start with some numpy foreplay.
dates = pd.date_range('2017-06-21', '2017-06-27')
pd.DataFrame(np.random.randint(0,10,7), index=dates, columns=['freq'])
You can also create a dataframe elseways. Here’s a multi-column version from a dictionary.
x = {'a' : np.random.randint(0,10,7),
'b' : np.random.randint(0,10,7)}
pd.DataFrame(x)
Creating a series Link to heading
Series are the loyal servants in the Pandas empire.
To create one, use pd.Series(x, index)
.
Here, x
is a lowly array, dict, scalar, or something else. It will be paired with index
for eternity, or until death taketh them. Or a memory leak.
Dataframe functions Link to heading
Thirty of the finest functions, arranged for your convenience. Master this list, and mastery of self follows.
pd.DataFrame.head()
| returns the first five rows of a dataframe.pd.DataFrame.tail()
| returns the last five rows of a dataframe.pd.DataFrame.index
| display the index of a dataframe.pd.DataFrame.columns
| list the columns of a dataframe.pd.DataFrame.dtypes
| print the data types of each column of a dataframe.pd.DataFrame.values
| print the values of a dataframe.pd.DataFrame.describe()
| summarise a dataframe: return summary statistics including the number of observations per column, the mean of each column and the standard deviation of each column.pd.DataFrame.info()
| brief summary of a dataframe.pd.DataFrame.T
| transpose a dataframe.pd.DataFrame.sort_index()
| sort a dataframe by its index values. Can specify the axis(colnames, rownames)
and the order of sorting.pd.DataFrame.sort_values('col')
| sort a dataframe by the column namecol
.pd.DataFrame.iloc[i]
| slice and subset your data by a numerical index.pd.DataFrame.loc[]
| slice and subset your data by a string.pd.DataFrame.isin(l)
| return True or False depending if the item value is in the listl
.pd.DataFrame.set_index(s)
| set the index of a data frame to column names
, wheres
can be an array of columnnames to create a MultiIndex.pd.DataFrame.swaplevel(i,j)
| swap the levelsi
andj
in a MultiIndex.pd.DataFrame.drop('c1', axis=1, inplace=True)
| drop a columnc1
from a dataframe.pd.DataFrame.iterrows()
| a generator for iterating over the rows of a dataframe.pd.DataFrame.apply(f, axis)
| apply a functionf
vectorwise to a dataframe over a given axis.pd.DataFrame.applymap(f)
| apply a functionf
elementwise to a dataframe.pd.DataFrame.drop(s, axis=1)
| delete columns
from a dataframe.pd.DataFrame.resample('offsetString')
| convenient way to group timeseries into bins. See here for details on the offset string and here for some examples.pd.DataFrame.merge(df)
| join a dataframedf
to another dataframe. Can specify the type of join.pd.DataFrame.append(df)
| append the dataframe df to a dataframe, likerbind()
in R.pd.DataFrame.reset_index()
| reset the index back to the default numeric row counter.pd.DataFrame.idxmax()
| dataframe equivalent of the numpyargmax
method.pd.DataFrame.isnull()
| indicates if values are null or not.pd.DataFrame.from_dict(d)
| create a dataframe from a dictionaryd
.pd.DataFrame.stack()
| turn column names into index labels.pd.DataFrame.unstack()
| turn index values into column names.
Groupby methods Link to heading
We turn next to the Groupby methods. A useful family, these ones.
To group a dataframe by a column (or columns), use pd.DataFrame.groupby('colname')
. This returns a DataFrameGroupBy
object, on which you can call a certain set of methods.
So! say gb
is a DataFrameGroupBy
object, obtained faithfully from pd.DataFrame.groupby()
.
There’s some very useful functions you can use; sum
, min
, max
, mean
, median
and std
. Hardworking citizens of the data science empire, those guys.
More useful methods:
gb.agg(arr)
| returns whatever functions you specify in arrayarr
gb.size()
| return the number of elements in each group.gb.describe()
| returns summary statistics.
String methods Link to heading
The Pandas library has a module for string manipulation and string handling. This module operates on Series objects and is located at pd.Series.str
. Don’t confuse it with Python’s native str
. A false friend, that one.
Again! let s
be a pd.Series of strings. Then you could do
s.str[0]
– return the first letter of each element ins
.s.str.lower()
– change each element ofs
to lowercase.s.str.upper()
– change each element ofs
to uppercase.s.str.len()
– return the number of letters of each element ofs
.s.str.strip()
– remove whitespace around the elements ofs
.s.str.replace('s1', 's2')
– replace a substrings1
with a substrings2
for each element ofs
.s.str.split('s1')
– split up the elements ofs
usings1
as a separator.s.str.get(i)
– extract thei
th element of each array ofs
.
Miscellaneous functions Link to heading
You want more! Okay then.
pd.__version__
| return the version of Pandas.pd.date_range()
| create a series of dates in aDateTimeIndex
. Some options include: a start date and an end date (e.g.pd.date_range('2015-01-05', '2015-01-10')
) a start date, end date and a frequency (e.g.pd.date_range('2016-01', '2016-10',freq='M')
) a start date and the number of periods (e.g.pd.date_range('2016-01', periods=10)
)pd.read_csv(filepath, sep, index_col)
| read in a CSV file, often from a web address or file. Specify the separator with thesep
parameter, and the column to use as the rownames of the table with theindex_col
parameter.pd.value_counts()
| count how many times a value appears in a column.pd.crosstab()
| create frequency table of two or more factors.pd.Series.map(f)
| the Series version of applymap.pd.to_datetime()
| convert something to a numpy datetime64 format.pd.to_numeric()
| convert something to a float format.pd.concat(objs)
| put together data frames in the arrayobjs
along a given axis, similar torbind()
orcbind()
in R.
Final words Link to heading
This is by no means complete; nor does it pretend to be complete.
It’s just a list of functions. No more, no less.