locate
pd.loc[row, col]
pd.iloc[row, col] must be number as input
plot
pd.plot()
pd.plot.scatter() scatter plot
pd.plot.box() box plot
create subplot based on col -> pd.plot.area(figsize=(w,h), subplots=True)
apply function
pd[col_name].agg_func()
calculate for more col -> pd[[col1, col2,…]].agg_func()
apply more aggregate functions at once
pd.agg({col1:[‘agg_func1’, ‘agg_func2’,..]})
aggregate function:
- describe()
- min()
- max()
- median()
- skew()
- mean()
- value_counts() short cut a groupby operation
- count() exclude null value
- size() count null value and return # of rows
reshape
pd.assign(new_col_name=’’) append new column
pd.rename(columns={old_name: new_name})
pivot(columns=col, values=val) reshape data, a single value for each index/column combination
pd.pivot_table(columns=col, index=idx, values=val, aggfunc=func)
combin data from two tables
pd.concat([a, b], axis=0) with same col, increment rows
pd.merge() by default perform inner join, how=on -> full join
join table by common identifier, support inner, outer[full join],left, right joins as database style merging of tables
pd.merge(tab1, tab2, how=’left/right’, on=’common_col’)
if no common col name, but have common format col in two tables
pd.merge(tab1, tab2, how=’left’, left_on=’com_tab1_col’, right_on=’com_tab2_col’)
time series data
convert to datetime object so that can use datetime operations by dt accessor->pd.to_datetime(pd[col])
parse data -> pd.read_csv(‘’, parse_dates=[‘datetime’])
pd.dt.year = pd.index.year not need the dt accessor to get the time series properties, but have these properties available on the index directly
resample() groupby frequency of time series, aggregate function required along with resample function
texture data
use str accessor to get use specialized string methods
series.str.get() extract relevant part
regular expression:
series.str.contain() check whether string is contained
series.str.extract()