pandas

locate

pd.loc[row, col]
pd.iloc[row, col] must be number as input

plot

pd.plot()
pd.plot.scatter() scatter plot
pd.plot.box() box plot
create subplot based on col -> pd.plot.area(figsize=(w,h), subplots=True)

apply function

pd[col_name].agg_func()
calculate for more col -> pd[[col1, col2,…]].agg_func()
apply more aggregate functions at once
pd.agg({col1:[‘agg_func1’, ‘agg_func2’,..]})

aggregate function:

describe()
min()
max()
median()
skew()
mean()
value_counts() short cut a groupby operation
count() exclude null value
size() count null value and return # of rows

reshape

pd.assign(new_col_name=’’) append new column
pd.rename(columns={old_name: new_name})

pivot(columns=col, values=val) reshape data, a single value for each index/column combination
pd.pivot_table(columns=col, index=idx, values=val, aggfunc=func)

combin data from two tables
pd.concat([a, b], axis=0) with same col, increment rows

pd.merge() by default perform inner join, how=on -> full join
join table by common identifier, support inner, outer[full join],left, right joins as database style merging of tables
pd.merge(tab1, tab2, how=’left/right’, on=’common_col’)
if no common col name, but have common format col in two tables
pd.merge(tab1, tab2, how=’left’, left_on=’com_tab1_col’, right_on=’com_tab2_col’)

time series data

convert to datetime object so that can use datetime operations by dt accessor->pd.to_datetime(pd[col])
parse data -> pd.read_csv(‘’, parse_dates=[‘datetime’])

pd.dt.year = pd.index.year not need the dt accessor to get the time series properties, but have these properties available on the index directly

resample() groupby frequency of time series, aggregate function required along with resample function

texture data

use str accessor to get use specialized string methods
series.str.get() extract relevant part
regular expression:
series.str.contain() check whether string is contained
series.str.extract()

# Data Analysis

Commentaires

Articles

73

Catégories

11