mlcourse.ai week1

main pandas methods

pandas is used to load, process, analyze
data structure: series anda dataframe class

df[].value_counts() // return # for each category
df[].value_counts(normalize=True) // return fraction for each category

df.assign(new_col=expression) // return new dataframe, either create new one or update old one
df.eval() // return new dataframe, or update old one by inplace=True, use @variable or f’ string to use variable
df.query() // search, return new dataframe, or update old one by inplace=True

apply functions to cells, rows, columns

df.apply(np.max) // apply get max value function to all columns
df[‘State’].apply(lambda state:state[0] == ‘W’) // apply to rows by using lambda function

map method to replace values in column by passing dict form {old_val: new_val} as argument
df.replace(col_nam: {old_val: new_val})

grouping

df.groupby(by=grouping_columns)[columns_to_show].function() // If columns_to_show is not included, all non groupby clauses will be included.
divide - group - apply

summary table

pd.crosstab(df[‘Churn’], df[‘International plan’]) // show distribution of two variables
pd.crosstab(df[‘Churn’], df[‘Voice mail plan’], normalize=True) // show fraction distribution

pd.pivot_table(values, index, aggfunc)
values – a list of variables to calculate statistics for,
index – a list of variables to group data by,
aggfunc – what statistics we need to calculate for groups, ex. sum, mean, maximum, minimum or something else.

dataframe transformations

add columns

  • df.insert(loc=len(df.columns), column=’Total calls’, value=total_calls) //loc parameter is # of columns after which to insert the Series object
  • df[‘Total calls’] = df[‘Total day charge’] + df[‘Total eve charge’] + \
    df['Total night charge'] + df['Total intl charge']
    

delete columns/rows - axis=1/0

  • df.drop() // not change exist dataframe, only if set parameter inplace=True

first attempt at predict

  • find relation by pd.crosstab(df[‘Churn’], df[‘International plan’], margins=True) // return two variables relation and all in each row and col
  • visual analysis through seaborn, sns.countplot(x=’International plan’, hue=’Churn’, data=df)

lecture note & assignment

Commentaires

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×