mlcourse.ai week2

data visual analysis in Python

univariate visualization

numerical features

distribution of numerical variables = histogram df.hist()
density plot(kernal density) = more smoothed histogram with no bins df.plot(kind=’density’, …)

seaborn.distplot() - histogram + density(KDE)
seaborn.boxplot() - display vertical box default+lines[whiskers]+outliers[anomaly], by setting x=[col] show horizontal
seaborn.violinplot() - kernel density estimate, more smoothed distribution along with boxplot

categorical and binary features

frequency table

  • df.value_counts()
  • bar plot df.plot(kind=’bar’, …) or seaborn.countplot() [** seaborn.barplot() is used to represent basic statistics of a numerical variable grouped by a categorical feature]
  • display value_counts of categorical features using seaborn.factorplot(x=, y=, hue=, data=df, kind=’bar’)

difference between histogram and bar chart

  • histogram for numerical feature, bar plot for categorical feature
  • value on x-axis in histogram is number, value on x-axis in bar plot can be number, string, booleans
  • histogram x-axis is cartesian coordinate, and order is not predefined, bar in bar plot is often sorted by height

multivariate visualization

numerical vs numerical features

correlation matrix

  • df.corr()
  • seaborn.heatmap() color-coded matrix to show correlation
    1
    2
    3
    4
    5
    6
    7
    8
    corr = df.corr()
    # Create a mask to hide the upper triangle of the correlation matrix (which is symmetric)
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    f, ax = plt.subplots(figsize=(12, 9))
    sns.heatmap(corr, mask=mask, vmax=1, center=0, annot=True, fmt='.1f',
    square=True, linewidths=.5, cbar_kws={"shrink": .5});

scatter plot

  • plt.scatter(v1, v2) display two numerical variables as cartesian coordinate in 2d space
  • seaborn.jointplot(x=v1, y=v2, data=df, kind=’scatter’) display scatter plot and two histograms of two indivdual variables
  • seaborn.jointplot(v1, v2, data=df, kind=kda) display bivariate version of kda
    scatter plot matrix
  • seaborn.pairplot(df[numerical]) display scatter plot for each pair of variables, diagonal show distribution of each variable

numerical vs categorical features

  • seaborn.lmplot(v1, v2, data=df, hue=cat_var) display how categorical feature present in num vs num scatter plot
  • seaborn.boxplot(x=cat_var, y=num_var, data=df) distribution statistics of numerical variable in grouped categorical feature
    analyze a quantitative variable in two categorical dimensions at once = seaborn.catplot(x=cat1, y=num, col=cat2, data=df, kind=’box’, col_wrap=4, height=3, aspect=.8)

categorical vs catgorical features

seaborn.countplot(x=cat1, hue=cat2, data=df)

contingency table/cross tabulation statistic tool = multivariate frequency table

  • pd.crosstab(cat1, cat2).T

whole dataset visualization

dimensionality reduction

  • Principal Component Analysis(PCA) linear algorithm
  • t-SNE, manifold learning non-linear

plotly library

  • plot user interactive interface for detailed data exploration
  • Figure is main class, consisting data[traces] and style[layout object]. call iplot function to return traces
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
    import plotly
    import plotly.graph_objs as go

    init_notebook_mode(connected=True)

    # Create a line (trace) for the global sales
    trace0 = go.Scatter/Bar/Box(
    x=years_df.index,
    y=years_df['Global_Sales'],
    name='Global Sales'
    )

    # Create a line (trace) for the number of games released
    trace1 = go.Scatter(
    x=years_df.index,
    y=years_df['Number_of_Games'],
    name='Number of games released'
    )

    # Define the data array
    data = [trace0, trace1]

    # Set the title
    layout = {'title': 'Statistics for video games'}

    # Create a Figure and plot it
    fig = go.Figure(data=data, layout=layout)
    iplot(fig, show_link=False)

lecture note & assignment

reference

Commentaires

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×