data visual analysis in Python
univariate visualization
numerical features
distribution of numerical variables = histogram df.hist()
density plot(kernal density) = more smoothed histogram with no bins df.plot(kind=’density’, …)
seaborn.distplot() - histogram + density(KDE)
seaborn.boxplot() - display vertical box default+lines[whiskers]+outliers[anomaly], by setting x=[col] show horizontal
seaborn.violinplot() - kernel density estimate, more smoothed distribution along with boxplot
categorical and binary features
frequency table
- df.value_counts()
- bar plot df.plot(kind=’bar’, …) or seaborn.countplot() [** seaborn.barplot() is used to represent basic statistics of a numerical variable grouped by a categorical feature]
- display value_counts of categorical features using seaborn.factorplot(x=, y=, hue=, data=df, kind=’bar’)
difference between histogram and bar chart
- histogram for numerical feature, bar plot for categorical feature
- value on x-axis in histogram is number, value on x-axis in bar plot can be number, string, booleans
- histogram x-axis is cartesian coordinate, and order is not predefined, bar in bar plot is often sorted by height
multivariate visualization
numerical vs numerical features
correlation matrix
- df.corr()
- seaborn.heatmap() color-coded matrix to show correlation
1
2
3
4
5
6
7
8corr = df.corr()
# Create a mask to hide the upper triangle of the correlation matrix (which is symmetric)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corr, mask=mask, vmax=1, center=0, annot=True, fmt='.1f',
square=True, linewidths=.5, cbar_kws={"shrink": .5});
scatter plot
- plt.scatter(v1, v2) display two numerical variables as cartesian coordinate in 2d space
- seaborn.jointplot(x=v1, y=v2, data=df, kind=’scatter’) display scatter plot and two histograms of two indivdual variables
- seaborn.jointplot(v1, v2, data=df, kind=kda) display bivariate version of kda
scatter plot matrix - seaborn.pairplot(df[numerical]) display scatter plot for each pair of variables, diagonal show distribution of each variable
numerical vs categorical features
- seaborn.lmplot(v1, v2, data=df, hue=cat_var) display how categorical feature present in num vs num scatter plot
- seaborn.boxplot(x=cat_var, y=num_var, data=df) distribution statistics of numerical variable in grouped categorical feature
analyze a quantitative variable in two categorical dimensions at once = seaborn.catplot(x=cat1, y=num, col=cat2, data=df, kind=’box’, col_wrap=4, height=3, aspect=.8)
categorical vs catgorical features
seaborn.countplot(x=cat1, hue=cat2, data=df)
contingency table/cross tabulation statistic tool = multivariate frequency table
- pd.crosstab(cat1, cat2).T
whole dataset visualization
dimensionality reduction
- Principal Component Analysis(PCA) linear algorithm
- t-SNE, manifold learning non-linear
plotly library
- plot user interactive interface for detailed data exploration
- Figure is main class, consisting data[traces] and style[layout object]. call iplot function to return traces
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly
import plotly.graph_objs as go
init_notebook_mode(connected=True)
# Create a line (trace) for the global sales
trace0 = go.Scatter/Bar/Box(
x=years_df.index,
y=years_df['Global_Sales'],
name='Global Sales'
)
# Create a line (trace) for the number of games released
trace1 = go.Scatter(
x=years_df.index,
y=years_df['Number_of_Games'],
name='Number of games released'
)
# Define the data array
data = [trace0, trace1]
# Set the title
layout = {'title': 'Statistics for video games'}
# Create a Figure and plot it
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=False)
lecture note & assignment
- https://www.kaggle.com/adaxing/mlcourse-ai-wk2-p1-visual-data-analysis-in-python
- https://www.kaggle.com/adaxing/mlcourse-ai-wk2-p2-seaborn-and-plotly
- https://www.kaggle.com/adaxing/mlcourse-ai-assign2-analyzing-cardiovascular-data