Exploratory Data Analysis

Steven Bergner

October 3rd, 2018

Overview

  • Exploratory Data Analysis (EDA)
  • Techniques for 1-, 2-, multi-, high-dimensional data

Exploratory Data Analysis

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” – John Tukey

Iterative cycle:

  1. Generate questions about your data
  2. Search for answers by visualising, transforming, and modelling your data
  3. Use what you learn to refine your questions and/or generate new questions

[Source: Hadley Wickham’s book chapter on EDA (2017)]

Aspects of EDA

  • Variation: tendency of the values of a variable to change from measurement to measurement
    • Study: distribution, typical values, unusual values, missing
  • Covariation: tendency for values of two or more variables to vary together in a related way
    • Compare: A categorical and a continuous variable

One vs Two variables

Data: 272 eruptions of the Old Faithful Geyser in Yellowstone National Park

One vs Two variables

Data: 272 eruptions of the Old Faithful Geyser in Yellowstone National Park

Boxplots

Boxplot comparison

Two categorical variables

Two categorical variables

Two continuous variables

Two continuous variables

Two continuous variables

EDA in Python

High-dimensional data

Dimensionality reduction