Skip to content

3.3. Data visualization

Jim Schwoebel edited this page Aug 3, 2020 · 5 revisions

Visualization

This is a univeral visualizer for all types of data.

Note that this is a setting in the Allie Framework (e.g. "visualize_data": true).

Getting started

To get started, you first need to featurize some data using featurizations scripts. This data must be in the train_dir folder in the form of directories. To read more about featurization, see this page.

After you have featurized your data, go to this current folder (./allie/visualize') and run the visualize.py script:

python3 visualize.py [problemtype] [folder A] [folder B] ... [folder N]

Note you need to pass through the problem_type (e.g. 'audio'|'text'|'image'|'video'|'csv') and also all the relevant folders featurizations. In this case, we are looking at audio files separating males from females (e.g. featurizations exist) in the train_dir folder.

python3 visualize.py audio males females 

This then generates a tree structure of graphs, for example below:

β”œβ”€β”€ classes.png
β”œβ”€β”€ clustering
β”‚Β Β  β”œβ”€β”€ isomap.png
β”‚Β Β  β”œβ”€β”€ lle.png
β”‚Β Β  β”œβ”€β”€ mds.png
β”‚Β Β  β”œβ”€β”€ modified.png
β”‚Β Β  β”œβ”€β”€ pca.png
β”‚Β Β  β”œβ”€β”€ spectral.png
β”‚Β Β  β”œβ”€β”€ tsne.png
β”‚Β Β  └── umap.png
β”œβ”€β”€ feature_ranking
β”‚Β Β  β”œβ”€β”€ feature_importance.png
β”‚Β Β  β”œβ”€β”€ feature_plots
β”‚Β Β  β”‚Β Β  └── 128_mfcc_10_std.png
            ... [all feature plots (many files)]
β”‚Β Β  β”œβ”€β”€ heatmap.png
β”‚Β Β  β”œβ”€β”€ heatmap_clean.png
β”‚Β Β  β”œβ”€β”€ lasso.png
β”‚Β Β  β”œβ”€β”€ pearson.png
β”‚Β Β  └── shapiro.png
└── model_selection
    β”œβ”€β”€ calibration.png
    β”œβ”€β”€ cluster_distance.png
    β”œβ”€β”€ elbow.png
    β”œβ”€β”€ ks.png
    β”œβ”€β”€ learning_curve.png
    β”œβ”€β”€ logr_percentile_plot.png
    β”œβ”€β”€ outliers.png
    β”œβ”€β”€ pca_explained_variance.png
    β”œβ”€β”€ precision-recall.png
    β”œβ”€β”€ prediction_error.png
    β”œβ”€β”€ residuals.png
    β”œβ”€β”€ roc_curve.png
    β”œβ”€β”€ roc_curve_train.png
    β”œβ”€β”€ siloutte.png
    └── thresholds.png

Output graphs

Once you run this script, you output many visualizations. These visualizations can be customized within the script itself with some simple modifications. See below for some of the visualizations you can make.

Note that this script considers whether or not to balance datasets (e.g. "balance_data": true in settings.json) - so make sure you adjust your settings as to whether or not you'd like to balance the data before running the script above. These were the settings used to create the visualizations below:

{
  "augment_data": false,
  "balance_data": true,
  "clean_data": false,
  "create_YAML": true,
  "default_audio_features": [ "librosa_features", "pyworld_features" ],
  "default_audio_transcriber": "pocketsphinx",
  "default_csv_features": [ "csv_features" ],
  "default_csv_transcriber": "raw text",
  "default_dimensionality_reducer": [ "pca" ],
  "default_feature_selector": [ "lasso" ],
  "default_image_features": [ "image_features" ],
  "default_image_transcriber": "tesseract",
  "default_scaler": [ "standard_scaler" ],
  "default_text_features": [ "nltk_features" ],
  "default_text_transcriber": "raw text",
  "default_training_script": [ "tpot" ],
  "default_video_features": [ "video_features" ],
  "default_video_transcriber": "tesseract (averaged over frames)",
  "model_compress": false,
  "reduce_dimensions": true,
  "scale_features": true,
  "select_features": true,
  "test_size": 0.25,
  "transcribe_audio": false,
  "transcribe_csv": true,
  "transcribe_image": true,
  "transcribe_text": true,
  "transcribe_videos": true,
  "visualize_data": true
}

numbers in each class

Clustering

Quickly iterate and see which cluster method works best with your dataset.

β”œβ”€β”€ clustering
β”‚Β Β  β”œβ”€β”€ isomap.png
β”‚Β Β  β”œβ”€β”€ lle.png
β”‚Β Β  β”œβ”€β”€ mds.png
β”‚Β Β  β”œβ”€β”€ modified.png
β”‚Β Β  β”œβ”€β”€ pca.png
β”‚Β Β  β”œβ”€β”€ spectral.png
β”‚Β Β  β”œβ”€β”€ tsne.png
β”‚Β Β  └── umap.png

Isomap embedding

LLE embedding

MDS embedding

Modified embedding

PCA embedding

Spectral embedding

tSNE embedding

UMAP embedding

Feature ranking

β”œβ”€β”€ feature_ranking
β”‚Β Β  β”œβ”€β”€ feature_importance.png
β”‚Β Β  β”œβ”€β”€ feature_plots
β”‚Β Β  β”‚Β Β  └── 128_mfcc_10_std.png
            ... [all feature plots (many files)]
β”‚Β Β  β”œβ”€β”€ heatmap.png
β”‚Β Β  β”œβ”€β”€ heatmap_clean.png
β”‚Β Β  β”œβ”€β”€ lasso.png
β”‚Β Β  β”œβ”€β”€ pearson.png
β”‚Β Β  └── shapiro.png

Feature importances (top 20 features)

Feature_plots

Easily plots all the features via violin plots (to spot distributions).

Lasso plot

Heatmaps

Heatmap with correlated variables

Heatmap with removed correlated variables

Pearson ranking plot

Shapiro plot

Modeling graphs

└── model_selection
    β”œβ”€β”€ calibration.png
    β”œβ”€β”€ cluster_distance.png
    β”œβ”€β”€ elbow.png
    β”œβ”€β”€ ks.png
    β”œβ”€β”€ learning_curve.png
    β”œβ”€β”€ logr_percentile_plot.png
    β”œβ”€β”€ outliers.png
    β”œβ”€β”€ pca_explained_variance.png
    β”œβ”€β”€ precision-recall.png
    β”œβ”€β”€ prediction_error.png
    β”œβ”€β”€ residuals.png
    β”œβ”€β”€ roc_curve.png
    β”œβ”€β”€ roc_curve_train.png
    β”œβ”€β”€ siloutte.png
    └── thresholds.png

Calibration plot

Cluster distance

Elbow plot

KS stat plot

Learning curve

logr percentile plot

Outlier detection

PCA explained variance plot

Precision/recall graphs

Prediction error graphs

Residuals

ROC curve_train

ROC curve_test

siloutte graph

Threshold graph

References