-
Notifications
You must be signed in to change notification settings - Fork 35
3.3. Data visualization
This is a univeral visualizer for all types of data as a part of model training.
Note that this is a setting in the Allie Framework (e.g. "visualize_data": true).
To get started, you first need to featurize some data using featurizations scripts. This data must be in the train_dir folder in the form of directories. To read more about featurization, see this page.
After you have featurized your data, go to this current folder (./allie/visualize') and run the visualize.py script:
python3 visualize.py [problemtype] [folder A] [folder B] ... [folder N]
Note you need to pass through the problem_type (e.g. 'audio'|'text'|'image'|'video'|'csv') and also all the relevant folders featurizations. In this case, we are looking at audio files separating males from females (e.g. featurizations exist) in the train_dir folder.
python3 visualize.py audio males females
This then generates a tree structure of graphs, for example below:
βββ classes.png
βββ clustering
βΒ Β βββ isomap.png
βΒ Β βββ lle.png
βΒ Β βββ mds.png
βΒ Β βββ modified.png
βΒ Β βββ pca.png
βΒ Β βββ spectral.png
βΒ Β βββ tsne.png
|Β Β βββ umap.png
βββ feature_ranking
βΒ Β βββ feature_importance.png
βΒ Β βββ feature_plots
βΒ Β βΒ Β βββ 128_mfcc_10_std.png
... [all feature plots (many files)]
βΒ Β βββ heatmap.png
βΒ Β βββ heatmap_clean.png
βΒ Β βββ lasso.png
βΒ Β βββ pearson.png
βΒ Β βββ shapiro.png
βββ model_selection
βββ calibration.png
βββ cluster_distance.png
βββ elbow.png
βββ ks.png
βββ learning_curve.png
βββ logr_percentile_plot.png
βββ outliers.png
βββ pca_explained_variance.png
βββ precision-recall.png
βββ prediction_error.png
βββ residuals.png
βββ roc_curve.png
βββ roc_curve_train.png
βββ siloutte.png
βββ thresholds.png
Once you run this script, you output many visualizations. These visualizations can be customized within the script itself with some simple modifications. See below for some of the visualizations you can make.
Note that this script considers whether or not to balance datasets (e.g. "balance_data": true in settings.json) - so make sure you adjust your settings as to whether or not you'd like to balance the data before running the script above. These were the settings used to create the visualizations below:
{
"version": "1.0.0",
"augment_data": false,
"balance_data": true,
"clean_data": false,
"create_YAML": true,
"create_csv": true,
"default_audio_features": [ "pspeech_features", "praat_features", "sox_features" ],
"default_audio_transcriber": ["deepspeech_dict"],
"default_csv_features": [ "csv_features" ],
"default_csv_transcriber": ["raw text"],
"default_dimensionality_reducer": [ "pca" ],
"default_feature_selector": [ "lasso" ],
"default_image_features": [ "image_features" ],
"default_image_transcriber": ["tesseract"],
"default_scaler": [ "standard_scaler" ],
"default_text_features": [ "nltk_features" ],
"default_text_transcriber": "raw text",
"default_training_script": [ "tpot" ],
"default_video_features": [ "video_features" ],
"default_video_transcriber": [ "tesseract (averaged over frames)" ],
"feature_number": 20,
"model_compress": false,
"reduce_dimensions": false,
"scale_features": true,
"select_features": false,
"test_size": 0.10,
"transcribe_audio": false,
"transcribe_csv": true,
"transcribe_image": true,
"transcribe_text": true,
"transcribe_video": true,
"visualize_data": true
}
Quickly iterate and see which cluster method works best with your dataset.
βββ clustering
βΒ Β βββ isomap.png
βΒ Β βββ lle.png
βΒ Β βββ mds.png
βΒ Β βββ modified.png
βΒ Β βββ pca.png
βΒ Β βββ spectral.png
βΒ Β βββ tsne.png
|Β Β βββ umap.png
βββ feature_ranking
βΒ Β βββ feature_importance.png
βΒ Β βββ feature_plots
βΒ Β βΒ Β βββ 128_mfcc_10_std.png
... [all feature plots (many files)]
βΒ Β βββ heatmap.png
βΒ Β βββ heatmap_clean.png
βΒ Β βββ lasso.png
βΒ Β βββ pearson.png
βΒ Β βββ shapiro.png
Easily plots all the features via violin plots (to spot distributions).
Heatmap with correlated variables
Heatmap with removed correlated variables
βββ model_selection
βββ calibration.png
βββ cluster_distance.png
βββ elbow.png
βββ ks.png
βββ learning_curve.png
βββ logr_percentile_plot.png
βββ outliers.png
βββ pca_explained_variance.png
βββ precision-recall.png
βββ prediction_error.png
βββ residuals.png
βββ roc_curve.png
βββ roc_curve_train.png
βββ siloutte.png
βββ thresholds.png