3.2. Data preprocessing

This is a folder for manipulating and pre-processing features extracted from audio, text, image, video, or .CSV files as part of the machine learning modeling process.

This is done via a convention for transformers, which are in the proper folders (e.g. audio files --> audio_transformers). There are three main feature transformation techniques: feature scaling, feature selection, and dimensionality reduction.

In this way, we can appropriately create transformers for various sample data types.

Building transformers (for classification problems)

To transform two folders of files with a classification modeling goal, type this into the terminal:

cd /Users/jim/desktop/allie
cd preprocessing
python3 transform.py audio c gender males females

What results are a few files, with a summary .JSON file (c_gender_standard_scaler_rfe.json) that elaborates upon the transformer in the ./preprocessing/audio_transformer directory:

{"estimators": "[('standard_scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('rfe', RFE(estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,\n                  gamma='scale', kernel='linear', max_iter=-1, shrinking=True,\n                  tol=0.001, verbose=False),\n    n_features_to_select=20, step=1, verbose=0))]", "settings": {"version": "1.0.0", "augment_data": false, "balance_data": true, "clean_data": false, "create_csv": true, "default_audio_augmenters": ["augment_tsaug"], "default_audio_cleaners": ["clean_mono16hz"], "default_audio_features": ["librosa_features"], "default_audio_transcriber": ["deepspeech_dict"], "default_csv_augmenters": ["augment_ctgan_regression"], "default_csv_cleaners": ["clean_csv"], "default_csv_features": ["csv_features"], "default_csv_transcriber": ["raw text"], "default_dimensionality_reducer": ["pca"], "default_feature_selector": ["rfe"], "default_image_augmenters": ["augment_imaug"], "default_image_cleaners": ["clean_greyscale"], "default_image_features": ["image_features"], "default_image_transcriber": ["tesseract"], "default_outlier_detector": ["isolationforest"], "default_scaler": ["standard_scaler"], "default_text_augmenters": ["augment_textacy"], "default_text_cleaners": ["remove_duplicates"], "default_text_features": ["nltk_features"], "default_text_transcriber": ["raw text"], "default_training_script": ["tpot"], "default_video_augmenters": ["augment_vidaug"], "default_video_cleaners": ["remove_duplicates"], "default_video_features": ["video_features"], "default_video_transcriber": ["tesseract (averaged over frames)"], "dimension_number": 2, "feature_number": 20, "model_compress": false, "reduce_dimensions": false, "remove_outliers": true, "scale_features": true, "select_features": true, "test_size": 0.1, "transcribe_audio": true, "transcribe_csv": true, "transcribe_image": true, "transcribe_text": true, "transcribe_video": true, "visualize_data": false, "transcribe_videos": true}, "classes": [0, 1], "sample input X": [7.0, 23.428571428571427, 13.275725891987362, 40.0, 3.0, 29.0, 143.5546875, 0.9958894161283733, 0.548284548384031, 2.9561698164853145, 0.0, 0.9823586435889371, 1.0, 0.0, 1.0, 1.0, 1.0, 0.9563696862586734, 0.004556745090440225, 0.96393999788138, 0.9484953491224131, 0.956450500708107, 0.9159480905407004, 0.0083874210259623, 0.9299425651546245, 0.9015187648369452, 0.9160637212676059, 0.8944452328439548, 0.010436308983995553, 0.9118600181268972, 0.876491746933975, 0.8945884359248597, 0.8770112296385062, 0.012213153676505805, 0.897368483560061, 0.8559758923011408, 0.8771914323071244, 0.8905940594153822, 0.010861457873386392, 0.9086706269489068, 0.871855264076524, 0.8907699627836017, 0.8986618946665619, 0.010095245635220757, 0.9154303336418514, 0.8812084302562171, 0.8988437925782, 0.8940183738781617, 0.01039025341119034, 0.9112831285473728, 0.8760607413483972, 0.8942023162840362, 0.8980899592148434, 0.010095670257151154, 0.9148198685317376, 0.8805923053101389, 0.8982937440953425, 0.888882132228931, 0.01127171631616655, 0.9075145588083179, 0.8692983803153955, 0.8891346978561219, 0.8803505690032647, 0.012156972724147449, 0.9004327206206673, 0.8592151658555779, 0.8806302144175665, 0.8783421648443998, 0.012494808554290946, 0.8989303042965345, 0.8565635990020571, 0.8786581988736766, 0.8633477274070439, 0.013675039594980561, 0.8859283087034104, 0.8395630641775639, 0.8636673866109066, -313.23363896251993, 18.946764320029068, -265.4153881359699, -352.35669009191434, -314.88964335810385, 136.239475379525, 13.46457033057532, 155.28229790095634, 96.67729067600845, 138.31307847807975, -60.109940589659594, 8.90651546650511, -43.00250224341745, -77.22644879310883, -61.59614027580888, 59.959525194997426, 11.49266912690683, 79.92823038661382, 22.593262641790204, 60.14384367187341, -54.39960148805922, 12.978670142489454, -16.69391321594054, -78.18044376664089, -54.04351001558572, 30.023862498118685, 8.714431771268103, 45.984861607171624, -7.969418151448695, 30.779899533210106, -48.79826737490987, 9.404798307829793, -17.32746858770041, -67.85565811008664, -48.67558954047166, 16.438960903373093, 6.676108733267705, 24.75100641115554, -1.8759098025429237, 18.27300445180957, -24.239093865617573, 6.8313516276284245, -9.56759656295116, -40.92277771655667, -24.18878158134608, 3.2516761928923215, 4.2430222382933085, 10.37732827872848, -6.461490621772226, 3.393567465008272, -4.1570109920127685, 5.605424304597271, 5.78957218995748, -18.10767695295411, -3.8190369770110664, -9.46159588572396, 5.81772077466229, 2.7763746636679323, -20.054279810217025, -10.268401482915364, 9.197482271105386, 5.755721680320874, 18.46922506683798, -6.706210697210241, 10.044558505805792, -4.748126927006937e-05, 1.0575334143974938e-05, -2.4722594252240076e-05, -6.952111317908028e-05, -4.773507820227446e-05, 0.40672100602206257, 0.08467855992898438, 0.5757090803234001, 0.22579515457526012, 0.4087367660373401, 2210.951610581014, 212.91019021101542, 2791.2529330926723, 1845.3115106685345, 2223.07457835522, 2063.550185470081, 111.14828141425747, 2287.23164419421, 1816.298268701022, 2073.585928819859, 12.485818860644423, 4.014563343823625, 25.591622605877692, 7.328069768561837, 11.33830589622713, 0.0021384726278483868, 0.004675689153373241, 0.020496303215622902, 0.00027283065719529986, 0.0006564159411936998, 4814.383766867898, 483.0045387722584, 5857.03125, 3682.177734375, 4888.037109375, 0.12172629616477272, 0.0227615872259035, 0.1875, 0.0732421875, 0.1181640625, 0.011859399266541004, 0.0020985447335988283, 0.015743320807814598, 0.006857675965875387, 0.012092416174709797], "sample input Y": 1, "sample transformed X": [1.403390013815334, -0.46826787078753823, -1.1449558304944494, 0.16206820805906216, 0.29026766476533494, 1.121884315543298, 0.6068005729105246, 0.8938752562067679, 1.321370886418425, 0.003118933840524784, 0.9096755549417692, 1.6455655890558802, 0.29214439046692964, 0.47100284419336996, -0.3969793022778641, 0.3178025570312262, -0.5484053895612412, -0.34677929105059246, -0.6946185268998197, -0.6088183224560646], "sample transformed y": 1}

Click the .GIF below to follow along this example in a video format:

The code above will transform all the featurized text files (in .JSON files, folder MALES and folder FEMALES) via a classification script (c) with a common name GENDER. For clarity, the command line arguments are further elaborated upon below along with all possible options to help you use the transformers API. Note that folder ONE and folder TWO are assumed to be in the train_dir folder.

CLI argument	sample	description	all options
sys.argv[1]	'audio'	the sample type of file preprocessed by the transformer	['audio', 'text', 'image', 'video', 'csv']
sys.argv[2]	'c'	classification or regression problems	['c', 'r']
sys.argv[3]	'gender'	the common name for the transformer	can be any string
sys.argv[4], sys.argv[5], sys.argv[n]	'males'	classes that you seek to model in the train_dir folder	any string folder name

Building transformers (for regression problems)

To transform an entire folder of a featurized files (for a regression problem - target being between [0,1], you can run:

cd ~ 
cd allie/preprocessing
python3 transform.py text c age test.csv /Users/jim/desktop/allie/train_dir age

The code above will transform all the features in the test.csv spreadsheet in the /Users/jim/desktop/allie/train_dir around the target variable age according to the specified preprocessing settings. In other words, all other variables from the target variable are represented as numberical features that will be transformed.

CLI argument	sample	description	all options
sys.argv[1]	'text'	the sample type of file preprocessed by the transformer	['audio', 'text', 'image', 'video', 'csv']
sys.argv[2]	'c'	classification or regression problems	['c', 'r']
sys.argv[3]	'age'	target variable in a spreadsheet	any string variable as a pandas dataframe
sys.argv[4]	'test.csv'	csv spreadsheet for the regression problem	any string that represents a spreadsheet name
sys.argv[5]	'/Users/jim/desktop/allie/train_dir'	directory of the spreadsheet	any string directory file (can get with os.getcwd())
sys.argv[6]	'age'	common_name for the modeling problem	any string common name that makes sense for the problem

Settings

Here are the relevant settings in Allie related to preprocessing that you can change in the settings.json file (along with the default settings).

setting	description	default setting	all options
reduce_dimensions	if True, reduce dimensions via the default_dimensionality_reducer (or set of dimensionality reducers)	False	True, False
default_dimensionality_reducer	the default dimensionality reducer or set of dimensionality reducers	["pca"]	["pca", "lda", "tsne", "plda","autoencoder"]
select_features	if True, select features via the default_feature_selector (or set of feature selectors)	False	True, False
default_feature_selector	the default feature selector or set of reature selectors	["lasso"]	["lasso", "rfe", "chi", "kbest", "variance"]
scale_features	if True, scales features via the default_scaler (or set of scalers)	False	True, False
default_scaler	the default scaler (e.g. StandardScalar) to pre-process data	["standard_scaler"]	["binarizer", "one_hot_encoder", "normalize", "power_transformer", "poly", "quantile_transformer", "standard_scaler"]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3.2. Data preprocessing

Building transformers (for classification problems)

Building transformers (for regression problems)

Settings

Clone this wiki locally