-
Notifications
You must be signed in to change notification settings - Fork 35
1. Getting started
First, clone the repository:
git clone --recurse-submodules -j8 [email protected]:jim-schwoebel/allie.git
cd allie
Set up virtual environment (to ensure consistent operating mode across operating systems).
python3 -m pip install --user virtualenv
python3 -m venv env
source env/bin/activate
Now install required dependencies:
python3 setup.py
Now do some unit tests to make sure everything works:
cd tests
python3 test.py
Note the test above takes roughly 5-10 minutes to complete and makes sure that you can featurize, model, and load model files (to make predictions) via your default featurizers and modeling techniques.
Here is a table that describes the folder structure for this repository. These descriptions could help guide how you can quickly get started with featurizing and modeling data samples.
folder name | description of folder |
---|---|
datasets | an elaborate list of open source datasets that can be used for curating datasets and augmenting datasets. |
features | a list of audio, text, image, video, and csv featurization scripts (these can be specified in the settings.json files). |
load_dir | a directory where you can put in audio, text, image, video, or .CSV files and make moel predictions from ./models directory. |
models | for loading/storing machine learning models and making model predictions for files put in the load_dir. |
production | a folder for outputting production-ready repositories via the YAML.py script. |
tests | for running local tests and making sure everything works as expected. |
train_dir | a directory where you can put in audio, text, image, video, or .CSV files in folders and train machine learning models from the model.py script in the ./training/ directory. |
training | for training machine learning models via specified model training scripts. |
After much trial and error, this standard feature array schema seemed the most appropriate for defining data samples (audio, text, image, video, or CSV samples):
def make_features(sampletype):
# only add labels when we have actual labels.
features={'audio':dict(),
'text': dict(),
'image':dict(),
'video':dict(),
'csv': dict()}
transcripts={'audio': dict(),
'text': dict(),
'image': dict(),
'video': dict(),
'csv': dict()}
models={'audio': dict(),
'text': dict(),
'image': dict(),
'video': dict(),
'csv': dict()}
data={'sampletype': sampletype,
'transcripts': transcripts,
'features': features,
'models': models,
'labels': []}
return data
There are many advantages for having this schema including:
- sampletype definition flexibility - flexible to 'audio' (.WAV / .MP3), 'text' (.TXT / .PPT / .DOCX), 'image' (.PNG / .JPG), 'video' (.MP4), and 'csv' (.CSV). This format can also can adapt into the future to new sample types, which can also tie to new featurization scripts. By defining a sample type, it can help guide how data flows through model training and prediction scripts.
- transcript definition flexibility - transcripts can be audio, text, image, video, and csv transcripts. The image and video transcripts use OCR to characterize text in the image, whereas audio transcripts are transcipts done by traditional speech-to-text systems (e.g. Pocketsphinx). You can also add multiple transcripts (e.g. Google and PocketSphinx) for the same sample type.
- featurization flexibility - many types of features can be put into this array of the same data type. For example, an audio file can be featurized with 'standard_features' and 'praat_features' without really affecting anything. This eliminates the need to re-featurize and reduces time to sort through multiple types of featurizations during the data cleaning process.
- label annotation flexibility - can take the form of ['classname_1', 'classname_2', 'classname_N...'] - classification problems and [{classname1: 'value'}, {classname2: 'value'}, ... {classnameN: 'valueN'}] where values are between [0,1] for regression problems.
- model predictions - one survey schema can be used for making model predictions and updating the schema with these predictions. Note that any model that is used for training can be used to make predictions in the load_dir.
We are currently in process to implement this schema into the SurveyLex architecture.
Settings can be modified in the settings.json file. If no settings.json file is identified, it will automatically be created with some default settings from the setup.py script.
Here are some settings that you can modify in this settings.json file and the various options for these settings:
setting | description | default setting | all options |
---|---|---|---|
default_audio_features | default set of audio features used for featurization (list). | ["standard_features"] | ["audioset_features", "audiotext_features", "librosa_features", "meta_features", "mixed_features", "praat_features", "pspeech_features", "pyaudio_features", "sa_features", "sox_features", "specimage_features", "specimage2_features", "spectrogram_features", "standard_features"] |
default_text_features | default set of text features used for featurization (list). | ["nltk_features"] | ["fast_features", "glove_features", "nltk_features", "spacy_features", "w2v_features"] |
default_image_features | default set of image features used for featurization (list). | ["image_features"] | ["image_features", "inception_features", "resnet_features", "tesseract_features", "vgg16_features", "vgg19_features", "xception_features"] |
default_video_features | default set of video features used for featurization (list). | ["video_features"] | ["video_features", "y8m_features"] |
default_csv_features | default set of csv features used for featurization (list). | ["csv_features"] | ["csv_features"] |
transcribe_audio | determines whether or not to transcribe an audio file via default_audio_transcriber (boolean). | True | True, False |
default_audio_transcriber | the default audio transcriber if transcribe_audio == True (string). | 'pocketsphinx' | 'pocketsphinx' |
transcribe_text | determines whether or not to transcribe a text file via default_text_transcriber (boolean). | True | True, False |
default_text_transcriber | the default text transcriber if transcribe_text == True (string). | 'raw text' | 'raw text' |
transcribe_image | determines whether or not to transcribe an image file via default_image_transcriber (boolean). | True | True, False |
default_image_transcriber | the default image transcriber if transcribe_image == True (string). | 'tesseract' | 'tesseract' |
transcribe_video | determines whether or not to transcribe a video file via default_video_transcriber (boolean). | True | True, False |
default_video_transcriber | the default video transcriber if transcribe_video == True (boolean). | 'tesseract_connected_over_frames' | 'tesseract_connected_over_frames' |
transcribe_csv | determines whether or not to transcribe a csv file via default_csv_transcriber (boolean). | True | True, False |
default_csv_transcriber | the default video transcriber if transcribe_csv == True (string). | 'raw text' | 'raw text' |
default_training_script | the specified traning script(s) to train machine learning models. Note that if you specify multiple training scripts here that the training scripts will be executed serially (list). | ['tpot','devol'] | ['scsr', 'tpot', 'hyperopt', 'keras', 'devol', 'ludwig'] |
clean_data | specifies whether or not you'd like to clean / pre-process data in folders before model training (boolean). | True | True, False |
augment_data | specifies whether or not you'd like to augment data during training (boolean). | False | True, False |
create_YAML | specifies whether or not you'd like to output a production-ready repository for model deployment (boolean). | False | True, False |
model_compress | if True compresses the model for production purposes to reduce memory consumption. Note this only can happen on Keras or scikit-learn / TPOT models for now (boolean). | False | True, False |