This repository is for training models related to disease prediction. This summarizes our lab's work in this area and makes it accessible to the public. If you'd like to contribute to this repo, please let us know!
Take a second to watch this video below on why we started our company and why we think this research is impactful.
We created the TRIBE model to work with outstanding individuals to help accomplish our mission to build a universal voice test to refer patients to specialists faster. Fellows come from many different backgrounds - undergraduates, graduate students, faculty members, physicians, engineers, computer scientists, and other professionals.
Fellows contribute to this repo by pursuing a data science project to model existing datasets or a research-related project to collect more data around an existing or new use case. Research demos are important for us since many of our datasets have a very small number of samples, we're very focused on curating a larger dataset and have it open-sourced to advance this work into the world.
If you're interested, definitely apply to the next TRIBE here. You can read this FAQ and watch a previous demo day below to get a better feel for the program. If you have any additional questions, please reach out to Jim Schwoebel @ [email protected].
We have found that Youtube is a reliable place to get labeled speech data if you know what to search for. For example, if we were using Parkinson’s disease as an example to find data, you could search something like “Parkinson’s disease: my story”. You’ll quickly find many people who have shared their stories of living with Parkinson’s disease. You can then manually annotate these videos to cut them around voice segments of patients and use this data to train machine learning models without any formal IRB.
This repository makes it seamless to build custom voice-based disease datasets using Youtube.
To get started, clone this repository:
git clone [email protected]:NeuroLexDiagnostics/train-diseases.git
cd train-diseases
open template.xlsx
Now fill out the spreadsheet (template.xlsx) in the current directory. This template (template.xlsx in this directory) allows you to quickly label 20 second segments with labels of voice data along with age (e.g. twenties), gender (e.g. male), accent (e.g. British), audio quality (e.g. good/bad), and location (indoor vs. outdoor). Note that you can make a new spreadsheet or expand upon an existing spreadsheet in this repository (in the spreadsheets directory):
- addiction.xlsx
- adhd.xlsx
- als.xlsx
- anxiety.xlsx
- autism.xlsx
- cold.xlsx
- controls.xlsx
- depression.xlsx
- depression_labels.xlsx
- dyslexia.xlsx
- glioblastoma.xlsx
- gravesdisease.xlsx
- multiple_sclerosis.xlsx
- parkinsons.xlsx
- postpartum_depression.xlsx
- schizophrenia.xlsx
- sleep_apnea.xlsx
- stressed.xlsx
Once you fill out this spreadsheet and save it to the cloned directory's spreadsheets folder, type this into the terminal:
python3 setup.py
python3 yscrape.py
After this, the video should start downloading and they will be converted to audio files in a folder named after the excel sheet you type in (e.g. glioblastoma.xlsx will be put into the folder glioblastoma).
If you get stuck, you can watch a quick tutorial on how this process works in the video below.
We also have options to collect to datasets through drafting an IRB-approved study. If you are interested in doing this, please contact Reza Hosseini Ghomi (MD/MSE), our Chief Medical Officer, @ [email protected] to see if your project idea is feasible.