Welcome to the "NOTSOFAR-1: Distant Meeting Transcription with a Single Device" Challenge.
This repo contains the baseline system code and datasets for the NOTSOFAR-1 Challenge.
- For details on the datasets, baseline system, and tasks, please see our NOTSOFAR-1 paper or visit CHiME's official challenge website.
- Contact us: join the
chime-8-notsofar
channel on the CHiME Slack, or open a GitHub issue.
Values are presented in tcpWER / tcORC-WER (session count)
format.
As mentioned in the official website,
systems are ranked based on the speaker-attributed
tcpWER
, while the speaker-agnostic tcORC-WER serves as a supplementary metric for analysis.
We include analysis based on a selection of hashtags from our metadata, providing insights into how different conditions affect system performance.
Single-Channel | Multi-Channel | |
---|---|---|
All Sessions | 46.8 / 38.5 (177) | 32.4 / 26.7 (106) |
#NaturalMeeting | 47.6 / 40.2 (30) | 32.3 / 26.2 (18) |
#DebateOverlaps | 54.9 / 44.7 (39) | 38.0 / 31.4 (24) |
#TurnsNoOverlap | 32.4 / 29.7 (10) | 21.2 / 18.8 (6) |
#TransientNoise=high | 51.0 / 43.7 (10) | 33.6 / 29.1 (5) |
#TalkNearWhiteboard | 55.4 / 43.9 (40) | 39.9 / 31.2 (22) |
The following steps will guide you through setting up the project on your machine.
This project is compatible with Linux environments. Windows users can refer to Docker or
Devcontainer sections.
Alternatively, install WSL2 by following the WSL2 Installation Guide, then install Ubuntu 20.04 from the Microsoft Store.
Clone the NOTSOFAR1-Challenge
repository from GitHub. Open your terminal and run the following command:
sudo apt-get install git
cd path/to/your/projects/directory
git clone https://github.com/microsoft/NOTSOFAR1-Challenge.git
Conda is a package manager that is used to install Python and other dependencies.
To install Miniconda, which is a minimal version of Conda, run the following commands:
miniconda_dir="$HOME/miniconda3"
script="Miniconda3-latest-Linux-$(uname -m).sh"
wget --tries=3 "https://repo.anaconda.com/miniconda/${script}"
bash "${script}" -b -p "${miniconda_dir}"
export PATH="${miniconda_dir}/bin:$PATH"
*** You may change the miniconda_dir
variable to install Miniconda in a different directory.
Conda Environments are used to isolate Python dependencies.
To set it up, run the following commands:
source "/path/to/conda/dir/etc/profile.d/conda.sh"
conda create --name notsofar python=3.10 -y
conda activate notsofar
cd /path/to/NOTSOFAR1-Challenge
python -m pip install --upgrade pip
pip install --upgrade setuptools wheel Cython fasttext-wheel
pip install -r requirements.txt
conda install ffmpeg -c conda-forge -y
Python 3.10 is required to run the project. To install it, run the following commands:
sudo apt update && sudo apt upgrade
sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt update
sudo apt install python3.10
Python virtual environments are used to isolate Python dependencies.
To set it up, run the following commands:
sudo apt-get install python3.10-venv
python3.10 -m venv /path/to/virtualenvs/NOTSOFAR
source /path/to/virtualenvs/NOTSOFAR/bin/activate
Navigate to the cloned repository and install the required Python dependencies:
cd /path/to/NOTSOFAR1-Challenge
python -m pip install --upgrade pip
pip install --upgrade setuptools wheel Cython fasttext-wheel
sudo apt-get install python3.10-dev ffmpeg build-essential
pip install -r requirements.txt
Refer to the Dockerfile
in the project's root for dependencies setup. To use Docker, ensure you have Docker installed on your system and configured to use Linux containers.
With the provided devcontainer.json
you can run and work on the project in a devctonainer using, for example, the Dev Containers VSCode Extension.
The following command will download the entire dev-set-1 of the recorded meeting dataset and run the inference pipeline
according to selected configuration. The default is configured to --config-name dev_set_mc_debug
for quick debugging,
running on a single session with the Whisper 'tiny' model.
cd /path/to/NOTSOFAR1-Challenge
python run_inference.py
To run on all multi-channel or single-channel dev-set sessions, use the following commands respectively:
python run_inference.py --config-name full_dev_set_mc
python run_inference.py --config-name full_dev_set_sc
The first time run_inference.py
runs, it will automatically download these required models and datasets from blob storage:
- The development set of the meeting dataset (dev-set) will be stored in the
artifacts/meeting_data
directory. - The CSS models required to run the inference pipeline will be stored in the
artifacts/css_models
directory.
Outputs will be written to the artifacts/outputs
directory.
The session_query
argument found in the yaml config file (e.g. configs/inference/inference_v1.yaml
) offers more control over filtering meetings.
Note that to submit results on the dev-set, you must evaluate on the full set (full_dev_set_mc
or full_dev_set_sc
) and no filtering must be performed.
The inference pipeline is modular, designed for easy research and extension. Begin by exploring the following components:
- Continuous Speech Separation (CSS): See
css_inference
incss.py
. We provide a model pre-trained on NOTSOFAR's simulated training dataset, as well as inference and training code. For more information, refer to the CSS section. - Automatic Speech Recognition (ASR): See
asr_inference
inasr.py
. The baseline implementation relies on Whisper. - Speaker Diarization: See
diarization_inference
indiarization.py
. The baseline implementation relies on the NeMo toolkit.
For training and fine-tuning your models, NOTSOFAR offers the simulated training set and the training portion of the
recorded meeting dataset. Refer to the download_simulated_subset
and download_meeting_subset
functions in
utils/azure_storage.py,
or the NOTSOFAR-1 Datasets section.
The following command will run CSS training on the 10-second simulated training data sample in sample_data/css_train_set
.
cd /path/to/NOTSOFAR1-Challenge
python run_training_css_local.py
You can use the download_simulated_subset
function in
utils/azure_storage.py
to download the training dataset from blob storage.
You have the option to download either the complete dataset, comprising almost 1000 hours, or a smaller, 200-hour subset.
Examples:
ver='v1.5' # this should point to the lateset and greatest version of the dataset.
# Option 1: Download the training and validation sets of the entire 1000-hour dataset.
train_set_path = download_simulated_subset(
version=ver, volume='1000hrs', subset_name='train', destination_dir=os.path.join(my_dir, 'train'))
val_set_path = download_simulated_subset(
version=ver, volume='1000hrs', subset_name='val', destination_dir=os.path.join(my_dir, 'val'))
# Option 2: Download the training and validation sets of the smaller 200-hour dataset.
train_set_path = download_simulated_subset(
version=ver, volume='200hrs', subset_name='train', destination_dir=os.path.join(my_dir, 'train'))
val_set_path = download_simulated_subset(
version=ver, volume='200hrs', subset_name='val', destination_dir=os.path.join(my_dir, 'val'))
Once you have downloaded the training dataset, you can run CSS training on it using the run_training_css
function in css/training/train.py
.
The main
function in run_training_css.py
provides an entry point with conf
, data_root_in
, and data_root_out
arguments that you can use to configure the run.
It is important to note that the setup and provisioning of a compute cloud environment for running this training process is the responsibility of the user. Our code is designed to support PyTorch's Distributed Data Parallel (DDP) framework. This means you can leverage multiple GPUs across several nodes efficiently.
To add a new CSS model, you need to do the following:
- Have your model implement the same interface as our baseline CSS model class
ConformerCssWrapper
which is located incss/training/conformer_wrapper.py
. Note that in addition to theforward
method, it must also implement theseparate
,stft
, andistft
methods. The latter three methods will be used in the inference pipeline and to calculate the loss when training. - Create a configuration dataclass for your model. Add it as a member of the
TrainCfg
dataclass incss/training/train.py
. - Add your model to the
get_model
function incss/training/train.py
.
This section is for those specifically interested in downloading the NOTSOFAR datasets.
The NOTSOFAR-1 Challenge provides two datasets: a recorded meeting dataset and a simulated training dataset.
The datasets are provided for open research. See the Data License section.
They are hosted in Azure Blob Storage. See download instructions below.
Visis the Data section on CHiME's website to explore the data further.
The NOTSOFAR-1 Recorded Meeting Dataset is a collection of 237 meetings, each averaging 6 minutes, recorded across 30 conference rooms with 4-8 attendees, featuring a total of 35 unique speakers. This dataset captures a broad spectrum of real-world acoustic conditions and conversational dynamics.
To download the dataset, you can call the python function download_meeting_subset within utils/azure_storage.py
.
Alternatively, using AzCopy CLI, set these arguments and run the following command:
subset_name
: name of split to download (dev_set
/eval_set
/train_set
).version
: version to download.datasets_path
- path to the directory where you want to download the benchmarking dataset (destination directory must exist).
azcopy copy https://notsofarsa.blob.core.windows.net/benchmark-datasets/<subset_name>/<version>/MTG <datasets_path>/benchmark --recursive
-
240825.1_train
: Corresponds to Train-set-1 and Train-set-2 from the challenge datasets, except faulty white-noise recordings of sc_rockfall_1 have been removed from 3 meetings.azcopy copy https://notsofarsa.blob.core.windows.net/benchmark-datasets/train_set/240825.1_train/MTG . --recursive
-
240825.1_dev1
: Same as Dev-set-1 from the challenge. Users should be mindful of speakers overlap: there are 12 speakers, 10 of which are in the training set.azcopy copy https://notsofarsa.blob.core.windows.net/benchmark-datasets/dev_set/240825.1_dev1/MTG . --recursive
-
240629.1_eval_small_with_GT
: Identical to the challenge evaluation set, hence enabling direct comparison to challenge results. This relatively smaller evaluation set is designed for resource-constrained research and includes 80 meetings with 2 devices per track (single-channel/multi-channel), totaling 16 hours for each. Ground-truth is available.azcopy copy https://notsofarsa.blob.core.windows.net/benchmark-datasets/eval_set/240629.1_eval_small_with_GT/MTG . --recursive
-
240825.1_eval_full_with_GT
: A larger evaluation set to facilitate further research and increase statistical significane of performance evaluations. It includes 129 meetings and a greater variety of devices: 3 multi-channel, and 6-7 single-channel.azcopy copy https://notsofarsa.blob.core.windows.net/benchmark-datasets/eval_set/240825.1_eval_full_with_GT/MTG . --recursive
The dataset currently available for open research have been modified, with the major differences being:
- In addition to the evaluation set used in the challenge (eval-small), a larger evaluation set (eval-full) is made available.
- Due to legal and quality constraints we unfortunately had to remove Dev-set-2. Instead, Dev-set-1 will serve as the development set for the open dataset.
Dev-set-2 is approved for use exclusively as part of the NOTSOFAR-1 Challenge, but not beyond it. Challenge participants are only allowed to use Dev-set-2 for publications related to systems developed during the Challenge.
- Upcoming upgrades: To ensure maximal annotation quality, the training and development subsets are undergoing transcription upgrades. Please stay tuned for updates.
- The Rockfall1 multi-channel device (mc_rockfall_1), suspected of being faulty due to subjective sound quality, is excluded from eval-small and eval-full but remains in the training and development sets. Users may choose whether to use it.
The simulated training dataset consists of almsot 1000 hours simulated with the same microphone-array geometry as the multi-channel devices in the NOTSOFAR meeting dataset. It was synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions.
To download the dataset, you can call the python function download_simulated_subset
within utils/azure_storage.py
.
Alternatively, using AzCopy CLI,
set these arguments and run the following command:
version
: version of the train data to download (1.5
is the latest). See doc indownload_simulated_subset
function inutils/azure_storage.py
for available versions.volume
- volume of the train data to download (200hrs
/1000hrs
)subset_name
: train data type to download (train
/val
)datasets_path
- path to the directory where you want to download the simulated dataset (destination directory must exist).
azcopy copy https://notsofarsa.blob.core.windows.net/css-datasets/<version>/<volume>/<subset_name> <datasets_path>/benchmark --recursive
Examples:
azcopy copy https://notsofarsa.blob.core.windows.net/css-datasets/v1.5/200hrs/train . --recursive
azcopy copy https://notsofarsa.blob.core.windows.net/css-datasets/v1.5/1000hrs/train . --recursive
If you use the NOTSOFAR datasets or code in your research, please cite the following paper:
@inproceedings{vinnikov24_interspeech,
title = {NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription},
author = {Alon Vinnikov and Amir Ivry and Aviv Hurvitz and Igor Abramovski and Sharon Koubi and Ilya Gurvich and Shai Peer and Xiong Xiao and Benjamin Martinez Elizalde and Naoyuki Kanda and Xiaofei Wang and Shalev Shaer and Stav Yagev and Yossi Asher and Sunit Sivasankaran and Yifan Gong and Min Tang and Huaming Wang and Eyal Krupka},
year = {2024},
booktitle = {Interspeech 2024},
pages = {5003--5007},
doi = {10.21437/Interspeech.2024-1788},
issn = {2958-1796},
}
The data provided in this repository is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to use, share, and adapt the data as long as appropriate credit is given.
Please refer to our contributing guide for more information on how to contribute!