Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ASR] AudioToAudio datasets and related test #5196

Merged
merged 6 commits into from
Nov 30, 2022

Conversation

anteju
Copy link
Collaborator

@anteju anteju commented Oct 18, 2022

What does this PR do ?

This is a draft PR of datasets for different audio-to-audio tasks.

Datasets

  • BaseAudioDataset: Abstract base class.
  • AudioToTargetDataset: A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal.
  • AudioToTargetWithReferenceDataset: A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal and an additional reference signal is available.
  • AudioToTargetWithEmbeddingDataset: A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal and an additional embedding signal. It is assumed that the embedding is in a form of a vector.

Tests

Multiple tests are implemented in test_asr_datasets.py in class TestAudioDatasets. These tests include

  • test_audio_to_target_dataset: multiple tests for AudioToTargetDataset
  • test_audio_to_target_dataset_with_target_list: tests specifically a scenario where target is provides as a list of files
  • test_audio_to_target_with_reference_dataset: tests for AudioToTargetWithReferenceDataset
  • test_audio_to_target_with_embedding_dataset: tests for AudioToTargetWithEmbeddingDataset

Tests can be started using the following command

pytest tests/collections/asr/test_asr_datasets.py::TestAudioDatasets

Collection: ASR

Changelog

  • Added datasets in audio_to_audio.py
  • Added unit tests in test_asr_datasets.py and test_audio_utils.py
  • Added an option in AudioSegment. segment_from_file to use a fixed offset

Usage

Usage is illustrated in unit tests, which can be executed using

pytest tests/collections/asr/test_asr_datasets.py::TestAudioDatasets

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@lgtm-com
Copy link

lgtm-com bot commented Oct 18, 2022

This pull request introduces 1 alert when merging da6a76d into acb5073 - view on LGTM.com

new alerts:

  • 1 for Unused local variable

@anteju anteju force-pushed the dev/audio-datasets branch from da6a76d to 16a363e Compare October 18, 2022 23:43
@lgtm-com
Copy link

lgtm-com bot commented Oct 18, 2022

This pull request introduces 1 alert when merging 16a363e into acb5073 - view on LGTM.com

new alerts:

  • 1 for Unused local variable

@anteju anteju force-pushed the dev/audio-datasets branch 2 times, most recently from 88561ac to 0b127ad Compare October 19, 2022 03:39
nemo/collections/asr/data/audio_to_audio.py Outdated Show resolved Hide resolved
sample_rate: desired sample rate for output samples
duration: Optional desired duration of output samples.
If `None`, complete file will be loaded.
If set, a random segment of `duration` seconds will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to load random segment of duration, or is there a way to load a file with fixed_offset and duration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually audio-to-audio is trained with a fixed duration and selecting a random segment from audio. Test is usually performed on whole audio file. That's why I added support for these two essential use cases (random fixed length or whole audio).
I will and a fixed duration, fixed offset (non-random).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great of the function docsting explains more about optional multichannel aspect.
I think diarization/buffer-ASR codes should use this function for removing duplications

Copy link
Collaborator Author

@anteju anteju Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added loading fixed offset & fixed duration + more docstrings.

@anteju anteju force-pushed the dev/audio-datasets branch 2 times, most recently from c73994d to 47ded05 Compare October 19, 2022 17:28
@yzhang123
Copy link
Contributor

yzhang123 commented Oct 19, 2022

Thanks for this draft @anteju ! This is very helpful.
I wonder how the pipeline will look like if you add all audio based augmentations/perturbations. could you add an example?
Also, i am not sure if this is out of scope, but it would be great to also see an example where we have (multiple) audios and (multiple) text.

@anteju
Copy link
Collaborator Author

anteju commented Oct 26, 2022

Thanks for this draft @anteju ! This is very helpful.
I wonder how the pipeline will look like if you add all audio based augmentations/perturbations. could you add an example?

The most straightforward way would be to add augmentor object to datasets, similarly to existing datasets.

Also, i am not sure if this is out of scope, but it would be great to also see an example where we have (multiple) audios and (multiple) text.

Could you please clarify multiple audios and multiple text?

@anteju anteju force-pushed the dev/audio-datasets branch 2 times, most recently from ca69595 to 2dd5343 Compare October 27, 2022 00:31
@yzhang123
Copy link
Contributor

yzhang123 commented Oct 27, 2022

The most straightforward way would be to add augmentor object to datasets, similarly to existing datasets.
Yes, good idea!

Also, i am not sure if this is out of scope, but it would be great to also see an example where we have (multiple) audios and (multiple) text.

Could you please clarify multiple audios and multiple text?

Maybe have a example dataset class, which in get_item can return two or more audio files, and two or more transcripts? This is a most general class that comes to mind, and is POC for the general design you worked on. This would allow us to then easily adapt the class for TSASR, which returns two audios and 1 transcript

@titu1994
Copy link
Collaborator

Augnentor should be a detached operation, even if it's part of the config. Ie don't use the way we do it right now where we have no control over augmentation and it's all random inside of Waveform Featurizer. Let's seperate and make the call to augmentation more controllable inside of the new data loaders

@anteju
Copy link
Collaborator Author

anteju commented Oct 27, 2022

The most straightforward way would be to add augmentor object to datasets, similarly to existing datasets.
Yes, good idea!

Also, i am not sure if this is out of scope, but it would be great to also see an example where we have (multiple) audios and (multiple) text.

Could you please clarify multiple audios and multiple text?

Maybe have a example dataset class, which in get_item can return two or more audio files, and two or more transcripts? This is a most general class that comes to mind, and is POC for the general design you worked on. This would allow us to then easily adapt the class for TSASR, which returns two audios and 1 transcript

Re: audio signals
AudioDatasetWithReference is returning multiple audio signals: input and target. Target can be, for example, be provided as a list of signals which are concatenated along the channel dimension. This is also tested in test_audio_to_target_dataset_with_target_list.
AudioDatasetWithReference is returning multiple audio signals: input, target, and reference. This one should be exactly what's necessary for TSASR (reference signal = enrollment audio for TS).

Re: text
As discussed earlier, I'll take a look at that.

@anteju
Copy link
Collaborator Author

anteju commented Oct 27, 2022

Augnentor should be a detached operation, even if it's part of the config. Ie don't use the way we do it right now where we have no control over augmentation and it's all random inside of Waveform Featurizer. Let's seperate and make the call to augmentation more controllable inside of the new data loaders

That is exactly the plan: to have augmentation inside the data loader (and not when loading audio, as in WaveformFeaturizer).

@anteju anteju force-pushed the dev/audio-datasets branch from 2dd5343 to c42d9f1 Compare October 27, 2022 19:47
@anteju anteju force-pushed the dev/audio-datasets branch from d989568 to 5c05957 Compare November 10, 2022 04:35
@anteju anteju requested review from tango4j and yzhang123 November 11, 2022 00:51
@anteju anteju force-pushed the dev/audio-datasets branch from 5c05957 to 7ea9b7b Compare November 15, 2022 22:54
@anteju anteju marked this pull request as ready for review November 16, 2022 01:52
@anteju anteju force-pushed the dev/audio-datasets branch from 7ea9b7b to 1bff667 Compare November 17, 2022 00:23
@@ -257,12 +257,22 @@ def from_file(

@classmethod
def segment_from_file(
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None,
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, offset=None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XuesongYang, I've added an option to specify a fixed (non-randomized) offset.
The new parameter offset defaults to None and as before results in a random offset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: offset can be enforced as a float type. All negative values mean no offsets. So we could make it default to -1.0 to specify no offsets. Then benefit is that we can have a cleaner type hint.

Copy link
Collaborator

@XuesongYang XuesongYang Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems an easy add-on with type hint for this function. Do you mind if enforcing that? Thanks.

@anteju anteju requested a review from XuesongYang November 17, 2022 00:28
@lgtm-com
Copy link

lgtm-com bot commented Nov 17, 2022

This pull request introduces 1 alert when merging 1bff667 into 563cc2f - view on LGTM.com

new alerts:

  • 1 for Unused import

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@anteju anteju requested a review from titu1994 November 21, 2022 23:36
Copy link
Collaborator

@XuesongYang XuesongYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for segment_from_file() func. Added some comments accordingly. Thanks.

@@ -257,12 +257,22 @@ def from_file(

@classmethod
def segment_from_file(
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None,
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, offset=None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: offset can be enforced as a float type. All negative values mean no offsets. So we could make it default to -1.0 to specify no offsets. Then benefit is that we can have a cleaner type hint.

@@ -257,12 +257,22 @@ def from_file(

@classmethod
def segment_from_file(
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None,
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, offset=None
Copy link
Collaborator

@XuesongYang XuesongYang Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems an easy add-on with type hint for this function. Do you mind if enforcing that? Thanks.

@XuesongYang
Copy link
Collaborator

discussed offline. Approved for my parts and please hold off merging until other folks approve.

@anteju anteju requested a review from XuesongYang November 22, 2022 00:43
XuesongYang
XuesongYang previously approved these changes Nov 22, 2022
nemo/collections/common/parts/preprocessing/manifest.py Outdated Show resolved Hide resolved
]


def load_samples_synchronized(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 adding one vote for this comment. this function feels like doing more than one thing.

nemo/collections/asr/data/audio_to_audio.py Show resolved Hide resolved
nemo/collections/asr/data/audio_to_audio.py Outdated Show resolved Hide resolved
nemo/collections/asr/data/audio_to_audio.py Outdated Show resolved Hide resolved
]


def load_samples_synchronized(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to make the function itself split into seperate components, each with private methods inside of the class, which can be overriden. Having a class which calls a monolithic function defeats the purpose of extensible code

titu1994
titu1994 previously approved these changes Nov 29, 2022
nemo/collections/asr/data/audio_to_audio.py Show resolved Hide resolved
tango4j
tango4j previously approved these changes Nov 30, 2022
Signed-off-by: Ante Jukić <[email protected]>
@anteju anteju dismissed stale reviews from tango4j and titu1994 via 0567aa5 November 30, 2022 01:26
@anteju anteju requested review from tango4j and titu1994 November 30, 2022 01:31
Copy link
Collaborator

@tango4j tango4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me

@titu1994 titu1994 merged commit 5c1d59e into NVIDIA:main Nov 30, 2022
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Dec 5, 2022
* AudioToAudio datasets and related test

Signed-off-by: Ante Jukić <[email protected]>

* Updated doc, created utility function in manifest to avoide code duplication

Signed-off-by: Ante Jukić <[email protected]>

* Remove unused import

Signed-off-by: Ante Jukić <[email protected]>

* Moved functionality to ASRAudioProcessor

Signed-off-by: Ante Jukić <[email protected]>

* Addressed review comments

Signed-off-by: Ante Jukić <[email protected]>

* Removed unused local variable

Signed-off-by: Ante Jukić <[email protected]>

Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: Hainan Xu <[email protected]>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Dec 5, 2022
* AudioToAudio datasets and related test

Signed-off-by: Ante Jukić <[email protected]>

* Updated doc, created utility function in manifest to avoide code duplication

Signed-off-by: Ante Jukić <[email protected]>

* Remove unused import

Signed-off-by: Ante Jukić <[email protected]>

* Moved functionality to ASRAudioProcessor

Signed-off-by: Ante Jukić <[email protected]>

* Addressed review comments

Signed-off-by: Ante Jukić <[email protected]>

* Removed unused local variable

Signed-off-by: Ante Jukić <[email protected]>

Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: Hainan Xu <[email protected]>
andrusenkoau pushed a commit to andrusenkoau/NeMo that referenced this pull request Jan 5, 2023
* AudioToAudio datasets and related test

Signed-off-by: Ante Jukić <[email protected]>

* Updated doc, created utility function in manifest to avoide code duplication

Signed-off-by: Ante Jukić <[email protected]>

* Remove unused import

Signed-off-by: Ante Jukić <[email protected]>

* Moved functionality to ASRAudioProcessor

Signed-off-by: Ante Jukić <[email protected]>

* Addressed review comments

Signed-off-by: Ante Jukić <[email protected]>

* Removed unused local variable

Signed-off-by: Ante Jukić <[email protected]>

Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: andrusenkoau <[email protected]>
titu1994 pushed a commit to titu1994/NeMo that referenced this pull request Mar 24, 2023
* AudioToAudio datasets and related test

Signed-off-by: Ante Jukić <[email protected]>

* Updated doc, created utility function in manifest to avoide code duplication

Signed-off-by: Ante Jukić <[email protected]>

* Remove unused import

Signed-off-by: Ante Jukić <[email protected]>

* Moved functionality to ASRAudioProcessor

Signed-off-by: Ante Jukić <[email protected]>

* Addressed review comments

Signed-off-by: Ante Jukić <[email protected]>

* Removed unused local variable

Signed-off-by: Ante Jukić <[email protected]>

Signed-off-by: Ante Jukić <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASR common feature request/PR for a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants