-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ASR] AudioToAudio datasets and related test #5196
Conversation
This pull request introduces 1 alert when merging da6a76d into acb5073 - view on LGTM.com new alerts:
|
da6a76d
to
16a363e
Compare
This pull request introduces 1 alert when merging 16a363e into acb5073 - view on LGTM.com new alerts:
|
88561ac
to
0b127ad
Compare
sample_rate: desired sample rate for output samples | ||
duration: Optional desired duration of output samples. | ||
If `None`, complete file will be loaded. | ||
If set, a random segment of `duration` seconds will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to load random segment of duration, or is there a way to load a file with fixed_offset and duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually audio-to-audio is trained with a fixed duration and selecting a random segment from audio. Test is usually performed on whole audio file. That's why I added support for these two essential use cases (random fixed length or whole audio).
I will and a fixed duration, fixed offset (non-random).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great of the function docsting explains more about optional multichannel aspect.
I think diarization/buffer-ASR codes should use this function for removing duplications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added loading fixed offset & fixed duration + more docstrings.
c73994d
to
47ded05
Compare
Thanks for this draft @anteju ! This is very helpful. |
The most straightforward way would be to add
Could you please clarify |
ca69595
to
2dd5343
Compare
Maybe have a example dataset class, which in get_item can return two or more audio files, and two or more transcripts? This is a most general class that comes to mind, and is POC for the general design you worked on. This would allow us to then easily adapt the class for TSASR, which returns two audios and 1 transcript |
Augnentor should be a detached operation, even if it's part of the config. Ie don't use the way we do it right now where we have no control over augmentation and it's all random inside of Waveform Featurizer. Let's seperate and make the call to augmentation more controllable inside of the new data loaders |
Re: audio signals Re: text |
That is exactly the plan: to have augmentation inside the data loader (and not when loading audio, as in |
2dd5343
to
c42d9f1
Compare
d989568
to
5c05957
Compare
5c05957
to
7ea9b7b
Compare
7ea9b7b
to
1bff667
Compare
@@ -257,12 +257,22 @@ def from_file( | |||
|
|||
@classmethod | |||
def segment_from_file( | |||
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, | |||
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, offset=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XuesongYang, I've added an option to specify a fixed (non-randomized) offset.
The new parameter offset
defaults to None
and as before results in a random offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: offset
can be enforced as a float type. All negative values mean no offsets. So we could make it default to -1.0
to specify no offsets. Then benefit is that we can have a cleaner type hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems an easy add-on with type hint for this function. Do you mind if enforcing that? Thanks.
This pull request introduces 1 alert when merging 1bff667 into 563cc2f - view on LGTM.com new alerts:
Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for segment_from_file()
func. Added some comments accordingly. Thanks.
@@ -257,12 +257,22 @@ def from_file( | |||
|
|||
@classmethod | |||
def segment_from_file( | |||
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, | |||
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, offset=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: offset
can be enforced as a float type. All negative values mean no offsets. So we could make it default to -1.0
to specify no offsets. Then benefit is that we can have a cleaner type hint.
@@ -257,12 +257,22 @@ def from_file( | |||
|
|||
@classmethod | |||
def segment_from_file( | |||
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, | |||
cls, audio_file, target_sr=None, n_segments=0, trim=False, orig_sr=None, channel_selector=None, offset=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems an easy add-on with type hint for this function. Do you mind if enforcing that? Thanks.
discussed offline. Approved for my parts and please hold off merging until other folks approve. |
] | ||
|
||
|
||
def load_samples_synchronized( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 adding one vote for this comment. this function feels like doing more than one thing.
] | ||
|
||
|
||
def load_samples_synchronized( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to make the function itself split into seperate components, each with private methods inside of the class, which can be overriden. Having a class which calls a monolithic function defeats the purpose of extensible code
Signed-off-by: Ante Jukić <[email protected]>
…ication Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: Ante Jukić <[email protected]>
Signed-off-by: Ante Jukić <[email protected]>
e0f166d
to
835b4e6
Compare
Signed-off-by: Ante Jukić <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me
* AudioToAudio datasets and related test Signed-off-by: Ante Jukić <[email protected]> * Updated doc, created utility function in manifest to avoide code duplication Signed-off-by: Ante Jukić <[email protected]> * Remove unused import Signed-off-by: Ante Jukić <[email protected]> * Moved functionality to ASRAudioProcessor Signed-off-by: Ante Jukić <[email protected]> * Addressed review comments Signed-off-by: Ante Jukić <[email protected]> * Removed unused local variable Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: Hainan Xu <[email protected]>
* AudioToAudio datasets and related test Signed-off-by: Ante Jukić <[email protected]> * Updated doc, created utility function in manifest to avoide code duplication Signed-off-by: Ante Jukić <[email protected]> * Remove unused import Signed-off-by: Ante Jukić <[email protected]> * Moved functionality to ASRAudioProcessor Signed-off-by: Ante Jukić <[email protected]> * Addressed review comments Signed-off-by: Ante Jukić <[email protected]> * Removed unused local variable Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: Hainan Xu <[email protected]>
* AudioToAudio datasets and related test Signed-off-by: Ante Jukić <[email protected]> * Updated doc, created utility function in manifest to avoide code duplication Signed-off-by: Ante Jukić <[email protected]> * Remove unused import Signed-off-by: Ante Jukić <[email protected]> * Moved functionality to ASRAudioProcessor Signed-off-by: Ante Jukić <[email protected]> * Addressed review comments Signed-off-by: Ante Jukić <[email protected]> * Removed unused local variable Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: andrusenkoau <[email protected]>
* AudioToAudio datasets and related test Signed-off-by: Ante Jukić <[email protected]> * Updated doc, created utility function in manifest to avoide code duplication Signed-off-by: Ante Jukić <[email protected]> * Remove unused import Signed-off-by: Ante Jukić <[email protected]> * Moved functionality to ASRAudioProcessor Signed-off-by: Ante Jukić <[email protected]> * Addressed review comments Signed-off-by: Ante Jukić <[email protected]> * Removed unused local variable Signed-off-by: Ante Jukić <[email protected]> Signed-off-by: Ante Jukić <[email protected]>
What does this PR do ?
This is a draft PR of datasets for different audio-to-audio tasks.
Datasets
BaseAudioDataset
: Abstract base class.AudioToTargetDataset
: A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal.AudioToTargetWithReferenceDataset
: A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal and an additional reference signal is available.AudioToTargetWithEmbeddingDataset
: A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal and an additional embedding signal. It is assumed that the embedding is in a form of a vector.Tests
Multiple tests are implemented in
test_asr_datasets.py
in classTestAudioDatasets
. These tests includetest_audio_to_target_dataset
: multiple tests forAudioToTargetDataset
test_audio_to_target_dataset_with_target_list
: tests specifically a scenario where target is provides as a list of filestest_audio_to_target_with_reference_dataset
: tests forAudioToTargetWithReferenceDataset
test_audio_to_target_with_embedding_dataset
: tests forAudioToTargetWithEmbeddingDataset
Tests can be started using the following command
Collection: ASR
Changelog
audio_to_audio.py
test_asr_datasets.py
andtest_audio_utils.py
AudioSegment. segment_from_file
to use a fixed offsetUsage
Usage is illustrated in unit tests, which can be executed using
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information