Skip to content

Commit

Permalink
Pcla tutorial fixes (NVIDIA#5271) (NVIDIA#5273)
Browse files Browse the repository at this point in the history
* Fixed typos

Signed-off-by: Matvei Novikov <[email protected]>

* Fixed cell type and tatoeba reference

Signed-off-by: Matvei Novikov <[email protected]>

* Fixed typo

Signed-off-by: Matvei Novikov <[email protected]>

* Fixed branch variable

Signed-off-by: Matvei Novikov <[email protected]>

Signed-off-by: Matvei Novikov <[email protected]>

Signed-off-by: Matvei Novikov <[email protected]>
Co-authored-by: Matvei Novikov <[email protected]>
Signed-off-by: Hainan Xu <[email protected]>
  • Loading branch information
2 people authored and Hainan Xu committed Nov 29, 2022
1 parent 37f36da commit 7ea98b0
Showing 1 changed file with 12 additions and 11 deletions.
23 changes: 12 additions & 11 deletions tutorials/nlp/Punctuation_and_Capitalization_Lexical_Audio.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@
"- whether the word should be capitalized\n",
"\n",
"\n",
"In some cases lexical only model can't predict punctutation correctly without audio. It is especially hard for conversational speech.\n",
"In some cases lexical only model can't predict punctuation correctly without audio. It is especially hard for conversational speech.\n",
"\n",
"For example:\n",
"\n",
Expand All @@ -119,7 +119,7 @@
"## Architecture\n",
"Punctuation and capitaalization lexical audio model is based on [Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech](https://arxiv.org/pdf/2008.00702.pdf). Model consists of lexical encoder (BERT-like model), acoustic encoder (i.e. Conformer's audio encoder), fusion of lexical and audio features (attention based fusion) and prediction layers.\n",
"\n",
"Fusion is needed because encoded text and audio might have different length therfore can't be alligned one-to-one. As model predicts punctuation and capitalization per text token we use cross-attention between encoded lexical and encoded audio input."
"Fusion is needed because encoded text and audio might have different length therefore can't be aligned one-to-one. As model predicts punctuation and capitalization per text token we use cross-attention between encoded lexical and encoded audio input."
]
},
{
Expand Down Expand Up @@ -279,22 +279,23 @@
]
},
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"outputs": [],
"source": [
"## download get_tatoeba_data.py script to download and preprocess the Tatoeba data\n",
"## download get_libritts_data.py script to download and preprocess the LibriTTS data\n",
"os.makedirs(WORK_DIR, exist_ok=True)\n",
"if not os.path.exists(WORK_DIR + '/get_libritts_data.py'):\n",
" print('Downloading get_libritts_data.py...')\n",
" wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/token_classification/data/get_libritts_data.py', WORK_DIR)\n",
"else:\n",
" print ('get_libritts_data.py already exists')"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
]
},
{
"cell_type": "code",
Expand Down

0 comments on commit 7ea98b0

Please sign in to comment.