-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add initial doc for text_normalization
- Loading branch information
Tuan Lai
committed
Jun 28, 2021
1 parent
4c53304
commit 86c8cb4
Showing
1 changed file
with
15 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
.. _text_normalization: | ||
|
||
Text Normalization Models | ||
========================== | ||
Text normalization is the task of converting a written text into its spoken form. For example, | ||
``$123`` should be verbalized as ``one hundred twenty three dollars``, while ``123 King Ave`` | ||
should be verbalized as ``one twenty three King Avenue``. Text normalization is typically used as | ||
a pre-processing step for a range of speech application such as text-to-speech synthesis (TTS). | ||
|
||
Data format | ||
------------------ | ||
|
||
The data needs to be stored in TAB separated files (.tsv) with three columns, the first of which | ||
is the "semiotic class", the second is the input token and the third is the output. An example can | ||
be the dataset used in the `Google Text Normalization Challenge <https://www.kaggle.com/google-nlu/text-normalization>`_. |