tags | ||
---|---|---|
|
Wavenet is an autoregressive CNN-based Text-to-Speech (TTS) model introduced by van den Oord et al. (2016).
The paper is very cryptic and hides a lot of details, mainly how are the individual layers stacked and combined. What we can deduce from the paper is the following:
This means that layer
Wavenet's CNN layer
For prediction Wavenet quantizes amplitudes to 256 levels using uniform ranges in logarithmic scale (as humans perceive loudness). It then uses softmax to classify particular timestamp to 256 different levels of loudness.
Though Wavenet's overall architecture is unclear, there are number of follow-up articles and implementations:
One could fairly simply just infer the architecture from the open source implementation.