Wavenet

Wavenet is an autoregressive CNN-based Text-to-Speech (TTS) model introduced by van den Oord et al. (2016).

The paper is very cryptic and hides a lot of details, mainly how are the individual layers stacked and combined. What we can deduce from the paper is the following:

Wavenet is a CNN model with 1D causal convolutions

This means that layer $l$ combines information from the current and the previous (in terms of sequence) tokens of the previous layer $l - 1$.

Wavenet uses dilated convolutions to increase receptive field

Wavenet's CNN layer $l$ has dilatation $2^l$ up to $l = 9$ with dilatation $2^9 = 512$ and then it starts again from dilatation $2^0$. The reason for the dilatation is to increase the receptive field. The dilatation is exponentially increased so that at layer $l$ the model sees exponentially many inputs in the past.

Wavenet quantizes amplitudes and uses softmax

For prediction Wavenet quantizes amplitudes to 256 levels using uniform ranges in logarithmic scale (as humans perceive loudness). It then uses softmax to classify particular timestamp to 256 different levels of loudness.

Wavenet overall architecture is unclear

Though Wavenet's overall architecture is unclear, there are number of follow-up articles and implementations:

ParallelWavenet (2017)
Open source implementation

One could fairly simply just infer the architecture from the open source implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wavenet.md

wavenet.md

Wavenet

Wavenet is a CNN model with 1D causal convolutions

Wavenet uses dilated convolutions to increase receptive field

Wavenet quantizes amplitudes and uses softmax

Wavenet overall architecture is unclear

Files

wavenet.md

Latest commit

History

wavenet.md

File metadata and controls

Wavenet

Wavenet is a CNN model with 1D causal convolutions

Wavenet uses dilated convolutions to increase receptive field

Wavenet quantizes amplitudes and uses softmax

Wavenet overall architecture is unclear