Skip to content

Latest commit

 

History

History
45 lines (30 loc) · 2.18 KB

wavenet.md

File metadata and controls

45 lines (30 loc) · 2.18 KB
tags
ml
text_to_speech

Wavenet

Wavenet is an autoregressive CNN-based Text-to-Speech (TTS) model introduced by van den Oord et al. (2016).

The paper is very cryptic and hides a lot of details, mainly how are the individual layers stacked and combined. What we can deduce from the paper is the following:

Wavenet is a CNN model with 1D causal convolutions

This means that layer $l$ combines information from the current and the previous (in terms of sequence) tokens of the previous layer $l - 1$.

Wavenet uses dilated convolutions to increase receptive field

Wavenet's CNN layer $l$ has dilatation $2^l$ up to $l = 9$ with dilatation $2^9 = 512$ and then it starts again from dilatation $2^0$. The reason for the dilatation is to increase the receptive field. The dilatation is exponentially increased so that at layer $l$ the model sees exponentially many inputs in the past.

Wavenet quantizes amplitudes and uses softmax

For prediction Wavenet quantizes amplitudes to 256 levels using uniform ranges in logarithmic scale (as humans perceive loudness). It then uses softmax to classify particular timestamp to 256 different levels of loudness.

Wavenet overall architecture is unclear

Though Wavenet's overall architecture is unclear, there are number of follow-up articles and implementations:

One could fairly simply just infer the architecture from the open source implementation.