- Evaluate the model using the val dataset. Save each resulting embedding, with the corresponding speaker;
- Group the embeddings by speaker;
- Compute the distribution of each embedding (seaborn histogram, softmax);
- Compute all the distances between all possible distribution couples, using a distribution distance (e.g. entropy) and plot them (seaborn histogram?).
-> Check if the entropy result is close to the uniform distribution with and without speaker embedding (do all the previous steps using speaker embedding as well); -> Because 29 codebook vectors compressed as shit the data, it's likely already speaker independent. We can check that by increasing the number of embedding vectors (e.g. the higher, the finess the representation will be).
Increase exponentially the number of embedding vectors and notice if the mapping accuracy increase, at least linearly (it should be the case)
Goal: Check if each vector in the codebook correspond to a specific phoneme.
- Make a 2D projection of the embedding vectors using umap;
- Plot the projection result as points, and the embedding indices as the most probable phoneme found using many to one mapping strategy;
- Map each groundtruth phoneme with each mark (how?).