-
Hello, I've seen that using PCA is one of the biggest advantages. Just wondering where PCA is applied exactly? I initially thought that you were applying PCA over the output dimension of a decently large dataset, but seems like this is not the case (since you seem to be able to apply this to arbitrary models). The only other place I can think of is the token embeddings itself, but then the model dimensions mis-match. TIA |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @sachinruk, we are applying PCA directly on the static token embeddings that you get by forward passing the vocabulary. The relevant line can be found here. So essentially, we first forward pass all the tokens, which gives you (vocab_size, dim_size) embeddings (where dim_size is the dimensionality of the model you are distilling), and then we apply PCA on those embeddings, which gives you (vocab_size, pca_dims) output embeddings. Hope that answers your question! |
Beta Was this translation helpful? Give feedback.
Hi @sachinruk, we are applying PCA directly on the static token embeddings that you get by forward passing the vocabulary. The relevant line can be found here. So essentially, we first forward pass all the tokens, which gives you (vocab_size, dim_size) embeddings (where dim_size is the dimensionality of the model you are distilling), and then we apply PCA on those embeddings, which gives you (vocab_size, pca_dims) output embeddings. Hope that answers your question!