-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A more featureful API #16
Comments
Here are a couple of my initial thoughts. I think one feature that any reasonable API will require is a fast way to go from word (to index) to vector. Currently this mapping is implicit, but walking through the word list for every vector lookup would be far too slow. The simplest way to do this would be to just add a More generally, I think that it would be best to keep the interface minimal and "unopinionated." Embeddings.jl already seems to embrace this. I think that a julia> w2v = load_embeddings(Word2Vec{:en}, args...)
julia> word_vector(w2v, "the")
300-element Array{Float64,1}:
0.1
0.2
...
julia> word_vector(w2v, "notfound")
Error: "notfound" not in vocabulary
julia> oov_vector = zeros(size(w2v.embedding_matrix, 1))
300-element Array{Float64,1}:
0.0
0.0
julia> word_vector(w2v, "notfound", oov_vector)
300-element Array{Float64,1}:
0.0
0.0
...
julia> word_vector(() -> randn(300), w2v, "notfound")
300-element Array{Float64,1}:
-1.784924003687886
1.3056642179917306
# implementing that method for each type would also permit this syntax:
julia> vec = word_vector(w2v, "notfound") do
# do some fancy calculation
return vector
end I believe an approach like this would work well alongside system-specific needs of things like OOV interpolation as well. Maybe the default behavior for |
Some very scatted thoughts: What I currently do, when using this in the wild, is to use MLLabelUtils
See in When working with this in the something, Two advantages: point 1) is the indexing with Ints can be performed trivially in all systems. E.g. on point 1) But in, e.g. Flux, On point 2. More thoughts We can do dispatches for both And it is probably fine to, if given a string, automatically convert it to a vocab index. WE also have the ability to overload both call syntax and indexing syntax. |
Any thoughts on OOV interpolation? I was thinking of StringDistances.jl but it may be too slow. Regarding the embeddings format, the simplest data structure possible is preferable in my case (i.e. |
When I say OOV interpolation, I don't mean just finding the nearest word according to Levenshtein Distance, OOV interpolation is a problem that needs to be solved in the numerical embedding space, Preprocessing to correct misspellings is beyond the scope of this package. |
Makes sense, thank you. Any plans for including that in Embeddings.jl ? |
FastText interpolation? |
Yes, completely agreed on all points. MLDataUtils is definitely great, and does this well. My feeling is that because this word -> int conversion is so fundamental to using word embeddings, it probably makes sense for Embeddings.jl to provide a lightweight way to do it. If this functionality does become part of Embeddings.jl, I'd expect it to work like that. Regarding using Embeddings.jl alongside TensorFlow.jl or Flux, yes, I think it makes sense to have any API be generic and flexible enough to work well with both.
Maybe I miss your point, but I think Flux's I guess to put it another way, I'd like to try to come up with a minimal, flexible set or methods that can work easily with e.g. Tensorflow.jl, Flux.jl, and/or other packages like Distances.jl without actually needing to know anything at all about their implementation details. Probably a good way to figure this out would be to actually write a few models that need to do these things and see which parts are easy and which parts are hard, and then look at how Embeddings.jl could make the hard parts easier. |
I've actually had these ideas too! I'm also not totally sure about it, but I do think I like it. |
I didn't have much of a point, just putting it out there. I think for flexibility, Then if we have different types, they can internally use different representations |
Shifting discussion form #14
The text was updated successfully, but these errors were encountered: