-
-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch feature parity #1431
Comments
Do you mind if I try to implement the support in Flux corresponding to |
yes please, essentially this is all up for grabs |
Note that we shouldn't add all these layers here, for eg pixel shuffle has an implementation, so does Transformers, up sampling and embedding are direct Julia operations etc |
Maybe @chengchingwen could provide some suggestions on the last two items |
@CarloLucibello About ONNX: This exists which I guess (hope?) is better than nothing: https://github.com/DrChainsaw/ONNXmutable.jl I haven't registered it because 1) the name sucks and I can't think of anything better and 2) I'm thinking of splitting the import and export into two separate packages. 1 is the main blocker though :) I'd be happy to donate it to FluxML, or parts of it (e.g. import/export primitives). |
Yeah upsampling is non-trivial to get right and be performant on the GPU as well (last time I tried it, I had to ask in For ONNX, is it possible to hand control of ONNX.jl to @DrChainsaw? It seems like ONNXmutable.jl should really supersede that package. For vision models, there is this Metalhead PR which I think will bring us much closer to PyTorch parity. I am planning on training some of the simpler ones this weekend, but I would appreciate the help to add pre-trained weights from anyone with a GPU. Lastly, for hyperparameter/learning rate schedules, I just started ParameterSchedulers.jl to break the functionality out of FluxTraining.jl. This is quite a simple package, and I want to finish it this weekend for a project. I am happy to transfer ownership to FluxML. |
I tried implementing WeightNorm before, but it's harder than I thought without doing a per-layer implementation. See #1005 |
@DrChainsaw what are the limitations of ONNXmutable? |
@CarloLucibello From the ML-Coordination issue it seems like there is alot of ways to look at ONNX import/export so what is a limitation appears to be a bit more subjective than I thought. Here are some things I can think of
Only a subset of OPs supported. This is imo not a big deal as I have made an effort for it to be easy to add more and even make it easy for users to just hack in own versions locally. Most OPs are trivial to add but I have intentionally not added more than what I happen to need in hopes that it would encourage contribution. It has capabilities which perhaps only a small subset of users have use for w.r.t model manipulation. This translates to dependencies like JuMP and Cbc (used to solve the problem of keeping all parameter shapes aligned when changing the model structure) as well as metadata used to formulate the shape constraints. This may appear as bloat to users who only want to import a model and use it. The annoying part here is that RNNs are currently a bit limited, although this is more on Flux than on ONNXmutable since ONNX wants RNN to have 3D input while Flux wants 2D (in a loop). I have worked around this to some extent by just changing the model shape to 3D if a recurrent layer is found and then just fold the time dimension into the batch dimension if a Dense layer is encountered. This only works for a few model architecture types though. Exporting functionality can't handle 1) non-primitive functions with type constraints (e.g. Eco-system wise it would be better to refactor at least the export primitives to use NNlib as that would make them useable from other libraries which use NNlib (KNet, Avalorn etc). Perhaps not so much a limitation in itself though and can always be broken out later down the road. For export there is no limit on how many ways one can chose to translate a Julia function to an ONNX Node. Btw, I think it would be better to try to remove ONNX.jl from the general registry and use a name like OnnxFlux.jl to clearly state that it translates between ONNX and Flux. |
Unfortunately we can't remove packages from the registry. But if ONNXFlux.jl makes more sense, then we can just archive the ONNX.jl repo. |
I don't think it's unreasonable to expect anyone looking to use transformer layers to use Transformers.jl. One potential reason for Torch to add them is because there is no canonical library for transformers in that ecosystem (or really for any other domain...). RE ONNX, why not give that repo name over to ONNXMutable and then consider how best to refactor/reorganize? I highly doubt anyone is using the existing functionality, given that it's broken on most recent versions of Julia that Flux supports. RE XLA, I presume this is covered by the work Keno and Tim are doing? Not sure if there's a link to any details there. |
Regarding embeddings, although I haven't dealt with the potential caveats from weight norm and such, are there challenges I'm overlooking compared to doing a fairly trivial matrix indexing? Example: struct Embed{T}
w::T
end
@functor Embed
Embed(in::Integer, out::Integer; initW=glorot_uniform) = Embed(initW(out, in))
(m::Embed)(x::AbstractVector) = m.w[:,x] |
Something else I'd like to submit for consideration is an equivalent to the upcoming LazyModuleMixin. Not a 1-1 port, but some mechanism to avoid specifying intermediate sizes during model construction. |
Are Embeddings something of general utility besides Transformers, worth moving to Flux.jl? |
That |
Flux is lacking attention modules. That would be good to have (and PyTorch does have it). |
Note that there's also very similar implementation in ScatterNNlib (gather, scatter, their gradients). It wold be great to have them in NNlib and CUDA so other packages (like Avalon of my own) could use it. |
I recently used this approach for embedding and can confirm good performance on GPU, maybe there's been recent improvement in CUDA.jl explaining that it doesn't resort to scalar operations. Benchmark against Transformers.jl would be interesting through. |
I would want to help on that if possible, I'm not really sure of the process of it though. I do have access to a GPU (GTX 1080), let me know if I can be of any help on that. I'll try to figure out the procedure for that. |
I think something need to be mentioned together with Embedding is the one-hot encoding implementation. The problem of Embedding/OneHotEncoding is to maintain semantics and composability without hurting the performance on GPU. Currently the implementation of I do think they are worth moving to Flux/NNlib but there some questions need to be discuss. The semantics of |
@CarloLucibello I would like to add Einstein summation and tensor product to the discussion list. They are quite useful in some novel model design. |
I added them as covered by Tullio.jl. Possibly we just have to add references and examples in Flux. |
@chengchingwen could you open an issue here about OneHotVector's limitations? |
I think the issue with onnx implementations in general isn't writing the package initially, but the additional ops that need to be added regularly. We need a solution to that problem which is more pressing imo. I agree we need more attention modules. I would want to gather the relative issues with upsampling @CarloLucibello https://github.com/FluxML/NNlib.jl/pull/112/files |
That makes sense. Did we reexport those functions in Flux? |
yes, we have |
I'm not sure if the top post can be made into a wiki or something, but barring that some updates to keep this going:
|
Updated the OP with @ToucheSir's comments.
since there has been some request and it is what we do with basically anything else I think we should do it |
Apologies if I am slightly late. I went through the discussion and the tracker and want to implement the |
If it's not in the source, it's not implemented ;) Feel free to file a PR,
but note that there may be some discussion required on the exact interface
for this once you do.
…On Wed., Dec. 22, 2021, 3:28 a.m. Tanay Sharma, ***@***.***> wrote:
Apologies if I am slightly late. I went through the discussion and the
tracker and want to implement the FractionalMaxPooling layer. Can someone
please let me know if it is already implemented? Otherwise, I would love to
work on it. Thanks!
—
Reply to this email directly, view it on GitHub
<#1431 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA67XP27ZJHYGFSW2I2FMQLUSGY4FANCNFSM4VB4X76A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID: <FluxML/Flux
.***@***.***>
|
Thank You!
|
The pooling layer should ideally reuse as much code from NNlib and the layer can be In Flux. We would expect the layer to be a generalization of the maxpool layer in NNlib to accept real input in addition to integers, so the new layer would serve to generate the fractional sections and use pooling in those sections and combine the resulting array. Ideally it would be done with minimal changes to NNlib. |
As far as I know there is no negative log likelihood loss in https://fluxml.ai/Flux.jl/stable/models/losses/ |
Pytorch et al need these constructs because they require users to use custom intrinsics that their own codebases can understand (eg |
(BTW that naming convention in PyTorch makes no sense, since it refers to a specific distributional assumption) |
for quantization maybe https://github.com/google/qkeras is a good reference? do we have any advancement since this issue opened? |
Pytorch has weight normalization, this would be good to add the normalization section |
|
not much progress on that front. In any case, this issue just tracks pytorch's features, not those exposed by specialized libraries (although having other references is good). |
Aren't those already covered by |
@ToucheSir Aah yes, my bad, switching back and forth between Tf and pytorch hasn't been going well :) |
A short question: are there any plans to implement sparse convolutional layers? I found the following pytorch implementation referencing this article |
A list of PyTorch 1.7 features.
Items are checked if we have something more or less equivalent in Flux or in the julia ecosystem and supported by Flux.
This list is not complete, it comes from a rough scan of pytorch's documentation. Please feel free to add anything I missed in the comments, and whoever has write access to modify the list.
Related issue https://github.com/FluxML/ML-Coordination-Tracker/issues/16, and more generally anything in https://github.com/FluxML/ML-Coordination-Tracker/issues
Pytorch Features
Conv Layers
Conv1d
,Conv2d
,Conv3d
.ConvTranspose1d
,ConvTranspose2d
,ConvTranspose3d
.Fold
,Unfold
. In progress: Add fold and unfold NNlib.jl#444Pooling Layers
MaxPool1d
,MaxPool2d
,MaxPool3d
MaxUnPool1d
,MaxUnPool2d
,MaxUnPool3d
AvgPool1d
,AvgPool2d
,AvgPool3d
FractionalMaxPool2d
LPPool1d
,LPPool2d
AdaptiveAvgPool1d
,AdaptiveAvgPool2d
,AdaptiveAvgPool3d
AdaptiveMaxPool1d
,AdaptiveMaxPool2d
,AdaptiveMaxPool3d
Padding Layers
Add corresponding layers for all of the aboves wrapping the NNlin functionskeep as functions. Need to add them Flux's docs.Activations
Normalization Layers
BatchNorm1d
,BatchNorm2d
,BatchNorm3d
LayerNorm
GroupNorm
InstanceNorm1d
,InstanceNorm2d
,InstanceNorm3d
SyncBatchNorm
LocalResponseNorm
. Very old unfinished PR Local Response Normalization[W.I.P] #312. It is an outdated technique, probably we can live without it.Recurrent Layers
RNN
GRU
LSTM
Attention Layers
Transformer
. Well maintained implementations in Tansformers.jl.Should be moved from Transformers.jl to Flux.jl(ensure hitting cudnn kernels). PR MultiHeadAttention implementation #2146Linear Layers
Identity
Linear
Bilinear
Dropout Layers
Dropout
Dropout2d
,Dropout3d
(Make Dropout docstring clear w.r.t. N-D dropout #1490)AlphaDropout
Sparse Layers
Embedding
PR add Embedding layer #1516EmbeddingBag
PR AddEmbeddingBag
#2031Distance Functions
CosineSimilarity
. We have this in Distances.jl. Also easy to handcode. TODO check if AD and gpu friendly.PairwiseDistance
. We have this in Distances.jl TODO check if AD and gpu friendly (could use Tullio.jl to achieve both)Loss Functions
Vision Layers
PixelShuffle
. add Upsample and PixelShuffle layers #1468Upsample
(for 1d, 2d, and 3d). (partially done in add Upsample and PixelShuffle layers #1468)Initialization
xavier_uniform
,xavier_normal
. Calledglorot
here.kaiming_normal
kaiming_uniform
sparse
orthogonal
(Add Orthogonal initialization feature. #1496)Parallelism and Distributed
DataParallel
DistributedDataParallel
(solved by https://github.com/DhairyaLGandhi/DaggerFlux.jlset_num_threads
,set_num_interop_threads
. Not sure which operations are parallelized in pytorch. Here we have parallelization only in blas operations.Distributions
logpdf
offered by DistributionsAD.jlrsample
. params's differentiability through sampling supported by many distr:gradient(mu -> rand(Normal(mu, 1)), 0) == (1,)
.ONNX
FFT
AbstractFFT
s.Quantization
Pruning
Optim
- [ ] Reexport in Flux (see Add basic scheduling policies and a scheduler #1506)(TBD)LambdaLR
(handled in ParameterSchedulers.jl)MultiplicativeLR
(handled in ParameterSchedulers.jl)LinAlg
det
norm
Tensorboard
XLA
Misc
einsum
. AD and CUDA compatible Einstein summation given by Tullio.jl and other packages@autosize
#2078weight_norm
. Attempt in Added WeightNorm #1005 , PR AddWeightNorm
layer #2053spectral_norm
. Old attempt in Fixed the spectral normalization #115Pytorch Extras
Torchvision
unreleasedwork in DataAugmentation.jlTorchaudio
...
Torchtext
...
The text was updated successfully, but these errors were encountered: