Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base structure for Native Operators #16

Closed
wants to merge 32 commits into from

Conversation

vchuravy
Copy link
Collaborator

@vchuravy vchuravy commented Nov 5, 2015

Thanks to the help in #15 I managed to get past the initialization stage of _Native.

With the test_enty functions defined in src/nativeops.jl

info = mx.NativeOpInfo(fb_entry, fb_entry, infer_entry, list_entry, list_entry)
pstring = bytestring("0x", hex(reinterpret(UInt, pointer_from_objref(info))))
mx._Native(name = :test, info = pstring)

Works without causing a seqfault 🎉

Todo:

  • Defining API exposed to Users
  • Defining (gc-safe) entry points (maybe also thread safe?)

API

I was thinking of something along the lines of

immutable Softmax <: mx.NativeOp end

function forward(::Type{Softmax}, in_data, out_data)
  ...
end

function forward(::Type{Softmax}, out_grad, in_data, out_data, in_grad)
  ...
end

function infer_shape(::Type{Softmax},  in_shape)
  ...
end

function list_output(::Type{Softmax})
  ...
end

function list_output(::Type{Softmax})
  ...
end

@pluskid
Copy link
Member

pluskid commented Nov 5, 2015

Great! Thanks! You might want to use an actual object instead of mere type in the API. e.g.

function list_output(op::Softmax)
  ...
end

As my impression is that an operator could actually have some properties, like name, etc.

@codecov-io
Copy link

Current coverage is 67.44%

Merging #16 into master will decrease coverage by -6.30% as of 42dd45c

@@            master     #16   diff @@
======================================
  Files           20      21     +1
  Stmts         1531    1674   +143
  Branches         0       0       
  Methods          0       0       
======================================
  Hit           1129    1129       
  Partial          0       0       
- Missed         402     545   +143

Review entire Coverage Diff as of 42dd45c

Powered by Codecov. Updated on successful CI builds.

@vchuravy
Copy link
Collaborator Author

vchuravy commented Nov 5, 2015

I want to formulate it threadsafe from the get go and I don't know how to do that (yet) -> https://groups.google.com/forum/#!topic/julia-dev/9J1GYfCyVpE

@pluskid
Copy link
Member

pluskid commented Nov 8, 2015

I dig a bit into the discussions on mailing list. Just to summarize based on my understanding:

  1. For the kvstore callback (rewrite kvstore callback to be threadsafe #18), since @tqchen mentioned, that callback is guaranteed to be called in the main thread, there is no issue of the current implementation. We will probably go with the current way since it is somehow more efficient for some unknown reason.
  2. For the operator case, since the callback is executed from a different thread, and Julia GC is not thread safe (yet), the only safe thing to do in the callback is ccall(:uv_async_send, ...). So we will have to make this a shallow wrapper that does nothing but notify an actual async worker in the main thread to do the job. This shallow callback can not do waiting as context switching is definitely going to cause issue. So we will need to modify libmxnet to add ability to get notified when the Julia async worker finished the actual computation of the operator.

@vchuravy
Copy link
Collaborator Author

vchuravy commented Nov 8, 2015

@pluskid Exactly. also relevant #18 (comment)

I am currently trying to implement async safe callbacks in OpenCL.jl with the ability to pass data along, but I am running into issues. I will keep you updated as it goes. JuliaGPU/OpenCL.jl#87

@tqchen
Copy link
Member

tqchen commented Nov 21, 2015

Sorry to keep you wait for a while, this PR apache/mxnet#667 enables asynchronize OP.

I would recommend make an update to NativeOp to enable async mode, that exposes a callback to C API, which does some finalization stuffs(e.g. copy data back to TBlob), and call async_on_compelete.

@pluskid
Copy link
Member

pluskid commented Nov 21, 2015

@tqchen Thanks for adding this! I had a closer look at the nativeop API. It's a bit strange to expose the user some raw pointer to operate with. I think it might be better to use NDArray directly since it has nice async execution scheduling. The user could choose to use NDArray operations to implement the operator (which is potentially faster and runs on GPU), or if he want to work with numpy array or Julia array, copying the data from and to an NDArray is super easy in the script side (we could also provide a simple default wrapper for this case). What do you think?

@piiswrong
Copy link
Member

@pluskid That was what I was going to do originally, but due to some reason I went this way instead.
You are welcome to refactor it.

@vchuravy
Copy link
Collaborator Author

@pluskid @piiswrong I will look into the refactoring today, for both the python side and the julia side.
@tqchen Thanks for loonking into async ops.

@piiswrong
Copy link
Member

@vchuravy remember to set the ctx of the ndarraies to OpContext.ctx

@pluskid
Copy link
Member

pluskid commented Nov 21, 2015

@piiswrong Thanks! @vchuravy Great! I'll leave it to you now then.

@tqchen
Copy link
Member

tqchen commented Nov 21, 2015

There was a bit interfacing issue that we did not yet expose ndarray as operator.

  • The callback Julia operations are meant to be ran by another worker thread, while itself execute in kinda of synchronized manner.
  • NDArray operations are usually executed by main thread, and itself runs in async manner.

So if synchronize copy is introduced in the operator, the operator itself blocks in the memcpy call (of course this could also be ran by another worker thread), which is OK. Current operator was kinda of simplified to assume it actually does the heavy lifting works, copy to CPU and calls kinda of synchronize callback in native language.

To support NDArray tracking, we will need to introduce operator interface with NDArray, while most wrapper should still go with the synchronized real action. This complicates a bit with c++11 issues, as c++11 support by nvcc was introduced, but seems not yet perfect, and current cuda related code are not yet c++11. But I am sure we will find a way.

For short term, making native julia and numpy op work would be a good first step

@piiswrong
Copy link
Member

@vchuravy BTW, you probably will be interested in apache/mxnet#668
This is a runtime compile engine for CUDA. You can write kernels in python or julia

@piiswrong
Copy link
Member

@vchuravy when do you expect to finish C++ side refactor? I also want that feature, maybe we can setup an API standard and divide labor

@vchuravy
Copy link
Collaborator Author

@piiswrong I was thinking of later tonight, but if you want go ahead and I focus on the Julia side.

@vchuravy
Copy link
Collaborator Author

@pluskid I made some progress, but this is primarily focused on the conceptual side of how things should work. I am now working on getting it to a state where it actually works.

@pluskid
Copy link
Member

pluskid commented Dec 13, 2015

@vchuravy Thanks! With the end of the semester settled and coming back from conference, I think I will probably finally be able to resume a relatively stable commitment to the project soon. What is the current status and current blocking item? Let me know if you need any help on this side.

@vchuravy
Copy link
Collaborator Author

I still have to write the entire points and the rtc interface. After that
it is mostly going to be weeding out bugs and making sure everything is
thread safe.

On Sun, 13 Dec 2015, 23:59 Chiyuan Zhang [email protected] wrote:

@vchuravy https://github.com/vchuravy Thanks! With the end of the
semester settled and coming back from conference, I think I will probably
finally be able to resume a relatively stable commitment to the project
soon. What is the current status and current blocking item? Let me know if
you need any help on this side.


Reply to this email directly or view it on GitHub
#16 (comment).

@vchuravy vchuravy force-pushed the vc/nativeops branch 2 times, most recently from e492b57 to f223c96 Compare January 7, 2016 07:02
@vchuravy
Copy link
Collaborator Author

vchuravy commented Jan 7, 2016

@pluskid Conceptually everything is in place, but currently the following still seqfaults.

using MXNet
immutable Softmax <: mx.Native.Operator end
op = Softmax()
info = mx.Native.NDArrayOpInfo(op)
pstring = bytestring("0x", hex(reinterpret(UInt, pointer_from_objref(info))))
mx._NDArray(name = :test, info = pstring)

So I will have to do some more testing and debugging. But if you can find time and look over the basic design?

@vchuravy
Copy link
Collaborator Author

vchuravy commented Jan 7, 2016

Initializing a NativeOperator now works without seqfaulting.One major issue where I am unsure how to best handle that is the lifetime of Julia objects that we return via pointers to mxnet. One of those examples would be list_arguments. The Julia function returns an array of strings, that then gets converted to an array of pointers and we then store the pointer of that array to mxnet. When the GC kicks in the strings will get collected and the memory addresses will be void.

We will either keep track of all the pointers we hand off to mxnet or get MXNet to allocate memory for us.
http://docs.julialang.org/en/release-0.4/manual/calling-c-and-fortran-code/?highlight=cconvert#garbage-collection-safety

@vchuravy
Copy link
Collaborator Author

vchuravy commented Jan 7, 2016

@pluskid

I would like to define

function Base.call(op :: Operator; kwargs...)
  info = NDArrayOpInfo(op)
  pstring = bytestring("0x", hex(reinterpret(UInt, pointer_from_objref(info))))
  mx._NDArray(info = pstring, kwargs...)
end

So that the following would work

immutable SoftmaxOp <: mx.Native.Operator end; 
Softmax = SoftmaxOP(); 
Softmax(name = :Softmax)

But the function gets precompiled and it doesn't find _Nativeany ideas how circumvent that, short of using @generated function?

@vchuravy
Copy link
Collaborator Author

vchuravy commented Aug 17, 2016

So I think this is actually getting into the realm of the possible. The trick is to use put a lock around :uv_async_send and potentially making the entire call blocking.

Right now I am running into the issue that if I make the entire call blocking, the execution engine seems to stall and because operations on NDArray are blocking.

@piiswrong do I see it right that I should build this on top of the Custom operator and that NativeOp and NDArrayOP are deprecated? And am I also correct in assuming that https://github.com/dmlc/mxnet/blob/916dc9670abe2b64666c076bfff9f8feac23b8d6/src/operator/custom.cc#L75 is supposed to be blocking?

TODO:

  • Use condition and not barrier
  • Move to Custom
  • Decide between Julia arrays or NDArrays for the API
  • Allow for push! and GPU codegen

@piiswrong
Copy link
Member

@vchuravy Yes use custom op. The others are deprecated.

It's subtle. It can be either blocking or non blocking. However, if you do anything blocking in forward, you need to set the env variables such that there are more than 2 cpu/gpu worker threads

@vchuravy
Copy link
Collaborator Author

Non blocking I would have to push a new sync point? Similar to how NVRTC is
implemented in python?

On Wed, 17 Aug 2016, 20:21 Eric Junyuan Xie, [email protected]
wrote:

@vchuravy https://github.com/vchuravy Yes use custom op. The others are
deprecated.

It's subtle. It can be either blocking or non blocking. However, if you do
anything blocking in forward, you need to set the env variables such that
there are more than 2 cpu/gpu worker threads


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#16 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAI3anJWHixnCoKgdRU_w9A6NpUbhSomks5qg6VygaJpZM4GcVIK
.

@piiswrong
Copy link
Member

No syncing is needed from the frontend side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants