feat/native Implemnt Sigmoid, ReLU, and tanh for Native back ends #7

ehiggs · 2015-12-30T11:40:40Z

This implements Sigmoid, ReLU, and tanh activation functions for the native backend. This includes the grad functions too.

Notes:

The tests only use the same numbers as the cudaback end tests. They are only the first level of testing (does it ever work?) and don't cover (does it behave nicely in the face of failure?)
write_to_memory should be moved to a single common function somewhere (co?). Currently it is copy pasted in the tests (twice in each) and in the native back ends.
The unwrapping of SharedTensors is pretty nasty (nested if lets). Perhaps this could be abstracted into an iterator or a deeper function that does all the unwrapping we want. I think this is a side issue from the implementations as they stand now. A side effect of the nested if lets is that I had to make up stupid variable names which don't add anything to the readability.
I noticed that write_to_memory doesn't work nicely in the face of different features being enabled. If only one feature is enabled, Rust doesn't like that we use an if let because there's only one possible backend. If we use multiple backends then it doesn't like it when we /dont/ use an if let since we aren't sure about the results of the backend resolution.

I don't know if you want to merge this and pull the issues into issues; or if you want these to be fixed in the PR, so I'll leave it up to you.

If you want to make some changes, please fork my branch and pass a PR back to me.

Thanks!

r? @MichaelHirn

MichaelHirn · 2015-12-30T13:40:02Z

First, big thanks, for the PR and the issues. I think it is best to create separate issues/PRs for them, as they need to be fixed at Collenchyma.

Review, with three points that need to be resolved before we can merge that PR.

Memory Management:

We need them for the non-plain functions ;) You would always want to add at least the following to the non-plain functions.
```
match read_or_read_write_arg.add_device(self.device()) { _ => try!(read_or_read_write_arg.sync(self.device())) }
match write_only_arg.add_device(self.device()) { _ => () }
```
This does two things. It first checks, that a memory copy exists on the device on which the operation will be executed on, which saves for a lot of trouble. And read arguments, also sync from the latest data location to the memory copy on which the operation will happen. This is important, as undesired things happen if we dont explicitly sync. For example:
The SharedTensor was used at an operation on a CUDA device, next it is the argument for an operation that happens on the native CPU. We expect, that the SharedTensor will use the latest result (from the CUDA operation) now for the Native operation. But the memory is not synced yet. Which is precisely what the .sync does - it makes sure, that the expected data is at the right memory for the operation. If someone would like to take care of all this by himself, or doesn't need those guarantees, he would use the _plain operation.
Performance:

The .clone statements add quite big to the computation time of an operation. I am guessing here, but I'd say we can use input.as_mut_slice instead of input.as_slice to get rid of those clones. There might even be no need anymore for the write_to_memory function inside of the operations, which would reduce the overhead even further.
Squashing:

GitCop was not activated for this repo, yet. I added it to keep the commit style for future auto history generation. You can squash the messages with git rebase -i {initial commit id} setting s as the mode for all but the initial commit and then changing the message to perhaps something like: feat/native: implement Sigmoid, ReLU, tanh for Native backend.

GitCop · 2016-01-01T22:56:23Z

There were the following issues with your Pull Request

Commit: 6931f20
- Commits must be in the following format: %{type}/%{scope}: %{description}

Guidelines are available at https://github.com/autumnai/collenchyma-nn/blob/master/CONTRIBUTING.md#git-commit-guidelines

This message was auto-generated by https://gitcop.com

ehiggs · 2016-01-01T23:41:41Z

@MichaelHirn I updated the non plain functions to perform the sync.

Regarding the clone (collect, really), I removed the bit where the results are collected into a vector before being copied into the result tensor. Now the iterator is streamed into the results. For further performance refinement, I think a few bench tests should be put in place to track progress.

ehiggs · 2016-01-01T23:44:19Z

Regarding the issues raised:

write_to_memory is copied in lots of places. This should be handled by: autumnai/collenchyma#32

The if let unwrapping issues should be handled by: autumnai/collenchyma#33

MichaelHirn · 2016-01-02T11:25:45Z

Awesome ✨

Thanks for squashing the commits, adding the synchronization and referencing the issues, that improves the code and future works a lot 👍

I made a PR to your native-sigmoid branch, which introduces the working benches for native sig, tanh and relu. It also includes the performance tooling we already used at collenchyma-blas. The benches are all working for native - quite interesting to see the performance actually. Even more interesting would to have the CUDA benches, but I am on a non-CUDA machine at the moment.

The bench PR should help to refactor the native activation functions. My gut tells me, that we can significantly increase the performance if we rewrite it, so that we can drop the .clone. Your latest commit, removing the collect statement, was already a big performance win, I bet, and if the clone statements are gone, I think we can push it down to probably something near the optimum of that native solution. I am excited to see if the benches help to improve the performance of the native activation functions.

For later comparsion, these are the current benches on my machine with the clones.

     Running target/release/relu-99bf132886a561b3

running 4 tests
test bench_1000_relu_100_native      ... bench:     462,655 ns/iter (+/- 50,939)
test bench_1000_relu_grad_100_native ... bench:     232,341 ns/iter (+/- 124,687)
test bench_10_relu_10000_native      ... bench:     445,418 ns/iter (+/- 5,762)
test bench_10_relu_grad_10000_native ... bench:     206,564 ns/iter (+/- 44,633)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

     Running target/release/sigmoid-053da06452689dd5

running 4 tests
test bench_1000_sigmoid_100_native      ... bench:   2,851,015 ns/iter (+/- 290,278)
test bench_1000_sigmoid_grad_100_native ... bench:     224,172 ns/iter (+/- 2,164)
test bench_10_sigmoid_10000_native      ... bench:   2,848,230 ns/iter (+/- 145,276)
test bench_10_sigmoid_grad_10000_native ... bench:     205,792 ns/iter (+/- 17,093)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

     Running target/release/tanh-fe117839e5c4c282

running 4 tests
test bench_1000_tanh_100_native      ... bench:   3,088,289 ns/iter (+/- 229,586)
test bench_1000_tanh_grad_100_native ... bench:     568,862 ns/iter (+/- 7,811)
test bench_10_tanh_10000_native      ... bench:   3,078,578 ns/iter (+/- 171,473)
test bench_10_tanh_grad_10000_native ... bench:     551,521 ns/iter (+/- 300,994)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

feat/bench: add bench and perf utilities

hobofan · 2016-01-02T23:29:29Z

Looks very nice 👍

if the clone statements are gone, I think we can push it down to probably something near the optimum of that native solution

I think we are still quite far away from that. The speediest solution will most likely involve things like SIMD optimizations which due to the current lack of support in Rust can only be achieved by using external libraries like BLAS or similar.

But I think these optimizations are something for the future since that would introduce a dependency on the collenchyma-blas plugin and I'd rather keep that simple for the time being. The solution as it is already provides great value.

One thing we should think about are traits for only a subset of the plugins features, e.g. Tanh, Convolution. I am sorry that I was absent for the discussion of #4 / #5 / #6 , but I really dislike the use of unimplemented! in this case since it takes away a lot of power from the compiler and instead produces unexpected runtime crashes for new users of the plugin. More fine grained traits could help solve that issue.

MichaelHirn · 2016-01-03T12:59:00Z

Yes, definitely a good point. A Plugin could have multiple plugin traits instead of one. Exposing all of the traits at the root level of the crate, would then yield the same result as we currently have at the application side with plugin::*. I think that this is a reasonable approach. But I would merge that PR first, then refactor into multiple plugin traits and at the end publish a new version.

Regarding my previous comment, I feel I was quite ambiguous with that. My intention was that, that refers to the particular solution at hand, using common Rust code/math operations. BLAS and similar optimized libraries could of course result in a significant performance increase, but I wasn't using them for comparison at that point. So if we can get rid of the clones I think we moved quite close to the optimal performance we can achieve with common Rust.

ehiggs · 2016-01-03T14:36:11Z

FWIW, clone on a reference to a primitives just copies the primitives to a register. It's not a malloclike with String or Vec.

When I'm next on a suitable machine I'll try profiling with the bench to see where time goes. Also, I think it's worth making a bench on the underlying sigmoid functions. But I think that requires that they're public. Maybe wrapping them in publicly exposed functions marked #[bench](so they don't exist in release versions) could dance around that.

ehiggs · 2016-01-10T16:24:42Z

I profiled this and found most of the time is spend in expf. The discussion above wasn't clear to me. Do you expect this PR for implementing the collenchyma nn native backend functions to also refactor the nn traits? Or do you want to roll these in and then refactor so we don't need to use unimplemented! anymore?

hobofan · 2016-01-10T16:44:40Z

@higgs: I think the PR is mergable as it is. I think the refactoring of the traits is something that needs a bit more discussion.

MichaelHirn · 2016-01-10T22:40:06Z

@ehiggs @hobofan yes, I totally agree. Let's move this PR in and then we refactor the traits. @ehiggs do you have any benches (code changes) that you think might be worth publishing for future work on the native performance? If so I will wait for the commit before I merge.

ehiggs · 2016-01-11T10:27:27Z

@MichaelHirn I ran bench just like you and my numbers were in line with yours except I ran them on a 2012 mac book air so it took a good deal more nanoseconds/iteration than your box. I also began looking at comparing against the cuda version but my desktop has a GeForce 460 which is too old for cuNN (which requires the CUDA API version 3). So the comparison was a non starter and underlined the importance of implementing the native backend. :)

MichaelHirn · 2016-01-12T09:51:56Z

@ehiggs Thank you so much for that PR.

I merge it now into master. For publishing this to cargo, we should make the refactor with the traits to avoid unimplemented!. I will then update the implementation table on the README with your native activation function support. I try to get on this as soon as possible.

feat/native Implemnt Sigmoid, ReLU, and tanh for Native back ends

This was referenced Dec 30, 2015

Add convenient write interface to SharedTensor autumnai/collenchyma#32

Open

Add convenient access to underlying memory of a SharedTensor autumnai/collenchyma#33

Open

feat/native: implement Sigmoid, ReLU, tanh for Native backend.

ece54e3

feat/bench: add bench and perf utilities

0e2d34c

Merge pull request #1 from MichaelHirn/bench

d0cd78b

feat/bench: add bench and perf utilities

MichaelHirn added a commit that referenced this pull request Jan 12, 2016

Merge pull request #7 from ehiggs/native-sigmoid

c64f23e

feat/native Implemnt Sigmoid, ReLU, and tanh for Native back ends

MichaelHirn merged commit c64f23e into autumnai:master Jan 12, 2016

MichaelHirn mentioned this pull request Jan 12, 2016

Add bench tests to track performance improvements #8

Open

ehiggs deleted the native-sigmoid branch January 12, 2016 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/native Implemnt Sigmoid, ReLU, and tanh for Native back ends #7

feat/native Implemnt Sigmoid, ReLU, and tanh for Native back ends #7

ehiggs commented Dec 30, 2015

MichaelHirn commented Dec 30, 2015

GitCop commented Jan 1, 2016

ehiggs commented Jan 1, 2016

ehiggs commented Jan 1, 2016

MichaelHirn commented Jan 2, 2016

hobofan commented Jan 2, 2016

MichaelHirn commented Jan 3, 2016

ehiggs commented Jan 3, 2016

ehiggs commented Jan 10, 2016

hobofan commented Jan 10, 2016

MichaelHirn commented Jan 10, 2016

ehiggs commented Jan 11, 2016

MichaelHirn commented Jan 12, 2016

feat/native Implemnt Sigmoid, ReLU, and tanh for Native back ends #7

feat/native Implemnt Sigmoid, ReLU, and tanh for Native back ends #7

Conversation

ehiggs commented Dec 30, 2015

MichaelHirn commented Dec 30, 2015

GitCop commented Jan 1, 2016

ehiggs commented Jan 1, 2016

ehiggs commented Jan 1, 2016

MichaelHirn commented Jan 2, 2016

hobofan commented Jan 2, 2016

MichaelHirn commented Jan 3, 2016

ehiggs commented Jan 3, 2016

ehiggs commented Jan 10, 2016

hobofan commented Jan 10, 2016

MichaelHirn commented Jan 10, 2016

ehiggs commented Jan 11, 2016

MichaelHirn commented Jan 12, 2016