Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing unit tests for SGDSolver #39

Open
samalone opened this issue Jan 13, 2017 · 9 comments
Open

Failing unit tests for SGDSolver #39

samalone opened this issue Jan 13, 2017 · 9 comments

Comments

@samalone
Copy link

I realize that your SGDSolver class in version 0.4.0 is probably a work in progress since it is not documented, but I decided to try it anyway. In my own code using that class, I got hundreds of identical console errors during training:

error 10:02:43.602113 -0500 xctest Execution of the command buffer was aborted due to an error during execution. Internal Error (IOAF code 1)

I decided to run the BrainCore unit tests, and discovered that they are getting the same console errors. The unit test failures are:

file:///Users/samalone/Projects/Gorillias/Carthage/Checkouts/BrainCore/Tests/SGDSolverTests.swift: test failure: -[SGDSolverTests testTrainXOR()] failed: Asynchronous wait failed: Exceeded timeout of 20 seconds, with unfulfilled expectations: "Net train".

file:///Users/samalone/Projects/Gorillias/Carthage/Checkouts/BrainCore/Tests/SGDSolverTests.swift: test failure: -[SGDSolverTests testTrainXOR()] failed: XCTAssertLessThan failed: ("0.0292995") is not less than ("-0.000584897") -

file:///Users/samalone/Projects/Gorillias/Carthage/Checkouts/BrainCore/Tests/SGDSolverTests.swift: test failure: -[SGDSolverTests testTrainXOR()] failed: XCTAssertGreaterThan failed: ("-0.553643") is not greater than ("0.0646476") -

file:///Users/samalone/Projects/Gorillias/Carthage/Checkouts/BrainCore/Tests/SGDSolverTests.swift: test failure: -[SGDSolverTests testTrainXOR()] failed: XCTAssertGreaterThan failed: ("-0.697093") is not greater than ("0.603576") -

I'm running on a retina MacBook Pro Late 2013. The graphics card which Metal is presumably using is a NVIDIA GeForce GT 750M 2048 MB. Let me know if I can do any debugging or provide more information.

@alejandro-isaza
Copy link
Owner

Yes, the solver is a work in progress. @aidangomez may know more but I'll have a look when I get a chance.

@aidangomez
Copy link
Collaborator

@samalone @Aleph7 Likely caused by Swift3; we'll need to upgrade the code. It definitely wouldn't have been merged if those tests were failing.

@alejandro-isaza
Copy link
Owner

I don't get any errors running on master. 😕

@samalone
Copy link
Author

@Aleph7 OK, I'll take some time to try and narrow down the failure.

The only reference I've found to this error on the internet is a bug report for nVidia's Unreal Engine.

Let me know if you have any tips on how to isolate problems with Metal or BrainCore. Meanwhile I'll flail around as best I can.

@samalone
Copy link
Author

Here's what I can say so far:

  • The only unit test that gets errors is testTrainXOR
  • The failure is always preceded by the console message: Internal Error (IOAF code 1)
  • The failure can be detected earlier by adding the code assert(commandBuffer.status == .completed) to the top of each addCompletedHandler callback. I suggest that you add such assertions after every call to addCompletedHandler or waitUntilCompleted.
  • The number of training steps that are executed successfully before getting the error varies widely from 1 to 75.
  • The error can first occur when executing the command buffer in any of the following functions:
    • Trainer.processForwardNodes
    • Trainer.processBackwardNodes
    • SGDSolver.updateParameters

Since trainTestXOR uses ReLULayer but testSimpleTrain does not, perhaps the problem is related to ReLULayer?

@samalone
Copy link
Author

ReLULayer is not the problem. I was able to remove that layer from the unit test, but the error still occurs.

@alejandro-isaza
Copy link
Owner

I added the assert you suggested and still don't get any errors. It looks like it may be a GPU problem. At one point we discovered a bug in the metal implementation of tanh and ended up implementing our own (see Utilities.h). May be something like that.

@samalone
Copy link
Author

The unit tests pass on a Mac mini (Late 2012) with an Intel HD Graphics 4000 1536 MB GPU. I'd be interested in knowing what GPU you are using.

I've posted a question about how to debug this on the Apple Developer Forums. If I don't make any progress in the next few days, I'll open a support incident with Apple.

@alejandro-isaza
Copy link
Owner

I have a MacBook Pro 2016 with Radeon Pro 460 4096 MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants