Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disabling parallel evaluation on darwin #438

Merged
merged 22 commits into from
Jul 8, 2021
Merged

Conversation

scarrazza
Copy link
Member

With qibojit installed the multiprocessing import cannot be changed to "fork" mode on macos. This PR disables the parallel feature for macos. I believe we should schedule a revision of this module in the future, in particular considering better parallel algorithms rather than forcing circuit copies and async evaluation. @stavros11 please let me know your opinion.

@scarrazza scarrazza requested a review from stavros11 July 6, 2021 09:53
@scarrazza scarrazza mentioned this pull request Jul 6, 2021
Copy link
Member

@stavros11 stavros11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, I cannot test on mac but code looks okay to me. If I remember correctly though parallel evaluation was not working with qibojit also on linux, that's why the related tests are disabled. What exactly is the new issue? Installing qibojit on mac blocks the parallel evaluation for all backends? For qibotf it used to work on mac, right?

Regarding revising the implementation, perhaps we should merge this and open an issue for now. Is there an alternative approach based on Python?

@scarrazza
Copy link
Member Author

If I remember correctly though parallel evaluation was not working with qibojit also on linux, that's why the related tests are disabled.

ok, I believe this PR already disables if get_backend() == "qibojit", so we can skip the tests or catch the runtime error.

What exactly is the new issue?

https://github.com/qiboteam/qibo/runs/2996557250?check_suite_focus=true#step:8:56
Presumably numba (or some other package from qibojit) already loads multiprocessing, so we can't change the start method from spawn to fork (our previous default only for macos).

Installing qibojit on mac blocks the parallel evaluation for all backends? For qibotf it used to work on mac, right?

I think so, in particular if we import qibojit, before loading qibotf.

Regarding revising the implementation, perhaps we should merge this and open an issue for now. Is there an alternative approach based on Python?

I agree, I don't think the parallel implementation we have is robust, we can keep as experimental for the time being, and the propose some better approach.

@stavros11
Copy link
Member

ok, I believe this PR already disables if get_backend() == "qibojit", so we can skip the tests or catch the runtime error.

Currently the tests are skipped. We could change to catching the error, however I am not sure if this is very useful.

I think so, in particular if we import qibojit, before loading qibotf.

Oh, I see now. I am not sure if there is a away to avoid this while keeping qibojit as the default backend. So we either follow this PR and disable parallel completely for mac or we make a different backend the default.

@scarrazza
Copy link
Member Author

Yeah, lets keep disable and in future revisit the implementation ref. #440.

@scarrazza
Copy link
Member Author

@stavros11 seems like examples are taking much longer to run, so we hit the max time of 6h.
Maybe this is something related to the number of threads? Did you observe this before?

@stavros11
Copy link
Member

@stavros11 seems like examples are taking much longer to run, so we hit the max time of 6h.
Maybe this is something related to the number of threads? Did you observe this before?

Usually example tests take long but do not hit the max time. I am running them locally now and I actually get some failures so I am guessing examples don't work well with the qibojit backend which is the new default.

@stavros11
Copy link
Member

The failing tests were related to numpy conversion using Tensorflow's .numpy(), which I fixed by changing to our backend generic K.to_numpy. There is still an issue with some tests that hang which may be the reason the CI timed out. I skip these tests temproarily so let's see if everything works now. The tests that hang are test_adiabatic3sat and test_ef_qae.

@codecov
Copy link

codecov bot commented Jul 6, 2021

Codecov Report

❗ No coverage uploaded for pull request base (testqibojit@5712247). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             testqibojit      #438   +/-   ##
===============================================
  Coverage               ?   100.00%           
===============================================
  Files                  ?        83           
  Lines                  ?     11865           
  Branches               ?         0           
===============================================
  Hits                   ?     11865           
  Misses                 ?         0           
  Partials               ?         0           
Flag Coverage Δ
unittests 100.00% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5712247...f73b07b. Read the comment docs.

@stavros11
Copy link
Member

I re-run the example tests using this branch (with the two problematic tests skipped) and they pass within a few minutes both in my laptop and qibo machine. @scarrazza could you please confirm that this works for you too? I am not sure why the CI times out.

@scarrazza
Copy link
Member Author

Thanks for checking, I doing that right now.

@scarrazza
Copy link
Member Author

@stavros11, yes, I get:

94 passed, 12 skipped, 28 warnings in 703.09s (0:11:43)

Let me check the memory and reducing the threads.

@scarrazza
Copy link
Member Author

Ok, the problem comes from these limitations: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources

Try to run the examples with:

ulimit -v 7000000
OMP_NUM_THREADS=2 pytest -s examples/

After few minutes the shor test fails with:

 RuntimeError: State does not fit in /CPU:0 memory.Please switch the execution device to a different one using ``qibo.set_device``.

@scarrazza
Copy link
Member Author

Given that qibotf was not crashing, maybe there examples importing tensorflow...

@scarrazza
Copy link
Member Author

Lets see what happens, in fact it is quite odd that qibotf was taking more than 3h to complete something that qibojit can do in less than 1h. So this sounds like memory limitations...

@stavros11
Copy link
Member

With the latest version the tests pass on both my laptop and the qibo machine. When using the ulimit and OMP_NUM_THREADS commands the tests still pass on my laptop but give various errors instantly on qibo machine. So I'm not sure if CI will pass but it is likely that no.

@scarrazza
Copy link
Member Author

To me it fails with:

------------------------------------------------------------

Using Shor's algorithm to factorize N = 21 with a = 16.

  - Performing algorithm using a fully quantum iQFT.

  - Total number of qubits used: 22.

[Qibo|WARNING|2021-07-07 15:12:54]: Falling back to CPU because the GPU is out-of-memory.

INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/home/carrazza/anaconda3/lib/python3.8/site-packages/_pytest/main.py", line 269, in wrap_session
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>   File "/home/carrazza/anaconda3/lib/python3.8/site-packages/_pytest/main.py", line 323, in _main
INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
...

Do you understand why there is this log message about falling back to CPU because the GPU is out-of-memory? Sounds like shor raises a OOM in CPU, maybe during shots, and then this function catches the error.

@stavros11
Copy link
Member

stavros11 commented Jul 7, 2021

Do you understand why there is this log message about falling back to CPU because the GPU is out-of-memory? Sounds like shor raises a OOM in CPU, maybe during shots, and then this function catches the error.

That's strange, I never got something similar. Are you sure the tests are running on CPU and not on GPU?

By the way, when I run on GPU I get

cupy_backends.cuda.libs.cusolver.CUSOLVERError: CUSOLVER_STATUS_INTERNAL_ERROR

@scarrazza
Copy link
Member Author

Are you sure the tests are running on CPU and not on GPU?

On CPU for sure, I double checked nvidia-smi. I cannot run on GPU with ulimit and 2 threads, the code crashes silently.
If I run on GPU without ulimit, everything passes except this one:

_______________________________________________ test_entropy_large_circuit[qibojit-None] ________________________________________________

backend = None, accelerators = None

    def test_entropy_large_circuit(backend, accelerators):
        """Check that entropy calculation works for variational like circuit."""
        thetas = np.pi * np.random.random((3, 8))
        target_entropy = callbacks.EntanglementEntropy([0, 2, 4, 5])
        c1 = Circuit(8)
        c1.add((gates.RY(i, thetas[0, i]) for i in range(8)))
        c1.add((gates.CZ(i, i + 1) for i in range(0, 7, 2)))
        state1 = c1()
        e1 = target_entropy(state1)
    
        c2 = Circuit(8)
        c2.add((gates.RY(i, thetas[1, i]) for i in range(8)))
        c2.add((gates.CZ(i, i + 1) for i in range(1, 7, 2)))
        c2.add(gates.CZ(0, 7))
        state2 = (c1 + c2)()
        e2 = target_entropy(state2)
    
        c3 = Circuit(8)
        c3.add((gates.RY(i, thetas[2, i]) for i in range(8)))
        c3.add((gates.CZ(i, i + 1) for i in range(0, 7, 2)))
        state3 = (c1 + c2 + c3)()
        e3 = target_entropy(state3)
    
        entropy = callbacks.EntanglementEntropy([0, 2, 4, 5])
        c = Circuit(8, accelerators)
        c.add(gates.CallbackGate(entropy))
        c.add((gates.RY(i, thetas[0, i]) for i in range(8)))
        c.add((gates.CZ(i, i + 1) for i in range(0, 7, 2)))
        c.add(gates.CallbackGate(entropy))
        c.add((gates.RY(i, thetas[1, i]) for i in range(8)))
        c.add((gates.CZ(i, i + 1) for i in range(1, 7, 2)))
        c.add(gates.CZ(0, 7))
        c.add(gates.CallbackGate(entropy))
        c.add((gates.RY(i, thetas[2, i]) for i in range(8)))
        c.add((gates.CZ(i, i + 1) for i in range(0, 7, 2)))
        c.add(gates.CallbackGate(entropy))
        state = c()
    
        K.assert_allclose(state3, state)
>       K.assert_allclose(entropy[:], K.cast([0, e1, e2, e3]))

src/qibo/tests/test_core_callbacks.py:238: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/qibo/backends/numpy.py:460: in cast
    return self.op.cast(x, dtype=dtype)
../qibojit/src/qibojit/custom_operators/__init__.py:47: in cast
    return backend.cast(x, dtype)
../qibojit/src/qibojit/custom_operators/backends.py:170: in cast
    return self.cp.asarray(x, dtype=dtype)
../../../anaconda3/lib/python3.8/site-packages/cupy/_creation/from_data.py:66: in asarray
    return _core.array(a, dtype, False, order)
cupy/_core/core.pyx:2120: in cupy._core.core.array
    ???
cupy/_core/core.pyx:2199: in cupy._core.core.array
    ???
cupy/_core/core.pyx:2267: in cupy._core.core._send_object_to_gpu
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

cupy_backends.cuda.libs.cusolver.CUSOLVERError: CUSOLVER_STATUS_INTERNAL_ERROR

This sounds like limited GPU memory, try to reboot or use an empty GPU.

Going back to our issue here. I believe that the github action was going to swap memory when using qibotf (that's why 3h), while now with qibojit it crashes directly. If the amount of memory required by shor is really correct, then the only solution is to reduce the number of qubits for the tests.

@stavros11
Copy link
Member

On CPU for sure, I double checked nvidia-smi. I cannot run on GPU with ulimit and 2 threads, the code crashes silently.
If I run on GPU without ulimit, everything passes except this one:

Thanks for the update, actually the qibo tests were passing for me on GPU even before the last push. The example tests still fail on GPU due to some cupy -> numpy conversions though.

Going back to our issue here. I believe that the github action was going to swap memory when using qibotf (that's why 3h), while now with qibojit it crashes directly. If the amount of memory required by shor is really correct, then the only solution is to reduce the number of qubits for the tests.

If that's the case perhaps the simplest to try would be to remove the N=21 value from the Shor test and leave N=15. I am not sure if it would work but it would definitely make the test lighter.

@scarrazza
Copy link
Member Author

scarrazza commented Jul 7, 2021

If that's the case perhaps the simplest to try would be to remove the N=21 value from the Shor test and leave N=15. I am not sure if it would work but it would definitely make the test lighter.

Indeed, I am playing with different configurations here: https://github.com/qiboteam/qibo/actions?query=branch%3Atryfixci
I have fixed some extra problems including the adiabatic3sat, however as you can see, jobs still take a long time, while on my laptop if we skip the EF_QEA these should take max 6 minutes with ulimit and 2 cores.

If these tests hangs, then we should try to isolate which test is breaking the CI.

@scarrazza
Copy link
Member Author

With these last changes, I get 5-7 minutes runtime for all tests (no skips) on 2 different machines, using 2 threads and ulimit.

@scarrazza
Copy link
Member Author

@stavros11 just to let you know that I am debugging one by one here: https://github.com/qiboteam/qibo/actions?query=branch%3Apruning

@scarrazza
Copy link
Member Author

Looks like the scipy minimize is slowing down dramatically all tests on the CI machines, if I set a tiny maxiter test are passing.

@stavros11
Copy link
Member

stavros11 commented Jul 8, 2021

What is the current status of this? Can I help in any way? I see that some tests are passing fast in the prunning branch. Also, example tests are still failing on GPU for me with the current version of this PR.

@scarrazza
Copy link
Member Author

scarrazza commented Jul 8, 2021

Some of them passes, if we reduce the minimization to few iteractions, however it still takes much longer than on my laptop.
I will try to replace the exec call with something else (e.g. this), I have the suspicious that pytest is accumulating data.

If that doesn't work then I assume we can disable these tests from CI for the time being.

@scarrazza
Copy link
Member Author

@stavros11 believe me or not, but CI seems to work well only with single-thread. Let me polish this PR, then we can merge.

@scarrazza scarrazza merged commit 1d72d91 into testqibojit Jul 8, 2021
@scarrazza scarrazza deleted the fixparalleljit branch July 10, 2021 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants