Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SymbolicHamiltonian refactor #1548

Merged
merged 34 commits into from
Feb 12, 2025

Conversation

BrunoLiegiBastonLiegi
Copy link
Contributor

@BrunoLiegiBastonLiegi BrunoLiegiBastonLiegi commented Dec 18, 2024

This should address #1494

EDIT: this turned out to be more in general a refactor of SymbolicHamiltonian and partly of hamiltonian/models.py. Namely, SymbolicHamiltonian now only has a form attribute, dense and terms are cached properties computed starting from form and, thus, they can no longer be manually setted. Concerning hamiltonian/models.py, now the models are built starting from the form and not the terms as it was before. I also made the dense representation of the models more backend aware, replacing the numpy operations with the backend ones.

Note: tests on cupy are still failing due to this qiboteam/qibojit#196 .

EDIT 2:
I went ahead and provided both Symbols and SymbolicTerm with a backend attribute which is useful whenever you move to the dense form representation. Initially, I thought about making backend an additional input argument to the functions that needed it (similarly to what we do with gates), but this would have meant breaking the api (e.g. symbol.matrix -> symbol.matrix(backend)).
What it means now though, is that there are cases where the backends of the different pieces may mismatch, thus causing problems. To mitigate this, I decided to forcely set the backend of the symbols here https://github.com/qiboteam/qibo/blob/4d8c9c1211c850a5e740607de943da75fc31f8b2/src/qibo/hamiltonians/terms.py#L176C1-L179C50
which allows for the end user to not care about which backend to construct a symbol with, maintaining in practice the usual API.
As part of the refactor, I removed the symbol_map argument of the hamiltonians, the from_symbolic method and the TrotterHamiltonian.

Checklist:

  • Reviewers confirm new code works as expected.
  • Tests are passing.
  • Coverage does not decrease.
  • Documentation is updated.

@BrunoLiegiBastonLiegi BrunoLiegiBastonLiegi changed the title Remove SymbolicHamiltonian terms and dense setters SymbolicHamiltonian refactor Dec 19, 2024
Copy link

codecov bot commented Dec 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.58%. Comparing base (312efb9) to head (3544ebe).
Report is 36 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1548      +/-   ##
==========================================
- Coverage   99.61%   99.58%   -0.04%     
==========================================
  Files          76       76              
  Lines       11449    11325     -124     
==========================================
- Hits        11405    11278     -127     
- Misses         44       47       +3     
Flag Coverage Δ
unittests 99.58% <100.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@BrunoLiegiBastonLiegi
Copy link
Contributor Author

BrunoLiegiBastonLiegi commented Dec 20, 2024

I am still unsure whether it is worth keeping the two different ways for computing the dense form of the SymbolicHamiltonian, namely from_terms and from_form.
Therefore I've done a little benchmark on 4 different hamiltonians ("TFIM", "XXZ", "TFIM**2", "XXZ**2") with the numpy backend for qubits from 2 to 10. In particular the number of terms (after expansion) for a given number of qubits for the TFIM**2 (the biggest) is:

{
    "2q": 36,
    "3q": 81,
    "4q": 144,
    "5q": 225,
    "6q": 324,
    "7q": 441,
    "8q": 576,
    "9q": 729,
    "10q": 900,
    "11q": 1089
}

the average runtimes across 10 runs are as follows:

time vs nqubits
time_vs_nqubits.pdf

time vs nterms
time_vs_nterms.pdf

the from_terms has an advantage when the number of terms is "smaller" (less than ~40 let's say), after that the from_form seems to prevail, but, on the long run, for a large number of qubits and terms the from_terms appears to become comparable if not better.

Long story short, if I had to pick only one of the two I'd probably lean towards the from_terms, but I could also internally automatically switch between the two given a pre-defined nterms threshold.

EDIT: a small update on this one: I've done some minor optimization to the _get_symbol_matrix function and cached it, which made the picture clearer. Now from_form appears to constantly outperform from_terms

image

at this point it is maybe not worth keeping from_terms anymore....

Copy link

github-actions bot commented Feb 7, 2025

Run on QPU sim completed! :atom:

You can download the coverage report as an artifact, from the workflow summary page:
https://github.com/qiboteam/qibo/actions/runs/13201313490

@BrunoLiegiBastonLiegi
Copy link
Contributor Author

BrunoLiegiBastonLiegi commented Feb 7, 2025

There are still some parts of the code that I left in commented because they are up for discussion and I would like to conduct some more tests on a GPU.

Just a small update on the GPU scaling, I tested the multikron function with the two approaches reduce(kron) and pure einsum:

def multikron_reduce(matrix_list, kron):
    return reduce(kron, matrix_list)
    
def multikron_einsum(matrix_list, einsum):
    indices = list(range(2 * len(matrix_list)))
    even, odd = indices[::2], indices[1::2]
    lhs = zip(even, odd)
    rhs = even + odd
    einsum_args = [item for pair in zip(matrix_list, lhs) for item in pair]
    N = 2 ** len(matrix_list)
    h = einsum(*einsum_args, rhs)
    return np.sum(np.reshape(h, (-1, N, N)), axis=0)

the results with numpy and cupy with 10 2x2 matrices are (100 repetitions):

## numpy:
reduce --> 0.8061706170010439
einsum --> 4.193189857000107
## cupy:
reduce --> 0.7108887559988943
einsum --> 0.056594685000163736

the GPU scales wonderfully as expected and I suspect this would transfer to the other similar functions that I encountered here. However, the drawback for the CPU seems too big to consider einsum a viable option, and differentiating among the backends would mean to move these functions to the backend directly. Thus I am not sure about what to do.

@alecandido
Copy link
Member

Just as an attempt, could you try to use

einsum(..., optimize=True)

or even

einsum(..., optimize="optimal")

?

@BrunoLiegiBastonLiegi
Copy link
Contributor Author

I tested optimize=True with no difference already, I am testing now optimize="optimal", which I was not aware of btw, no difference for numpy as well, cupy instead gets misteriously utterly slow, in the sense that it is still running while I am typing.

@alecandido
Copy link
Member

alecandido commented Feb 7, 2025

Well, if NumPy is getting no benefit, than it's not worth in any case.

However, this would be interesting to investigate: "optimal" attempts (at least to a certain extent) to find the best contraction, possibly wasting a lot of time in computing it.
Now, I expect this not to be that hard for the problem you're laying out. And after you know the contraction path, it's just a matter of executing it.
So, I'm now wondering: how is it possible that reduce() ends up in a better contraction than the optimal one?

Just to exclude that the contraction's computation is actually taking a huge amount of time, I'd propose to compute the contraction executed by, and pass it explicitly to optimize=... (which also accepts a contraction path as argument).

Controls if intermediate optimization should occur. No optimization will occur if False and True will default to the ‘greedy’ algorithm. Also accepts an explicit contraction list from the np.einsum_path function. See np.einsum_path for more details. Defaults to False.

https://numpy.org/doc/stable/reference/generated/numpy.einsum.html

EDIT: you may also compute the contraction path by a separate invocation of np.einsum_path (with optimize="optimal"), and only time the contraction part - however, while it should be as conclusive as the manual computation of the other one, since it is documented to brute force the optimal path, it would not be as explanatory (the manual path computation following the reduced approach may even give you an alternative implementation, since you may reuse it to skip the path optimization part, while already achieving better-than-naive performances, that should be at least comparable with reduce())

@BrunoLiegiBastonLiegi
Copy link
Contributor Author

I tried calculating the path separately, but no difference at all, it looks like the contraction is taking that time and that's it.

def get_path(matrix_list):
    indices = list(range(2 * len(matrix_list)))
    even, odd = indices[::2], indices[1::2]
    lhs = zip(even, odd)
    rhs = even + odd
    einsum_args = [item for pair in zip(matrix_list, lhs) for item in pair]
    return np.einsum_path(*einsum_args, rhs, optimize="optimal")
    
def multikron_einsum(matrix_list, einsum, path):
    indices = list(range(2 * len(matrix_list)))
    even, odd = indices[::2], indices[1::2]
    lhs = zip(even, odd)
    rhs = even + odd
    einsum_args = [item for pair in zip(matrix_list, lhs) for item in pair]
    N = 2 ** len(matrix_list)
    h = einsum(*einsum_args, rhs, optimize=path[0])
    return np.sum(np.reshape(h, (-1, N, N)), axis=0)

@alecandido
Copy link
Member

alecandido commented Feb 10, 2025

Ok, sorry, I was a bit ignorant about the Kronecker product (I knew what it was, but definitely not familiar with it - so, I had no intuition).

In any case, the einsum_path is trivial, and thus completely irrelevant. Indeed, in the end you are contracting all matrices at the same time. There is no inner and outer summation, thus no summation to be sorted.
This is because Kronecker is actually not a contraction at all (no index is disappearing), but rather closer to an outer product (that's exactly what you're doing with the einsum implementation).

Most likely, in the reduce() loop you're just dominated by the last operation (you can see the scaling of the time by incrementing the number of matrices), so the presence of a Python loop is completely irrelevant, since there will be a single huge operation, and everything else is just a "brief" preparation.

I tried measuring the increment (ratios of subsequent timing) and even the scaling itself is not clear:

reduce [40.13  1.9   1.51  1.39  1.52  1.89  2.66  3.46  3.87  5.53]
einsum [0.67 1.2  1.3  1.64 2.5  5.57 3.71 3.56 4.66 4.47]

Despite they may fluctuate a bit, mutual relations are overall pretty stable.

So, all in all it is not clear to me there is neither a theoretical nor a practical advantage in doing one thing or the other. I'd suggest to just keep the simplest.


In case of further doubts, I actually timed the np.einsum() operation on its own, and it is the single operation taking some significant time in the multikron_einsum function.

I was doubting of the final summation, but I did not realize that is a summation over a dimension of length 1 (in fact, it is not even needed, you could directly .reshape(N,N)).
The reshape itself is only changing the view, and it does not require any operation on the elements (and so, no iteration at all).

@BrunoLiegiBastonLiegi
Copy link
Contributor Author

Thanks to all of you for the comments, I should have addressed everything already.
The are just two last things that I would like to discuss with you:

@alecandido
Copy link
Member

  • What to do with the whole einsum vs reduce argument we discussed above together with @alecandido. At the moment, it looks like it would be better to just stick with reduce. Another option would be to make the multikron and similar functions part of the backends, if it's worth it.

I would go for the plain reduce. It is good as anything else in performances, and the simplest.
Even making part of the backends would gain at most marginal performance for some backends (for the reason mentioned above). So, I'd not worry that much :)

Copy link

Run on QPU sim completed! :atom:

You can download the coverage report as an artifact, from the workflow summary page:
https://github.com/qiboteam/qibo/actions/runs/13284275308

Copy link
Member

@stavros11 stavros11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BrunoLiegiBastonLiegi.

  • What to do with the whole einsum vs reduce argument we discussed above together with @alecandido. At the moment, it looks like it would be better to just stick with reduce. Another option would be to make the multikron and similar functions part of the backends, if it's worth it.

From my side, I don't think I ever tried to benchmark these methods, so I do not have any feedback in terms of performance. I would also go with reduce, as it looks simpler in terms of code. If there is any chance that a different implementation would provide a significant advantage in performance we could do it in a different PR after verifying such advantage (you could even open an issue if you think that is worth exploring).

I am also not a big fan of providing multiple, equally simple, ways to do the same thing, so I would agree with you that it is not very useful. However, one thing to point out is that removing the dense option would be breaking for anyone using that interface and worse, even for people not using it, as it would change the current default (dense=True). It could be fine, as there are a few other (potentially less) breaking changes implemented here and I am not sure if we are following a strict version convention for such changes in qibo.

Run on QPU sim completed! :atom:

This is an interesting QPU partition...

@BrunoLiegiBastonLiegi
Copy link
Contributor Author

Thanks for the feedback, I can then maybe open two issues:

  • one where I report the different implementations of reduce vs einsum to be tested and, in case, introduced the fastest in a future PR
  • one reminding us about the option of dropping one of the two hamiltonian implementations

After that, I think this can be safely merge.

This is an interesting QPU partition...

that's the best ever, it works perfectly everytime as long as you don't ask for more than 30 qubits

@BrunoLiegiBastonLiegi
Copy link
Contributor Author

Ok, after the tests passed this can be merged.

Copy link

Run on QPU sim completed! :atom:

You can download the coverage report as an artifact, from the workflow summary page:
https://github.com/qiboteam/qibo/actions/runs/13286534503

@BrunoLiegiBastonLiegi BrunoLiegiBastonLiegi added this pull request to the merge queue Feb 12, 2025
Merged via the queue into master with commit 725e646 Feb 12, 2025
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Code is working but could be simplified run-on-sim
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants