Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression caused by #7087 #7528

Closed
mtreinish opened this issue Jan 13, 2022 · 11 comments · Fixed by #7840
Closed

Performance regression caused by #7087 #7528

mtreinish opened this issue Jan 13, 2022 · 11 comments · Fixed by #7840
Assignees
Labels
bug Something isn't working performance
Milestone

Comments

@mtreinish
Copy link
Member

mtreinish commented Jan 13, 2022

Environment

  • Qiskit Terra version: main (after 05b60a5 )
  • Python version: 3.8
  • Operating system: linux

What is happening?

The nightly benchmark runs have flagged several run time performance regressions after #7087 merged:

none are huge in absolute time (on the order of ms) and likely won't be noticeable in a larger transpile() call or application but we should try to fix these because #7087 really shouldn't have had any performance impact.

How can we reproduce the issue?

Run any of the transpiler passes identified in the regressions linked

What should happen?

The addition of a new abstract class defining the interface for a circuit operation shouldn't cause a noticeable performance regression

Any suggestions?

Identify where the passes are spending more time after the addition of the Operation class and fix the bottleneck.

@mtreinish mtreinish added the bug Something isn't working label Jan 13, 2022
@mtreinish mtreinish added this to the 0.20 milestone Jan 13, 2022
@alexanderivrii
Copy link
Contributor

After a bit of experimenting with the first problematic testcase (passes.MultiQBlockPassBenchmarks.time_collect_multiq_block), we see the following (we = Shelly, Eli and myself):

  • The slowdown seems to be due to inheriting Operation from ABC: class Operation(ABC): ...; changing this to class Operation: ... removes the slowdown.
  • The above is due to many __instancecheck__ calls within ABC (and for some reason these are not free), which is in turn due to many isinstance calls in Qiskit.
  • In the current testcase most of the isinstance calls come from the function collect_key in qiskit/transpiler/passes/optimization/collect_multiqubit_blocks.py, where the function above is used in dag.topological_op_nodes(key=collect_key). We saw on average 2 calls to isinstance per node in the dag.

What would be the best way to proceed? Should we get rid of dependence on ABC? Should we try to optimize the number of times isinstance is called in this transpiler pass? Or maybe the slowdown is acceptable?

@jakelishman
Copy link
Member

jakelishman commented Jan 17, 2022

The __instancecheck__ mechanism of ABCMeta is to allow a registration method for classes that don't actually inherit from the defined type, like the generic types in typing (dict appears to be an subclass of typing.MutableMapping, for example, even though it doesn't actually inherit from it). This is why isinstance is slowed on ABCs than other things.

We can't give up isinstance; it's the correct way to test the types of objects when they may be inherited. It's also the reason we put in this class at all, to ensure that QuantumCircuit has an easy way to check that an object is of the right type - it's just got to inherit from Operation.

Perhaps we could define a lightweight metaclass that inherits from ABCMeta but rebinds __instancecheck__ and __subclasscheck__ to the relevant unbound method from type. That should remove the overhead back down to pretty much exactly what it was before (technically it'll be marginally slower because there will be an extra MRO entry in most classes, but it shouldn't be noticeable).

@eliarbel
Copy link
Contributor

For topological ordering, can we eventually get rid of using isinstance to derive the ordering labels by relying on proper polymorphism instead ? i.e. if all circuit elements derive from Operation, we can think of a polymorphic method that returns some ID info which can be leveraged by many algorithms. Though run-time type resolution is very common in Python for functional purposes (e.g. not just input validation), I think it should be avoided whenever possible

@mtreinish
Copy link
Member Author

For topological sort we shouldn't be relying on isinstance anywhere. For the lexicographical topological sort we do the tie breaking string sort key is str(qargs) and the rest of the sort should be isolated to retworkx internals. The place where we use isinstance more is filtering over the sorted nodes to get only gates, which we need to do to find the ordered subset of operations in the circuit which are gates. We could add a slotted type or gate boolean attribute to do the same which probably would be faster than isinstance with an abcmeta.

@jakelishman
Copy link
Member

jakelishman commented Jan 17, 2022

The sort key here is specific to CollectMultiQBlocks; it's not the default sort key for DAGCircuit.topological_sort, which does use object polymorphism (sort of) by accessing x.sort_key.

edit: Matthew just said something pretty similar haha.

I don't see why we should have a gate Boolean instead of a Gate interface. That's kind of the point of #7087 in the first place - having those features be a defined interface means you can attach extra functionality, and you get static type-checking that the operations you want to perform are well defined. It's also something we explicitly removed from DAGNode.

@eliarbel
Copy link
Contributor

So it seems we should live ABC as a base class for Operation and treat the performance degradation either in collect_key. @jakelishman what do you mean having a Gate interface instead of a gate Boolean, without using isinstance ?

@jakelishman
Copy link
Member

I think you've misunderstood me here - I think the best way to treat the performance is to change the base class of Operation to use a metaclass that has the original versions of __instancecheck__ and __subclasscheck__ that avoids the ABC.register overhead checking. I also think the usage of isinstance(x, Gate) is appropriate in this key function; it needs to find out if a particular circuit Operation also implements the more specialised interface Gate, and that's the exact reason for #7087 in the first place.

I'd agree that using run-time type information is bad if you're trying switch on all possible subclass instances to do something different - that's absolutely something polymorphism should be used for - but in this case there is specialist behaviour that is necessary only for one specific subinterface to Operation. If we wanted to use polymorphism over all implementors of Operation here, we'd have to add a dummy method to Operation, simply so Gate can override it, and everything else can ignore it. Anything that wants to be treated as a gate should inherit from Gate to be properly polymorphic, not try to use gate-specific hooks without implementing the rest of the interface.

Gate is actually a concrete class that can be instantiated, but really we treat it as the "unitary circuit operation" interface throughout Qiskit, and we require that all such operations inherit from Gate, so isinstance is the appropriate check to me.

@kdk
Copy link
Member

kdk commented Jan 21, 2022

I think this needs some more investigation before picking a direction here, especially to determine if we would expect this slow-down to grow as we increase either the number of operations in a circuit, the number of operation-types or something else.

>>> ABCMeta._dump_registry(Gate)
Class: qiskit.circuit.gate.Gate
Inv. counter: 60
_abc_registry: set()
_abc_cache: {<weakref at 0x13f23fdd0; to 'ABCMeta' at 0x7feff5cf5be0 (XGate)>}
_abc_negative_cache: set()
_abc_negative_cache_version: 60
  • It sounds like there was some investigation into the origin of the regression for CollectMultiQBlocks, but it's less clear to me where the regressions around the BasisTranslator are originating. That pass doesn't contain any obvious isinstance calls, and none of the DAGCircuit methods it uses (.op_nodes, and .substitute_ndoe_with_dag) do either. My best guess would be that this is coming from its use of the circuit_to_dag and dag_to_circuit converters and their use of QuantumCircuit._append, but if that were the case, I would've expected this commit to show up as a regression in some of our circuit building benchmarks, but those are all noticeably flat, even with 100k+ gates:

https://qiskit.github.io/qiskit/#circuit_construction.CircuitConstructionBench.time_circuit_construction?commits=05b60a5e
https://qiskit.github.io/qiskit/#circuit_construction.CircuitConstructionBench.time_circuit_copy?commits=05b60a5e
https://qiskit.github.io/qiskit/#converters.ConverterBenchmarks.time_dag_to_circuit?commits=05b60a5e
https://qiskit.github.io/qiskit/#converters.ConverterBenchmarks.time_circuit_to_dag?commits=05b60a5e
https://qiskit.github.io/qiskit/#ripple_adder.RippleAdderConstruction.time_build_ripple_adder?commits=05b60a5e
https://qiskit.github.io/qiskit/#ripple_adder.RippleAdderTranspile.time_transpile_square_grid_ripple_adder?commits=05b60a5e

  • @jakelishman 's suggestion seems like a reasonable way to solve the regression, but I am a bit hesitant to immediately jump to making an Operation a non-standard ABC without a better understanding of the origin of the problem, and it's likelihood to increase in magnitude . If we do go forward with that approach, we should make sure that when someone someday tries Operation.register(MyCustomOp) they see something like raise NotImplemenetedError("see GH-7528").

Possibly helpful reference: https://stackoverflow.com/questions/42378726/why-is-checking-isinstancesomething-mapping-so-slow

@alexanderivrii
Copy link
Contributor

alexanderivrii commented Feb 4, 2022

After investigating this a bit more (using cProfile):

  • As discussed previously, isinstance slows things down when Operation inherits from ABC
  • Additionally, it seems that returning fields directly, e.g.
class Instruction:
    def __init__(self):
        ...
        self.name = name

is faster than calling (polymorphic) functions:

    @property
    def name(self):
        """Return the name."""
        return self._name

and in the PR we have indeed changed name, num_qubits, and num_clbits to be properties. I believe this is what's responsible for basis_translator and assembler benchmarks.
(Note: if it would help, I could upload my profiling results somewhere).


Here are some numbers on a toy example, with classes

class Base(ABC): ...
class B(Base): ...
class C(B): ...
  • isinstance(elem, C) with Base not inheriting from ABC: 0.49
  • isinstance(elem, C) with Base inheriting from ABC: 1.17
  • inheriting from ABC, but using a function to return the "type": 0.57 (when returning int) and 0.51 (when returning bool)
  • inheriting from ABC, but using a field self._type: 0.33 when int, and 0.26 when bool

So, working with fields directly (especially bool ones) seems fastest (and does not depend on whether the base class inherits from ABC or not). Specifically, for CollectMultiQBlocks this would mean keeping a field in DagNode representing whether it's a DagInNode or DagOutNode, but (if I understand correctly) this is very closely related to what we are trying to deprecate.


I have tried to implement @jakelishman 's suggestion as follows (and probably did not do it fully correctly):

class ABCMeta_Hack(ABCMeta):

    def __instancecheck__(self, instance):
        return type.__instancecheck__(self, instance)

    def __subclasscheck__(self, subclass):
        return type.__subclasscheck__(self, subclass)


class ABC_Hack(metaclass=ABCMeta_Hack):
    __slots__ = ()

class Base(ABC_Hack): ...

but on the toy example it did not improve things compared to inheriting from ABC. If this wold help, here is the profiling data for Base not inheriting from anything, vs. Base inheriting from ABC_Hack vs Base inheriting from ABC:

 NO INHERITANCE:

  5000002 function calls in 1.363 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.830    0.830    1.363    1.363 test_code\test_isinstance.py:126(experiment_isinst)
  5000000    0.533    0.000    0.533    0.000 {built-in method builtins.isinstance}

ABC_HACK:

  11669922 function calls in 3.176 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.800    0.800    3.176    3.176 test_isinstance.py:126(experiment_isinst)
  5000000    1.094    0.000    2.376    0.000 {built-in method builtins.isinstance}
  3334960    0.864    0.000    1.282    0.000 test_code\test_isinstance.py:12(__instancecheck__)
  3334960    0.418    0.000    0.418    0.000 {function ABCMeta_Hack.__instancecheck__ at 0x0199BB68}

ABC:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.776    0.776    3.156    3.156 test_code\test_isinstance.py:126(experiment_isinst)
  5000000    1.132    0.000    2.380    0.000 {built-in method builtins.isinstance}
  3334960    0.663    0.000    1.248    0.000 Python\Python39-32\lib\abc.py:96(__instancecheck__)
  3334960    0.585    0.000    0.585    0.000 {built-in method _abc._abc_instancecheck}
        2    0.000    0.000    0.000    0.000 Python39-32\lib\abc.py:100(__subclasscheck__)
        2    0.000    0.000    0.000    0.000 {built-in method _abc._abc_subclasscheck}

@kdk
Copy link
Member

kdk commented Feb 15, 2022

Thanks for looking into this @alexanderivrii .

PR we have indeed changed name, num_qubits, and num_clbits to be properties. I believe this is what's responsible for basis_translator and assembler benchmarks.

Interesting, this seems like a likely culprit. It might be interesting to see if this can be mitigated by using class-level attributes, e.g.

class Clifford(Operation):
    name = 'clifford'

    def __init__(...):
        ....

This should satisfy the ABC requirements and be faster, and be more common than the case where these properties need to be dynamic (and thus @properties). (See also: #6883 )

(Note: if it would help, I could upload my profiling results somewhere).

I believe you should be able to attach files in the Github UI (by clicking the lower bar or drag and drop), but failing that, flamegraphs or cProfile outputs are still helpful.

@kdk
Copy link
Member

kdk commented Mar 29, 2022

For 0.20, since the circuit and transpiler are still using isinstance(..., Instruction), we should remove Operation as a base class for any of the instructions in the circuit library and defer the release note until we can investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants