Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Pants's data structures by not sorting as much (FrozenOrderedSet and FrozenDict) #14719

Open
Eric-Arellano opened this issue Mar 7, 2022 · 4 comments

Comments

@Eric-Arellano
Copy link
Contributor

As the author of both FrozenDict and FrozenOrderedSet, I think I was misguided in thinking that deterministic ordering actually matters for the engine. Per #14715 (comment), I believe that all we need is a deterministic __eq__ and __hash__ - the underlying ordering of the elements should be irrelevant and abstracted from the engine.

the underlying ordering of the elements should be irrelevant and abstracted from the engine.

The only place I think ordering should matter is things like error messages, if we want to alphabetize for example. But we can call sorted() at the callsites.

--

Because both FrozenDict and FrozenOrderedSet consider the order of elements in their __eq__, we have to eagerly call sort in lots of places, which is wasted computation. For example:

return FirstPartyPythonModuleMapping(
(k, tuple(sorted(v))) for k, v in sorted(modules_to_providers.items())
)

We must still use FrozenDict rather than dict so that it's hashable, but I don't think we need to care anymore about ordering.

I wonder if we even need OrderedSet and FrozenOrderedSet...can we use set and frozenset? Even if we keep those classes, can we stop using them use much and do the simpler + faster thing of stdlib?

--

To implement this change, I recommend:

  1. Fix the __eq__ implementations for FrozenDict and {Frozen,}OrderedSet
  2. Audit calls to sorted() and the class property sort_input, remove when possible.
@jyggen
Copy link
Member

jyggen commented Mar 15, 2022

I've started to look into this one.

Am I wrong thinking that __eq__ for FrozenDict and {Frozen,}OrderedSet can simply be self._data == other._data and self._items == other._items respectively since Python supports dict comparisons natively?

FrozenDict also implements __lt__, and < on dicts isn't supported - so if that needs to be optimized as well we need another solution for that. Counter(self.items()) < Counter(other.items()) seems consistent regardless of key order, so maybe that's an option.

@Eric-Arellano
Copy link
Contributor Author

I've started to look into this one.

🎉

Am I wrong thinking that eq for FrozenDict and {Frozen,}OrderedSet can simply be self._data == other._data and self._items == other._items respectively since Python supports dict comparisons natively?

Not at all wrong. That's I think what we should do.

FrozenDict also implements lt

Hm I don't remember why we did this. I think so that we can sort dataclasses that compose FrozenDicts. But maybe that's not necessary? I'd recommend:

  1. do whatever will keep semantics identical for now for __lt__, even if not optimal
  2. fix __eq__
  3. then, in a new PR, try changing the __lt__ semantics and see how CI does. Possibly get rid of __lt__

Eric-Arellano added a commit that referenced this issue Apr 13, 2022
…lementation (#15121)

Closes #15111. This is a particular edge case when the same python module has multiple entries, and those entries come from the same target.

The better fix is #14719.

[ci skip-rust]
[ci skip-build-wheels]
Eric-Arellano added a commit to Eric-Arellano/pants that referenced this issue Apr 13, 2022
…lementation (pantsbuild#15121)

Closes pantsbuild#15111. This is a particular edge case when the same python module has multiple entries, and those entries come from the same target.

The better fix is pantsbuild#14719.

[ci skip-rust]
[ci skip-build-wheels]
Eric-Arellano added a commit to Eric-Arellano/pants that referenced this issue Apr 13, 2022
…lementation (pantsbuild#15121)

Closes pantsbuild#15111. This is a particular edge case when the same python module has multiple entries, and those entries come from the same target.

The better fix is pantsbuild#14719.

[ci skip-rust]
[ci skip-build-wheels]
Eric-Arellano added a commit that referenced this issue Apr 13, 2022
…lementation (Cherry-pick of #15121) (#15127)

Closes #15111. This is a particular edge case when the same python module has multiple entries, and those entries come from the same target.

The better fix is #14719.

[ci skip-rust]
[ci skip-build-wheels]
Eric-Arellano added a commit that referenced this issue Apr 13, 2022
…lementation (Cherry-pick of #15121) (#15128)

Closes #15111. This is a particular edge case when the same python module has multiple entries, and those entries come from the same target.

The better fix is #14719.

[ci skip-rust]
[ci skip-build-wheels]
@Eric-Arellano
Copy link
Contributor Author

Thinking more about this...order should probably matter for FrozenOrderedSet! It's in the name! Instead, we should be using frozenset rather than FrozenOrderedSet most of the time.

Ordering should not matter for FrozenDict.

@stuhood
Copy link
Member

stuhood commented May 10, 2022

Instead, we should be using frozenset rather than FrozenOrderedSet most of the time.

Maybe? See #14195 (comment) re: sets.

If something needs to iterate over a set, it should use FrozenOrderedSet. And assuming that the inputs to that FrozenOrderedSet are deterministic (because they themselves were not produced by iterating over non-Ordered sets), then you're good to not sort the inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants