Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: patches uses a map in some cases #1626

Merged
merged 4 commits into from
Dec 9, 2024
Merged

Conversation

danking
Copy link
Member

@danking danking commented Dec 9, 2024

See this sheet for the data from take_patches.rs. I'm on an M3 Max with 96 GiB of RAM with macOS 14.4. This threshold likely depends on the ISA.

Intuitively, repeated searching is O(N_INDICES * lg N_PATCHES) and repeated map lookups is O(N_INDICES + N_PATCHES). It seems to me that the compiler & CPU would have trouble paralleling search (via SIMD or ILP) because of the branching, whereas map lookups are more obviously parallelized (e.g. SIMD hash computation). I'm not entirely sure why the cross over point seems to be around N_PATCHES / N_INDICES = 5. I believe the M3 Max has 128-bit registers, so if the indices are 32-bits then index arithmetic could be 4-way parallel.

See [this sheet for the data from
take_patches.rs](https://docs.google.com/spreadsheets/d/1D9vBZ1QJ6mwcIvV5wIL0hjGgVchcEnAyhvitqWu2ugU). I'm
on an M3 Max with 96 GiB of RAM with macOS 14.4. This threshold likely depends on the ISA.

Intuitively, repeated searching is `O(N_INDICES * lg N_PATCHES)` and repeated map lookups is
`O(N_INDICES + N_PATCHES)`. It seems to me that the compiler & CPU would have trouble paralleling
search (via SIMD or ILP) because of the branching, whereas map lookups are more obviously
parallelized (e.g. SIMD hash computation). I'm not entirely sure why the cross over point seems to
be around N_PATCHES / N_INDICES = 5. I believe the M3 Max has 128-bit registers, so if the indices
are 32-bits then index arithmetic could be 4-way parallel.
@danking danking requested a review from gatesn December 9, 2024 20:37
@danking danking marked this pull request as ready for review December 9, 2024 20:37
@danking danking merged commit 0b93fe0 into develop Dec 9, 2024
16 checks passed
@danking danking deleted the dk/restore-euro-2016-speed branch December 9, 2024 22:40
lwwmanning added a commit that referenced this pull request Dec 10, 2024
lwwmanning added a commit that referenced this pull request Dec 10, 2024
danking added a commit that referenced this pull request Dec 10, 2024
A second attempt at #1626 with fixes from #1628 as well as the transition of ALPRD and SparseArray
to use Patches.

---

See [this sheet for the data from
take_patches.rs](https://docs.google.com/spreadsheets/d/1D9vBZ1QJ6mwcIvV5wIL0hjGgVchcEnAyhvitqWu2ugU).
I'm on an M3 Max with 96 GiB of RAM with macOS 14.4. This threshold likely depends on the ISA.

Intuitively, repeated searching is `O(N_INDICES * lg N_PATCHES)` and
repeated map lookups is `O(N_INDICES + N_PATCHES)`. It seems to me that
the compiler & CPU would have trouble paralleling search (via SIMD or
ILP) because of the branching, whereas map lookups are more obviously
parallelized (e.g. SIMD hash computation). I'm not entirely sure why the
cross over point seems to be around N_PATCHES / N_INDICES = 5. I believe
the M3 Max has 128-bit registers, so if the indices are 32-bits then
index arithmetic could be 4-way parallel.
danking added a commit that referenced this pull request Dec 10, 2024
A second attempt at #1626 with fixes from #1628 as well as the transition of ALPRD and SparseArray
to use Patches.

---

See [this sheet for the data from
take_patches.rs](https://docs.google.com/spreadsheets/d/1D9vBZ1QJ6mwcIvV5wIL0hjGgVchcEnAyhvitqWu2ugU).
I'm on an M3 Max with 96 GiB of RAM with macOS 14.4. This threshold likely depends on the ISA.

Intuitively, repeated searching is `O(N_INDICES * lg N_PATCHES)` and
repeated map lookups is `O(N_INDICES + N_PATCHES)`. It seems to me that
the compiler & CPU would have trouble paralleling search (via SIMD or
ILP) because of the branching, whereas map lookups are more obviously
parallelized (e.g. SIMD hash computation). I'm not entirely sure why the
cross over point seems to be around N_PATCHES / N_INDICES = 5. I believe
the M3 Max has 128-bit registers, so if the indices are 32-bits then
index arithmetic could be 4-way parallel.
danking added a commit that referenced this pull request Dec 11, 2024
A second attempt at #1626 with
fixes from #1628 as well as the
transition of ALPRD and SparseArray
to use Patches.

---

See [this sheet for the data from take_patches.rs]
https://docs.google.com/spreadsheets/d/1D9vBZ1QJ6mwcIvV5wIL0hjGgVchcEnAyhvitqWu2ugU).
I'm on an M3 Max with 96 GiB of RAM with macOS 14.4. This threshold
likely depends on the ISA.

Intuitively, repeated searching is `O(N_INDICES * lg N_PATCHES)` and
repeated map lookups is `O(N_INDICES + N_PATCHES)`. It seems to me that
the compiler & CPU would have trouble paralleling search (via SIMD or
ILP) because of the branching, whereas map lookups are more obviously
parallelized (e.g. SIMD hash computation). I'm not entirely sure why the
cross over point seems to be around N_PATCHES / N_INDICES = 5. I believe
the M3 Max has 128-bit registers, so if the indices are 32-bits then
index arithmetic could be 4-way parallel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants