Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework annotation ordering/optimisations #54289

Merged
merged 4 commits into from
Apr 30, 2024

Conversation

tecosaur
Copy link
Contributor

There are some annoyances/discrepancies with annotation handling that I originally tried to fix in #53794, #53800 and #53801.

@LilithHafner was kind enough to have a look at these, and after a long chat we came to the conclusion that this isn't the right approach to take, and that the order with which annotations are applied should be given primacy over annotation ranges.

This PR supersedes the aforementioned PRs, and should make the way annotations are ordered more sensible.

See the commit messages for more details.

@tecosaur tecosaur added strings "Strings!" backport 1.11 Change should be backported to release-1.11 labels Apr 28, 2024
@tecosaur
Copy link
Contributor Author

Ah, tests are failing because I missed a use of Base.annotatedstring_optimize! in StyledStrings' tests. I'll fix that and re-bump the stdlib shortly.

@tecosaur tecosaur force-pushed the rework-annotation-optimisation branch from a1539b8 to b4a4dec Compare April 28, 2024 16:34
base/strings/annotated.jl Outdated Show resolved Hide resolved
@LilithHafner
Copy link
Member

LilithHafner commented Apr 29, 2024

What part of annotations are semantically meaningful/visible?

IIUC the annotations

[(1:1, :a => 1), (2:2, :a => 1)] and [(1:2, :a => 1)] are semantically equivalent
[(1:1, :a => 1), (2:2, :a => 1)] and [ (2:2, :a => 1), (1:1, :a => 1)] are semantically equivalent
[(1:1, :a => 1), (1:1, :a => 2)] and [ (1:1, :a => 2), (1:1, :a => 1)] are semantically different
[(1:1, :a => 1), (1:1, :b => 2)] and [ (1:1, :a => 2), (1:1, :b => 1)] are semantically different

This should be documented. Specifically, a simple specification that describes what is and is not semantically visible from which one can derive the above (or a different result).

Ideally everything returned by annotations(::AnnotatedString) would be semantically meaningful, that's an easy semantic to document. (and https://www.hyrumslaw.com/)

(I can't review the correctness of an optimization without knowing what "correct" means)

@tecosaur
Copy link
Contributor Author

Ah yep, we did also discuss a need for more info in the docstrings. I'll see about adding some of the elaboration needed.

@tecosaur tecosaur force-pushed the rework-annotation-optimisation branch from b4a4dec to f869c2f Compare April 29, 2024 15:35
Copy link
Member

@LilithHafner LilithHafner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    for (region, annot) in str.annotations
        if region == fullregion
            push!(annotations, (firstindex(unannot):lastindex(unannot), annot))
        end
    end
    for offset in 0:len:(r-1)*len
        for (region, annot) in str.annotations
            if region != fullregion
                push!(annotations, (region .+ offset, annot))
            end
        end
    end

This is buggy in the new semantic because it re-orders annotations. Not necessarily blocking this PR, but an issue.

end
# Insert any extra entries in the appropriate position
for (offset, (i, entry)) in enumerate(extras)
insert!(annotations, i + offset, entry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is O(n^2), but we can revise this function for perf later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and I don't think this should be in the "hot path" of any common use-cases 🤞.

first(first(annot)) == 1 || continue
if last(annot) == last(last(io.annotations))
valid_run = true
for runlen in 1:i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also O(n^2) when it could be O(n), but perf can wait.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it will be O(n) in non-pathological cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is written with the non-pathological case in mind, and if you consider the case where few-annotation insertions are repeatedly made, individual insertions should be pretty much O(1) with a constant factor based on the average "checked annotation depth" / average number of annotations each insertion.

Copy link
Member

@LilithHafner LilithHafner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be merged and followup prs should address the performance of this PR's implementations and address the now-invalid-ness of the repeat function.

@tecosaur
Copy link
Contributor Author

I've just adjusted repeat to be consistent with the new semantic,

fullregion = firstindex(str):lastindex(str)
if allequal(first, str.annotations) && first(first(str.annotations)) == fullregion
    newfullregion = firstindex(unannot):lastindex(unannot)
    for (_, annot) in str.annotations
        push!(annotations, (newfullregion, annot))
    end
else
    for offset in 0:len:(r-1)*len
        for (region, annot) in str.annotations
            push!(annotations, (region .+ offset, annot))
        end
    end
end

this change has been folded into 2656884453 * Remove strong ordering of annotation ranges

@tecosaur tecosaur force-pushed the rework-annotation-optimisation branch from f869c2f to 1bed84f Compare April 30, 2024 02:11
@tecosaur
Copy link
Contributor Author

Now that #54308 has been merged, we need to resolve the merge conflict and adjust the test introduced there.

This is in preparation for changes to the way annotation ordering is
handled in Base.
After a long chat with Lilith Halfner, we've come to the conclusion that
the range-based ordering applied to annotations, while nice for making
some otherwise O(n) code O(log n) and O(n^2) code O(n), is actually
assuming too much about how annotations are used and interact with each
other. Removing all assumptions about ordering, and giving annotation
order primacy seems like the most sensible thing to do, even if it makes
a few bits of the code less algorithmically "nice".

As a consequence, we also get rid of annotatedstring_optimize!. Specific
producers/consumers of annotated text will know what assumptions can be
made to compress/optimise the annotations used, and are thus best suited
to do so themselves. The one exception to this is probably when writing
to an AnnotatedIOBuffer, here adding a specific optimisation to
_insert_annotations! probably makes sense, and will be explored soon.
This is a rather important optimisation, since it prevents the
annotation blow-up that can result from say writing to an
AnnotatedIOBuffer char-by-char. Originally I was just going to pass the
AnnotatedString produced when reading the AnnotatedIOBuffer through
annotatedstring_optimize!, but now that's been removed, this seems like
the best past forwards (it's also actually a better approach than
applying annotatedstring_optimize!, just hard to justify when that code
already existed).
It's important to specify the way that annotations relate to the
characters of the underlying string and each other.

Along the way, it's also worth explaining the behaviour of the internal
functions _clear_annotations_in_region! and _insert_annotations!.
@tecosaur tecosaur force-pushed the rework-annotation-optimisation branch from 1bed84f to 4c18472 Compare April 30, 2024 11:15
@tecosaur
Copy link
Contributor Author

After CI, I think this should be good to merge if that also sounds good by @LilithHafner 🙂.

@LilithHafner
Copy link
Member

1bed84f LGTM. I have not reviewed https://github.com/JuliaLang/julia/compare/1bed84f4ae6fe6fc9bb39612e274fd1630a51349..4c18472b3e56998f2784ecd5d9deea4cf5371850, but am open to suggestions to how I could do that without re-reviewing this PR from scratch.

@tecosaur
Copy link
Contributor Author

The only change made here in the diff you link is

    @test chopprefix(sprint(show, str), "Base.") ==
-        "AnnotatedString{String}(\"some string\", [(1:4, :thing => 0x01), (1:11, :all => 0x03), (6:11, :other => 0x02)])"
+        "AnnotatedString{String}(\"some string\", [(1:4, :thing => 0x01), (6:11, :other => 0x02), (1:11, :all => 0x03)])"

i.e. correctly resolving the change introduced by #54308.

HTH.

@LilithHafner
Copy link
Member

Okay, the diff in your comment also LGTM.

@tecosaur tecosaur merged commit fe554b7 into JuliaLang:master Apr 30, 2024
7 checks passed
@tecosaur tecosaur deleted the rework-annotation-optimisation branch April 30, 2024 15:14
@tecosaur tecosaur mentioned this pull request Apr 30, 2024
59 tasks
@IanButterworth
Copy link
Member

@tecosaur I know you have a preference to not squash but for things that will get backported, it introduces unnecessary stumbling blocks in that automated process that would be better to avoid.

@tecosaur
Copy link
Contributor Author

tecosaur commented May 1, 2024

Thanks for mentioning that Ian, I do like to preserve more granular information when it seems sensible, but I wasn't aware that complicated the backporting process. I'll keep that in mind with the few remaining PRs I'm hoping to get backported 🙂.

(As an aside, is the automated backporting doing something other than cherry picking? Or is the hassle having to supply -m?)

@IanButterworth
Copy link
Member

Yeah the automatic script doesn't handle adding -m. If you wanted to help with adding that it's here https://github.com/KristofferC/Backporter/blob/master/backporter.jl

tecosaur added a commit that referenced this pull request May 6, 2024
Improvements to the consistency and semantics of AnnotatedStrings, mainly around the ordering of annotations.
KristofferC added a commit that referenced this pull request May 28, 2024
Backported PRs:
- [x] #53665 <!-- use afoldl instead of tail recursion for tuples -->
- [x] #53976 <!-- LinearAlgebra: LazyString in interpolated error
messages -->
- [x] #54005 <!-- make `view(::Memory, ::Colon)` produce a Vector -->
- [x] #54010 <!-- Overload `Base.literal_pow` for `AbstractQ` -->
- [x] #54069 <!-- Allow PrecompileTools to see MI's inferred by foreign
abstract interpreters -->
- [x] #53750 <!-- inference correctness: fields and globals can revert
to undef -->
- [x] #53984 <!-- Profile: fix heap snapshot is valid char check -->
- [x] #54102 <!-- Explicitly compute stride in unaliascopy for SubArray
-->
- [x] #54070 <!-- Fix integer overflow in `skip(s::IOBuffer,
typemax(Int64))` -->
- [x] #54013 <!-- Support case-changes to Annotated{String,Char}s -->
- [x] #53941 <!-- Fix writing of AnnotatedChars to AnnotatedIOBuffer -->
- [x] #54137 <!-- Fix typo in docs for `partialsortperm` -->
- [x] #54129 <!-- use correct size when creating output data from an
IOBuffer -->
- [x] #54153 <!-- Fixup IdSet docstring -->
- [x] #54143 <!-- Fix `make install` from tarballs -->
- [x] #54151 <!-- LinearAlgebra: Correct zero element in
`_generic_matvecmul!` for block adj/trans -->
- [x] #54213 <!-- Add `public` statement to `Base.GC` -->
- [x] #54222 <!-- Utilize correct tbaa when emitting stores of unions.
-->
- [x] #54233 <!-- set MAX_OS_WRITE on unix -->
- [x] #54255 <!-- fix `_checked_mul_dims` in the presence of 0s and
overflow. -->
- [x] #54259 <!-- Fix typo in `readuntil` -->
- [x] #54251 <!-- fix typo in gc_mark_memory8 when chunking a large
array -->
- [x] #54276 <!-- Fix solve for complex `Hermitian` with non-vanishing
imaginary part on diagonal -->
- [x] #54248 <!-- ensure package callbacks are invoked when no valid
precompile file exists for an "auto loaded" stdlib -->
- [x] #54308 <!-- Implement eval-able AnnotatedString 2-arg show -->
- [x] #54302 <!-- Specialised substring equality for annotated strs -->
- [x] #54243 <!-- prevent `package_callbacks` to run multiple time for a
single package -->
- [x] #54350 <!-- add a precompile signature to Artifacts code that is
used by JLLs -->
- [x] #54331 <!-- correctly track freed bytes in
jl_genericmemory_to_string -->
- [x] #53509 <!-- revert moving "creating packages" from Pkg.jl -->
- [x] #54335 <!-- When accessing the data pointer for an array, first
decay it to a Derived Pointer -->
- [x] #54239 <!-- Make sure `fieldcount` constant-folds for `Tuple{...}`
-->
- [x] #54288
- [x] #54067
- [x] #53715 <!-- Add read/write specialisation for IOContext{AnnIO} -->
- [x] #54289 <!-- Rework annotation ordering/optimisations -->
- [x] #53815 <!-- create phantom task for GC threads -->
- [x] #54130 <!-- inference: handle `LimitedAccuracy` in
`handle_global_assignment!` -->
- [x] #54428 <!-- Move ConsoleLogging.jl into Base -->
- [x] #54332 <!-- Revert "add unsetindex support to more copyto methods
(#51760)" -->
- [x] #53826 <!-- Make all command-line options documented in all
related files -->
- [x] #54465 <!-- typeintersect: conservative typevar subtitution during
`finish_unionall` -->
- [x] #54514 <!-- typeintersect: followup cleanup for the nothrow path
of type instantiation -->
- [x] #54499 <!-- make `@doc x` work without REPL loaded -->
- [x] #54210 <!-- attach finalizer in `mmap` to the correct object -->
- [x] #54359 <!-- Pkg REPL: cache `pkg_mode` lookup -->

Non-merged PRs with backport label:
- [ ] #54471 <!-- Actually setup jit targets when compiling
packageimages instead of targeting only one -->
- [ ] #54457 <!-- Make `String(::Memory)` copy -->
- [ ] #54323 <!-- inference: fix too conservative effects for recursive
cycles -->
- [ ] #54322 <!-- effects: add new `@consistent_overlay` macro -->
- [ ] #54191 <!-- make `AbstractPipe` public -->
- [ ] #53957 <!-- tweak how filtering is done for what packages should
be precompiled -->
- [ ] #53882 <!-- Warn about cycles in extension precompilation -->
- [ ] #53707 <!-- Make ScopedValue public -->
- [ ] #53452 <!-- RFC: allow Tuple{Union{}}, returning Union{} -->
- [ ] #53402 <!-- Add `jl_getaffinity` and `jl_setaffinity` -->
- [ ] #53286 <!-- Raise an error when using `include_dependency` with
non-existent file or directory -->
- [ ] #52694 <!-- Reinstate similar for AbstractQ for backward
compatibility -->
- [ ] #51479 <!-- prevent code loading from lookin in the versioned
environment when building Julia -->
@KristofferC KristofferC removed the backport 1.11 Change should be backported to release-1.11 label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
don't squash Don't squash merge strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants