refactor: Add helper to factor out newline + padding pattern #98

ConnorGray · 2025-02-17T03:39:43Z

Hey 👋, big fan and first-time contributor to this project🙂

I'm building a Markdown editor that incorporates pulldown-cmark-to-cmark, and thought it might be time that I try to contribute some effort back to this project (and perhaps improve on a few issues I've run into). I've found that a great way to learn a new codebase is to start with making some small improvements that don't try to change behavior, so that's what I'm starting with here 🙂

Please let me know if you'd like anything changed with this patch, or if you'd prefer I start with some other kind of change.

If you happen to be interested in how I'm putting this project to use, click here to read a bit about that.

I use pulldown-cmark-to-cmark as part of markdown-ast, which supports converting Markdown → Events → an AST (and back again). I've developing that project as part of a quest to build a "WYSISYG"-style editor for Markdown within Wolfram notebooks.

Overall pulldown-cmark-to-cmark works great for that use case, but since my flow is Foo.md → <edit file> → Foo.md, I care especially that the Markdown "round-trips" as well as possible.

I understand that round-tripping of Markdown is likely quite difficult in general, but I'm hoping that, over time, a few adjustments and additional configuration knobs on pulldown-cmark-to-cmark can make it tractable for certain use cases.

For example, if the Foo.md being round-tripped is restricted by Markdown linting that runs in CI (enforcing things like consistent use of '*' vs '-' for bulletted lists, exact numbers of blank lines between paragraphs, etc.), then it is much easier to imagine pulldown-cmark-to-cmark generating output that conforms to particular, fixed rules.

I'm also hoping to eventually refactor my clap-markdown crate to use pulldown-cmark-to-cmark instead of bespoke (error prone), manual generation of a Markdown string.

After writing a newline into the generated Markdown output to start a new line of output, it is (almost) always necessary to output the "padding" characters used to indent the content at the current location in the document.

Since writing a newline and writing the padding are always paired, factoring them out into a function should help with readability and consistency in performing this minor two-step dance correctly.

Byron

Thanks a lot for this PR with one of the most considerate and mindful descriptions I have seen in a while!

Hey 👋, big fan and first-time contributor to this project🙂

🙏☺️

I understand that round-tripping of Markdown is likely quite difficult in general, but I'm hoping that, over time, a few adjustments and additional configuration knobs on pulldown-cmark-to-cmark can make it tractable for certain use cases.

You have probably seen the markdown spec tests already which also use roundtripping as indicator of success (I think). Getting to 100% there would probably be meaningful to you.

Besides that, I noticed that pulldown-cmark tends to degenerate information, so true round-tripping would only be possible if the whole input buffer is used along with the version of the function here that also provides indices into the input buffer (so skipped characters can be taken into consideration).

I'm also hoping to eventually refactor my clap-markdown crate to use pulldown-cmark-to-cmark instead of bespoke (error prone), manual generation of a Markdown string.

It will be interesting to see if the typed version of this will be better, in the sense that it prevents certain bugs that would have been invisible before.
The reason I am cautious here is that Events are assumed to be valid because pulldown-cmark generated them, which is not true anymore if they are generated 'by hand'. Probably I am missing something though 😁.

src/text_modifications.rs

Byron · 2025-02-17T06:56:14Z

src/text_modifications.rs

+///
+/// Concretely, a call to [`write_padded_newline()`] after the first line in the
+/// paragraph of the list item would write `">···"`.
+pub fn write_padded_newline<F>(formatter: &mut F, state: &State<'_>) -> Result<(), fmt::Error>


Despite the nice docs, I'd lean towards making this private, unless you want to be its first user.

In any case, I think it's easier to read when going with formatter: &mut impl fmt::Write.

Sounds good; I erred on the side of matching the pub of the other functions in this file, but I'll change this to pub(crate) since this is merely a helper function :)

I've updated to use impl trait for the argument 👍

If you'd like, I can make a small follow-up PR that changes the rest of the functions in text_modifications.rs to use pub(crate) as well, since the text_modifications module itself is private, and these functions are never re-exported publicly (that I can tell). I also found it a bit unexpected that they were all pub but not actually externally public.

Ah, I see: It wouldn't have been exported publicly despite the pub. Something that isn't obvious in my armchair review, and that is nicely fixed with pub(crate) - thanks!

When writing a newline into the generated Markdown content to start a new line of output, it is (almost) always necessary to output the "padding" characters used to indent the content at the current location in the document, based on the hierarchy of block-level elements the output cursor is "inside" of. Since writing a newline and writing the current padding are always paired, factoring them out into a function should help with readability and consistency in performing this minor two-step dance correctly.

ConnorGray · 2025-02-17T07:47:51Z

Thanks a lot for this PR with one of the most considerate and mindful descriptions I have seen in a while!

Thank you for the speedy and engaged maintenance of this project, and keeping it working nicely all these years! 🙂

You have probably seen the markdown spec tests already which also use roundtripping as indicator of success (I think). Getting to 100% there would probably be meaningful to you.

I'd noticed them, but not run them myself yet. For discussions sake, it might help to clarify that I'm interested in "byte-for-byte" round-tripping, whereas I understand many tests to only require "semantic" round tripping. To be concrete, these two Markdown programs are "semantically" the same (parse to the same Event sequence), but pulldown-cmark-to-cmark will (as of today) always produce the later:

> Foo
> Bar

 >
 > Foo
 > Bar

You could correct for this perhaps by keeping track of offsets, but I'm aiming at a simpler solution: make pulldown-cmark-to-cmark a little bit more configurable around "quirks" like the above, so that users can choose to always get Markdown back out in their preferred format (and matching any Markdown linting tool they / their company might be using).

Besides that, I noticed that pulldown-cmark tends to degenerate information, so true round-tripping would only be possible if the whole input buffer is used along with the version of the function here that also provides indices into the input buffer (so skipped characters can be taken into consideration).

This matches my experience as well. pulldown-cmark effectively produces a "flattened" sequence of AST events, which I unflatten as part of markdown-ast. But AFAIK there is no "concrete syntax tree" (CST) parser for CommonMark within the Rust ecosystem, which is what you'd probably need if you were really determined to have byte-for-byte round-tripping working.

As mentioned above, for my use case, I fortunately don't quite need to go that far :)

It will be interesting to see if the typed version of this will be better, in the sense that it prevents certain bugs that would have been invisible before.
The reason I am cautious here is that Events are assumed to be valid because pulldown-cmark generated them, which is not true anymore if they are generated 'by hand'. Probably I am missing something though 😁.

Ah I was a bit unclear: I won't be directly calling pulldown-cmark-to-cmark APIs and constructing the Events manually; I'll refactor clap-markdown to use markdown-ast, which will look something like:

let block = Block::Paragraph(Inlines(vec![
        Inline::Text("Hello! This is a paragraph ".to_owned()),
        Inline::Strong(Inlines(vec![
            Inline::Text("with bold text".to_owned()),
        ])),
        Inline::Text(".".to_owned())
    ]));

println!("Markdown:\n\n {}", ast_to_markdown(&[block]));

Internally, ast_to_markdown() calls ast_to_events(), which converts the markdown_ast::Block "tree" into a flat sequence of Event, and then in turn calls events_to_markdown(), which is just a very thin wrapper around pulldown-cmark-to-cmark.

So the Event stream that pulldown-cmark-to-cmark will be "correct by construction" because markdown-ast enforces that.

Byron

You could correct for this perhaps by keeping track of offsets, but I'm aiming at a simpler solution: make pulldown-cmark-to-cmark a little bit more configurable around "quirks" like the above, so that users can choose to always get Markdown back out in their preferred format (and matching any Markdown linting tool they / their company might be using).

I see, that makes sense and be much easier to implement.

Ah I was a bit unclear: I won't be directly calling pulldown-cmark-to-cmark APIs and constructing the Events manually; I'll refactor clap-markdown to use markdown-ast, which will look something like:

That's more like it - nice!

So the Event stream that pulldown-cmark-to-cmark will be "correct by construction" because markdown-ast enforces that

Thanks for sharing, it's nice to see how this crate seems to fit in nicely at the end of that conversion pipeline!

If you'd like, I can make a small follow-up PR that changes the rest of the functions [..]

Thanks for offering. If you happen to keep contributing something else you could U-Boot that, but no need for a PR just for that.

src/text_modifications.rs

ConnorGray force-pushed the connorgray/refactor-1 branch 2 times, most recently from 3e18243 to 4b6b9a3 Compare February 17, 2025 03:45

Byron requested changes Feb 17, 2025

View reviewed changes

ConnorGray force-pushed the connorgray/refactor-1 branch from 4b6b9a3 to 2252ba1 Compare February 17, 2025 07:24

fixup: Adjust wording + typos

05e247e

Byron reviewed Feb 18, 2025

View reviewed changes

src/text_modifications.rs Outdated Show resolved Hide resolved

Update doc-string so State::padding is shown conventionally

e10010f

Byron merged commit b47d6c9 into Byron:main Feb 18, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Add helper to factor out newline + padding pattern #98

refactor: Add helper to factor out newline + padding pattern #98

ConnorGray commented Feb 17, 2025

Byron left a comment

Byron Feb 17, 2025

ConnorGray Feb 17, 2025

Byron Feb 18, 2025

ConnorGray commented Feb 17, 2025

Byron left a comment

refactor: Add helper to factor out newline + padding pattern #98

refactor: Add helper to factor out newline + padding pattern #98

Conversation

ConnorGray commented Feb 17, 2025

Byron left a comment

Choose a reason for hiding this comment

Byron Feb 17, 2025

Choose a reason for hiding this comment

ConnorGray Feb 17, 2025

Choose a reason for hiding this comment

Byron Feb 18, 2025

Choose a reason for hiding this comment

ConnorGray commented Feb 17, 2025

Byron left a comment

Choose a reason for hiding this comment