Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Add helper to factor out newline + padding pattern #98

Merged
merged 3 commits into from
Feb 18, 2025

Conversation

ConnorGray
Copy link
Contributor

Hey 👋, big fan and first-time contributor to this project🙂

I'm building a Markdown editor that incorporates pulldown-cmark-to-cmark, and thought it might be time that I try to contribute some effort back to this project (and perhaps improve on a few issues I've run into). I've found that a great way to learn a new codebase is to start with making some small improvements that don't try to change behavior, so that's what I'm starting with here 🙂

Please let me know if you'd like anything changed with this patch, or if you'd prefer I start with some other kind of change.

If you happen to be interested in how I'm putting this project to use, click here to read a bit about that.

I use pulldown-cmark-to-cmark as part of markdown-ast, which supports converting Markdown → Events → an AST (and back again). I've developing that project as part of a quest to build a "WYSISYG"-style editor for Markdown within Wolfram notebooks.

Overall pulldown-cmark-to-cmark works great for that use case, but since my flow is Foo.md → <edit file> → Foo.md, I care especially that the Markdown "round-trips" as well as possible.

I understand that round-tripping of Markdown is likely quite difficult in general, but I'm hoping that, over time, a few adjustments and additional configuration knobs on pulldown-cmark-to-cmark can make it tractable for certain use cases.

For example, if the Foo.md being round-tripped is restricted by Markdown linting that runs in CI (enforcing things like consistent use of '*' vs '-' for bulletted lists, exact numbers of blank lines between paragraphs, etc.), then it is much easier to imagine pulldown-cmark-to-cmark generating output that conforms to particular, fixed rules.

I'm also hoping to eventually refactor my clap-markdown crate to use pulldown-cmark-to-cmark instead of bespoke (error prone), manual generation of a Markdown string.


After writing a newline into the generated Markdown output to start a new line of output, it is (almost) always necessary to output the "padding" characters used to indent the content at the current location in the document.

Since writing a newline and writing the padding are always paired, factoring them out into a function should help with readability and consistency in performing this minor two-step dance correctly.

@ConnorGray ConnorGray force-pushed the connorgray/refactor-1 branch 2 times, most recently from 3e18243 to 4b6b9a3 Compare February 17, 2025 03:45
Copy link
Owner

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this PR with one of the most considerate and mindful descriptions I have seen in a while!

Hey 👋, big fan and first-time contributor to this project🙂

🙏☺️

I understand that round-tripping of Markdown is likely quite difficult in general, but I'm hoping that, over time, a few adjustments and additional configuration knobs on pulldown-cmark-to-cmark can make it tractable for certain use cases.

You have probably seen the markdown spec tests already which also use roundtripping as indicator of success (I think). Getting to 100% there would probably be meaningful to you.

Besides that, I noticed that pulldown-cmark tends to degenerate information, so true round-tripping would only be possible if the whole input buffer is used along with the version of the function here that also provides indices into the input buffer (so skipped characters can be taken into consideration).

I'm also hoping to eventually refactor my clap-markdown crate to use pulldown-cmark-to-cmark instead of bespoke (error prone), manual generation of a Markdown string.

It will be interesting to see if the typed version of this will be better, in the sense that it prevents certain bugs that would have been invisible before.
The reason I am cautious here is that Events are assumed to be valid because pulldown-cmark generated them, which is not true anymore if they are generated 'by hand'. Probably I am missing something though 😁.

///
/// Concretely, a call to [`write_padded_newline()`] after the first line in the
/// paragraph of the list item would write `">···"`.
pub fn write_padded_newline<F>(formatter: &mut F, state: &State<'_>) -> Result<(), fmt::Error>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the nice docs, I'd lean towards making this private, unless you want to be its first user.

In any case, I think it's easier to read when going with formatter: &mut impl fmt::Write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good; I erred on the side of matching the pub of the other functions in this file, but I'll change this to pub(crate) since this is merely a helper function :)

I've updated to use impl trait for the argument 👍

If you'd like, I can make a small follow-up PR that changes the rest of the functions in text_modifications.rs to use pub(crate) as well, since the text_modifications module itself is private, and these functions are never re-exported publicly (that I can tell). I also found it a bit unexpected that they were all pub but not actually externally public.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see: It wouldn't have been exported publicly despite the pub. Something that isn't obvious in my armchair review, and that is nicely fixed with pub(crate) - thanks!

When writing a newline into the generated Markdown content
to start a new line of output, it is (almost) always necessary
to output the "padding" characters used to indent the content
at the current location in the document, based on the hierarchy
of block-level elements the output cursor is "inside" of.

Since writing a newline and writing the current padding are
always paired, factoring them out into a function should help
with readability and consistency in performing this minor
two-step dance correctly.
@ConnorGray ConnorGray force-pushed the connorgray/refactor-1 branch from 4b6b9a3 to 2252ba1 Compare February 17, 2025 07:24
@ConnorGray
Copy link
Contributor Author

Thanks a lot for this PR with one of the most considerate and mindful descriptions I have seen in a while!

Thank you for the speedy and engaged maintenance of this project, and keeping it working nicely all these years! 🙂

You have probably seen the markdown spec tests already which also use roundtripping as indicator of success (I think). Getting to 100% there would probably be meaningful to you.

I'd noticed them, but not run them myself yet. For discussions sake, it might help to clarify that I'm interested in "byte-for-byte" round-tripping, whereas I understand many tests to only require "semantic" round tripping. To be concrete, these two Markdown programs are "semantically" the same (parse to the same Event sequence), but pulldown-cmark-to-cmark will (as of today) always produce the later:

> Foo
> Bar
 >
 > Foo
 > Bar

You could correct for this perhaps by keeping track of offsets, but I'm aiming at a simpler solution: make pulldown-cmark-to-cmark a little bit more configurable around "quirks" like the above, so that users can choose to always get Markdown back out in their preferred format (and matching any Markdown linting tool they / their company might be using).

Besides that, I noticed that pulldown-cmark tends to degenerate information, so true round-tripping would only be possible if the whole input buffer is used along with the version of the function here that also provides indices into the input buffer (so skipped characters can be taken into consideration).

This matches my experience as well. pulldown-cmark effectively produces a "flattened" sequence of AST events, which I unflatten as part of markdown-ast. But AFAIK there is no "concrete syntax tree" (CST) parser for CommonMark within the Rust ecosystem, which is what you'd probably need if you were really determined to have byte-for-byte round-tripping working.

As mentioned above, for my use case, I fortunately don't quite need to go that far :)

It will be interesting to see if the typed version of this will be better, in the sense that it prevents certain bugs that would have been invisible before.
The reason I am cautious here is that Events are assumed to be valid because pulldown-cmark generated them, which is not true anymore if they are generated 'by hand'. Probably I am missing something though 😁.

Ah I was a bit unclear: I won't be directly calling pulldown-cmark-to-cmark APIs and constructing the Events manually; I'll refactor clap-markdown to use markdown-ast, which will look something like:

let block = Block::Paragraph(Inlines(vec![
        Inline::Text("Hello! This is a paragraph ".to_owned()),
        Inline::Strong(Inlines(vec![
            Inline::Text("with bold text".to_owned()),
        ])),
        Inline::Text(".".to_owned())
    ]));

println!("Markdown:\n\n {}", ast_to_markdown(&[block]));

Internally, ast_to_markdown() calls ast_to_events(), which converts the markdown_ast::Block "tree" into a flat sequence of Event, and then in turn calls events_to_markdown(), which is just a very thin wrapper around pulldown-cmark-to-cmark.

So the Event stream that pulldown-cmark-to-cmark will be "correct by construction" because markdown-ast enforces that.

Copy link
Owner

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could correct for this perhaps by keeping track of offsets, but I'm aiming at a simpler solution: make pulldown-cmark-to-cmark a little bit more configurable around "quirks" like the above, so that users can choose to always get Markdown back out in their preferred format (and matching any Markdown linting tool they / their company might be using).

I see, that makes sense and be much easier to implement.

Ah I was a bit unclear: I won't be directly calling pulldown-cmark-to-cmark APIs and constructing the Events manually; I'll refactor clap-markdown to use markdown-ast, which will look something like:

That's more like it - nice!

So the Event stream that pulldown-cmark-to-cmark will be "correct by construction" because markdown-ast enforces that

Thanks for sharing, it's nice to see how this crate seems to fit in nicely at the end of that conversion pipeline!

If you'd like, I can make a small follow-up PR that changes the rest of the functions [..]

Thanks for offering. If you happen to keep contributing something else you could U-Boot that, but no need for a PR just for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants