Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range formatting: Fix invalid syntax after parenthesizing expression #9751

Merged
merged 3 commits into from
Feb 2, 2024

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Feb 1, 2024

Summary

This PR fixes an issue where range formatting omitted the closing parentheses when formatting a range that end's at an expression that gets parenthesized.

The issue is there's no unambiguous mapping for the position after "automatic" in the following example:

def needs_parentheses( ) -> bool:
    return item.sizing_mode is None and item.width_policy == "auto" and item.height_policy == "automatic"

This gets reformatted to

def needs_parentheses() -> bool:
    return (
        item.sizing_mode is None
        and item.width_policy == "auto"
        and item.height_policy == "automatic"
    )

The generated IR ending after "automatic" roughly is (where 139 is the position after the closing quote of "automatic")

source_position(139), // end position from the expression statement
soft_line_break(),
if_group_breaks(&text(")"))
hard_line_break(), 
source_position(139)

The printer will generate multiple source map entries for this

139 => 164 // Right after "automatic"
139 => 169 // The closing parentheses
139 => 170 // After the closing parentheses
139 => 171 // After the hard line break

Generating multiple mappings for the same position is correct because, depending on the use case, you either want to know where position 139 starts (you pick the first) or where it ends (you pick the last) in the formatted code. That's why the proper fix to this problem would be to include whether we look for the end or start of a node for both offsets passed to Printed::slice_range. This way, the source map lookup logic could pick the correct entry if multiple entries point to the same source location... However, this doesn't work for us because the suite ranges in the AST are off.

Looking at the example from above again: Both the expression and the expression statement end at position 139. If the statement is the last in the body (which is the case in the example above), then the body, and with it, potentially the enclosing node, also ends at position 139 in our AST because the body's end position is equal to the last statement's end position. However, this is incorrect. The end offset of the body should be the position of the Dedent token (after the hard line break). We might fix this with the new parser, but it is as it is now.

The problem with the body ending at offset 139 is that we incorrectly include the hard_line_break that ends the body even when formatting the expression statement alone because it's impossible to disambiguate the end of the expression statement from the end of the body. Including the hard_line_break results in inserting a new line every time we format the expression statement.

I'm not too fond of the solution taken by this PR but it works 🤷

First, we reduce the granularity of the source map to only generate positions we care about. We no longer generate source map entries for:

  • Text elements
  • or anything below logical line level (expression, with items, etc)

This solves the problem that we can't disambiguate between the end of the expression and the expression statement.
This also solves the problem where we can't disambiguate between the expression statement and the body because we'll now always take the first source mapping position. This can produce incorrect* results if we start inserting tokens around the statement, similar to the problem with parenthesizing expressions. However, I'm unaware of a case where we do this today or why this should be needed in the future. If it's needed in the future, than let's hope that we have our new parser by then and can fix the ranges of compound statements.

The first commit removes the automatic source map generation for text elements from the Printer. It now always requires explicit source_position element.

The second commit changes the python formatter to only emit source positions for logical-line like nodes. It also fixes an issue where we didn't fix a comment's indent.

Test Plan

Added snapshot tests.

Copy link
Contributor

github-actions bot commented Feb 1, 2024

ruff-ecosystem results

Linter (stable)

ℹ️ ecosystem check encountered linter errors. (no lint changes; 1 project error)

sphinx-doc/sphinx (error)

ruff failed
  Cause: Selection of unstable rules without the `--preview` flag is not allowed. Enable preview or remove selection of:
	- FURB113
	- FURB131
	- FURB132

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

ℹ️ ecosystem check encountered format errors. (no format changes; 2 project errors)

sphinx-doc/sphinx (error)

ruff format --no-preview --exclude tests/roots/test-pycode/cp_1251_coded.py

ruff failed
  Cause: Selection of unstable rules without the `--preview` flag is not allowed. Enable preview or remove selection of:
	- FURB113
	- FURB131
	- FURB132

openai/openai-cookbook (error)

warning: Detected debug build without --no-cache.
error: Failed to parse examples/dalle/Image_generations_edits_and_variations_with_DALL-E.ipynb:3:7:8: Unexpected token 'prompt'

Formatter (preview)

ℹ️ ecosystem check encountered format errors. (no format changes; 1 project error)

openai/openai-cookbook (error)

ruff format --preview

warning: Detected debug build without --no-cache.
error: Failed to parse examples/dalle/Image_generations_edits_and_variations_with_DALL-E.ipynb:3:7:8: Unexpected token 'prompt'

Require explicit `FormatElement::SourcePosition` elements over implicitly adding a source mapping for every token. This has the downside that it requires more `FormatElement`s to achieve a source mapping at the same detail level as today. However, we don't have a use case for token by token source maps today. We only want to map node positions where this is more than sufficient.
@MichaReiser MichaReiser force-pushed the range-formatting-parentheses branch from f6ac41a to 8502cad Compare February 2, 2024 08:05
@MichaReiser MichaReiser force-pushed the range-formatting-parentheses branch from 8502cad to 22ba608 Compare February 2, 2024 08:09
@MichaReiser MichaReiser added internal An internal refactor or improvement formatter Related to the formatter labels Feb 2, 2024
@MichaReiser MichaReiser requested review from charliermarsh and BurntSushi and removed request for charliermarsh February 2, 2024 08:19
@MichaReiser MichaReiser marked this pull request as ready for review February 2, 2024 08:31
@MichaReiser MichaReiser force-pushed the range-formatting-parentheses branch from 765c6a7 to 59a8f29 Compare February 2, 2024 08:44
@charliermarsh
Copy link
Member

This can produce incorrect* results if we start inserting tokens around the statement, similar to the problem with parenthesizing expressions. However, I'm unaware of a case where we do this today or why this should be needed in the future. If it's needed in the future, than let's hope that we have our new parser by then and can fix the ranges of compound statements.

Are there any other downsides to removing these source map entries? Would it mean that we lose granularity in some cases (that we don't have a use for anyway right now)?

Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -340,29 +337,25 @@ impl<Context> Format<Context> for SourcePosition {
}
}

/// Creates a text from a dynamic string with its optional start-position in the source document.
/// Creates a text from a dynamic string.
///
/// This is done by allocating a new string internally.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function doesn't allocate any more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't, but it does when calling Format. I can improve the documentation.

@MichaReiser
Copy link
Member Author

@charliermarsh Yeah, it's mainly if we want to have a full fidelity source maps where we want to map exact positions from source to dest and the other way round. However, that would still have the problem I'm facing today. So that needs fixing before it can become useful anyway.

@MichaReiser MichaReiser merged commit 4f7fb56 into main Feb 2, 2024
17 checks passed
@MichaReiser MichaReiser deleted the range-formatting-parentheses branch February 2, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
formatter Related to the formatter internal An internal refactor or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants