Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First string token is always considered as docstring #7595

Closed
dhruvmanila opened this issue Sep 22, 2023 · 11 comments · Fixed by #10923
Closed

First string token is always considered as docstring #7595

dhruvmanila opened this issue Sep 22, 2023 · 11 comments · Fixed by #10923
Assignees
Labels
bug Something isn't working docstring Related to docstring linting or formatting

Comments

@dhruvmanila
Copy link
Member

dhruvmanila commented Sep 22, 2023

In the docstring detection state machine, the first string token is always considered as docstring without considering the context. For example, the first strings in the following example are considered as docstrings:

"abcba".strip("ab")

On the opposite end, if multiple strings are implicitly concatenated then only the first string is considered as docstring:

"first" "second"

Python considers the entire string ("firststring") as the module docstring but the state machine will only say that "first" is docstring while "second" is a normal string.

@dhruvmanila dhruvmanila added bug Something isn't working docstring Related to docstring linting or formatting labels Sep 22, 2023
@charliermarsh
Copy link
Member

Yeah, we need something more reliable here. I wish this could use our AST-based docstring detection -- that is, I wish we could just move these rules to the AST checker.

We could move them, but we'd then have to re-lex the strings in the AST for any implicit concatenations, which seems expensive.

@MichaReiser
Copy link
Member

We could move them, but we'd then have to re-lex the strings in the AST for any implicit concatenations, which seems expensive.

Why is this necessary, considering that python detects the whole string as docsttring?

Let's see if @dhruvmanila ends up refactoring our string representation anyway to represent string parts in the AST.

@charliermarsh
Copy link
Member

@MichaReiser -- Because the string could contain comments in-between the segments:

x = (
    "foo" # comment
    "bar"
)

This isn't a problem in the token-based model because we iterate over string segments and comment tokens. But in the AST, we'd need to delineate the string contents from the comments.

@MichaReiser
Copy link
Member

This isn't a problem in the token-based model because we iterate over string segments and comment tokens. But in the AST, we'd need to delineate the string contents from the comments.

I see, but we would only need to do this for strings where implicit concatenated is true. So it shouldn't be as many? But I guess it is now more complicated with the fstrings support.

@dhruvmanila
Copy link
Member Author

dhruvmanila commented Nov 6, 2023

I haven't explored it fully but I don't think it'll be an issue once the new AST node for implicit string concatenation is complete.

@charliermarsh
Copy link
Member

I think we can now solve this.

@dhruvmanila
Copy link
Member Author

Yeah, I can look at it tonight, looks interesting :)

@dhruvmanila dhruvmanila self-assigned this Dec 7, 2023
@dhruvmanila
Copy link
Member Author

I think we would need to move the existing rules which uses the state machine to be based on AST instead of tokens. There aren't many -- RUF001-003 and flake8-quotes rules. I think the only reason they're token based is because of implicit string concatenation.

The implementation should probably be in the semantic model.

@charliermarsh
Copy link
Member

Yeah ideally we would get rid of that state machine entirely.

dhruvmanila added a commit that referenced this issue Feb 13, 2024
## Summary

This PR introduces a new semantic model flag `DOCSTRING` which suggests
that the model is currently in a module / class / function docstring.
This is the first step in eliminating the docstring detection state
machine which is prone to bugs as stated in #7595.

## Test Plan

~TODO: Is there a way to add a test case for this?~

I tested this using the following code snippet and adding a print
statement in the `string_like` analyzer to print if we're currently in a
docstring or not.

<details><summary>Test code snippet:</summary>
<p>

```python
"Docstring" ", still a docstring"
"Not a docstring"


def foo():
    "Docstring"
    "Not a docstring"
    if foo:
        "Not a docstring"
        pass


class Foo:
    "Docstring"
    "Not a docstring"

    foo: int
    "Unofficial variable docstring"

    def method():
        "Docstring"
        "Not a docstring"
        pass


def bar():
    "Not a docstring".strip()


def baz():
    _something_else = 1
    """Not a docstring"""
```

</p>
</details>
dhruvmanila added a commit that referenced this issue Feb 17, 2024
## Summary

Part of #7595 

This PR moves the `RUF001` and `RUF002` rules to the AST checker. This
removes the use of docstring detection from these rules.

## Test Plan

As this is just a refactor, make sure existing test cases pass.
nkxxll pushed a commit to nkxxll/ruff that referenced this issue Mar 10, 2024
## Summary

This PR introduces a new semantic model flag `DOCSTRING` which suggests
that the model is currently in a module / class / function docstring.
This is the first step in eliminating the docstring detection state
machine which is prone to bugs as stated in astral-sh#7595.

## Test Plan

~TODO: Is there a way to add a test case for this?~

I tested this using the following code snippet and adding a print
statement in the `string_like` analyzer to print if we're currently in a
docstring or not.

<details><summary>Test code snippet:</summary>
<p>

```python
"Docstring" ", still a docstring"
"Not a docstring"


def foo():
    "Docstring"
    "Not a docstring"
    if foo:
        "Not a docstring"
        pass


class Foo:
    "Docstring"
    "Not a docstring"

    foo: int
    "Unofficial variable docstring"

    def method():
        "Docstring"
        "Not a docstring"
        pass


def bar():
    "Not a docstring".strip()


def baz():
    _something_else = 1
    """Not a docstring"""
```

</p>
</details>
nkxxll pushed a commit to nkxxll/ruff that referenced this issue Mar 10, 2024
## Summary

Part of astral-sh#7595 

This PR moves the `RUF001` and `RUF002` rules to the AST checker. This
removes the use of docstring detection from these rules.

## Test Plan

As this is just a refactor, make sure existing test cases pass.
dhruvmanila added a commit that referenced this issue Mar 23, 2024
## Summary

Continuing with #7595, this PR
moves the `Q001`, `Q002`, `Q003` rules to the AST based checker.

## Test Plan

Make sure all of the existing test cases pass and verify there are no
ecosystem changes.
dhruvmanila added a commit that referenced this issue Mar 25, 2024
## Summary

Continuing with #7595, this PR moves the `Q004` rule to the AST checker.

## Test Plan

- [x] Existing test cases should pass
- [x] No ecosystem updates
@charliermarsh
Copy link
Member

@dhruvmanila - I can't remember, can we close this?

@dhruvmanila
Copy link
Member Author

No, the last rule (Q003) is remaining, maybe I'll pick that up this weekend. I remember that was a bit difficult because it supports PEP 701 syntax.

dhruvmanila added a commit that referenced this issue Apr 14, 2024
## Summary

This PR moves the `Q003` rule to AST checker.

This is the final rule that used the docstring detection state machine
and thus this PR removes it as well.

resolves: #7595 
resolves: #7808 

## Test Plan

- [x] `cargo test`
- [x] Make sure there are no changes in the ecosystem
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docstring Related to docstring linting or formatting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants