Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only Python cells should be validated in Jupyter notebooks #12281

Closed
stewartadam opened this issue Jul 10, 2024 · 14 comments · Fixed by #12875
Closed

Only Python cells should be validated in Jupyter notebooks #12281

stewartadam opened this issue Jul 10, 2024 · 14 comments · Fixed by #12875
Assignees
Labels
notebook Related to (Jupyter) notebooks

Comments

@stewartadam
Copy link

stewartadam commented Jul 10, 2024

Description

When using creating a multi-language notebook (e.g. using the Polyglot Notebooks extension), Ruff continues to validate all cells producing many linting errors on non-Python code.

Context

In our case we were creating a C# notebook with .NET Interactive kernel and Ruff produced many linting errors. As a workaround, we have a top-level code cell with the contents # ruff: noqa to disable this, but it would be great if Ruff examined the cell language and only validated Python code.

#10504 mentions notebooks language is validated in the metadata, however Ruff is still linting the C# code even with this metadata present in the notebook:

"metadata": {
  "kernelspec": {
   "display_name": ".NET (C#)",
   "language": "C#",
   "name": ".net-csharp"
  },
  "polyglot_notebook": {
   "kernelInfo": {
    "defaultKernelName": "csharp",
    "items": [
     {
      "aliases": [],
      "name": "csharp"
     }
    ]
   }
  }

One thing to note with polyglot notebooks is each cell does have a language associated to it - it would be great if Ruff could parse this per-cell metadata and only validate Python code cells.

Additional Details

ruff 0.4.10

configuration (pyproejct.toml):

[tool.ruff]
extend-include = ["*.ipynb"]
lint.extend-select = ["W"]
line-length = 140

Example output of linting errors produced on C# code:

$ pdm run ruff check
error: Failed to parse docs/getting-started-notebooks/api-example.ipynb:4:4:7: Simple statements must be separated by newlines or semicolons
docs/getting-started-notebooks/api-example.ipynb:cell 4:4:7: E999 SyntaxError: Simple statements must be separated by newlines or semicolons
docs/getting-started-notebooks/api-example.ipynb:cell 4:4:22: E703 [*] Statement ends with an unnecessary semicolon
docs/getting-started-notebooks/api-example.ipynb:cell 4:5:7: E999 SyntaxError: Simple statements must be separated by newlines or semicolons
docs/getting-started-notebooks/api-example.ipynb:cell 4:5:27: E703 [*] Statement ends with an unnecessary semicolon
docs/getting-started-notebooks/api-example.ipynb:cell 4:6:7: E999 SyntaxError: Simple statements must be separated by newlines or semicolons
docs/getting-started-notebooks/api-example.ipynb:cell 4:6:30: E703 [*] Statement ends with an unnecessary semicolon
docs/getting-started-notebooks/api-example.ipynb:cell 4:7:7: E999 SyntaxError: Simple statements must be separated by newlines or semicolons
docs/getting-started-notebooks/api-example.ipynb:cell 4:7:37: E703 [*] Statement ends with an unnecessary semicolon
docs/getting-started-notebooks/api-example.ipynb:cell 4:8:7: E999 SyntaxError: Simple statements must be separated by newline
...
@charliermarsh
Copy link
Member

I thought we filtered based on this already but can't find it. \cc @dhruvmanila

@MichaReiser
Copy link
Member

I'm not sure if we filter based on the file's metadata. I thought we filter based on the cell metdata.

@stewartadam
Copy link
Author

If I'm understanding right it looks like it is validated here:

pub fn from_path(path: &Path, source_type: PySourceType) -> Result<Option<Self>, SourceError> {

However the metadata format I see in my polyglot or Python notebooks looks to be in a slightly different structure, so it's just returning the default true.

@dhruvmanila
Copy link
Member

@stewartadam Can you provide a full Notebook? I can look at where that metadata is coming from. Is it coming from the cell or from the Notebook itself? We verify that it's a Python notebook by looking at the notebook-level metadata which is what the function that you've linked does:

/// Return `true` if the notebook is a Python notebook, `false` otherwise.
pub fn is_python_notebook(&self) -> bool {
self.raw
.metadata
.language_info
.as_ref()
.map_or(true, |language| language.name == "python")
}

@dhruvmanila dhruvmanila added the notebook Related to (Jupyter) notebooks label Jul 11, 2024
@stewartadam
Copy link
Author

stewartadam commented Jul 11, 2024

Here are two notebooks: https://gist.github.com/stewartadam/528689d9bb917715a4c16a6ff9282de3

It looks like the extensions have a habit of making a mess of the metadata though. Creating a new notebook defaults to Python and after switching it to .NET Interactive, it gets the .NET Interactive metadata layered on top of the original Python language metadata.

Similarly, saving a copy of the .NET notebook as python.ipynb and then changing the kernel+cells to Python left all the .NET metadata in place (look at the first revision in the Gist).

I suspect we'll want to introspect a mix of the notebook language, cell type, and kernel.

@dhruvmanila
Copy link
Member

This is confusing.

The dotnet-polyglot.ipynb has a cell with the following metadata which doesn't mention any language:

   "metadata": {
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },

While, the language_info is still "python":

  "language_info": {
   "name": "python"
  },

And, the python.ipynb notebook has a JavaScript cell with the following metadata:

   "metadata": {
    "vscode": {
     "languageId": "javascript"
    }
   },

But, the language_info is still "python".

Which extensions are all involved here? Is it just https://marketplace.visualstudio.com/items?itemName=ms-dotnettools.dotnet-interactive-vscode or are there any other extension? I'll probably need to look at the metadata schema these extensions have and encode it accordingly. I'm not sure how much effort it is to support this, it's not a priority right now with other important things going on.

@MichaReiser
Copy link
Member

Related, the open AI notebooks fail parsing because they contain "code" cells where only the vscode.languageId is set. I think we should start respecting vscode.langaugeId

https://github.com/openai/openai-cookbook/blob/main/examples/chatgpt/gpt_actions_library/gpt_action_salesforce.ipynb?short_path=65a7845

@dhruvmanila
Copy link
Member

Related, the open AI notebooks fail parsing because they contain "code" cells where only the vscode.languageId is set. I think we should start respecting vscode.langaugeId

openai/openai-cookbook@main/examples/chatgpt/gpt_actions_library/gpt_action_salesforce.ipynb?short_path=65a7845

Yeah, this seems reasonable. I'll want to check for any official documentation on this field and might require to look at the source code if it doesn't exists.

dhruvmanila added a commit that referenced this issue Aug 13, 2024
## Summary

This PR adds support for VS Code specific cell metadata to consider when
collecting valid code cells.

For context, Ruff only runs on valid code cells. These are the code
cells that doesn't contain cell magics. Previously, Ruff only used the
notebook's metadata to determine whether it's a Python notebook. But, in
VS Code, a notebook's preferred language might be Python but it could
still contain code cells for other languages. This can be determined
with the `metadata.vscode.languageId` field.

### References:
* https://code.visualstudio.com/docs/languages/identifiers
* https://github.com/microsoft/vscode/blob/e6c009a3d4ee60f352212b978934f52c4689fbd9/extensions/ipynb/src/serializers.ts#L104-L107
*
https://github.com/microsoft/vscode/blob/e6c009a3d4ee60f352212b978934f52c4689fbd9/extensions/ipynb/src/serializers.ts#L117-L122

This brings us one step closer to fixing #12281.

## Test Plan

Add test cases for `is_valid_python_code_cell` and an integration test
case which showcase running it end to end. The test notebook contains a
JavaScript code cell and a Python code cell.
@dhruvmanila
Copy link
Member

@stewartadam So, #12864 should actually also fix the Polyglot notebooks case as well, at least it does for the linked notebooks:

./target/debug/ruff check ~/Downloads/notebooks/dotnet-polyglot.ipynb --no-cache
All checks passed!./target/debug/ruff check ~/Downloads/notebooks/python.ipynb --no-cache 
All checks passed!

One thing to note with polyglot notebooks is each cell does have a language associated to it - it would be great if Ruff could parse this per-cell metadata and only validate Python code cells.

If this metadata is of the form vscode.languageId, then we'll start using it from next version.

Do you have an example Polyglot notebook which resembles the one in the PR description? I could test it out.

@dhruvmanila
Copy link
Member

Looking at the metadata you've provided, I think it might be useful to use the kernelspec.language as a fallback if there's no language_info. VS Code does the same: https://github.com/microsoft/vscode/blob/1c31e758985efe11bc0453a45ea0bb6887e670a4/extensions/ipynb/src/deserializers.ts#L20-L22

@stewartadam
Copy link
Author

This great to hear!

Agreed kernelspec.language as a fallback given the metadata captured from the notebooks generaetd with VSCode extension.

@tigerhawkvok
Copy link

I'll chime in and say that magic commands for Databricks environments should have those cells disabled:

https://docs.databricks.com/en/notebooks/notebooks-code.html#mix-languages

eg, cells that start with the pattern /^%[a-z]{2,}\s*$/ should be ignored (or, ignored unless it's %python)

@dhruvmanila
Copy link
Member

This seems like a special case for Databricks environment specifically, I'm not sure how to detect that. My main hesitation is that a single percent sign magic command like %python is a line magic which only considers the text on the same line where the magic command is present while a double percent sign magic command like %%python is a cell magic which considers all the content of the cell where the command is defined in. Ruff considers certain cells that contains cell magic commands because the following lines will have Python code:

!matches!(
command,
"capture"
| "debug"
| "ipytest"
| "prun"
| "pypy"
| "python"
| "python3"
| "time"
| "timeit"
)

Ruff will only ignore the line where a line magic is defined but will ignore the entire cell for cell magics except for the above cell magic commands.

@tigerhawkvok
Copy link

That's fair enough. Databricks is an enormous platform used by a ton of big shops, so it seems like a decent carve-out, but it is nevertheless at least somewhat niche.

The language support is limited, so potentially a match fail on %sql, %scala, %sh, %fs, %md, %r is easier to support, but I was hesitant to suggest an exclude list rather than an include list initially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
notebook Related to (Jupyter) notebooks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants