Only Python cells should be validated in Jupyter notebooks #12281

stewartadam · 2024-07-10T19:07:57Z

Description

When using creating a multi-language notebook (e.g. using the Polyglot Notebooks extension), Ruff continues to validate all cells producing many linting errors on non-Python code.

Context

In our case we were creating a C# notebook with .NET Interactive kernel and Ruff produced many linting errors. As a workaround, we have a top-level code cell with the contents # ruff: noqa to disable this, but it would be great if Ruff examined the cell language and only validated Python code.

#10504 mentions notebooks language is validated in the metadata, however Ruff is still linting the C# code even with this metadata present in the notebook:

"metadata": {
  "kernelspec": {
   "display_name": ".NET (C#)",
   "language": "C#",
   "name": ".net-csharp"
  },
  "polyglot_notebook": {
   "kernelInfo": {
    "defaultKernelName": "csharp",
    "items": [
     {
      "aliases": [],
      "name": "csharp"
     }
    ]
   }
  }

One thing to note with polyglot notebooks is each cell does have a language associated to it - it would be great if Ruff could parse this per-cell metadata and only validate Python code cells.

Additional Details

ruff 0.4.10

configuration (pyproejct.toml):

[tool.ruff]
extend-include = ["*.ipynb"]
lint.extend-select = ["W"]
line-length = 140

Example output of linting errors produced on C# code:

$ pdm run ruff check
error: Failed to parse docs/getting-started-notebooks/api-example.ipynb:4:4:7: Simple statements must be separated by newlines or semicolons
docs/getting-started-notebooks/api-example.ipynb:cell 4:4:7: E999 SyntaxError: Simple statements must be separated by newlines or semicolons
docs/getting-started-notebooks/api-example.ipynb:cell 4:4:22: E703 [*] Statement ends with an unnecessary semicolon
docs/getting-started-notebooks/api-example.ipynb:cell 4:5:7: E999 SyntaxError: Simple statements must be separated by newlines or semicolons
docs/getting-started-notebooks/api-example.ipynb:cell 4:5:27: E703 [*] Statement ends with an unnecessary semicolon
docs/getting-started-notebooks/api-example.ipynb:cell 4:6:7: E999 SyntaxError: Simple statements must be separated by newlines or semicolons
docs/getting-started-notebooks/api-example.ipynb:cell 4:6:30: E703 [*] Statement ends with an unnecessary semicolon
docs/getting-started-notebooks/api-example.ipynb:cell 4:7:7: E999 SyntaxError: Simple statements must be separated by newlines or semicolons
docs/getting-started-notebooks/api-example.ipynb:cell 4:7:37: E703 [*] Statement ends with an unnecessary semicolon
docs/getting-started-notebooks/api-example.ipynb:cell 4:8:7: E999 SyntaxError: Simple statements must be separated by newline
...

The text was updated successfully, but these errors were encountered:

charliermarsh · 2024-07-10T19:43:47Z

I thought we filtered based on this already but can't find it. \cc @dhruvmanila

MichaReiser · 2024-07-10T19:57:37Z

I'm not sure if we filter based on the file's metadata. I thought we filter based on the cell metdata.

stewartadam · 2024-07-10T21:52:16Z

If I'm understanding right it looks like it is validated here:

ruff/crates/ruff_linter/src/source_kind.rs

Line 50 in bbb9fe1

    
               pub fn from_path(path: &Path, source_type: PySourceType) -> Result<Option<Self>, SourceError> {

However the metadata format I see in my polyglot or Python notebooks looks to be in a slightly different structure, so it's just returning the default true.

dhruvmanila · 2024-07-11T02:59:44Z

@stewartadam Can you provide a full Notebook? I can look at where that metadata is coming from. Is it coming from the cell or from the Notebook itself? We verify that it's a Python notebook by looking at the notebook-level metadata which is what the function that you've linked does:

ruff/crates/ruff_notebook/src/notebook.rs

Lines 391 to 398 in 880c31d

    
               /// Return `true` if the notebook is a Python notebook, `false` otherwise. 
        
               pub fn is_python_notebook(&self) -> bool { 
        
                   self.raw 
        
                       .metadata 
        
                       .language_info 
        
                       .as_ref() 
        
                       .map_or(true, |language| language.name == "python") 
        
               }

stewartadam · 2024-07-11T18:21:03Z

Here are two notebooks: https://gist.github.com/stewartadam/528689d9bb917715a4c16a6ff9282de3

It looks like the extensions have a habit of making a mess of the metadata though. Creating a new notebook defaults to Python and after switching it to .NET Interactive, it gets the .NET Interactive metadata layered on top of the original Python language metadata.

Similarly, saving a copy of the .NET notebook as python.ipynb and then changing the kernel+cells to Python left all the .NET metadata in place (look at the first revision in the Gist).

I suspect we'll want to introspect a mix of the notebook language, cell type, and kernel.

dhruvmanila · 2024-07-12T03:52:43Z

This is confusing.

The dotnet-polyglot.ipynb has a cell with the following metadata which doesn't mention any language:

   "metadata": {
    "vscode": {
     "languageId": "polyglot-notebook"
    }
   },

While, the language_info is still "python":

  "language_info": {
   "name": "python"
  },

And, the python.ipynb notebook has a JavaScript cell with the following metadata:

   "metadata": {
    "vscode": {
     "languageId": "javascript"
    }
   },

But, the language_info is still "python".

Which extensions are all involved here? Is it just https://marketplace.visualstudio.com/items?itemName=ms-dotnettools.dotnet-interactive-vscode or are there any other extension? I'll probably need to look at the metadata schema these extensions have and encode it accordingly. I'm not sure how much effort it is to support this, it's not a priority right now with other important things going on.

MichaReiser · 2024-07-22T07:58:18Z

Related, the open AI notebooks fail parsing because they contain "code" cells where only the vscode.languageId is set. I think we should start respecting vscode.langaugeId

https://github.com/openai/openai-cookbook/blob/main/examples/chatgpt/gpt_actions_library/gpt_action_salesforce.ipynb?short_path=65a7845

dhruvmanila · 2024-07-22T08:57:34Z

Related, the open AI notebooks fail parsing because they contain "code" cells where only the vscode.languageId is set. I think we should start respecting vscode.langaugeId

openai/openai-cookbook@main/examples/chatgpt/gpt_actions_library/gpt_action_salesforce.ipynb?short_path=65a7845

Yeah, this seems reasonable. I'll want to check for any official documentation on this field and might require to look at the source code if it doesn't exists.

## Summary This PR adds support for VS Code specific cell metadata to consider when collecting valid code cells. For context, Ruff only runs on valid code cells. These are the code cells that doesn't contain cell magics. Previously, Ruff only used the notebook's metadata to determine whether it's a Python notebook. But, in VS Code, a notebook's preferred language might be Python but it could still contain code cells for other languages. This can be determined with the `metadata.vscode.languageId` field. ### References: * https://code.visualstudio.com/docs/languages/identifiers * https://github.com/microsoft/vscode/blob/e6c009a3d4ee60f352212b978934f52c4689fbd9/extensions/ipynb/src/serializers.ts#L104-L107 * https://github.com/microsoft/vscode/blob/e6c009a3d4ee60f352212b978934f52c4689fbd9/extensions/ipynb/src/serializers.ts#L117-L122 This brings us one step closer to fixing #12281. ## Test Plan Add test cases for `is_valid_python_code_cell` and an integration test case which showcase running it end to end. The test notebook contains a JavaScript code cell and a Python code cell.

dhruvmanila · 2024-08-13T16:48:24Z

@stewartadam So, #12864 should actually also fix the Polyglot notebooks case as well, at least it does for the linked notebooks:

❯ ./target/debug/ruff check ~/Downloads/notebooks/dotnet-polyglot.ipynb --no-cache
All checks passed!

❯ ./target/debug/ruff check ~/Downloads/notebooks/python.ipynb --no-cache 
All checks passed!

One thing to note with polyglot notebooks is each cell does have a language associated to it - it would be great if Ruff could parse this per-cell metadata and only validate Python code cells.

If this metadata is of the form vscode.languageId, then we'll start using it from next version.

Do you have an example Polyglot notebook which resembles the one in the PR description? I could test it out.

dhruvmanila · 2024-08-13T16:53:15Z

Looking at the metadata you've provided, I think it might be useful to use the kernelspec.language as a fallback if there's no language_info. VS Code does the same: https://github.com/microsoft/vscode/blob/1c31e758985efe11bc0453a45ea0bb6887e670a4/extensions/ipynb/src/deserializers.ts#L20-L22

stewartadam · 2024-08-13T17:04:32Z

This great to hear!

Agreed kernelspec.language as a fallback given the metadata captured from the notebooks generaetd with VSCode extension.

tigerhawkvok · 2024-10-16T18:37:22Z

I'll chime in and say that magic commands for Databricks environments should have those cells disabled:

https://docs.databricks.com/en/notebooks/notebooks-code.html#mix-languages

eg, cells that start with the pattern /^%[a-z]{2,}\s*$/ should be ignored (or, ignored unless it's %python)

dhruvmanila · 2024-10-17T05:12:02Z

This seems like a special case for Databricks environment specifically, I'm not sure how to detect that. My main hesitation is that a single percent sign magic command like %python is a line magic which only considers the text on the same line where the magic command is present while a double percent sign magic command like %%python is a cell magic which considers all the content of the cell where the command is defined in. Ruff considers certain cells that contains cell magic commands because the following lines will have Python code:

ruff/crates/ruff_notebook/src/cell.rs

Lines 243 to 254 in c6b311c

    
           !matches!( 
        
               command, 
        
               "capture" 
        
                   | "debug" 
        
                   | "ipytest" 
        
                   | "prun" 
        
                   | "pypy" 
        
                   | "python" 
        
                   | "python3" 
        
                   | "time" 
        
                   | "timeit" 
        
           )

Ruff will only ignore the line where a line magic is defined but will ignore the entire cell for cell magics except for the above cell magic commands.

tigerhawkvok · 2024-10-17T17:22:10Z

That's fair enough. Databricks is an enormous platform used by a ton of big shops, so it seems like a decent carve-out, but it is nevertheless at least somewhat niche.

The language support is limited, so potentially a match fail on %sql, %scala, %sh, %fs, %md, %r is easier to support, but I was hesitant to suggest an exclude list rather than an include list initially.

dhruvmanila added the notebook Related to (Jupyter) notebooks label Jul 11, 2024

stewartadam mentioned this issue Jul 11, 2024

Language metadata is not updated in .NET notebook dotnet/interactive#3602

Closed

17 tasks

MichaReiser mentioned this issue Jul 22, 2024

Ignore more open ai notebooks for now #12448

Merged

dhruvmanila mentioned this issue Aug 13, 2024

Consider VS Code cell metadata to determine valid code cells #12864

Merged

dhruvmanila self-assigned this Aug 13, 2024

dhruvmanila mentioned this issue Aug 14, 2024

Fallback to kernelspec to check if it's a Python notebook #12875

Merged

dhruvmanila closed this as completed in #12875 Aug 14, 2024

dhruvmanila closed this as completed in 2520ebb Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only Python cells should be validated in Jupyter notebooks #12281

Only Python cells should be validated in Jupyter notebooks #12281

stewartadam commented Jul 10, 2024 •

edited

Loading

charliermarsh commented Jul 10, 2024

MichaReiser commented Jul 10, 2024

stewartadam commented Jul 10, 2024

dhruvmanila commented Jul 11, 2024

stewartadam commented Jul 11, 2024 •

edited

Loading

dhruvmanila commented Jul 12, 2024

MichaReiser commented Jul 22, 2024

dhruvmanila commented Jul 22, 2024

dhruvmanila commented Aug 13, 2024

dhruvmanila commented Aug 13, 2024

stewartadam commented Aug 13, 2024

tigerhawkvok commented Oct 16, 2024

dhruvmanila commented Oct 17, 2024

tigerhawkvok commented Oct 17, 2024

Only Python cells should be validated in Jupyter notebooks #12281

Only Python cells should be validated in Jupyter notebooks #12281

Comments

stewartadam commented Jul 10, 2024 • edited Loading

Description

Context

Additional Details

charliermarsh commented Jul 10, 2024

MichaReiser commented Jul 10, 2024

stewartadam commented Jul 10, 2024

dhruvmanila commented Jul 11, 2024

stewartadam commented Jul 11, 2024 • edited Loading

dhruvmanila commented Jul 12, 2024

MichaReiser commented Jul 22, 2024

dhruvmanila commented Jul 22, 2024

dhruvmanila commented Aug 13, 2024

dhruvmanila commented Aug 13, 2024

stewartadam commented Aug 13, 2024

tigerhawkvok commented Oct 16, 2024

dhruvmanila commented Oct 17, 2024

tigerhawkvok commented Oct 17, 2024

stewartadam commented Jul 10, 2024 •

edited

Loading

stewartadam commented Jul 11, 2024 •

edited

Loading