Added more value inference for `dbutils.notebook.run(...)` #1860

ericvergnaud · 2024-06-07T11:00:12Z

add tests for value inference scenarios
support multiple values inference when dbutils.notebook.run is called in a loop

Linked issues

Progress #1205

Functionality

added relevant user documentation
added new CLI command
modified existing command: databricks labs ucx ...
added a new workflow
modified existing workflow: ...
added a new table
modified existing table: ...

Tests

manually tested
added unit tests
added integration tests
verified on staging environment (screenshot attached)

codecov · 2024-06-07T11:04:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.33%. Comparing base (b19c848) to head (5048314).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1860      +/-   ##
==========================================
+ Coverage   89.26%   89.33%   +0.06%     
==========================================
  Files          95       95              
  Lines       12022    12043      +21     
  Branches     2110     2113       +3     
==========================================
+ Hits        10732    10759      +27     
+ Misses        879      876       -3     
+ Partials      411      408       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JCZuurmond

Added some comments

src/databricks/labs/ucx/source_code/graph.py

src/databricks/labs/ucx/source_code/linters/imports.py

src/databricks/labs/ucx/source_code/graph.py

github-actions · 2024-06-07T11:30:12Z

✅ 189/189 passed, 23 skipped, 3h58m50s total

_{Running from acceptance #3828}

nfx · 2024-06-07T15:56:50Z

src/databricks/labs/ucx/source_code/linters/imports.py

-        if isinstance(inferred, Const):
-            return inferred.value.strip().strip("'").strip('"')
-        return None
+    def get_notebook_paths(self) -> list[str | None]:


why do we need to change the signature?.. dbutils.notebook.run() can have at most two arguments - path and parameters - it can't have multiple paths.

we need this because astroid is clever enough to return multiple inferred nodes in a scenario such as:

paths = ["p1", "p2"] for path in paths: dbutils.notebook.run(path)

Oh, interesting. Please add it as a code comment, so that the next time reading this code won't catch by surprise

nfx · 2024-06-07T15:58:56Z

src/databricks/labs/ucx/source_code/linters/imports.py

-            'dbutils-notebook-run-dynamic',
-            "Path for 'dbutils.notebook.run' is not a constant and requires adjusting the notebook path",
+            'dbutils-notebook-run-literal',
+            "Call to 'dbutils.notebook.run' will be migrated automatically",


we won't be migrating notebook.run() logic.

So what should be the message (the above was existing) ?

Should we drop this advice altogether ?

nfx · 2024-06-07T16:00:08Z

src/databricks/labs/ucx/source_code/notebooks/loaders.py

-        for language in CellLanguage:
-            if content.startswith(language.file_magic_header):
-                return language.language
+    def detect_language(path: Path, content: str):


why do we need this method public?

for testing

it's a slippery slope to expose methods public just for testing without a significant need. This case doesn't justify this need and could be tested through other public methods.

there's nothing dangerous about this method, so not sure why this one is slippery ? or maybe we should allow access to private methods in unit testing, such that we can actually write unit tests rather than slower and complex end-to-end tests ?

slippery slope is that if we allow it for trivial cases, then inexperienced devs would expose inner workings of classes as Public API, resulting in a more fragile system. This codebase was in that state 6 months ago and we're not going back there.

tests/unit/source_code/linters/test_python_ast.py

tests/unit/source_code/notebooks/test_loader.py

tests/unit/source_code/samples/run_notebooks_with_fstring.py

src/databricks/labs/ucx/source_code/linters/imports.py

nfx · 2024-06-07T20:22:35Z

src/databricks/labs/ucx/source_code/linters/imports.py

+            return [None]
+
+    @classmethod
+    def _get_notebook_paths(cls, nodes: Iterable[NodeNG]) -> list[str | None]:


Suggested change

def _get_notebook_paths(cls, nodes: Iterable[NodeNG]) -> list[str | None]:

def _get_notebook_paths(cls, nodes: Iterable[NodeNG]) -> list[str]:

let's avoid nullability

nfx · 2024-06-07T20:24:38Z

tests/unit/source_code/linters/test_python_imports.py

+        ),
+        (
+            """
+names = ["abc", "xyz"]


can we already do

def foo(): return "bar" name = foo() dbutils.notebook.run(name)

or does it require building a small type-aware interpreter?

we can! added corresponding test in test_infers_dbutils_notebook_run_dynamic_value

nfx

Lgtm

nfx · 2024-06-08T11:13:52Z

@ericvergnaud make fmt fails

changes already applied

ericvergnaud · 2024-06-08T19:17:29Z

@nfx blocked by code review requests, not sure why I can't dismiss them. Are you not a code owner ?

* Added `mlflow` to known packages ([#1895](#1895)). The `mlflow` package has been incorporated into the project and is now recognized as a known package. This integration includes modifications to the use of `mlflow` in the context of UC Shared Clusters, providing recommendations to modify or rewrite certain functionalities related to `sparkContext`, `_conf`, and `RDD` APIs. Additionally, the artifact storage system of `mlflow` in Databricks and DBFS has undergone changes. The `known.json` file has also been updated with several new packages, such as `alembic`, `aniso8601`, `cloudpickle`, `docker`, `entrypoints`, `flask`, `graphene`, `graphql-core`, `graphql-relay`, `gunicorn`, `html5lib`, `isort`, `jinja2`, `markdown`, `markupsafe`, `mccabe`, `opentelemetry-api`, `opentelemetry-sdk`, `opentelemetry-semantic-conventions`, `packaging`, `pyarrow`, `pyasn1`, `pygments`, `pyrsistent`, `python-dateutil`, `pytz`, `pyyaml`, `regex`, `requests`, and more. These packages are now acknowledged and incorporated into the project's functionality. * Added `tensorflow` to known packages ([#1897](#1897)). In this release, we are excited to announce the addition of the `tensorflow` package to our known packages list. Tensorflow is a popular open-source library for machine learning and artificial intelligence applications. This package includes several components such as `tensorflow`, `tensorboard`, `tensorboard-data-server`, and `tensorflow-io-gcs-filesystem`, which enable training, evaluation, and deployment of machine learning models, visualization of machine learning model metrics and logs, and access to Google Cloud Storage filesystems. Additionally, we have included other packages such as `gast`, `grpcio`, `h5py`, `keras`, `libclang`, `mdurl`, `namex`, `opt-einsum`, `optree`, `pygments`, `rich`, `rsa`, `termcolor`, `pyasn1_modules`, `sympy`, and `threadpoolctl`. These packages provide various functionalities required for different use cases, such as parsing Abstract Syntax Trees, efficient serial communication, handling HDF5 files, and managing threads. This release aims to enhance the functionality and capabilities of our platform by incorporating these powerful libraries and tools. * Added `torch` to known packages ([#1896](#1896)). In this release, the "known.json" file has been updated to include several new packages and their respective modules for a specific project or environment. These packages include "torch", "functorch", "mpmath", "networkx", "sympy", "isympy". The addition of these packages and modules ensures that they are recognized and available for use, preventing issues with missing dependencies or version conflicts. Furthermore, the `_analyze_dist_info` method in the `known.py` file has been improved to handle recursion errors during package analysis. A try-except block has been added to the loop that analyzes the distribution info folder, which logs the error and moves on to the next file if a `RecursionError` occurs. This enhancement increases the robustness of the package analysis process. * Added more known libraries ([#1894](#1894)). In this release, the `known` library has been enhanced with the addition of several new packages, bringing improved functionality and versatility to the software. Key additions include contourpy for drawing contours on 2D grids, cycler for creating cyclic iterators, docker-pycreds for managing Docker credentials, filelock for platform-independent file locking, fonttools for manipulating fonts, and frozendict for providing immutable dictionaries. Additional libraries like fsspec for accessing various file systems, gitdb and gitpython for working with git repositories, google-auth for Google authentication, html5lib for parsing and rendering HTML documents, and huggingface-hub for working with the Hugging Face model hub have been incorporated. Furthermore, the release includes idna, kiwisolver, lxml, matplotlib, mypy, peewee, protobuf, psutil, pyparsing, regex, requests, safetensors, sniffio, smmap, tokenizers, tomli, tqdm, transformers, types-pyyaml, types-requests, typing_extensions, tzdata, umap, unicorn, unidecode, urllib3, wandb, waterbear, wordcloud, xgboost, and yfinance for expanded capabilities. The zipp and zingg libraries have also been included for module name transformations and data mastering, respectively. Overall, these additions are expected to significantly enhance the software's functionality. * Added more value inference for `dbutils.notebook.run(...)` ([#1860](#1860)). In this release, the `dbutils.notebook.run(...)` functionality in `graph.py` has been significantly updated to enhance value inference. The change includes the introduction of new methods for handling `NotebookRunCall` and `SysPathChange` objects, as well as the refactoring of the `get_notebook_path` method into `get_notebook_paths`. This new method now returns a tuple of a boolean and a list of strings, indicating whether any nodes could not be resolved and providing a list of inferred paths. A new private method, `_get_notebook_paths`, has also been added to retrieve notebook paths from a list of nodes. Furthermore, the `load_dependency` method in `loaders.py` has been updated to detect the language of a notebook based on the file path, in addition to its content. The `Notebook` class now includes a new parameter, `SUPPORTED_EXTENSION_LANGUAGES`, which maps file extensions to their corresponding languages. In the `databricks.labs.ucx` project, more value inference has been added to the linter, including new methods and enhanced functionality for `dbutils.notebook.run(...)`. Several tests have been added or updated to demonstrate various scenarios and ensure the linter handles dynamic values appropriately. A new test file for the `NotebookLoader` class in the `databricks.labs.ucx.source_code.notebooks.loaders` module has been added, with a new class, `NotebookLoaderForTesting`, that overrides the `detect_language` method to make it a class method. This allows for more robust testing of the `NotebookLoader` class. Overall, these changes improve the accuracy and reliability of value inference for `dbutils.notebook.run(...)` and enhance the testing and usability of the related classes and methods. * Added nightly workflow to use industry solution accelerators for parser validation ([#1883](#1883)). A nightly workflow has been added to validate the parser using industry solution accelerators, which can be triggered locally with the `make solacc` command. This workflow involves a new Makefile target, 'solacc', which runs a Python script located at 'tests/integration/source_code/solacc.py'. The workflow is designed to run on the latest Ubuntu, installing Python 3.10 and hatch 1.9.4 using pip, and checking out the code with a fetch depth of 0. It runs on a daily basis at 7am using a cron schedule, and can also be triggered locally. The purpose of this workflow is to ensure parser compatibility with various industry solutions, improving overall software quality and robustness. * Complete support for pip install command ([#1853](#1853)). In this release, we've made significant enhancements to support the `pip install` command in our open-source library. The `register_library` method in the `DependencyResolver`, `NotebookResolver`, and `LocalFileResolver` classes has been modified to accept variable numbers of libraries instead of just one, allowing for more efficient dependency management. Additionally, the `resolve_import` method has been introduced in the `NotebookResolver` and `LocalFileResolver` classes for improved import resolution. Moreover, the `_split` static method has been implemented for better handling of pip command code and egg packages. The library now also supports the resolution of imports in notebooks and local files. These changes provide a solid foundation for full `pip install` command support, improving overall robustness and functionality. Furthermore, extensive updates to tests, including workflow linter and job dlt task linter modifications, ensure the reliability of the library when working with Jupyter notebooks and pip-installable libraries. * Infer simple f-string values when computing values during linting ([#1876](#1876)). This commit enhances the open-source library by adding support for inferring simple f-string values during linting, addressing issue [#1871](#1871) and progressing [#1205](#1205). The new functionality works for simple f-strings but currently does not support nested f-strings. It introduces the InferredValue class and updates the visit_call, visit_const, and _check_str_constant methods for better linter feedback. Additionally, it includes modifications to a unit test file and adjustments to error location in code. The commit also presents an example of simple f-string handling, emphasizing the limitations yet providing a solid foundation for future development. Co-authored by Eric Vergnaud. * Propagate widget parameters and data security mode to `CurrentSessionState` ([#1872](#1872)). In this release, the `spark_version_compatibility` function in `crawlers.py` has been refactored to `runtime_version_tuple`, returning a tuple of integers instead of a string. The function now handles custom runtimes and DLT, and raises a ValueError if the version components cannot be converted to integers. Additionally, the `CurrentSessionState` class has been updated to propagate named parameters from jobs and check for DBFS paths as both named and positional parameters. New attributes, including `spark_conf`, `named_parameters`, and `data_security_mode`, have been added to the class, all with default values of `None`. The `WorkflowTaskContainer` class has also been modified to include an additional `job` parameter in its constructor and new attributes for `named_parameters`, `spark_conf`, `runtime_version`, and `data_security_mode`. The `_register_cluster_info` method and `_lint_task` method in `WorkflowLinter` have also been updated to use the new `CurrentSessionState` attributes when linting a task. A new method `Job()` has been added to the `WorkflowTaskContainer` class, used in multiple unit tests to create a `Job` object and pass it as an argument to the `WorkflowTaskContainer` constructor. The tests cover various scenarios for library types, such as jar files, PyPI libraries, Python wheels, and requirements files, and ensure that the `WorkflowTaskContainer` object can extract the relevant information from a `Job` object and store it for later use. * Support inferred values when linting DBFS mounts ([#1868](#1868)). This commit adds value inference and enhances the consistency of advice messages in the context of linting Databricks File System (DBFS) mounts, addressing issue [#1205](#1205). It improves the precision of deprecated file system path calls and updates the handling of default DBFS references, making the code more robust and future-proof. The linter's behavior has been enhanced to detect DBFS paths in various formats, including string constants and variables. The test suite has been updated to include new cases and provide clearer deprecation warnings. This commit also refines the way advice is generated for deprecated file system path calls and renames `Advisory` to `Deprecation` in some places, providing more accurate and helpful feedback to developers. * Support inferred values when linting spark.sql ([#1870](#1870)). In this release, we have added support for inferring the values of table names when linting PySpark code, improving the accuracy and usefulness of the PySpark linter. This feature includes the ability to handle inferred values in Spark SQL code and updates to the test suite to reflect the updated linting behavior. The `QueryMatcher` class in `pyspark.py` has been updated to infer the value of the table name argument in a `Call` node, and an advisory message is generated if the value cannot be inferred. Additionally, the use of direct filesystem references, such as "s3://bucket/path", will be deprecated in favor of more dynamic and flexible querying. For example, the table "old.things" has been migrated to "brand.new.stuff" in the Unity Catalog. Furthermore, a loop has been introduced to demonstrate the ability to compute table names programmatically within SQL queries, enhancing the system's flexibility and adaptability. * Support inferred values when linting sys path ([#1866](#1866)). In this release, the library's linting system has been enhanced with added support for inferring values in the system path. The `DependencyGraph` class in `graph.py` has been updated to handle new node types, including `SysPathChange`, `NotebookRunCall`, `ImportSource`, and `UnresolvedPath`. The `UnresolvedPath` node is added for unresolved paths during linting, and new methods have been introduced in `conftest.py` for testing, such as `DependencyResolver`, `Whitelist`, `PythonLibraryResolver`, `NotebookResolver`, and `ImportFileResolver`. Additionally, the library now recognizes inferred values, including absolute paths added to the system path via `sys.path.append`. New tests have been added to ensure the correct behavior of the `DependencyResolver` class. This release also introduces a new file, `sys-path-with-fstring.py`, which demonstrates the use of Python's f-string syntax to append values to the system path, and a new method, `BaseImportResolver`, has been added to the `DependencyResolver` class to resolve imports more flexibly and robustly.

ericvergnaud added 3 commits June 7, 2024 11:15

move tests to dedicated file

f7e4bbc

formatting

3caa312

support multiple inferred values

7b1e8f5

ericvergnaud requested review from a team and mwojtyczka June 7, 2024 11:00

ericvergnaud temporarily deployed to account-admin June 7, 2024 11:00 — with GitHub Actions Inactive

JCZuurmond previously requested changes Jun 7, 2024

View reviewed changes

ericvergnaud added 3 commits June 7, 2024 16:34

fix typo

b245148

register loadable dependencies

91bce51

more tests

f3323ca

ericvergnaud had a problem deploying to account-admin June 7, 2024 14:35 — with GitHub Actions Error

formatting

757d5d8

ericvergnaud had a problem deploying to account-admin June 7, 2024 14:47 — with GitHub Actions Error

Merge branch 'main' into more-inference-tests

3f5e621

ericvergnaud had a problem deploying to account-admin June 7, 2024 14:52 — with GitHub Actions Error

more test coverage

d4d7389

ericvergnaud had a problem deploying to account-admin June 7, 2024 15:10 — with GitHub Actions Error

ericvergnaud requested a review from JCZuurmond June 7, 2024 15:13

ericvergnaud enabled auto-merge June 7, 2024 15:13

factorize code

01c213f

ericvergnaud had a problem deploying to account-admin June 7, 2024 15:22 — with GitHub Actions Failure

nfx requested changes Jun 7, 2024

View reviewed changes

Merge branch 'main' into more-inference-tests

37fee3e

ericvergnaud had a problem deploying to account-admin June 7, 2024 17:25 — with GitHub Actions Failure

nfx requested changes Jun 7, 2024

View reviewed changes

nfx disabled auto-merge June 7, 2024 17:34

ericvergnaud added 2 commits June 7, 2024 19:38

fix merge issues

04d1a21

address comments

bbaca0f

ericvergnaud added 2 commits June 7, 2024 20:01

test private API using test class

b6f27b0

add functional test

394be93

ericvergnaud had a problem deploying to account-admin June 7, 2024 18:17 — with GitHub Actions Error

add comment

6880e7c

ericvergnaud had a problem deploying to account-admin June 7, 2024 18:22 — with GitHub Actions Failure

nfx mentioned this pull request Jun 7, 2024

Add attribute resolution logic #1864

Closed

nfx requested changes Jun 7, 2024

View reviewed changes

nfx changed the title ~~More value inference for dbutils.notebook.run~~ Added more value inference for dbutils.notebook.run Jun 7, 2024

nfx changed the title ~~Added more value inference for dbutils.notebook.run~~ Added more value inference for dbutils.notebook.run(...) Jun 7, 2024

address comments

525cc75

ericvergnaud had a problem deploying to account-admin June 8, 2024 10:18 — with GitHub Actions Error

Merge branch 'main' into more-inference-tests

469de9b

ericvergnaud had a problem deploying to account-admin June 8, 2024 10:20 — with GitHub Actions Failure

ericvergnaud requested a review from nfx June 8, 2024 10:22

ericvergnaud mentioned this pull request Jun 8, 2024

Support inferred values when linting sys path - saved #1863

Closed

11 tasks

nfx approved these changes Jun 8, 2024

View reviewed changes

formatting

5048314

ericvergnaud temporarily deployed to account-admin June 8, 2024 18:41 — with GitHub Actions Inactive

ericvergnaud removed request for JCZuurmond and mwojtyczka June 8, 2024 19:18

ericvergnaud enabled auto-merge June 8, 2024 20:39

nfx merged commit 879a5b4 into main Jun 8, 2024
8 checks passed

nfx deleted the more-inference-tests branch June 8, 2024 21:08

ericvergnaud mentioned this pull request Jun 10, 2024

[FEATURE]: If a code computes a value dynamically, do value inference, at least at the state of linting #1205

Closed

1 task

nfx mentioned this pull request Jun 12, 2024

Release v0.27.0 #1898

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added more value inference for `dbutils.notebook.run(...)` #1860

Added more value inference for `dbutils.notebook.run(...)` #1860

ericvergnaud commented Jun 7, 2024

codecov bot commented Jun 7, 2024 •

edited

Loading

JCZuurmond left a comment

github-actions bot commented Jun 7, 2024 •

edited

Loading

nfx Jun 7, 2024

ericvergnaud Jun 7, 2024 •

edited

Loading

nfx Jun 7, 2024

ericvergnaud Jun 8, 2024

nfx Jun 7, 2024

ericvergnaud Jun 7, 2024 •

edited

Loading

ericvergnaud Jun 7, 2024

nfx Jun 7, 2024

ericvergnaud Jun 8, 2024

nfx Jun 7, 2024

ericvergnaud Jun 7, 2024

nfx Jun 7, 2024

ericvergnaud Jun 7, 2024

nfx Jun 7, 2024

ericvergnaud Jun 8, 2024

nfx Jun 7, 2024

ericvergnaud Jun 8, 2024

nfx Jun 7, 2024

ericvergnaud Jun 8, 2024

nfx Jun 8, 2024

nfx left a comment

nfx commented Jun 8, 2024

ericvergnaud commented Jun 8, 2024 •

edited

Loading

	def _get_notebook_paths(cls, nodes: Iterable[NodeNG]) -> list[str \| None]:
	def _get_notebook_paths(cls, nodes: Iterable[NodeNG]) -> list[str]:

Added more value inference for dbutils.notebook.run(...) #1860

Added more value inference for dbutils.notebook.run(...) #1860

Conversation

ericvergnaud commented Jun 7, 2024

Linked issues

Functionality

Tests

codecov bot commented Jun 7, 2024 • edited Loading

Codecov Report

JCZuurmond left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

ericvergnaud Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericvergnaud Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

nfx commented Jun 8, 2024

ericvergnaud commented Jun 8, 2024 • edited Loading

Added more value inference for `dbutils.notebook.run(...)` #1860

Added more value inference for `dbutils.notebook.run(...)` #1860

codecov bot commented Jun 7, 2024 •

edited

Loading

github-actions bot commented Jun 7, 2024 •

edited

Loading

ericvergnaud Jun 7, 2024 •

edited

Loading

ericvergnaud Jun 7, 2024 •

edited

Loading

ericvergnaud commented Jun 8, 2024 •

edited

Loading