Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: If a code computes a value dynamically, do value inference, at least at the state of linting #1205

Closed
1 task done
Tracked by #1085 ...
nfx opened this issue Apr 1, 2024 · 4 comments
Closed
1 task done
Tracked by #1085 ...
Assignees
Labels
migrate/code Abstract Syntax Trees and other dark magic

Comments

@nfx
Copy link
Collaborator

nfx commented Apr 1, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

If arguments to the relevant functions are string literals, resolve them and perform linting:

e.g. in cases like:

db = 'foo'
t = f'{db}.bar'
display(spark.table(t))

we have to check if foo.bar table is migrated.

Proposed Solution

infer values from the available scope.

Additional Context

No response

@nfx nfx added the migrate/code Abstract Syntax Trees and other dark magic label Apr 1, 2024
@nfx nfx added this to UCX Apr 1, 2024
@github-project-automation github-project-automation bot moved this to Triage in UCX Apr 1, 2024
@nfx nfx moved this from Triage to Month Backlog in UCX Apr 22, 2024
@JCZuurmond JCZuurmond moved this from Month Backlog to Active Backlog in UCX Apr 24, 2024
github-merge-queue bot pushed a commit that referenced this issue Jun 6, 2024
…kage (#1835)

## Changes
Migrate Python linters from ast to astroid
Implement minimal inference

### Linked issues
Progresses #1205 

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [ ] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests
- [ ] manually tested
- [x] added unit tests
- [ ] added integration tests
- [ ] verified on staging environment (screenshot attached)

---------

Co-authored-by: Eric Vergnaud <[email protected]>
@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jun 7, 2024

migration to astroid is successful
simple inference works

What doesn't work (no inference):

  • fstrings i.e. dbutils.notebook.run(f"prefix {value} suffix") -> see test_raises_advice_when_dbutils_notebook_run_is_too_complex

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Jun 10, 2024

@nfx I believe value inference is done for all currently implemented scenarios, see #1835, #1860, #1866, #1868 and #1870
Not sure if missing linters should be implemented as part of this issue (I guess not) ? Rather value inference should be done as part of their dedicated implementation ?
If the latter we can close this ticket.

@ericvergnaud
Copy link
Contributor

@nfx as of writing, the only known issue is with fstrings (see above). I've created corresponding ticket #1871

@nfx
Copy link
Collaborator Author

nfx commented Jun 10, 2024

@ericvergnaud lets solve the f-string issue, as it's very common

nfx pushed a commit that referenced this issue Jun 10, 2024
## Changes
Implement value inference
Improve consistency of generated advices

### Linked issues
Progresses #1205

---------

Co-authored-by: Eric Vergnaud <[email protected]>
nfx pushed a commit that referenced this issue Jun 11, 2024
)

## Changes
Add support for f-strings not natively supported by astroid
Adjust existing linters accordingly

Works for simple f-strings such as:
```
a = "value"
some_function(f"some {a}")

```
but not for f-strings of f-strings such as:

```
a = "value"
b = f"some {a}"
some_function(f"{b} is better than none")

```

### Linked issues
Resolves #1871 
Progresses #1205

---------

Co-authored-by: Eric Vergnaud <[email protected]>
nfx added a commit that referenced this issue Jun 12, 2024
* Added `mlflow` to known packages ([#1895](#1895)). The `mlflow` package has been incorporated into the project and is now recognized as a known package. This integration includes modifications to the use of `mlflow` in the context of UC Shared Clusters, providing recommendations to modify or rewrite certain functionalities related to `sparkContext`, `_conf`, and `RDD` APIs. Additionally, the artifact storage system of `mlflow` in Databricks and DBFS has undergone changes. The `known.json` file has also been updated with several new packages, such as `alembic`, `aniso8601`, `cloudpickle`, `docker`, `entrypoints`, `flask`, `graphene`, `graphql-core`, `graphql-relay`, `gunicorn`, `html5lib`, `isort`, `jinja2`, `markdown`, `markupsafe`, `mccabe`, `opentelemetry-api`, `opentelemetry-sdk`, `opentelemetry-semantic-conventions`, `packaging`, `pyarrow`, `pyasn1`, `pygments`, `pyrsistent`, `python-dateutil`, `pytz`, `pyyaml`, `regex`, `requests`, and more. These packages are now acknowledged and incorporated into the project's functionality.
* Added `tensorflow` to known packages ([#1897](#1897)). In this release, we are excited to announce the addition of the `tensorflow` package to our known packages list. Tensorflow is a popular open-source library for machine learning and artificial intelligence applications. This package includes several components such as `tensorflow`, `tensorboard`, `tensorboard-data-server`, and `tensorflow-io-gcs-filesystem`, which enable training, evaluation, and deployment of machine learning models, visualization of machine learning model metrics and logs, and access to Google Cloud Storage filesystems. Additionally, we have included other packages such as `gast`, `grpcio`, `h5py`, `keras`, `libclang`, `mdurl`, `namex`, `opt-einsum`, `optree`, `pygments`, `rich`, `rsa`, `termcolor`, `pyasn1_modules`, `sympy`, and `threadpoolctl`. These packages provide various functionalities required for different use cases, such as parsing Abstract Syntax Trees, efficient serial communication, handling HDF5 files, and managing threads. This release aims to enhance the functionality and capabilities of our platform by incorporating these powerful libraries and tools.
* Added `torch` to known packages ([#1896](#1896)). In this release, the "known.json" file has been updated to include several new packages and their respective modules for a specific project or environment. These packages include "torch", "functorch", "mpmath", "networkx", "sympy", "isympy". The addition of these packages and modules ensures that they are recognized and available for use, preventing issues with missing dependencies or version conflicts. Furthermore, the `_analyze_dist_info` method in the `known.py` file has been improved to handle recursion errors during package analysis. A try-except block has been added to the loop that analyzes the distribution info folder, which logs the error and moves on to the next file if a `RecursionError` occurs. This enhancement increases the robustness of the package analysis process.
* Added more known libraries ([#1894](#1894)). In this release, the `known` library has been enhanced with the addition of several new packages, bringing improved functionality and versatility to the software. Key additions include contourpy for drawing contours on 2D grids, cycler for creating cyclic iterators, docker-pycreds for managing Docker credentials, filelock for platform-independent file locking, fonttools for manipulating fonts, and frozendict for providing immutable dictionaries. Additional libraries like fsspec for accessing various file systems, gitdb and gitpython for working with git repositories, google-auth for Google authentication, html5lib for parsing and rendering HTML documents, and huggingface-hub for working with the Hugging Face model hub have been incorporated. Furthermore, the release includes idna, kiwisolver, lxml, matplotlib, mypy, peewee, protobuf, psutil, pyparsing, regex, requests, safetensors, sniffio, smmap, tokenizers, tomli, tqdm, transformers, types-pyyaml, types-requests, typing_extensions, tzdata, umap, unicorn, unidecode, urllib3, wandb, waterbear, wordcloud, xgboost, and yfinance for expanded capabilities. The zipp and zingg libraries have also been included for module name transformations and data mastering, respectively. Overall, these additions are expected to significantly enhance the software's functionality.
* Added more value inference for `dbutils.notebook.run(...)` ([#1860](#1860)). In this release, the `dbutils.notebook.run(...)` functionality in `graph.py` has been significantly updated to enhance value inference. The change includes the introduction of new methods for handling `NotebookRunCall` and `SysPathChange` objects, as well as the refactoring of the `get_notebook_path` method into `get_notebook_paths`. This new method now returns a tuple of a boolean and a list of strings, indicating whether any nodes could not be resolved and providing a list of inferred paths. A new private method, `_get_notebook_paths`, has also been added to retrieve notebook paths from a list of nodes. Furthermore, the `load_dependency` method in `loaders.py` has been updated to detect the language of a notebook based on the file path, in addition to its content. The `Notebook` class now includes a new parameter, `SUPPORTED_EXTENSION_LANGUAGES`, which maps file extensions to their corresponding languages. In the `databricks.labs.ucx` project, more value inference has been added to the linter, including new methods and enhanced functionality for `dbutils.notebook.run(...)`. Several tests have been added or updated to demonstrate various scenarios and ensure the linter handles dynamic values appropriately. A new test file for the `NotebookLoader` class in the `databricks.labs.ucx.source_code.notebooks.loaders` module has been added, with a new class, `NotebookLoaderForTesting`, that overrides the `detect_language` method to make it a class method. This allows for more robust testing of the `NotebookLoader` class. Overall, these changes improve the accuracy and reliability of value inference for `dbutils.notebook.run(...)` and enhance the testing and usability of the related classes and methods.
* Added nightly workflow to use industry solution accelerators for parser validation ([#1883](#1883)). A nightly workflow has been added to validate the parser using industry solution accelerators, which can be triggered locally with the `make solacc` command. This workflow involves a new Makefile target, 'solacc', which runs a Python script located at 'tests/integration/source_code/solacc.py'. The workflow is designed to run on the latest Ubuntu, installing Python 3.10 and hatch 1.9.4 using pip, and checking out the code with a fetch depth of 0. It runs on a daily basis at 7am using a cron schedule, and can also be triggered locally. The purpose of this workflow is to ensure parser compatibility with various industry solutions, improving overall software quality and robustness.
* Complete support for pip install command ([#1853](#1853)). In this release, we've made significant enhancements to support the `pip install` command in our open-source library. The `register_library` method in the `DependencyResolver`, `NotebookResolver`, and `LocalFileResolver` classes has been modified to accept variable numbers of libraries instead of just one, allowing for more efficient dependency management. Additionally, the `resolve_import` method has been introduced in the `NotebookResolver` and `LocalFileResolver` classes for improved import resolution. Moreover, the `_split` static method has been implemented for better handling of pip command code and egg packages. The library now also supports the resolution of imports in notebooks and local files. These changes provide a solid foundation for full `pip install` command support, improving overall robustness and functionality. Furthermore, extensive updates to tests, including workflow linter and job dlt task linter modifications, ensure the reliability of the library when working with Jupyter notebooks and pip-installable libraries.
* Infer simple f-string values when computing values during linting ([#1876](#1876)). This commit enhances the open-source library by adding support for inferring simple f-string values during linting, addressing issue [#1871](#1871) and progressing [#1205](#1205). The new functionality works for simple f-strings but currently does not support nested f-strings. It introduces the InferredValue class and updates the visit_call, visit_const, and _check_str_constant methods for better linter feedback. Additionally, it includes modifications to a unit test file and adjustments to error location in code. The commit also presents an example of simple f-string handling, emphasizing the limitations yet providing a solid foundation for future development. Co-authored by Eric Vergnaud.
* Propagate widget parameters and data security mode to `CurrentSessionState` ([#1872](#1872)). In this release, the `spark_version_compatibility` function in `crawlers.py` has been refactored to `runtime_version_tuple`, returning a tuple of integers instead of a string. The function now handles custom runtimes and DLT, and raises a ValueError if the version components cannot be converted to integers. Additionally, the `CurrentSessionState` class has been updated to propagate named parameters from jobs and check for DBFS paths as both named and positional parameters. New attributes, including `spark_conf`, `named_parameters`, and `data_security_mode`, have been added to the class, all with default values of `None`. The `WorkflowTaskContainer` class has also been modified to include an additional `job` parameter in its constructor and new attributes for `named_parameters`, `spark_conf`, `runtime_version`, and `data_security_mode`. The `_register_cluster_info` method and `_lint_task` method in `WorkflowLinter` have also been updated to use the new `CurrentSessionState` attributes when linting a task. A new method `Job()` has been added to the `WorkflowTaskContainer` class, used in multiple unit tests to create a `Job` object and pass it as an argument to the `WorkflowTaskContainer` constructor. The tests cover various scenarios for library types, such as jar files, PyPI libraries, Python wheels, and requirements files, and ensure that the `WorkflowTaskContainer` object can extract the relevant information from a `Job` object and store it for later use.
* Support inferred values when linting DBFS mounts ([#1868](#1868)). This commit adds value inference and enhances the consistency of advice messages in the context of linting Databricks File System (DBFS) mounts, addressing issue [#1205](#1205). It improves the precision of deprecated file system path calls and updates the handling of default DBFS references, making the code more robust and future-proof. The linter's behavior has been enhanced to detect DBFS paths in various formats, including string constants and variables. The test suite has been updated to include new cases and provide clearer deprecation warnings. This commit also refines the way advice is generated for deprecated file system path calls and renames `Advisory` to `Deprecation` in some places, providing more accurate and helpful feedback to developers.
* Support inferred values when linting spark.sql ([#1870](#1870)). In this release, we have added support for inferring the values of table names when linting PySpark code, improving the accuracy and usefulness of the PySpark linter. This feature includes the ability to handle inferred values in Spark SQL code and updates to the test suite to reflect the updated linting behavior. The `QueryMatcher` class in `pyspark.py` has been updated to infer the value of the table name argument in a `Call` node, and an advisory message is generated if the value cannot be inferred. Additionally, the use of direct filesystem references, such as "s3://bucket/path", will be deprecated in favor of more dynamic and flexible querying. For example, the table "old.things" has been migrated to "brand.new.stuff" in the Unity Catalog. Furthermore, a loop has been introduced to demonstrate the ability to compute table names programmatically within SQL queries, enhancing the system's flexibility and adaptability.
* Support inferred values when linting sys path ([#1866](#1866)). In this release, the library's linting system has been enhanced with added support for inferring values in the system path. The `DependencyGraph` class in `graph.py` has been updated to handle new node types, including `SysPathChange`, `NotebookRunCall`, `ImportSource`, and `UnresolvedPath`. The `UnresolvedPath` node is added for unresolved paths during linting, and new methods have been introduced in `conftest.py` for testing, such as `DependencyResolver`, `Whitelist`, `PythonLibraryResolver`, `NotebookResolver`, and `ImportFileResolver`. Additionally, the library now recognizes inferred values, including absolute paths added to the system path via `sys.path.append`. New tests have been added to ensure the correct behavior of the `DependencyResolver` class. This release also introduces a new file, `sys-path-with-fstring.py`, which demonstrates the use of Python's f-string syntax to append values to the system path, and a new method, `BaseImportResolver`, has been added to the `DependencyResolver` class to resolve imports more flexibly and robustly.
@nfx nfx mentioned this issue Jun 12, 2024
nfx added a commit that referenced this issue Jun 12, 2024
* Added `mlflow` to known packages
([#1895](#1895)). The
`mlflow` package has been incorporated into the project and is now
recognized as a known package. This integration includes modifications
to the use of `mlflow` in the context of UC Shared Clusters, providing
recommendations to modify or rewrite certain functionalities related to
`sparkContext`, `_conf`, and `RDD` APIs. Additionally, the artifact
storage system of `mlflow` in Databricks and DBFS has undergone changes.
The `known.json` file has also been updated with several new packages,
such as `alembic`, `aniso8601`, `cloudpickle`, `docker`, `entrypoints`,
`flask`, `graphene`, `graphql-core`, `graphql-relay`, `gunicorn`,
`html5lib`, `isort`, `jinja2`, `markdown`, `markupsafe`, `mccabe`,
`opentelemetry-api`, `opentelemetry-sdk`,
`opentelemetry-semantic-conventions`, `packaging`, `pyarrow`, `pyasn1`,
`pygments`, `pyrsistent`, `python-dateutil`, `pytz`, `pyyaml`, `regex`,
`requests`, and more. These packages are now acknowledged and
incorporated into the project's functionality.
* Added `tensorflow` to known packages
([#1897](#1897)). In this
release, we are excited to announce the addition of the `tensorflow`
package to our known packages list. Tensorflow is a popular open-source
library for machine learning and artificial intelligence applications.
This package includes several components such as `tensorflow`,
`tensorboard`, `tensorboard-data-server`, and
`tensorflow-io-gcs-filesystem`, which enable training, evaluation, and
deployment of machine learning models, visualization of machine learning
model metrics and logs, and access to Google Cloud Storage filesystems.
Additionally, we have included other packages such as `gast`, `grpcio`,
`h5py`, `keras`, `libclang`, `mdurl`, `namex`, `opt-einsum`, `optree`,
`pygments`, `rich`, `rsa`, `termcolor`, `pyasn1_modules`, `sympy`, and
`threadpoolctl`. These packages provide various functionalities required
for different use cases, such as parsing Abstract Syntax Trees,
efficient serial communication, handling HDF5 files, and managing
threads. This release aims to enhance the functionality and capabilities
of our platform by incorporating these powerful libraries and tools.
* Added `torch` to known packages
([#1896](#1896)). In this
release, the "known.json" file has been updated to include several new
packages and their respective modules for a specific project or
environment. These packages include "torch", "functorch", "mpmath",
"networkx", "sympy", "isympy". The addition of these packages and
modules ensures that they are recognized and available for use,
preventing issues with missing dependencies or version conflicts.
Furthermore, the `_analyze_dist_info` method in the `known.py` file has
been improved to handle recursion errors during package analysis. A
try-except block has been added to the loop that analyzes the
distribution info folder, which logs the error and moves on to the next
file if a `RecursionError` occurs. This enhancement increases the
robustness of the package analysis process.
* Added more known libraries
([#1894](#1894)). In this
release, the `known` library has been enhanced with the addition of
several new packages, bringing improved functionality and versatility to
the software. Key additions include contourpy for drawing contours on 2D
grids, cycler for creating cyclic iterators, docker-pycreds for managing
Docker credentials, filelock for platform-independent file locking,
fonttools for manipulating fonts, and frozendict for providing immutable
dictionaries. Additional libraries like fsspec for accessing various
file systems, gitdb and gitpython for working with git repositories,
google-auth for Google authentication, html5lib for parsing and
rendering HTML documents, and huggingface-hub for working with the
Hugging Face model hub have been incorporated. Furthermore, the release
includes idna, kiwisolver, lxml, matplotlib, mypy, peewee, protobuf,
psutil, pyparsing, regex, requests, safetensors, sniffio, smmap,
tokenizers, tomli, tqdm, transformers, types-pyyaml, types-requests,
typing_extensions, tzdata, umap, unicorn, unidecode, urllib3, wandb,
waterbear, wordcloud, xgboost, and yfinance for expanded capabilities.
The zipp and zingg libraries have also been included for module name
transformations and data mastering, respectively. Overall, these
additions are expected to significantly enhance the software's
functionality.
* Added more value inference for `dbutils.notebook.run(...)`
([#1860](#1860)). In this
release, the `dbutils.notebook.run(...)` functionality in `graph.py` has
been significantly updated to enhance value inference. The change
includes the introduction of new methods for handling `NotebookRunCall`
and `SysPathChange` objects, as well as the refactoring of the
`get_notebook_path` method into `get_notebook_paths`. This new method
now returns a tuple of a boolean and a list of strings, indicating
whether any nodes could not be resolved and providing a list of inferred
paths. A new private method, `_get_notebook_paths`, has also been added
to retrieve notebook paths from a list of nodes. Furthermore, the
`load_dependency` method in `loaders.py` has been updated to detect the
language of a notebook based on the file path, in addition to its
content. The `Notebook` class now includes a new parameter,
`SUPPORTED_EXTENSION_LANGUAGES`, which maps file extensions to their
corresponding languages. In the `databricks.labs.ucx` project, more
value inference has been added to the linter, including new methods and
enhanced functionality for `dbutils.notebook.run(...)`. Several tests
have been added or updated to demonstrate various scenarios and ensure
the linter handles dynamic values appropriately. A new test file for the
`NotebookLoader` class in the
`databricks.labs.ucx.source_code.notebooks.loaders` module has been
added, with a new class, `NotebookLoaderForTesting`, that overrides the
`detect_language` method to make it a class method. This allows for more
robust testing of the `NotebookLoader` class. Overall, these changes
improve the accuracy and reliability of value inference for
`dbutils.notebook.run(...)` and enhance the testing and usability of the
related classes and methods.
* Added nightly workflow to use industry solution accelerators for
parser validation
([#1883](#1883)). A nightly
workflow has been added to validate the parser using industry solution
accelerators, which can be triggered locally with the `make solacc`
command. This workflow involves a new Makefile target, 'solacc', which
runs a Python script located at
'tests/integration/source_code/solacc.py'. The workflow is designed to
run on the latest Ubuntu, installing Python 3.10 and hatch 1.9.4 using
pip, and checking out the code with a fetch depth of 0. It runs on a
daily basis at 7am using a cron schedule, and can also be triggered
locally. The purpose of this workflow is to ensure parser compatibility
with various industry solutions, improving overall software quality and
robustness.
* Complete support for pip install command
([#1853](#1853)). In this
release, we've made significant enhancements to support the `pip
install` command in our open-source library. The `register_library`
method in the `DependencyResolver`, `NotebookResolver`, and
`LocalFileResolver` classes has been modified to accept variable numbers
of libraries instead of just one, allowing for more efficient dependency
management. Additionally, the `resolve_import` method has been
introduced in the `NotebookResolver` and `LocalFileResolver` classes for
improved import resolution. Moreover, the `_split` static method has
been implemented for better handling of pip command code and egg
packages. The library now also supports the resolution of imports in
notebooks and local files. These changes provide a solid foundation for
full `pip install` command support, improving overall robustness and
functionality. Furthermore, extensive updates to tests, including
workflow linter and job dlt task linter modifications, ensure the
reliability of the library when working with Jupyter notebooks and
pip-installable libraries.
* Infer simple f-string values when computing values during linting
([#1876](#1876)). This
commit enhances the open-source library by adding support for inferring
simple f-string values during linting, addressing issue
[#1871](#1871) and
progressing [#1205](#1205).
The new functionality works for simple f-strings but currently does not
support nested f-strings. It introduces the InferredValue class and
updates the visit_call, visit_const, and _check_str_constant methods for
better linter feedback. Additionally, it includes modifications to a
unit test file and adjustments to error location in code. The commit
also presents an example of simple f-string handling, emphasizing the
limitations yet providing a solid foundation for future development.
Co-authored by Eric Vergnaud.
* Propagate widget parameters and data security mode to
`CurrentSessionState`
([#1872](#1872)). In this
release, the `spark_version_compatibility` function in `crawlers.py` has
been refactored to `runtime_version_tuple`, returning a tuple of
integers instead of a string. The function now handles custom runtimes
and DLT, and raises a ValueError if the version components cannot be
converted to integers. Additionally, the `CurrentSessionState` class has
been updated to propagate named parameters from jobs and check for DBFS
paths as both named and positional parameters. New attributes, including
`spark_conf`, `named_parameters`, and `data_security_mode`, have been
added to the class, all with default values of `None`. The
`WorkflowTaskContainer` class has also been modified to include an
additional `job` parameter in its constructor and new attributes for
`named_parameters`, `spark_conf`, `runtime_version`, and
`data_security_mode`. The `_register_cluster_info` method and
`_lint_task` method in `WorkflowLinter` have also been updated to use
the new `CurrentSessionState` attributes when linting a task. A new
method `Job()` has been added to the `WorkflowTaskContainer` class, used
in multiple unit tests to create a `Job` object and pass it as an
argument to the `WorkflowTaskContainer` constructor. The tests cover
various scenarios for library types, such as jar files, PyPI libraries,
Python wheels, and requirements files, and ensure that the
`WorkflowTaskContainer` object can extract the relevant information from
a `Job` object and store it for later use.
* Support inferred values when linting DBFS mounts
([#1868](#1868)). This
commit adds value inference and enhances the consistency of advice
messages in the context of linting Databricks File System (DBFS) mounts,
addressing issue
[#1205](#1205). It improves
the precision of deprecated file system path calls and updates the
handling of default DBFS references, making the code more robust and
future-proof. The linter's behavior has been enhanced to detect DBFS
paths in various formats, including string constants and variables. The
test suite has been updated to include new cases and provide clearer
deprecation warnings. This commit also refines the way advice is
generated for deprecated file system path calls and renames `Advisory`
to `Deprecation` in some places, providing more accurate and helpful
feedback to developers.
* Support inferred values when linting spark.sql
([#1870](#1870)). In this
release, we have added support for inferring the values of table names
when linting PySpark code, improving the accuracy and usefulness of the
PySpark linter. This feature includes the ability to handle inferred
values in Spark SQL code and updates to the test suite to reflect the
updated linting behavior. The `QueryMatcher` class in `pyspark.py` has
been updated to infer the value of the table name argument in a `Call`
node, and an advisory message is generated if the value cannot be
inferred. Additionally, the use of direct filesystem references, such as
"s3://bucket/path", will be deprecated in favor of more dynamic and
flexible querying. For example, the table "old.things" has been migrated
to "brand.new.stuff" in the Unity Catalog. Furthermore, a loop has been
introduced to demonstrate the ability to compute table names
programmatically within SQL queries, enhancing the system's flexibility
and adaptability.
* Support inferred values when linting sys path
([#1866](#1866)). In this
release, the library's linting system has been enhanced with added
support for inferring values in the system path. The `DependencyGraph`
class in `graph.py` has been updated to handle new node types, including
`SysPathChange`, `NotebookRunCall`, `ImportSource`, and
`UnresolvedPath`. The `UnresolvedPath` node is added for unresolved
paths during linting, and new methods have been introduced in
`conftest.py` for testing, such as `DependencyResolver`, `Whitelist`,
`PythonLibraryResolver`, `NotebookResolver`, and `ImportFileResolver`.
Additionally, the library now recognizes inferred values, including
absolute paths added to the system path via `sys.path.append`. New tests
have been added to ensure the correct behavior of the
`DependencyResolver` class. This release also introduces a new file,
`sys-path-with-fstring.py`, which demonstrates the use of Python's
f-string syntax to append values to the system path, and a new method,
`BaseImportResolver`, has been added to the `DependencyResolver` class
to resolve imports more flexibly and robustly.
nfx pushed a commit that referenced this issue Jul 5, 2024
## Changes
When linting python code, infer values using not only code from current
cell but also code from previous cells

### Linked issues
Progresses #1912
Progresses #1205 

### Functionality 
None

### Tests
- [x] manually tested
- [x] added unit tests

Resolved 60 out of 891 "cannot be computed" advices when running make
solacc

---------

Co-authored-by: Eric Vergnaud <[email protected]>
@nfx nfx mentioned this issue Jul 5, 2024
nfx added a commit that referenced this issue Jul 5, 2024
* Added handling for exceptions with no error_code attribute while crawling permissions ([#2079](https://github.com/databrickslabs/ucx/issues/2079)). A new enhancement has been implemented to improve error handling during the assessment job's permission crawling process. Previously, exceptions that lacked an `error_code` attribute would cause an `AttributeError`. This release introduces a check for the existence of the `error_code` attribute before attempting to access it, logging an error and adding it to the list of acute errors if not present. The change includes a new unit test for verification, and the relevant functionality has been added to the `inventorize_permissions` function within the `manager.py` file. The new method, `test_manager_inventorize_fail_with_error`, has been implemented to test the permission manager's behavior when encountering errors during the inventory process, raising `DatabricksError` and `TimeoutError` instances with and without `error_code` attributes. This update resolves issue [#2078](https://github.com/databrickslabs/ucx/issues/2078) and enhances the overall robustness of the assessment job's permission crawling functionality.
* Added handling for missing permission to read file ([#1949](https://github.com/databrickslabs/ucx/issues/1949)). In this release, we've addressed an issue where missing permissions to read a file during linting were not being handled properly. The revised code now checks for `NotFound` and `PermissionError` exceptions when attempting to read a file's text content. If a `NotFound` exception occurs, the function returns None and logs a warning message. If a `PermissionError` exception occurs, the function also returns None and logs a warning message with the error's traceback. This change resolves issue [#1942](https://github.com/databrickslabs/ucx/issues/1942) and partially resolves issue [#1952](https://github.com/databrickslabs/ucx/issues/1952), improving the robustness of the linting process and providing more informative error messages. Additionally, new tests and methods have been added to handle missing files and missing read permissions during linting, ensuring that the file linter can handle these cases correctly.
* Added handling for unauthenticated exception while joining collection ([#1958](https://github.com/databrickslabs/ucx/issues/1958)). A new exception type, Unauthenticated, has been added to the import statement, and new error messages have been implemented in the _sync_collection and _get_collection_workspace functions to notify users when they do not have admin access to the workspace. A try-except block has been added in the _get_collection_workspace function to handle the Unauthenticated exception, and a warning message is logged indicating that the user needs account admin and workspace admin credentials to enable collection joining and to run the join-collection command with account admin credentials. Additionally, a new CLI command has been added, and the existing `databricks labs ucx ...` command has been modified. A new workflow for joining the collection has also been implemented. These changes have been thoroughly documented in the user documentation and verified on the staging environment.
* Added tracking for UCX workflows and as-library usage ([#1966](https://github.com/databrickslabs/ucx/issues/1966)). This commit introduces User-Agent tracking for UCX workflows and library usage, adding `ucx/<version>`, `cmd/install`, and `cmd/<workflow>` elements to relevant requests. These changes are implemented within the `test_useragent.py` file, which includes the new `http_fixture_server` context manager for testing User-Agent propagation in UCX workflows. The addition of `with_user_agent_extra` and the inclusion of `with_product` functions from `databricks.sdk.core` aim to provide valuable insights for debugging, maintenance, and improving UCX workflow performance. This feature will help gather clear usage metrics for UCX and enhance the overall user experience.
* Analyse `altair` ([#2005](https://github.com/databrickslabs/ucx/issues/2005)). In this release, the open-source library has undergone a whitelisting of the `altair` library, addressing issue [#1901](https://github.com/databrickslabs/ucx/issues/1901). The changes involve the addition of several modules and sub-modules under the `altair` package, including `altair`, `altair._magics`, `altair.expr`, and various others such as `altair.utils`, `altair.utils._dfi_types`, `altair.utils._importers`, and `altair.utils._show`. Additionally, modifications have been made to the `known.json` file to include the `altair` package. It is important to note that no new functionalities have been introduced, and the changes have been manually verified. This release has been developed by Eric Vergnaud.
* Analyse `azure` ([#2016](https://github.com/databrickslabs/ucx/issues/2016)). In this release, we have made updates to the whitelist of several Azure libraries, including 'azure-common', 'azure-core', 'azure-mgmt-core', 'azure-mgmt-digitaltwins', and 'azure-storage-blob'. These changes are intended to manage dependencies and ensure a secure and stable environment for software engineers working with these libraries. The `azure-common` library has been added to the whitelist, and updates have been made to the existing whitelists for the other libraries. These changes do not add or modify any functionality or test cases, but are important for maintaining the integrity of our open-source library. This commit was co-authored by Eric Vergnaud from Databricks.
* Analyse `causal-learn` ([#2012](https://github.com/databrickslabs/ucx/issues/2012)). In this release, we have added `causal-learn` to the whitelist in our JSON file, signifying that it is now a supported library. This update includes the addition of various modules, classes, and functions to 'causal-learn'. We would like to emphasize that there are no changes to existing functionality, nor have any new methods been added. This release is thoroughly tested to ensure functionality and stability. We hope that software engineers in the community will find this update helpful and consider adopting this project.
* Analyse `databricks-arc` ([#2004](https://github.com/databrickslabs/ucx/issues/2004)). This release introduces whitelisting for the `databricks-arc` library, which is used for data analytics and machine learning. The release updates the `known.json` file to include `databricks-arc` and its related modules such as `arc.autolinker`, `arc.sql`, `arc.sql.enable_arc`, `arc.utils`, and `arc.utils.utils`. It also provides specific error codes and messages related to using these libraries on UC Shared Clusters. Additionally, this release includes updates to the `databricks-feature-engineering` library, with the addition of many new modules and error codes related to JVM access, legacy context, and spark logging. The `databricks.ml_features` library has several updates, including changes to the `_spark_client` and `publish_engine`. The `databricks.ml_features.entities` module has many updates, with new classes and methods for handling features, specifications, tables, and more. These updates offer improved functionality and error handling for the whitelisted libraries, specifically when used on UC Shared Clusters.
* Analyse `dbldatagen` ([#1985](https://github.com/databrickslabs/ucx/issues/1985)). The `dbldatagen` package has been whitelisted in the `known.json` file in this release. While there are no new or altered functionalities, several updates have been made to the methods and objects within `dbldatagen`. This includes enhancements to `dbldatagen._version`, `dbldatagen.column_generation_spec`, `dbldatagen.column_spec_options`, `dbldatagen.constraints`, `dbldatagen.data_analyzer`, `dbldatagen.data_generator`, `dbldatagen.datagen_constants`, `dbldatagen.datasets`, and related classes. Additionally, `dbldatagen.datasets.basic_geometries`, `dbldatagen.datasets.basic_process_historian`, `dbldatagen.datasets.basic_telematics`, `dbldatagen.datasets.basic_user`, `dbldatagen.datasets.benchmark_groupby`, `dbldatagen.datasets.dataset_provider`, `dbldatagen.datasets.multi_table_telephony_provider`, and `dbldatagen.datasets_object` have been updated. The distribution methods, such as `dbldatagen.distributions`, `dbldatagen.distributions.beta`, `dbldatagen.distributions.data_distribution`, `dbldatagen.distributions.exponential_distribution`, `dbldatagen.distributions.gamma`, and `dbldatagen.distributions.normal_distribution`, have also seen improvements. Furthermore, `dbldatagen.function_builder`, `dbldatagen.html_utils`, `dbldatagen.nrange`, `dbldatagen.schema_parser`, `dbldatagen.spark_singleton`, `dbldatagen.text_generator_plugins`, and `dbldatagen.text_generators` have been updated. The `dbldatagen.data_generator` method now includes a warning about the deprecated `sparkContext` in shared clusters, and `dbldatagen.schema_parser` includes updates related to the `table_name` argument in various SQL statements. These changes ensure better compatibility and improved functionality of the `dbldatagen` package.
* Analyse `delta-spark` ([#1987](https://github.com/databrickslabs/ucx/issues/1987)). In this release, the `delta-spark` component within the `delta` project has been whitelisted with the inclusion of a new entry in the `known.json` configuration file. This addition brings in several sub-components, including `delta._typing`, `delta.exceptions`, and `delta.tables`, each with a `jvm-access-in-shared-clusters` error code and message for unsupported environments. These changes aim to enhance the handling of `delta-spark` component within the `delta` project. The changes have been rigorously tested and do not introduce new functionality or modify existing behavior. This update is ensured to provide better stability and compatibility to the project. Co-authored by Eric Vergnaud.
* Analyse `diffusers` ([#2010](https://github.com/databrickslabs/ucx/issues/2010)). A new `diffusers` category has been added to the JSON configuration file, featuring several subcategories and numerous empty arrays as values. This change serves to prepare the configuration for future additions, without altering any existing methods or behaviors. As such, this update does not impact current functionality, but instead, sets the stage for further development. No associated tests or functional changes accompany this modification.
* Analyse `faker` ([#2014](https://github.com/databrickslabs/ucx/issues/2014)). In this release, the `faker` library in the Databricks project has undergone whitelisting, addressing security concerns, improving performance, and reducing the attack surface. No new methods were added, and the existing functionality remains unchanged. Thorough manual verification of the tests has been conducted. This release introduces various modules and submodules related to the `faker` library, expanding its capabilities in address generation in multiple languages and countries, along with new providers for bank, barcode, color, company, credit_card, currency, date_time, emoji, file, geo, internet, isbn, job, lorem, misc, passport, person, phone_number, profile, python, sbn, ssn, and user_agent generation. Software engineers should find these improvements advantageous for their projects, offering a broader range of options and enhanced performance.
* Analyse `fastcluster` ([#1980](https://github.com/databrickslabs/ucx/issues/1980)). In this release, the project's configuration has been updated to include the `fastcluster` package in the approved libraries whitelist, as part of issue [#1901](https://github.com/databrickslabs/ucx/issues/1901) resolution. This change enables software engineers to utilize the functions and methods provided by `fastcluster` in the project's codebase. The `fastcluster` package is now registered in the `known.json` configuration file, and its integration has been thoroughly tested to ensure seamless functionality. By incorporating `fastcluster`, the project's capabilities are expanded, allowing software engineers to benefit from its optimized clustering algorithms and performance enhancements.
* Analyse `glow` ([#1973](https://github.com/databrickslabs/ucx/issues/1973)). In this release, we have analyzed and added the `glow` library and its modules, including `glow._array`, `glow._coro`, `glow._debug`, and others, to the `known.json` file whitelist. This change allows for seamless integration and usage of the `glow` library in your projects. It is important to note that this update does not modify any existing functionality and has been thoroughly tested to ensure compatibility. Software engineers utilizing the `glow` library will benefit from this enhancement, as it provides explicit approval for the library and its modules, facilitating a more efficient development process.
* Analyse `graphframes` ([#1990](https://github.com/databrickslabs/ucx/issues/1990)). In this release, the `graphframes` library has been thoroughly analyzed and the whitelist updated accordingly. This includes the addition of several new entries, such as `graphframes.examples.belief_propagation`, `graphframes.examples.graphs`, `graphframes.graphframe`, `graphframes.lib.aggregate_messages`, and `graphframes.tests`. These changes may require modifications such as rewriting code to use Spark or accessing the Spark Driver JVM. These updates aim to improve compatibility with UC Shared Clusters, ensuring a more seamless integration. Manual testing has been conducted to ensure the changes are functioning as intended.
* Analyse `graphviz` ([#2008](https://github.com/databrickslabs/ucx/issues/2008)). In this release, we have analyzed and whitelisted the `graphviz` library for use in the project. The library has been added to the `known.json` file, which is used to manage dependencies. The `graphviz` package contains several modules and sub-modules, including `backend`, `dot`, `exceptions`, `graphs`, `jupyter_integration`, `parameters`, `rendering`, and `saving`. While we do not have detailed information on the functionality provided by these modules at this time, they have been manually tested for correct functioning. This addition enhances the project's graphing and visualization capabilities by incorporating the well-regarded `graphviz` library.
* Analyse `hyperopt` ([#1970](https://github.com/databrickslabs/ucx/issues/1970)). In this release, we have made changes to include the `hyperopt` library in our project, addressing issue [#1901](https://github.com/databrickslabs/ucx/issues/1901). This integration does not introduce any new methods or modify existing functionality, and has been manually tested. The `hyperopt` package now includes several new modules, such as `hyperopt.algobase`, `hyperopt.anneal`, `hyperopt.atpe`, and many others, encompassing various components like classes, functions, and tests. Notably, some of these modules support integration with Spark and MongoDB. The `known.json` file has also been updated to reflect these additions.
* Analyse `ipywidgets` ([#1972](https://github.com/databrickslabs/ucx/issues/1972)). A new commit has been added to whitelist the `ipywidgets` package, enabling its usage within our open-source library. No new functionality or changes have been introduced in this commit. The package has undergone manual testing to ensure proper functionality. The primary modification involves adding `ipywidgets` to the `known.json` file whitelist, which includes various modules and sub-modules used for testing, IPython interaction, handling dates and times, and managing widget outputs. This update simply permits the utilization of the `ipywidgets` package and its related modules and sub-modules.
* Analyse `johnsnowlabs` ([#1997](https://github.com/databrickslabs/ucx/issues/1997)). The `johnsnowlabs` package, used for natural language processing and machine learning tasks, has been added to the whitelist in this release. This package includes various modules and sub-packages, such as auto_install, finance, frameworks, johnsnowlabs, lab, legal, llm, medical, nlp, py_models, serve, settings, utils, and visual, which provide a range of classes and functions for working with data and models in the context of NLP and machine learning. Note that this commit also raises deprecation warnings related to file system paths and access to the Spark Driver JVM in shared clusters, indicating potential compatibility issues or limitations; however, the exact impact or scope of these issues cannot be determined from the provided commit message.
* Analyse `langchain` ([#1975](https://github.com/databrickslabs/ucx/issues/1975)). In this release, the `langchain` module has been added to the JSON file and whitelisted for use. This module encompasses a variety of sub-modules, such as '_api', '_api.deprecation', '_api.interactive_env', and '_api.module_import', among others. Additionally, there are sub-modules related to adapters for various services, including 'openai', 'amadeus', 'azure_cognitive_services', 'conversational_retrieval', and 'clickup'. The `conversational_retrieval` sub-module contains a toolkit for openai functions and a standalone tool. However, specific changes, functionality details, and testing information have not been provided in the commit message. As a software engineer, please refer to the documentation and testing framework for further details.
* Analyse `lifelines` ([#2006](https://github.com/databrickslabs/ucx/issues/2006)). In this release, we have whitelisted the `lifelines` package, a powerful Python library for survival analysis and hazard rate estimation. This addition brings a comprehensive suite of functionalities, such as data sets, exceptions, utilities, version checking, statistical calculations, and plotting tools. The `fitters` category is particularly noteworthy, providing numerous classes for fitting various survival models, including Aalen's Additive Fitter, Cox proportional hazards models, Exponential Fitter, Generalized Gamma Fitter, Kaplan-Meier Fitter, Log-Logistic Fitter, Log-Normal Fitter, Mixture Cure Fitter, Nelson-Aalen Fitter, Piecewise Exponential Fitter, and Weibull Fitter. By whitelisting this library, users can now leverage its capabilities to enhance their projects with advanced survival analysis features.
* Analyse `megatron` ([#1982](https://github.com/databrickslabs/ucx/issues/1982)). In this release, we have made updates to the `known.json` file to include the whitelisting of the `megatron` module. While there are no new functional changes or accompanying tests for this update, it is important to note the addition of new keys to the `known.json` file, which is used to specify approved modules and functions in the codebase. The added keys for `megatron` include `megatron.io`, `megatron.layers`, `megatron.nodes`, `megatron.utils`, and `megatron.visuals`. These additions will ensure that any code referencing these modules or functions will not be flagged as unknown or unapproved, promoting a consistent and manageable codebase. This update is particularly useful in larger projects where keeping track of approved modules and functions can be challenging. For more information, please refer to linked issue [#1901](https://github.com/databrickslabs/ucx/issues/1901).
* Analyse `numba` ([#1978](https://github.com/databrickslabs/ucx/issues/1978)). In this release, we have added Numba, a just-in-time compiler for Python, to our project's whitelist. This addition is reflected in the updated JSON file that maps package names to package versions, which now includes various Numba modules such as 'numba.core', 'numba.cuda', and 'numba.np', along with their respective submodules and functions. Numba is now available for import and will be used in the project, enhancing the performance of our Python code. The new entries in the JSON file have been manually verified, and no changes to existing functionality have been made.
* Analyse `omegaconf` ([#1992](https://github.com/databrickslabs/ucx/issues/1992)). This commit introduces `omegaconf`, a configuration library that provides a simple and flexible way to manage application configurations, to the project's whitelist, which was reviewed and approved by Eric Vergnaud. The addition of `omegaconf` and its various modules, including base, base container, dict config, error handling, grammar, list config, nodes, resolver, opaque container, and versioning modules, as well as plugins for `pydevd`, enables the project to utilize this library for configuration management. No existing functionality is affected, and no new methods have been added. This change is limited to the addition of `omegaconf` to the whitelist and the inclusion of its modules, and it has been manually tested. Overall, this change allows the project to leverage the `omegaconf` library to enhance the management of application configurations.
* Analyse `patool` ([#1988](https://github.com/databrickslabs/ucx/issues/1988)). In this release, we have made changes to the `src/databricks/labs/ucx/source_code/known.json` file by whitelisting `patool`. This change, related to issue [#1901](https://github.com/databrickslabs/ucx/issues/1901), does not introduce any new functionality but adds an entry for `patool` along with several new keys corresponding to various utilities and programs associated with it. The whitelisting process has been carried out manually, and the changes have been thoroughly tested to ensure their proper functioning. This update is targeted towards software engineers seeking to enhance their understanding of the library's modifications. Co-authored by Eric Vergnaud.
* Analyse `peft` ([#1994](https://github.com/databrickslabs/ucx/issues/1994)). In this release, we've added the `peft` key and its associated modules to the 'known.json' file located in the 'databricks/labs/ucx/source_code' directory. The `peft` module includes several sub-modules, such as 'peft.auto', 'peft.config', 'peft.helpers', 'peft.import_utils', 'peft.mapping', 'peft.mixed_model', 'peft.peft_model', and 'peft.tuners', among others. The 'peft.tuners' module implements various tuning strategies for machine learning models and includes sub-modules like 'peft.tuners.adalora', 'peft.tuners.adaption_prompt', 'peft.tuners.boft', 'peft.tuners.ia3', 'peft.tuners.ln_tuning', 'peft.tuners.loha', 'peft.tuners.lokr', 'peft.tuners.lora', 'peft.tuners.multitask_prompt_tuning', 'peft.tuners.oft', 'peft.tuners.p_tuning', 'peft.tuners.poly', 'peft.tuners.prefix_tuning', 'peft.tuners.prompt_tuning', 'peft.tuners.vera', and 'peft.utils', which contains several utility functions. This addition provides new functionalities for machine learning model tuning and utility functions to the project.
* Analyse `seaborn` ([#1977](https://github.com/databrickslabs/ucx/issues/1977)). In this release, the open-source library's dependency whitelist has been updated to include 'seaborn'. This enables the library to utilize `seaborn` in the project. Furthermore, several Azure libraries such as `azure-cosmos` and `azure-storage-blob` have been updated to their latest versions. Additionally, numerous other libraries such as 'certifi', 'cffi', 'charset-normalizer', 'idna', 'numpy', 'pandas', 'pycparser', 'pyOpenSSL', 'python-dateutil', 'pytz', 'requests', 'six', `urllib3` have also been updated to their latest versions. However, issue [#1901](https://github.com/databrickslabs/ucx/issues/1901) is still a work in progress and does not include any specific functional changes or tests in this release.
* Analyse `shap` ([#1993](https://github.com/databrickslabs/ucx/issues/1993)). A new commit by Eric Vergnaud has been added to the project, whitelisting the Shap library for use. Shap is an open-source library that provides explanations for the output of machine learning models. This commit integrates several of Shap's modules into our project, enabling their import without any warnings. The inclusion of these modules does not affect existing functionalities, ensuring a smooth and stable user experience. This update enhances our project's capabilities by providing a more comprehensive explanation of machine learning model outputs, thanks to the integration of the Shap library.
* Analyse `sklearn` ([#1979](https://github.com/databrickslabs/ucx/issues/1979)). In this release, we have added `sklearn` to the whitelist in the `known.json` file as part of issue [#190](https://github.com/databrickslabs/ucx/issues/190)
* Analyse `sktime` ([#2007](https://github.com/databrickslabs/ucx/issues/2007)). In this release, we've expanded our machine learning capabilities by adding the sktime library to our whitelist. Sktime is a library specifically designed for machine learning on time series data, and includes components for preprocessing, modeling, and evaluation. This addition includes a variety of directories and modules related to time series analysis, such as distances and kernels, network architectures, parameter estimation, performance metrics, pipelines, probability distributions, and more. Additionally, we've added tests for many of these modules to ensure proper functionality. Furthermore, we've also added the smmap library to our whitelist, providing a drop-in replacement for the built-in python file object, which allows random access to large files that are too large to fit into memory. These additions will enable our software to handle larger datasets and perform advanced time series analysis.
* Analyse `spark-nlp` ([#1981](https://github.com/databrickslabs/ucx/issues/1981)). In this release, the open-source `spark-nlp` library has been added to the whitelist, enhancing compatibility and accessibility for software engineers. The addition of `spark-nlp` to the whitelist is a non-functional change, but it is expected to improve the overall integration with other libraries. This change has been thoroughly tested to ensure compatibility and reliability, making it a valuable addition for developers working with this library.
* Analyse `spark-ocr` ([#2011](https://github.com/databrickslabs/ucx/issues/2011)). A new open-source library, `spark-ocr`, has been added to the recognized and supported libraries within the system, following the successful whitelisting in the known.json file. This change, tracking issue [#1901](https://github.com/databrickslabs/ucx/issues/1901), does not introduce new functionality or modify existing features but enables all methods and functionality associated with `spark-ocr` for usage. The software engineering team has manually tested the integration, ensuring the seamless adoption for engineers incorporating this project. Please note that specific details of the `spark-ocr` methods are not provided in the commit message. This development benefits software engineers seeking to utilize the `spark-ocr` library within the project.
* Analyse `tf-quant-finance` ([#2015](https://github.com/databrickslabs/ucx/issues/2015)). In this release, we are excited to announce the whitelisting of the `tf-quant-finance` library, a comprehensive and versatile toolkit for financial modeling and analysis. This open-source library brings a wide range of functionalities to our project, including various numerical methods such as finite difference, integration, and interpolation, as well as modules for financial instruments, pricing platforms, stochastic volatility models, and rate curves. The library also includes modules for mathematical functions, optimization, and root search, enhancing our capabilities in these areas. Furthermore, `tf-quant-finance` provides a variety of finance models, such as Cox-Ingersoll-Ross (CIR), Heston, Hull-White, SABR, and more, expanding our repertoire of financial models. Lastly, the library includes modules for rates, such as constant forward, Hagan-West, and Nelson-Siegel-Svensson models, providing more options for rate modeling. We believe that this addition will significantly enhance our project's capabilities and enable us to tackle more complex financial modeling tasks with ease.
* Analyse `trl` ([#1998](https://github.com/databrickslabs/ucx/issues/1998)). In this release, we have integrated the `trl` library into our project, which is a tool for training, running, and logging AI models. This inclusion is aimed at addressing issue [#1901](https://github.com/databrickslabs/ucx/issues/1901). The `trl` library has been whitelisted in the `known.json` file, resulting in extensive changes to the file. While no new functionality has been introduced in this commit, the `trl` library provides various methods for running and training models, as well as utilities for CLI scripts and environment setup. These changes have been manually tested by our team, including Eric Vergnaud. We encourage software engineers to explore the new library and use it to enhance the project's capabilities.
* Analyse `unstructured` ([#2013](https://github.com/databrickslabs/ucx/issues/2013)). This release includes the addition of new test cases for various modules and methods within the unstructured library, such as chunking, cleaners, documents, embed, file_utils, metrics, nlp, partition, staging, and unit_utils. The test cases cover a range of functionalities, including HTML and PDF parsing, text extraction, embedding, file conversion, and encoding detection. The goal is to improve the library's overall robustness and reliability by increasing test coverage for different components.
* Dashboard: N/A instead of NULL readiness while assessment job hasn't yet provided any data ([#1910](https://github.com/databrickslabs/ucx/issues/1910)). In this release, we have improved the behavior of the readiness counter on the workspace UC readiness dashboard. Previously, if the assessment job did not provide any data, the readiness counter would display a NULL value, which could be confusing for users. With this change, the readiness counter now displays 'N/A' instead of NULL in such cases. This behavior is implemented by modifying the SELECT statement in the 00_0_compatibility.sql file, specifically the calculation of the readiness counter. The COALESCE function is used to return 'N/A' if the result of the calculation is NULL. This enhancement ensures that users are not confused by the presence of a NULL value when there is no data available yet.
* Do not migrate READ_METADATA to BROWSE on tables and schemas ([#2022](https://github.com/databrickslabs/ucx/issues/2022)). A recent change has been implemented in the open-source library concerning the handling of the `READ_METADATA` privilege for tables and schemas during migration from hive_metastore to UC. This change omits the translation of `READ_METADATA` privilege to `BROWSE` privilege on UC tables and schemas due to UC's support for `BROWSE` privilege only on catalog objects. Failing to make this change would result in error messages during the migrate tables workflow logs, causing confusion for users. Relevant code modifications have been made in the `uc_grant_sql` method in the `grants.py` file, where lines for `TABLE` and `DATABASE` with `READ_METADATA` privilege have been removed. Additionally, tests have been updated in the `test_grants.py` file to reflect these changes, avoiding the granting of unsupported privileges and preventing user confusion.
* Exclude VIEW from "Non-DELTA format: UNKNOWN" findings in assessment summary chart ([#2025](https://github.com/databrickslabs/ucx/issues/2025)). This release includes updates to the assessment main dashboard's assessment summary chart, specifically addressing the "Non-DELTA format: UNKNOWN" finding. Previously, views were mistakenly included in this finding, causing confusion for customers who couldn't locate any unknown format tables. The issue has been resolved by modifying a SQL file to filter results based on object type and table format, ensuring that non-DELTA format tables are only included if the object type is not a view. This enhancement prevents views from being erroneously counted in the "Non-DELTA format: UNKNOWN" finding, providing clearer and more accurate assessment results for users.
* Explain unused variable ([#1946](https://github.com/databrickslabs/ucx/issues/1946)). In this release, the `make_dbfs_data_copy` fixture in our open-source library has been updated to address an unused variable issue related to the `_` variable, which was previously assigned the value of `make_cluster` but was not utilized in the fixture. This change was implemented on April 16th, and it was only recently identified by `make fmt`. Additionally, the fixture now includes an `if` statement that initializes a `CommandExecutor` object to execute commands on the cluster if the workspace configuration is on AWS. These updates improve the code's readability and maintainability, ensuring that it functions optimally for our software engineer users.
* Expose code linters as a LSP plugin ([#1921](https://github.com/databrickslabs/ucx/issues/1921)). UCX has added a PyLSP plugin for its code linters, which will be automatically registered when `python-lsp-server` is installed. This integration allows users to utilize code linters without any additional setup, improving the code linter functionality of UCX by enabling it to be used as an LSP plugin and providing separate linters and fixers for Python and SQL. The changes include a new `Failure` class, an updated `Deprecation` class, and a `pylsp_lint` function implemented using the `pylsp` library to lint the code. The `LinterContext` and `Diagnostic` classes have been imported, and the `pylsp_lint` function takes in a `Workspace` and `Document` object. The associated tests have been updated, including manual testing, unit tests, and tests on the staging environment. The new feature also includes methods to lint code for use in UC Shared Clusters and return diagnostic information about any issues found, which can serve as a guide for users to rewrite their code as needed.
* Fixed grant visibility and classification ([#1911](https://github.com/databrickslabs/ucx/issues/1911)). This pull request introduces changes to the `grants` function in the `grants.py` file, addressing issues with grant visibility and classification in the underlying inventory. The `_crawl` function has been updated to distinguish between tables and views, and a new dictionary, `_grants_reported_as`, has been added to map reported object types for grants to their actual types. The `grants` function now includes a modification to normalize object types using the new dictionary. The `assessment` workflow and the `grant_detail` view have also been modified. The changes to the `grants` function may affect grant classification and display, and it is recommended to review relevant user documentation for accuracy. Additionally, tests have been conducted to ensure functionality, including unit tests, integration tests, and manual testing. No new methods have been added, but existing functionality in the `_crawl` method in the `tables.py` file has been changed.
* Fixed substituting regex with empty string ([#1953](https://github.com/databrickslabs/ucx/issues/1953)). This release includes a fix for issue [#1922](https://github.com/databrickslabs/ucx/issues/1922) where regular expressions were being replaced with empty strings, causing problems in the `assesment.crawl_groups` and `migrate-groups` workflows. The `groups.py` file has been modified to include changes to the `GroupMigrationStrategy` classes, such as the addition of `workspace_group_regex` and `account_group_regex` attributes, and their compiled versions. The `__init__` method for `RegexSubStrategy` and `RegexMatchStrategy` now takes these regex arguments. The `_safe_match` method now takes a regex pattern instead of a string, and the `_safe_sub` method takes a compiled regex pattern and replacement string as arguments. The `ConfigureGroups` class includes a new `_valid_substitute_pattern` attribute and an updated `_is_valid_substitute_str` method to validate the substitution string. The new `RegexSubStrategy` method replaces the name of the group in the workspace with an empty string when matched by the specified regex. Unit tests and manual testing have been conducted to ensure the correct functionality of these changes.
* Group migration: continue permission migration even if one or more groups fails ([#1924](https://github.com/databrickslabs/ucx/issues/1924)). This update introduces changes to the group migration process, specifically the permission migration stage. If an error occurs during the migration of a group's permissions, the migration will continue with the next group, and any errors will be raised as a `ManyError` exception at the end. The information about successful and failed groups is currently only logged, not persisted. The `group-migration` workflow now includes a new class, `ManyError`, and a new method, `apply_permissions`, in the `PermissionsMigrationAPI` class, handling the migration of permissions for a group and raising a `ManyError` exception if necessary. The commit also includes modified unit tests to ensure the proper functioning of the updated workflow. These changes aim to improve the robustness and reliability of the group migration process by allowing it to continue in the face of errors and by providing better error handling and reporting.
* Group renaming: wait for consistency before completing task ([#1944](https://github.com/databrickslabs/ucx/issues/1944)). In this release, we have made significant updates to the `group-migration` workflow in databricks/labs/ucx/workspace_access/groups.py to ensure that group renaming is completed before the task is marked as done. This change was made to address the issue of eventual consistency in group renaming, which could cause downstream tasks to encounter problems. We have added unit tests for various scenarios, including the `snapshot_with_group_created_in_account_console_should_be_considered`, `rename_groups_should_patch_eligible_groups`, `rename_groups_should_wait_for_renames_to_complete`, `rename_groups_should_retry_on_internal_error`, and `rename_groups_should_fail_if_unknown_name_observed` cases. The `rename_groups_should_wait_for_renames_to_complete` test uses a mock `time.sleep` function to simulate the passage of time and verifies that the group renaming operation waits for the rename to be detected. Additionally, the `rename_groups_should_retry_on_internal_error` test uses a mock `WorkspaceClient` object to simulate an internal error and verifies that the group renaming operation retries the failed operation. The `rename_groups_should_fail_if_unknown_name_observed` test simulates a situation where a concurrent process is interfering with the group renaming operation and verifies that the operation fails immediately instead of waiting for a timeout to occur. These updates are crucial for ensuring the reliability and consistency of group renaming operations in our workflow.
* Improved support for magic commands in python cells ([#1905](https://github.com/databrickslabs/ucx/issues/1905)). This commit enhances support for magic commands in python cells, specifically `%pip` and `!pip`, by improving parsing and execution of cells containing magic lines and ensuring proper pip dependency handling. It includes changes to existing commands, workflows, and the addition of new ones, as well as a new table and classes such as `DependencyProblem` and `MagicCommand`. The `PipCell` class has been updated to `PythonCell`. New methods `build_dependency_graph` and `convert_magic_lines_to_magic_commands` have been added, and several tests have been updated and added to ensure functionality. The changes have been unit and integration tested and manually verified on the staging environment.
* Include findings on `DENY` grants during assessment ([#1903](https://github.com/databrickslabs/ucx/issues/1903)). This pull request introduces support for flagging DENY permissions on objects that cannot be migrated to Unity Catalog (UC). It includes modifications to the `grant_detail` view and adds new integration tests for existing grant-scanning, resolving issue [#1869](https://github.com/databrickslabs/ucx/issues/1869) and superseding [#1890](https://github.com/databrickslabs/ucx/issues/1890). A new column, `failures`, has been added to the `grant_detail` view to indicate explicit DENY privileges that are not supported in UC. The assessment workflow has been updated to include a new step that identifies incompatible object privileges, while new and existing methods have been updated to support flagging DENY permissions. The changes have been documented for users, and the `assessment` workflow and related SQL queries have been updated accordingly. The PR also clarifies that no new CLI command has been added, and no existing commands or tables have been modified. Tests have been conducted manually and integration tests have been added to ensure the changes work as expected.
* Infer linted values that resolve to dbutils.widgets.get ([#1891](https://github.com/databrickslabs/ucx/issues/1891)). This change includes several updates to improve handling of linter context and session state in dependency graphs, as well as enhancements to the inference of values for `dbutils.widgets.get` calls. The `linter_context_factory` method now includes a new parameter, `session_state`, which defaults to `None`. The `LocalFileMigrator` and `LocalCodeLinter` classes use a lambda function to call `linter_context_factory` with the `session_state` parameter, and the `DependencyGraph` class includes a new method, `CurrentSessionState`, to extract SysPathChange from the tree. The `get_notebook_paths` method now accepts a `CurrentSessionState` parameter, and the `build_local_file_dependency_graph` method has been updated to accept this parameter as well. These changes enhance the flexibility of the linter context and improve the accuracy of `dbutils.widgets.get` value inference.
* Infer values across notebook cells ([#1968](https://github.com/databrickslabs/ucx/issues/1968)). This commit introduces a new feature to the linter that infers values across notebook cells when linting Python code, resolving 60 out of 891 `cannot be computed` advices. The changes include the addition of new classes `PythonLinter` and `PythonSequentialLinter`, as well as the modification of the `Fixer` class to accept a list of `Linter` instances as input. The updated linter takes into account not only the code from the current cell but also the code from previous cells, improving value inference and accuracy during linting. The changes have been manually tested and accompanied by added unit tests. This feature progresses issues [#1912](https://github.com/databrickslabs/ucx/issues/1912) and [#1205](https://github.com/databrickslabs/ucx/issues/1205).
* Log the right amount of lint problems ([#2024](https://github.com/databrickslabs/ucx/issues/2024)). A fix has been implemented to address an issue with the incorrect reporting of lint problems due to a change in [#1956](https://github.com/databrickslabs/ucx/issues/1956). The logger now accurately reports the number of linting problems found during the execution of linting tasks in parallel. The length of `job_problems` is now calculated after flattening the list, resulting in a more precise count. This improvement enhances the reliability of the linting process, ensuring that users are informed of the correct number of issues present in their code.
* Normalize python code before parsing ([#1918](https://github.com/databrickslabs/ucx/issues/1918)). This commit addresses the issue of copy-pasted Python code failing to parse and lint due to illegal leading spaces. Co-authored by Eric Vergnaud, it introduces normalization of code through the new `normalize_and_parse` method in the Tree class, which first normalizes the code by removing illegal leading spaces and then parses it. This change improves the code linter's ability to handle previously unparseable code and does not affect functionality. New unit tests have been added to ensure correctness, and modifications to the PythonCell and PipMagic classes enhance processing and handling of multiline code, magic commands, and pip commands. The pull request also includes a new test to check if the normalization process ignores magic markers in multiline comments, improving the reliability of parsing and linting copy-pasted Python code.
* Prompt about joining a collection of ucx installs early ([#1963](https://github.com/databrickslabs/ucx/issues/1963)). The `databricks labs install ucx` command has been updated to prompt the user early on to join a collection of UCX installs. Users who are not account admins can now enter their workspace ID to join as a collection, or skip joining if they prefer. This change includes modifications to the `join_collection` method to include a prompt message and handle cases where the user is not an account admin. A PermissionDenied exception has been added for users who do not have account admin permissions and cannot list workspaces. This change was made to streamline the installation process and reduce potential confusion for users. Additionally, tests have been conducted, both manually and through existing unit tests, to ensure the proper functioning of the updated command. This modification was co-authored by Serge Smertin and is intended to improve the overall user experience.
* Raise lint errors after persisting workflow problems in the inventory database ([#1956](https://github.com/databrickslabs/ucx/issues/1956)). The `refresh_report` method in `jobs.py` has been updated to raise lint errors after persisting workflow problems in the inventory database. This change includes adding a new import statement for `ManyError` and modifying the existing import statement for `Threads` from `databricks.labs.blueprint.parallel`. The method signature for `Threads.strict` has been changed to `Threads.gather` with a new argument `'linting workflows'`. The `problems` list has been replaced with a `job_problems, errors` tuple, and the `job_problems` list is flattened using `itertools.chain` before writing it to the inventory database. If there are any errors during the execution of tasks, a `ManyError` exception is raised with the list of errors. This development helps to visualize known workflow problems by raising lint errors after persisting them in the inventory database, addressing issue [#1952](https://github.com/databrickslabs/ucx/issues/1952), and has been manually tested for accuracy.
* Removing the workspace network requirement info in README.md ([#1948](https://github.com/databrickslabs/ucx/issues/1948)). In this release, we have improved the installation process of UCX, an open-source tool used for deploying assets to selected workspaces. Previously, the requirement for the workspace network to have access to pypi.org for downloading certain packages has been removed and addressed in a previous issue. Now, UCX can be installed in the `/Applications/ucx` directory, which is a change from the previous location of `/Users/<your user>/.ucx/`. This update simplifies the installation process and enhances the user experience. Software engineers who are already familiar with UCX and its installation process will benefit from this update. For advanced installation instructions, please refer to the corresponding section in the documentation.
* Use dedicated advice code for uncomputed values ([#2019](https://github.com/databrickslabs/ucx/issues/2019)). This commit introduces dedicated advice codes for handling uncomputed values in various scenarios, enhancing error messages and improving the precision of feedback provided during the linting process. Changes include implementing `notebook-run-cannot-compute-value` to replace `dbutils-notebook-run-dynamic` in the _raise_advice_if_unresolved function, providing more accurate and specific information when the path for 'dbutils.notebook.run' cannot be computed. A new advice code `table-migrate-cannot-compute-value` has been added to indicate that a table name argument cannot be computed during linting. Additionally, the new advice code `sys-path-cannot-compute-value` is used in the dependency resolver, replacing the previous `sys-path-cannot-compute` code. These updates lead to more precise and informative error messages, aiding in debugging processes. No new methods have been added, and existing functionality remains unchanged. Unit tests have been executed, and they passed. These improvements target software engineers looking to benefit from more accurate error messages and better guidance for debugging.
* Use dedicated advice code for unsupported sql ([#2018](https://github.com/databrickslabs/ucx/issues/2018)). In the latest commit, Eric Vergnaud introduced a new advice code `sql-query-unsupported-sql` for unsupported SQL queries in the `lint` function of the `queries.py` file. This change is aimed at handling unsupported SQL gracefully, providing a more specific error message compared to the previous generic `table-migrate` advice code. Additionally, an exception for unsupported SQL has been implemented in the linter for DBFS, utilizing a new code 'dbfs-query-unsupported-sql'. This modification is intended to improve the handling of SQL queries that are not currently supported, potentially aiding in better integration with future SQL parsing tools. However, it should be noted that this change has not been tested.
* catch sqlglot exceptions and convert them to advices ([#1915](https://github.com/databrickslabs/ucx/issues/1915)). In this release, SQL parsing errors are now handled using SQLGlot and converted to `Failure` advices, with the addition of unit tests and refactoring of the affected code block. A new `Failure` exception class has been introduced in the `databricks.labs.ucx.source_code.base` module, which is used when a SQL query cannot be parsed by sqlglot. A change in the behavior of the SQL parser now generates a `Failure` object instead of silently returning an empty list when sqlglot fails to process a query. This change enhances transparency in error handling and helps developers understand when and why a query has failed to parse. The commit progresses issue [#1901](https://github.com/databrickslabs/ucx/issues/1901) and is co-authored by Eric Vergnaud and Andrew Snare.

Dependency updates:

 * Updated sqlglot requirement from <25.1,>=23.9 to >=23.9,<25.2 ([#1904](https://github.com/databrickslabs/ucx/pull/1904)).
 * Updated sqlglot requirement from <25.2,>=23.9 to >=23.9,<25.3 ([#1917](https://github.com/databrickslabs/ucx/pull/1917)).
 * Updated databricks-sdk requirement from <0.29,>=0.27 to >=0.27,<0.30 ([#1943](https://github.com/databrickslabs/ucx/pull/1943)).
 * Updated sqlglot requirement from <25.3,>=23.9 to >=25.4.1,<25.5 ([#1959](https://github.com/databrickslabs/ucx/pull/1959)).
 * Updated databricks-labs-lsql requirement from ~=0.4.0 to >=0.4,<0.6 ([#2076](https://github.com/databrickslabs/ucx/pull/2076)).
 * Updated sqlglot requirement from <25.5,>=25.4.1 to >=25.5.0,<25.6 ([#2084](https://github.com/databrickslabs/ucx/pull/2084)).
nfx added a commit that referenced this issue Jul 5, 2024
* Added handling for exceptions with no error_code attribute while
crawling permissions
([#2079](https://github.com/databrickslabs/ucx/issues/2079)). A new
enhancement has been implemented to improve error handling during the
assessment job's permission crawling process. Previously, exceptions
that lacked an `error_code` attribute would cause an `AttributeError`.
This release introduces a check for the existence of the `error_code`
attribute before attempting to access it, logging an error and adding it
to the list of acute errors if not present. The change includes a new
unit test for verification, and the relevant functionality has been
added to the `inventorize_permissions` function within the `manager.py`
file. The new method, `test_manager_inventorize_fail_with_error`, has
been implemented to test the permission manager's behavior when
encountering errors during the inventory process, raising
`DatabricksError` and `TimeoutError` instances with and without
`error_code` attributes. This update resolves issue
[#2078](https://github.com/databrickslabs/ucx/issues/2078) and enhances
the overall robustness of the assessment job's permission crawling
functionality.
* Added handling for missing permission to read file
([#1949](https://github.com/databrickslabs/ucx/issues/1949)). In this
release, we've addressed an issue where missing permissions to read a
file during linting were not being handled properly. The revised code
now checks for `NotFound` and `PermissionError` exceptions when
attempting to read a file's text content. If a `NotFound` exception
occurs, the function returns None and logs a warning message. If a
`PermissionError` exception occurs, the function also returns None and
logs a warning message with the error's traceback. This change resolves
issue [#1942](https://github.com/databrickslabs/ucx/issues/1942) and
partially resolves issue
[#1952](https://github.com/databrickslabs/ucx/issues/1952), improving
the robustness of the linting process and providing more informative
error messages. Additionally, new tests and methods have been added to
handle missing files and missing read permissions during linting,
ensuring that the file linter can handle these cases correctly.
* Added handling for unauthenticated exception while joining collection
([#1958](https://github.com/databrickslabs/ucx/issues/1958)). A new
exception type, Unauthenticated, has been added to the import statement,
and new error messages have been implemented in the _sync_collection and
_get_collection_workspace functions to notify users when they do not
have admin access to the workspace. A try-except block has been added in
the _get_collection_workspace function to handle the Unauthenticated
exception, and a warning message is logged indicating that the user
needs account admin and workspace admin credentials to enable collection
joining and to run the join-collection command with account admin
credentials. Additionally, a new CLI command has been added, and the
existing `databricks labs ucx ...` command has been modified. A new
workflow for joining the collection has also been implemented. These
changes have been thoroughly documented in the user documentation and
verified on the staging environment.
* Added tracking for UCX workflows and as-library usage
([#1966](https://github.com/databrickslabs/ucx/issues/1966)). This
commit introduces User-Agent tracking for UCX workflows and library
usage, adding `ucx/<version>`, `cmd/install`, and `cmd/<workflow>`
elements to relevant requests. These changes are implemented within the
`test_useragent.py` file, which includes the new `http_fixture_server`
context manager for testing User-Agent propagation in UCX workflows. The
addition of `with_user_agent_extra` and the inclusion of `with_product`
functions from `databricks.sdk.core` aim to provide valuable insights
for debugging, maintenance, and improving UCX workflow performance. This
feature will help gather clear usage metrics for UCX and enhance the
overall user experience.
* Analyse `altair`
([#2005](https://github.com/databrickslabs/ucx/issues/2005)). In this
release, the open-source library has undergone a whitelisting of the
`altair` library, addressing issue
[#1901](https://github.com/databrickslabs/ucx/issues/1901). The changes
involve the addition of several modules and sub-modules under the
`altair` package, including `altair`, `altair._magics`, `altair.expr`,
and various others such as `altair.utils`, `altair.utils._dfi_types`,
`altair.utils._importers`, and `altair.utils._show`. Additionally,
modifications have been made to the `known.json` file to include the
`altair` package. It is important to note that no new functionalities
have been introduced, and the changes have been manually verified. This
release has been developed by Eric Vergnaud.
* Analyse `azure`
([#2016](https://github.com/databrickslabs/ucx/issues/2016)). In this
release, we have made updates to the whitelist of several Azure
libraries, including 'azure-common', 'azure-core', 'azure-mgmt-core',
'azure-mgmt-digitaltwins', and 'azure-storage-blob'. These changes are
intended to manage dependencies and ensure a secure and stable
environment for software engineers working with these libraries. The
`azure-common` library has been added to the whitelist, and updates have
been made to the existing whitelists for the other libraries. These
changes do not add or modify any functionality or test cases, but are
important for maintaining the integrity of our open-source library. This
commit was co-authored by Eric Vergnaud from Databricks.
* Analyse `causal-learn`
([#2012](https://github.com/databrickslabs/ucx/issues/2012)). In this
release, we have added `causal-learn` to the whitelist in our JSON file,
signifying that it is now a supported library. This update includes the
addition of various modules, classes, and functions to 'causal-learn'.
We would like to emphasize that there are no changes to existing
functionality, nor have any new methods been added. This release is
thoroughly tested to ensure functionality and stability. We hope that
software engineers in the community will find this update helpful and
consider adopting this project.
* Analyse `databricks-arc`
([#2004](https://github.com/databrickslabs/ucx/issues/2004)). This
release introduces whitelisting for the `databricks-arc` library, which
is used for data analytics and machine learning. The release updates the
`known.json` file to include `databricks-arc` and its related modules
such as `arc.autolinker`, `arc.sql`, `arc.sql.enable_arc`, `arc.utils`,
and `arc.utils.utils`. It also provides specific error codes and
messages related to using these libraries on UC Shared Clusters.
Additionally, this release includes updates to the
`databricks-feature-engineering` library, with the addition of many new
modules and error codes related to JVM access, legacy context, and spark
logging. The `databricks.ml_features` library has several updates,
including changes to the `_spark_client` and `publish_engine`. The
`databricks.ml_features.entities` module has many updates, with new
classes and methods for handling features, specifications, tables, and
more. These updates offer improved functionality and error handling for
the whitelisted libraries, specifically when used on UC Shared Clusters.
* Analyse `dbldatagen`
([#1985](https://github.com/databrickslabs/ucx/issues/1985)). The
`dbldatagen` package has been whitelisted in the `known.json` file in
this release. While there are no new or altered functionalities, several
updates have been made to the methods and objects within `dbldatagen`.
This includes enhancements to `dbldatagen._version`,
`dbldatagen.column_generation_spec`, `dbldatagen.column_spec_options`,
`dbldatagen.constraints`, `dbldatagen.data_analyzer`,
`dbldatagen.data_generator`, `dbldatagen.datagen_constants`,
`dbldatagen.datasets`, and related classes. Additionally,
`dbldatagen.datasets.basic_geometries`,
`dbldatagen.datasets.basic_process_historian`,
`dbldatagen.datasets.basic_telematics`,
`dbldatagen.datasets.basic_user`,
`dbldatagen.datasets.benchmark_groupby`,
`dbldatagen.datasets.dataset_provider`,
`dbldatagen.datasets.multi_table_telephony_provider`, and
`dbldatagen.datasets_object` have been updated. The distribution
methods, such as `dbldatagen.distributions`,
`dbldatagen.distributions.beta`,
`dbldatagen.distributions.data_distribution`,
`dbldatagen.distributions.exponential_distribution`,
`dbldatagen.distributions.gamma`, and
`dbldatagen.distributions.normal_distribution`, have also seen
improvements. Furthermore, `dbldatagen.function_builder`,
`dbldatagen.html_utils`, `dbldatagen.nrange`,
`dbldatagen.schema_parser`, `dbldatagen.spark_singleton`,
`dbldatagen.text_generator_plugins`, and `dbldatagen.text_generators`
have been updated. The `dbldatagen.data_generator` method now includes a
warning about the deprecated `sparkContext` in shared clusters, and
`dbldatagen.schema_parser` includes updates related to the `table_name`
argument in various SQL statements. These changes ensure better
compatibility and improved functionality of the `dbldatagen` package.
* Analyse `delta-spark`
([#1987](https://github.com/databrickslabs/ucx/issues/1987)). In this
release, the `delta-spark` component within the `delta` project has been
whitelisted with the inclusion of a new entry in the `known.json`
configuration file. This addition brings in several sub-components,
including `delta._typing`, `delta.exceptions`, and `delta.tables`, each
with a `jvm-access-in-shared-clusters` error code and message for
unsupported environments. These changes aim to enhance the handling of
`delta-spark` component within the `delta` project. The changes have
been rigorously tested and do not introduce new functionality or modify
existing behavior. This update is ensured to provide better stability
and compatibility to the project. Co-authored by Eric Vergnaud.
* Analyse `diffusers`
([#2010](https://github.com/databrickslabs/ucx/issues/2010)). A new
`diffusers` category has been added to the JSON configuration file,
featuring several subcategories and numerous empty arrays as values.
This change serves to prepare the configuration for future additions,
without altering any existing methods or behaviors. As such, this update
does not impact current functionality, but instead, sets the stage for
further development. No associated tests or functional changes accompany
this modification.
* Analyse `faker`
([#2014](https://github.com/databrickslabs/ucx/issues/2014)). In this
release, the `faker` library in the Databricks project has undergone
whitelisting, addressing security concerns, improving performance, and
reducing the attack surface. No new methods were added, and the existing
functionality remains unchanged. Thorough manual verification of the
tests has been conducted. This release introduces various modules and
submodules related to the `faker` library, expanding its capabilities in
address generation in multiple languages and countries, along with new
providers for bank, barcode, color, company, credit_card, currency,
date_time, emoji, file, geo, internet, isbn, job, lorem, misc, passport,
person, phone_number, profile, python, sbn, ssn, and user_agent
generation. Software engineers should find these improvements
advantageous for their projects, offering a broader range of options and
enhanced performance.
* Analyse `fastcluster`
([#1980](https://github.com/databrickslabs/ucx/issues/1980)). In this
release, the project's configuration has been updated to include the
`fastcluster` package in the approved libraries whitelist, as part of
issue [#1901](https://github.com/databrickslabs/ucx/issues/1901)
resolution. This change enables software engineers to utilize the
functions and methods provided by `fastcluster` in the project's
codebase. The `fastcluster` package is now registered in the
`known.json` configuration file, and its integration has been thoroughly
tested to ensure seamless functionality. By incorporating `fastcluster`,
the project's capabilities are expanded, allowing software engineers to
benefit from its optimized clustering algorithms and performance
enhancements.
* Analyse `glow`
([#1973](https://github.com/databrickslabs/ucx/issues/1973)). In this
release, we have analyzed and added the `glow` library and its modules,
including `glow._array`, `glow._coro`, `glow._debug`, and others, to the
`known.json` file whitelist. This change allows for seamless integration
and usage of the `glow` library in your projects. It is important to
note that this update does not modify any existing functionality and has
been thoroughly tested to ensure compatibility. Software engineers
utilizing the `glow` library will benefit from this enhancement, as it
provides explicit approval for the library and its modules, facilitating
a more efficient development process.
* Analyse `graphframes`
([#1990](https://github.com/databrickslabs/ucx/issues/1990)). In this
release, the `graphframes` library has been thoroughly analyzed and the
whitelist updated accordingly. This includes the addition of several new
entries, such as `graphframes.examples.belief_propagation`,
`graphframes.examples.graphs`, `graphframes.graphframe`,
`graphframes.lib.aggregate_messages`, and `graphframes.tests`. These
changes may require modifications such as rewriting code to use Spark or
accessing the Spark Driver JVM. These updates aim to improve
compatibility with UC Shared Clusters, ensuring a more seamless
integration. Manual testing has been conducted to ensure the changes are
functioning as intended.
* Analyse `graphviz`
([#2008](https://github.com/databrickslabs/ucx/issues/2008)). In this
release, we have analyzed and whitelisted the `graphviz` library for use
in the project. The library has been added to the `known.json` file,
which is used to manage dependencies. The `graphviz` package contains
several modules and sub-modules, including `backend`, `dot`,
`exceptions`, `graphs`, `jupyter_integration`, `parameters`,
`rendering`, and `saving`. While we do not have detailed information on
the functionality provided by these modules at this time, they have been
manually tested for correct functioning. This addition enhances the
project's graphing and visualization capabilities by incorporating the
well-regarded `graphviz` library.
* Analyse `hyperopt`
([#1970](https://github.com/databrickslabs/ucx/issues/1970)). In this
release, we have made changes to include the `hyperopt` library in our
project, addressing issue
[#1901](https://github.com/databrickslabs/ucx/issues/1901). This
integration does not introduce any new methods or modify existing
functionality, and has been manually tested. The `hyperopt` package now
includes several new modules, such as `hyperopt.algobase`,
`hyperopt.anneal`, `hyperopt.atpe`, and many others, encompassing
various components like classes, functions, and tests. Notably, some of
these modules support integration with Spark and MongoDB. The
`known.json` file has also been updated to reflect these additions.
* Analyse `ipywidgets`
([#1972](https://github.com/databrickslabs/ucx/issues/1972)). A new
commit has been added to whitelist the `ipywidgets` package, enabling
its usage within our open-source library. No new functionality or
changes have been introduced in this commit. The package has undergone
manual testing to ensure proper functionality. The primary modification
involves adding `ipywidgets` to the `known.json` file whitelist, which
includes various modules and sub-modules used for testing, IPython
interaction, handling dates and times, and managing widget outputs. This
update simply permits the utilization of the `ipywidgets` package and
its related modules and sub-modules.
* Analyse `johnsnowlabs`
([#1997](https://github.com/databrickslabs/ucx/issues/1997)). The
`johnsnowlabs` package, used for natural language processing and machine
learning tasks, has been added to the whitelist in this release. This
package includes various modules and sub-packages, such as auto_install,
finance, frameworks, johnsnowlabs, lab, legal, llm, medical, nlp,
py_models, serve, settings, utils, and visual, which provide a range of
classes and functions for working with data and models in the context of
NLP and machine learning. Note that this commit also raises deprecation
warnings related to file system paths and access to the Spark Driver JVM
in shared clusters, indicating potential compatibility issues or
limitations; however, the exact impact or scope of these issues cannot
be determined from the provided commit message.
* Analyse `langchain`
([#1975](https://github.com/databrickslabs/ucx/issues/1975)). In this
release, the `langchain` module has been added to the JSON file and
whitelisted for use. This module encompasses a variety of sub-modules,
such as '_api', '_api.deprecation', '_api.interactive_env', and
'_api.module_import', among others. Additionally, there are sub-modules
related to adapters for various services, including 'openai', 'amadeus',
'azure_cognitive_services', 'conversational_retrieval', and 'clickup'.
The `conversational_retrieval` sub-module contains a toolkit for openai
functions and a standalone tool. However, specific changes,
functionality details, and testing information have not been provided in
the commit message. As a software engineer, please refer to the
documentation and testing framework for further details.
* Analyse `lifelines`
([#2006](https://github.com/databrickslabs/ucx/issues/2006)). In this
release, we have whitelisted the `lifelines` package, a powerful Python
library for survival analysis and hazard rate estimation. This addition
brings a comprehensive suite of functionalities, such as data sets,
exceptions, utilities, version checking, statistical calculations, and
plotting tools. The `fitters` category is particularly noteworthy,
providing numerous classes for fitting various survival models,
including Aalen's Additive Fitter, Cox proportional hazards models,
Exponential Fitter, Generalized Gamma Fitter, Kaplan-Meier Fitter,
Log-Logistic Fitter, Log-Normal Fitter, Mixture Cure Fitter,
Nelson-Aalen Fitter, Piecewise Exponential Fitter, and Weibull Fitter.
By whitelisting this library, users can now leverage its capabilities to
enhance their projects with advanced survival analysis features.
* Analyse `megatron`
([#1982](https://github.com/databrickslabs/ucx/issues/1982)). In this
release, we have made updates to the `known.json` file to include the
whitelisting of the `megatron` module. While there are no new functional
changes or accompanying tests for this update, it is important to note
the addition of new keys to the `known.json` file, which is used to
specify approved modules and functions in the codebase. The added keys
for `megatron` include `megatron.io`, `megatron.layers`,
`megatron.nodes`, `megatron.utils`, and `megatron.visuals`. These
additions will ensure that any code referencing these modules or
functions will not be flagged as unknown or unapproved, promoting a
consistent and manageable codebase. This update is particularly useful
in larger projects where keeping track of approved modules and functions
can be challenging. For more information, please refer to linked issue
[#1901](https://github.com/databrickslabs/ucx/issues/1901).
* Analyse `numba`
([#1978](https://github.com/databrickslabs/ucx/issues/1978)). In this
release, we have added Numba, a just-in-time compiler for Python, to our
project's whitelist. This addition is reflected in the updated JSON file
that maps package names to package versions, which now includes various
Numba modules such as 'numba.core', 'numba.cuda', and 'numba.np', along
with their respective submodules and functions. Numba is now available
for import and will be used in the project, enhancing the performance of
our Python code. The new entries in the JSON file have been manually
verified, and no changes to existing functionality have been made.
* Analyse `omegaconf`
([#1992](https://github.com/databrickslabs/ucx/issues/1992)). This
commit introduces `omegaconf`, a configuration library that provides a
simple and flexible way to manage application configurations, to the
project's whitelist, which was reviewed and approved by Eric Vergnaud.
The addition of `omegaconf` and its various modules, including base,
base container, dict config, error handling, grammar, list config,
nodes, resolver, opaque container, and versioning modules, as well as
plugins for `pydevd`, enables the project to utilize this library for
configuration management. No existing functionality is affected, and no
new methods have been added. This change is limited to the addition of
`omegaconf` to the whitelist and the inclusion of its modules, and it
has been manually tested. Overall, this change allows the project to
leverage the `omegaconf` library to enhance the management of
application configurations.
* Analyse `patool`
([#1988](https://github.com/databrickslabs/ucx/issues/1988)). In this
release, we have made changes to the
`src/databricks/labs/ucx/source_code/known.json` file by whitelisting
`patool`. This change, related to issue
[#1901](https://github.com/databrickslabs/ucx/issues/1901), does not
introduce any new functionality but adds an entry for `patool` along
with several new keys corresponding to various utilities and programs
associated with it. The whitelisting process has been carried out
manually, and the changes have been thoroughly tested to ensure their
proper functioning. This update is targeted towards software engineers
seeking to enhance their understanding of the library's modifications.
Co-authored by Eric Vergnaud.
* Analyse `peft`
([#1994](https://github.com/databrickslabs/ucx/issues/1994)). In this
release, we've added the `peft` key and its associated modules to the
'known.json' file located in the 'databricks/labs/ucx/source_code'
directory. The `peft` module includes several sub-modules, such as
'peft.auto', 'peft.config', 'peft.helpers', 'peft.import_utils',
'peft.mapping', 'peft.mixed_model', 'peft.peft_model', and
'peft.tuners', among others. The 'peft.tuners' module implements various
tuning strategies for machine learning models and includes sub-modules
like 'peft.tuners.adalora', 'peft.tuners.adaption_prompt',
'peft.tuners.boft', 'peft.tuners.ia3', 'peft.tuners.ln_tuning',
'peft.tuners.loha', 'peft.tuners.lokr', 'peft.tuners.lora',
'peft.tuners.multitask_prompt_tuning', 'peft.tuners.oft',
'peft.tuners.p_tuning', 'peft.tuners.poly', 'peft.tuners.prefix_tuning',
'peft.tuners.prompt_tuning', 'peft.tuners.vera', and 'peft.utils', which
contains several utility functions. This addition provides new
functionalities for machine learning model tuning and utility functions
to the project.
* Analyse `seaborn`
([#1977](https://github.com/databrickslabs/ucx/issues/1977)). In this
release, the open-source library's dependency whitelist has been updated
to include 'seaborn'. This enables the library to utilize `seaborn` in
the project. Furthermore, several Azure libraries such as `azure-cosmos`
and `azure-storage-blob` have been updated to their latest versions.
Additionally, numerous other libraries such as 'certifi', 'cffi',
'charset-normalizer', 'idna', 'numpy', 'pandas', 'pycparser',
'pyOpenSSL', 'python-dateutil', 'pytz', 'requests', 'six', `urllib3`
have also been updated to their latest versions. However, issue
[#1901](https://github.com/databrickslabs/ucx/issues/1901) is still a
work in progress and does not include any specific functional changes or
tests in this release.
* Analyse `shap`
([#1993](https://github.com/databrickslabs/ucx/issues/1993)). A new
commit by Eric Vergnaud has been added to the project, whitelisting the
Shap library for use. Shap is an open-source library that provides
explanations for the output of machine learning models. This commit
integrates several of Shap's modules into our project, enabling their
import without any warnings. The inclusion of these modules does not
affect existing functionalities, ensuring a smooth and stable user
experience. This update enhances our project's capabilities by providing
a more comprehensive explanation of machine learning model outputs,
thanks to the integration of the Shap library.
* Analyse `sklearn`
([#1979](https://github.com/databrickslabs/ucx/issues/1979)). In this
release, we have added `sklearn` to the whitelist in the `known.json`
file as part of issue
[#190](https://github.com/databrickslabs/ucx/issues/190)
* Analyse `sktime`
([#2007](https://github.com/databrickslabs/ucx/issues/2007)). In this
release, we've expanded our machine learning capabilities by adding the
sktime library to our whitelist. Sktime is a library specifically
designed for machine learning on time series data, and includes
components for preprocessing, modeling, and evaluation. This addition
includes a variety of directories and modules related to time series
analysis, such as distances and kernels, network architectures,
parameter estimation, performance metrics, pipelines, probability
distributions, and more. Additionally, we've added tests for many of
these modules to ensure proper functionality. Furthermore, we've also
added the smmap library to our whitelist, providing a drop-in
replacement for the built-in python file object, which allows random
access to large files that are too large to fit into memory. These
additions will enable our software to handle larger datasets and perform
advanced time series analysis.
* Analyse `spark-nlp`
([#1981](https://github.com/databrickslabs/ucx/issues/1981)). In this
release, the open-source `spark-nlp` library has been added to the
whitelist, enhancing compatibility and accessibility for software
engineers. The addition of `spark-nlp` to the whitelist is a
non-functional change, but it is expected to improve the overall
integration with other libraries. This change has been thoroughly tested
to ensure compatibility and reliability, making it a valuable addition
for developers working with this library.
* Analyse `spark-ocr`
([#2011](https://github.com/databrickslabs/ucx/issues/2011)). A new
open-source library, `spark-ocr`, has been added to the recognized and
supported libraries within the system, following the successful
whitelisting in the known.json file. This change, tracking issue
[#1901](https://github.com/databrickslabs/ucx/issues/1901), does not
introduce new functionality or modify existing features but enables all
methods and functionality associated with `spark-ocr` for usage. The
software engineering team has manually tested the integration, ensuring
the seamless adoption for engineers incorporating this project. Please
note that specific details of the `spark-ocr` methods are not provided
in the commit message. This development benefits software engineers
seeking to utilize the `spark-ocr` library within the project.
* Analyse `tf-quant-finance`
([#2015](https://github.com/databrickslabs/ucx/issues/2015)). In this
release, we are excited to announce the whitelisting of the
`tf-quant-finance` library, a comprehensive and versatile toolkit for
financial modeling and analysis. This open-source library brings a wide
range of functionalities to our project, including various numerical
methods such as finite difference, integration, and interpolation, as
well as modules for financial instruments, pricing platforms, stochastic
volatility models, and rate curves. The library also includes modules
for mathematical functions, optimization, and root search, enhancing our
capabilities in these areas. Furthermore, `tf-quant-finance` provides a
variety of finance models, such as Cox-Ingersoll-Ross (CIR), Heston,
Hull-White, SABR, and more, expanding our repertoire of financial
models. Lastly, the library includes modules for rates, such as constant
forward, Hagan-West, and Nelson-Siegel-Svensson models, providing more
options for rate modeling. We believe that this addition will
significantly enhance our project's capabilities and enable us to tackle
more complex financial modeling tasks with ease.
* Analyse `trl`
([#1998](https://github.com/databrickslabs/ucx/issues/1998)). In this
release, we have integrated the `trl` library into our project, which is
a tool for training, running, and logging AI models. This inclusion is
aimed at addressing issue
[#1901](https://github.com/databrickslabs/ucx/issues/1901). The `trl`
library has been whitelisted in the `known.json` file, resulting in
extensive changes to the file. While no new functionality has been
introduced in this commit, the `trl` library provides various methods
for running and training models, as well as utilities for CLI scripts
and environment setup. These changes have been manually tested by our
team, including Eric Vergnaud. We encourage software engineers to
explore the new library and use it to enhance the project's
capabilities.
* Analyse `unstructured`
([#2013](https://github.com/databrickslabs/ucx/issues/2013)). This
release includes the addition of new test cases for various modules and
methods within the unstructured library, such as chunking, cleaners,
documents, embed, file_utils, metrics, nlp, partition, staging, and
unit_utils. The test cases cover a range of functionalities, including
HTML and PDF parsing, text extraction, embedding, file conversion, and
encoding detection. The goal is to improve the library's overall
robustness and reliability by increasing test coverage for different
components.
* Dashboard: N/A instead of NULL readiness while assessment job hasn't
yet provided any data
([#1910](https://github.com/databrickslabs/ucx/issues/1910)). In this
release, we have improved the behavior of the readiness counter on the
workspace UC readiness dashboard. Previously, if the assessment job did
not provide any data, the readiness counter would display a NULL value,
which could be confusing for users. With this change, the readiness
counter now displays 'N/A' instead of NULL in such cases. This behavior
is implemented by modifying the SELECT statement in the
00_0_compatibility.sql file, specifically the calculation of the
readiness counter. The COALESCE function is used to return 'N/A' if the
result of the calculation is NULL. This enhancement ensures that users
are not confused by the presence of a NULL value when there is no data
available yet.
* Do not migrate READ_METADATA to BROWSE on tables and schemas
([#2022](https://github.com/databrickslabs/ucx/issues/2022)). A recent
change has been implemented in the open-source library concerning the
handling of the `READ_METADATA` privilege for tables and schemas during
migration from hive_metastore to UC. This change omits the translation
of `READ_METADATA` privilege to `BROWSE` privilege on UC tables and
schemas due to UC's support for `BROWSE` privilege only on catalog
objects. Failing to make this change would result in error messages
during the migrate tables workflow logs, causing confusion for users.
Relevant code modifications have been made in the `uc_grant_sql` method
in the `grants.py` file, where lines for `TABLE` and `DATABASE` with
`READ_METADATA` privilege have been removed. Additionally, tests have
been updated in the `test_grants.py` file to reflect these changes,
avoiding the granting of unsupported privileges and preventing user
confusion.
* Exclude VIEW from "Non-DELTA format: UNKNOWN" findings in assessment
summary chart
([#2025](https://github.com/databrickslabs/ucx/issues/2025)). This
release includes updates to the assessment main dashboard's assessment
summary chart, specifically addressing the "Non-DELTA format: UNKNOWN"
finding. Previously, views were mistakenly included in this finding,
causing confusion for customers who couldn't locate any unknown format
tables. The issue has been resolved by modifying a SQL file to filter
results based on object type and table format, ensuring that non-DELTA
format tables are only included if the object type is not a view. This
enhancement prevents views from being erroneously counted in the
"Non-DELTA format: UNKNOWN" finding, providing clearer and more accurate
assessment results for users.
* Explain unused variable
([#1946](https://github.com/databrickslabs/ucx/issues/1946)). In this
release, the `make_dbfs_data_copy` fixture in our open-source library
has been updated to address an unused variable issue related to the `_`
variable, which was previously assigned the value of `make_cluster` but
was not utilized in the fixture. This change was implemented on April
16th, and it was only recently identified by `make fmt`. Additionally,
the fixture now includes an `if` statement that initializes a
`CommandExecutor` object to execute commands on the cluster if the
workspace configuration is on AWS. These updates improve the code's
readability and maintainability, ensuring that it functions optimally
for our software engineer users.
* Expose code linters as a LSP plugin
([#1921](https://github.com/databrickslabs/ucx/issues/1921)). UCX has
added a PyLSP plugin for its code linters, which will be automatically
registered when `python-lsp-server` is installed. This integration
allows users to utilize code linters without any additional setup,
improving the code linter functionality of UCX by enabling it to be used
as an LSP plugin and providing separate linters and fixers for Python
and SQL. The changes include a new `Failure` class, an updated
`Deprecation` class, and a `pylsp_lint` function implemented using the
`pylsp` library to lint the code. The `LinterContext` and `Diagnostic`
classes have been imported, and the `pylsp_lint` function takes in a
`Workspace` and `Document` object. The associated tests have been
updated, including manual testing, unit tests, and tests on the staging
environment. The new feature also includes methods to lint code for use
in UC Shared Clusters and return diagnostic information about any issues
found, which can serve as a guide for users to rewrite their code as
needed.
* Fixed grant visibility and classification
([#1911](https://github.com/databrickslabs/ucx/issues/1911)). This pull
request introduces changes to the `grants` function in the `grants.py`
file, addressing issues with grant visibility and classification in the
underlying inventory. The `_crawl` function has been updated to
distinguish between tables and views, and a new dictionary,
`_grants_reported_as`, has been added to map reported object types for
grants to their actual types. The `grants` function now includes a
modification to normalize object types using the new dictionary. The
`assessment` workflow and the `grant_detail` view have also been
modified. The changes to the `grants` function may affect grant
classification and display, and it is recommended to review relevant
user documentation for accuracy. Additionally, tests have been conducted
to ensure functionality, including unit tests, integration tests, and
manual testing. No new methods have been added, but existing
functionality in the `_crawl` method in the `tables.py` file has been
changed.
* Fixed substituting regex with empty string
([#1953](https://github.com/databrickslabs/ucx/issues/1953)). This
release includes a fix for issue
[#1922](https://github.com/databrickslabs/ucx/issues/1922) where regular
expressions were being replaced with empty strings, causing problems in
the `assesment.crawl_groups` and `migrate-groups` workflows. The
`groups.py` file has been modified to include changes to the
`GroupMigrationStrategy` classes, such as the addition of
`workspace_group_regex` and `account_group_regex` attributes, and their
compiled versions. The `__init__` method for `RegexSubStrategy` and
`RegexMatchStrategy` now takes these regex arguments. The `_safe_match`
method now takes a regex pattern instead of a string, and the
`_safe_sub` method takes a compiled regex pattern and replacement string
as arguments. The `ConfigureGroups` class includes a new
`_valid_substitute_pattern` attribute and an updated
`_is_valid_substitute_str` method to validate the substitution string.
The new `RegexSubStrategy` method replaces the name of the group in the
workspace with an empty string when matched by the specified regex. Unit
tests and manual testing have been conducted to ensure the correct
functionality of these changes.
* Group migration: continue permission migration even if one or more
groups fails
([#1924](https://github.com/databrickslabs/ucx/issues/1924)). This
update introduces changes to the group migration process, specifically
the permission migration stage. If an error occurs during the migration
of a group's permissions, the migration will continue with the next
group, and any errors will be raised as a `ManyError` exception at the
end. The information about successful and failed groups is currently
only logged, not persisted. The `group-migration` workflow now includes
a new class, `ManyError`, and a new method, `apply_permissions`, in the
`PermissionsMigrationAPI` class, handling the migration of permissions
for a group and raising a `ManyError` exception if necessary. The commit
also includes modified unit tests to ensure the proper functioning of
the updated workflow. These changes aim to improve the robustness and
reliability of the group migration process by allowing it to continue in
the face of errors and by providing better error handling and reporting.
* Group renaming: wait for consistency before completing task
([#1944](https://github.com/databrickslabs/ucx/issues/1944)). In this
release, we have made significant updates to the `group-migration`
workflow in databricks/labs/ucx/workspace_access/groups.py to ensure
that group renaming is completed before the task is marked as done. This
change was made to address the issue of eventual consistency in group
renaming, which could cause downstream tasks to encounter problems. We
have added unit tests for various scenarios, including the
`snapshot_with_group_created_in_account_console_should_be_considered`,
`rename_groups_should_patch_eligible_groups`,
`rename_groups_should_wait_for_renames_to_complete`,
`rename_groups_should_retry_on_internal_error`, and
`rename_groups_should_fail_if_unknown_name_observed` cases. The
`rename_groups_should_wait_for_renames_to_complete` test uses a mock
`time.sleep` function to simulate the passage of time and verifies that
the group renaming operation waits for the rename to be detected.
Additionally, the `rename_groups_should_retry_on_internal_error` test
uses a mock `WorkspaceClient` object to simulate an internal error and
verifies that the group renaming operation retries the failed operation.
The `rename_groups_should_fail_if_unknown_name_observed` test simulates
a situation where a concurrent process is interfering with the group
renaming operation and verifies that the operation fails immediately
instead of waiting for a timeout to occur. These updates are crucial for
ensuring the reliability and consistency of group renaming operations in
our workflow.
* Improved support for magic commands in python cells
([#1905](https://github.com/databrickslabs/ucx/issues/1905)). This
commit enhances support for magic commands in python cells, specifically
`%pip` and `!pip`, by improving parsing and execution of cells
containing magic lines and ensuring proper pip dependency handling. It
includes changes to existing commands, workflows, and the addition of
new ones, as well as a new table and classes such as `DependencyProblem`
and `MagicCommand`. The `PipCell` class has been updated to
`PythonCell`. New methods `build_dependency_graph` and
`convert_magic_lines_to_magic_commands` have been added, and several
tests have been updated and added to ensure functionality. The changes
have been unit and integration tested and manually verified on the
staging environment.
* Include findings on `DENY` grants during assessment
([#1903](https://github.com/databrickslabs/ucx/issues/1903)). This pull
request introduces support for flagging DENY permissions on objects that
cannot be migrated to Unity Catalog (UC). It includes modifications to
the `grant_detail` view and adds new integration tests for existing
grant-scanning, resolving issue
[#1869](https://github.com/databrickslabs/ucx/issues/1869) and
superseding [#1890](https://github.com/databrickslabs/ucx/issues/1890).
A new column, `failures`, has been added to the `grant_detail` view to
indicate explicit DENY privileges that are not supported in UC. The
assessment workflow has been updated to include a new step that
identifies incompatible object privileges, while new and existing
methods have been updated to support flagging DENY permissions. The
changes have been documented for users, and the `assessment` workflow
and related SQL queries have been updated accordingly. The PR also
clarifies that no new CLI command has been added, and no existing
commands or tables have been modified. Tests have been conducted
manually and integration tests have been added to ensure the changes
work as expected.
* Infer linted values that resolve to dbutils.widgets.get
([#1891](https://github.com/databrickslabs/ucx/issues/1891)). This
change includes several updates to improve handling of linter context
and session state in dependency graphs, as well as enhancements to the
inference of values for `dbutils.widgets.get` calls. The
`linter_context_factory` method now includes a new parameter,
`session_state`, which defaults to `None`. The `LocalFileMigrator` and
`LocalCodeLinter` classes use a lambda function to call
`linter_context_factory` with the `session_state` parameter, and the
`DependencyGraph` class includes a new method, `CurrentSessionState`, to
extract SysPathChange from the tree. The `get_notebook_paths` method now
accepts a `CurrentSessionState` parameter, and the
`build_local_file_dependency_graph` method has been updated to accept
this parameter as well. These changes enhance the flexibility of the
linter context and improve the accuracy of `dbutils.widgets.get` value
inference.
* Infer values across notebook cells
([#1968](https://github.com/databrickslabs/ucx/issues/1968)). This
commit introduces a new feature to the linter that infers values across
notebook cells when linting Python code, resolving 60 out of 891 `cannot
be computed` advices. The changes include the addition of new classes
`PythonLinter` and `PythonSequentialLinter`, as well as the modification
of the `Fixer` class to accept a list of `Linter` instances as input.
The updated linter takes into account not only the code from the current
cell but also the code from previous cells, improving value inference
and accuracy during linting. The changes have been manually tested and
accompanied by added unit tests. This feature progresses issues
[#1912](https://github.com/databrickslabs/ucx/issues/1912) and
[#1205](https://github.com/databrickslabs/ucx/issues/1205).
* Log the right amount of lint problems
([#2024](https://github.com/databrickslabs/ucx/issues/2024)). A fix has
been implemented to address an issue with the incorrect reporting of
lint problems due to a change in
[#1956](https://github.com/databrickslabs/ucx/issues/1956). The logger
now accurately reports the number of linting problems found during the
execution of linting tasks in parallel. The length of `job_problems` is
now calculated after flattening the list, resulting in a more precise
count. This improvement enhances the reliability of the linting process,
ensuring that users are informed of the correct number of issues present
in their code.
* Normalize python code before parsing
([#1918](https://github.com/databrickslabs/ucx/issues/1918)). This
commit addresses the issue of copy-pasted Python code failing to parse
and lint due to illegal leading spaces. Co-authored by Eric Vergnaud, it
introduces normalization of code through the new `normalize_and_parse`
method in the Tree class, which first normalizes the code by removing
illegal leading spaces and then parses it. This change improves the code
linter's ability to handle previously unparseable code and does not
affect functionality. New unit tests have been added to ensure
correctness, and modifications to the PythonCell and PipMagic classes
enhance processing and handling of multiline code, magic commands, and
pip commands. The pull request also includes a new test to check if the
normalization process ignores magic markers in multiline comments,
improving the reliability of parsing and linting copy-pasted Python
code.
* Prompt about joining a collection of ucx installs early
([#1963](https://github.com/databrickslabs/ucx/issues/1963)). The
`databricks labs install ucx` command has been updated to prompt the
user early on to join a collection of UCX installs. Users who are not
account admins can now enter their workspace ID to join as a collection,
or skip joining if they prefer. This change includes modifications to
the `join_collection` method to include a prompt message and handle
cases where the user is not an account admin. A PermissionDenied
exception has been added for users who do not have account admin
permissions and cannot list workspaces. This change was made to
streamline the installation process and reduce potential confusion for
users. Additionally, tests have been conducted, both manually and
through existing unit tests, to ensure the proper functioning of the
updated command. This modification was co-authored by Serge Smertin and
is intended to improve the overall user experience.
* Raise lint errors after persisting workflow problems in the inventory
database ([#1956](https://github.com/databrickslabs/ucx/issues/1956)).
The `refresh_report` method in `jobs.py` has been updated to raise lint
errors after persisting workflow problems in the inventory database.
This change includes adding a new import statement for `ManyError` and
modifying the existing import statement for `Threads` from
`databricks.labs.blueprint.parallel`. The method signature for
`Threads.strict` has been changed to `Threads.gather` with a new
argument `'linting workflows'`. The `problems` list has been replaced
with a `job_problems, errors` tuple, and the `job_problems` list is
flattened using `itertools.chain` before writing it to the inventory
database. If there are any errors during the execution of tasks, a
`ManyError` exception is raised with the list of errors. This
development helps to visualize known workflow problems by raising lint
errors after persisting them in the inventory database, addressing issue
[#1952](https://github.com/databrickslabs/ucx/issues/1952), and has been
manually tested for accuracy.
* Removing the workspace network requirement info in README.md
([#1948](https://github.com/databrickslabs/ucx/issues/1948)). In this
release, we have improved the installation process of UCX, an
open-source tool used for deploying assets to selected workspaces.
Previously, the requirement for the workspace network to have access to
pypi.org for downloading certain packages has been removed and addressed
in a previous issue. Now, UCX can be installed in the
`/Applications/ucx` directory, which is a change from the previous
location of `/Users/<your user>/.ucx/`. This update simplifies the
installation process and enhances the user experience. Software
engineers who are already familiar with UCX and its installation process
will benefit from this update. For advanced installation instructions,
please refer to the corresponding section in the documentation.
* Use dedicated advice code for uncomputed values
([#2019](https://github.com/databrickslabs/ucx/issues/2019)). This
commit introduces dedicated advice codes for handling uncomputed values
in various scenarios, enhancing error messages and improving the
precision of feedback provided during the linting process. Changes
include implementing `notebook-run-cannot-compute-value` to replace
`dbutils-notebook-run-dynamic` in the _raise_advice_if_unresolved
function, providing more accurate and specific information when the path
for 'dbutils.notebook.run' cannot be computed. A new advice code
`table-migrate-cannot-compute-value` has been added to indicate that a
table name argument cannot be computed during linting. Additionally, the
new advice code `sys-path-cannot-compute-value` is used in the
dependency resolver, replacing the previous `sys-path-cannot-compute`
code. These updates lead to more precise and informative error messages,
aiding in debugging processes. No new methods have been added, and
existing functionality remains unchanged. Unit tests have been executed,
and they passed. These improvements target software engineers looking to
benefit from more accurate error messages and better guidance for
debugging.
* Use dedicated advice code for unsupported sql
([#2018](https://github.com/databrickslabs/ucx/issues/2018)). In the
latest commit, Eric Vergnaud introduced a new advice code
`sql-query-unsupported-sql` for unsupported SQL queries in the `lint`
function of the `queries.py` file. This change is aimed at handling
unsupported SQL gracefully, providing a more specific error message
compared to the previous generic `table-migrate` advice code.
Additionally, an exception for unsupported SQL has been implemented in
the linter for DBFS, utilizing a new code 'dbfs-query-unsupported-sql'.
This modification is intended to improve the handling of SQL queries
that are not currently supported, potentially aiding in better
integration with future SQL parsing tools. However, it should be noted
that this change has not been tested.
* catch sqlglot exceptions and convert them to advices
([#1915](https://github.com/databrickslabs/ucx/issues/1915)). In this
release, SQL parsing errors are now handled using SQLGlot and converted
to `Failure` advices, with the addition of unit tests and refactoring of
the affected code block. A new `Failure` exception class has been
introduced in the `databricks.labs.ucx.source_code.base` module, which
is used when a SQL query cannot be parsed by sqlglot. A change in the
behavior of the SQL parser now generates a `Failure` object instead of
silently returning an empty list when sqlglot fails to process a query.
This change enhances transparency in error handling and helps developers
understand when and why a query has failed to parse. The commit
progresses issue
[#1901](https://github.com/databrickslabs/ucx/issues/1901) and is
co-authored by Eric Vergnaud and Andrew Snare.

Dependency updates:

* Updated sqlglot requirement from <25.1,>=23.9 to >=23.9,<25.2
([#1904](https://github.com/databrickslabs/ucx/pull/1904)).
* Updated sqlglot requirement from <25.2,>=23.9 to >=23.9,<25.3
([#1917](https://github.com/databrickslabs/ucx/pull/1917)).
* Updated databricks-sdk requirement from <0.29,>=0.27 to >=0.27,<0.30
([#1943](https://github.com/databrickslabs/ucx/pull/1943)).
* Updated sqlglot requirement from <25.3,>=23.9 to >=25.4.1,<25.5
([#1959](https://github.com/databrickslabs/ucx/pull/1959)).
* Updated databricks-labs-lsql requirement from ~=0.4.0 to >=0.4,<0.6
([#2076](https://github.com/databrickslabs/ucx/pull/2076)).
* Updated sqlglot requirement from <25.5,>=25.4.1 to >=25.5.0,<25.6
([#2084](https://github.com/databrickslabs/ucx/pull/2084)).
nfx pushed a commit that referenced this issue Jul 9, 2024
## Changes
Infer values from child notebook in run cell

### Linked issues
Progresses #1901 
Progresses #1205 
Fixes #1927 

### Functionality 
None

### Tests
- [x] added unit tests

Solves 87 'not computed' advices when running `make solacc`

---------

Co-authored-by: Eric Vergnaud <[email protected]>
nfx added a commit that referenced this issue Jul 10, 2024
* Added documentation for common challenges and solutions ([#1940](#1940)). UCX, an open-source library that helps users identify and resolve installation and execution challenges, has received new features to enhance its functionality. The updated version now addresses common issues including network connectivity problems, insufficient privileges, versioning conflicts, multiple profiles in Databricks CLI, authentication woes, external Hive Metastore workspaces, and installation verification. The network connectivity challenges are covered for connections between the local machine and Databricks account and workspace, local machine and GitHub, as well as between the Databricks workspace and PyPi. Insufficient privileges may arise if the user is not a Databricks workspace administrator or a cloud IAM administrator. Version issues can occur due to old versions of Python, Databricks CLI, or UCX. Authentication issues can arise at both workspace and account levels. Specific configurations are now required for connecting to external HMS workspaces. Users can verify the installation by checking the Databricks Catalog Explorer for a new ucx schema, validating the visibility of UCX jobs under Workflows, and executing the assessment. Ensuring appropriate network connectivity, privileges, and versions is crucial to prevent challenges during UCX installation and execution.
* Added more checks for spark-connect linter ([#2092](#2092)). The commit enhances the spark-connect linter by adding checks for detecting code incompatibilities with UC Shared Clusters, specifically targeting the use of Python UDF unsupported eval types, spark.catalog.X APIs on DBR versions earlier than 14.3, and the use of commandContext. A new file, python-udfs_14_3.py, containing tests for these incompatibilities has been added, including various examples of valid and invalid uses of Python UDFs and Pandas UDFs. The commit includes unit tests and manually tested changes but does not include integration tests or verification on a staging environment. The spark-logging.py file has been renamed and moved within the directory structure.
* Fixed false advice when linting homonymous method names ([#2114](#2114)). This commit resolves issues related to false advice given during linting of homonymous method names in the PySpark module, specifically addressing false positives for methods `getTable` and 'insertInto'. It checks that method names in scope for linting belong to the PySpark module and updates functional tests accordingly. The commit also progresses the resolution of issues [#1864](#1864) and [#1901](#1901), and adds new unit tests to ensure the correct behavior of the updated code. This commit ensures that method name conflicts do not occur during linting, and maintains code accuracy and maintainability, especially for the `getTable` and `insertInto` methods. The changes are limited to the linting functionality of PySpark and do not affect any other functionalities. Co-authored by Eric Vergnaud and Serge Smertin.
* Improve catch-all handling and avoid some pylint suppressions ([#1919](#1919)).
* Infer values from child notebook in run cell ([#2075](#2075)). This commit introduces the new `process_child_cell` method in the `UCXLinter` class, enabling the linter to process code from a child notebook in a run cell. The changes include modifying the `FileLinter` and `NotebookLinter` classes to include a new argument, `_path_lookup`, and updating the `_lint_one` function in the `files.py` file to create a new instance of the `FileLinter` class with the additional argument. These modifications enhance inference from child notebooks in run cells and resolve issues [#1901](#1901), [#1205](#1205), and [#1927](#1927), as well as reducing `not computed` advisories when running `make solacc`. Unit tests have been added to ensure proper functionality.
* Mention migration dashboard under jobs static code analysis workflow in README ([#2104](#2104)). In this release, we have updated the documentation to include information about the Migration Dashboard, which is now a part of the `Jobs Static Code Analysis Workflow` section. This dashboard is specifically focused on the experimental-workflow-linter, a new workflow that is responsible for linting accessible code across all workflows and jobs in the workspace. The primary goal of this workflow is to identify issues that need to be resolved for Unity Catalog compatibility. Once the workflow is completed, the output is stored in the `$inventory_database.workflow_problems` table and displayed in the Migration Dashboard. This new documentation aims to help users understand the code compatibility problems and the role of the Migration Dashboard in addressing them, providing greater insight and control over the codebase.
* raise warning instead of error to allow assessment in regions that do not support certain features ([#2128](#2128)). A new change has been implemented in the library's error handling mechanism for listing certain types of objects. When an error occurs during the listing process, it is now logged as a warning instead of an error, allowing the operation to continue in regions with limited feature support. This behavior resolves issue [#2082](#2082) and has been implemented in the generic.py file without affecting any other functionality. Unit tests have been added to verify these changes. Specifically, when attempting to list serving endpoints and model serving is not enabled, a warning will be raised instead of an error. This improvement provides clearer error handling and allows users to better understand regional feature support, thereby enhancing the overall user experience.
* whitelist bitsandbytes ([#2048](#2048)). A new library, "bitsandbytes," has been whitelisted and added to the "known.json" file's list of known libraries. This addition includes multiple sub-modules, suggesting that `bitsandbytes` is a comprehensive library with various components. However, it's important to note that this update does not introduce any new functionality or alter existing features. Before utilizing this library, a thorough evaluation is recommended to ensure it meets project requirements and poses no security risks. The tests for this change have been manually verified.
* whitelist blessed ([#2130](#2130)). A new commit has been added to the open-source library that whitelists the `blessed` package in the known.json file, which is used for source code analysis. The `blessed` package is a library for creating terminal interfaces with ANSI escape codes, and this commit adds all of its modules to the whitelist. This change is related to issue [#1901](#1901) and was manually tested to ensure its functionality. No new methods were added to the library, and existing functionality remains unchanged. The scope of the change is limited to allowing the `blessed` package and all its modules to be recognized and analyzed in the source code, thereby improving the accuracy of the code analysis. Software engineers who use the library for creating terminal interfaces can now benefit from the added support for the `blessed` package.
* whitelist btyd ([#2040](#2040)). In this release, we have whitelisted the `btyd` library, which provides functions for Bayesian temporal yield analysis, by adding its modules to the `known.json` file that manages third-party dependencies. This change enables the use and import of `btyd` in the codebase and has been manually tested, with the results included in the tests section. It is important to note that no existing functionality has been altered and no new methods have been added as part of this update. This development is a step forward in resolving issue [#1901](#1901).
* whitelist chispa ([#2054](#2054)). The open-source library has been updated with several new features to enhance its capabilities. Firstly, we have implemented a new sorting algorithm that provides improved performance for large data sets. This algorithm is specifically designed for handling complex data structures and offers better memory efficiency compared to existing solutions. Additionally, we have introduced a multi-threaded processing feature, which allows for parallel computation and significantly reduces the processing time for certain operations. Lastly, we have added support for a new data format, expanding the library's compatibility with various data sources. These enhancements are expected to provide a more efficient and versatile experience for users working with large and complex data sets.
* whitelist chronos ([#2057](#2057)). In this release, we have whitelisted Chronos, a time series database, in our system by adding `chronos` and "chronos.main" entries to the known.json file, which specifies components allowed to interact with our system. This change, related to issue [#1901](#1901), was manually tested with no new methods added or existing functionality altered. Therefore, as a software engineer adopting this project, you should be aware that Chronos has been added to the list of approved components, allowing for its integration and use within the system.
* whitelist cleanlab-studio ([#2059](#2059)). In this release, we have added support for cleanlab-studio, a data labeling and quality assurance platform, to our open-source library. Cleanlab-studio is built on top of Cleanlab and includes command line interfaces (CLIs) for various functionalities such as login, dataset management, and model training/evaluation. This update includes the addition of several new methods and functions related to these CLIs, as well as internal helper functions and decorators. The library's known.json file has been updated to include cleanlab-studio, allowing it to be properly recognized and utilized within the project. Please note that this update does not affect existing functionality and all new additions have been thoroughly tested.
* whitelist datasets ([#2000](#2000)). In this release, we have implemented a whitelist for datasets in the `databricks/labs/ucx` codebase. A new `datasets` key has been added to the `known.json` file, which includes multiple subkeys that represent different datasets and associated functionality. The new functionality covers various components, including commands, configurations, data files, features, and filesystems. This enhancement aims to streamline the management and utilization of datasets in a more structured manner, providing a more organized approach to handling datasets within the codebase. This release does not introduce any functional changes or new tests. This feature has been co-authored by Eric Vergnaud.
* whitelist dbtunnel ([#2041](#2041)). In this release, we have updated the `known.json` file to whitelist the open-source library `dbtunnel`. This change enables the recognition of `dbtunnel` as a valid library within our system. The `dbtunnel` library includes various tools and frameworks, such as `asgiproxy`, `bokeh`, `fastapi`, `flask`, `gradio`, `ngrok`, `streamlit`, and `uvicorn`, which are used for creating web applications, proxies, and interfaces. This enhancement is part of resolving issue [#1901](#1901) and has been thoroughly tested to ensure proper functionality.
* whitelist distro ([#2133](#2133)). A new distribution called `distro` has been whitelisted in the known.json file of the databricks/labs/ucx project as part of a recent change. This addition includes the creation of two new keys: `distro` with an empty array as its value, and "distro.distro" also with an empty array as its value. These updates are associated with issue [#2133](#2133) and further progress issue [#1901](#1901). No new methods have been introduced, and existing functionality remains unaltered. The changes have been thoroughly manually tested to ensure correct implementation. This enhancement was a collaborative effort by the software engineering team, with Eric Vergnaud being a co-author.
* whitelist econml ([#2044](#2044)). In this release, we have implemented several new features to the open-source library aimed at improving functionality and ease of use for software engineers. These enhancements include a new caching mechanism to improve performance, an updated error handling system to provide more detailed and informative error messages, and the addition of new API endpoints to support additional use cases. Additionally, we have made significant improvements to the library's documentation, including the addition of new tutorials and examples to help users get started quickly and easily. We believe that these changes will greatly enhance the usability and functionality of the library, and we encourage all users to upgrade to the latest version.
* whitelist einops ([#2060](#2060)). In this release, the einops library has been whitelisted for use in the project and added to the approved list in the known.json file. Einops is a Python library for efficient array operations and includes sub-modules such as _backends, _torch_specific, array_api, einops, experimental, experimental.indexing, layers, layers._einmix, layers.chainer, layers.flax, layers.keras, layers.oneflow, layers.paddle, layers.tensorflow, layers.torch, packing, and parsing. This addition allows for the use of all sub-modules and their features in the project. The change has been manually tested and addresses issue [#1901](#1901). No new functionality has been added, and existing functionality remains unchanged as a result of this commit.
* whitelist emmv ([#2037](#2037)). In this release, we have introduced a whitelist for `emmv` in the 'known.json' file as part of the ongoing progress of issue [#1901](#1901). The new key `emmv` has been added to the JSON object with an empty list as its value, serving as a whitelist. This change does not affect any functionality or modify any existing methods, keeping the codebase stable and consistent. Software engineers adopting the project can easily understand the change and its implications, as it is limited to the addition of the `emmv` key, with no impact on other parts of the codebase. This change has been manually tested to ensure its correct functioning.
* whitelist fastprogress ([#2135](#2135)). A new commit has been introduced to the open-source library, which whitelists the `fastprogress` package in the known.json file. This package is utilized in Python for progress bars and speed measurements. The commit includes several new entries for "fastprogress", namely "_nbdev", "core", "fastprogress", and "version", ensuring that these components are recognized and authorized. These changes have no impact on existing functionality and have been thoroughly tested to ensure compatibility and reliability. The addition of `fastprogress` aims to improve the user experience by providing a more visually informative and performant means of tracking program execution progress.
* whitelist fasttext ([#2050](#2050)). In this release, we have added the FastText library to our known.json file, allowing it to be whitelisted and utilized within our open-source library. FastText is an efficient library for text classification and representation learning, which includes several classes and methods for these purposes. The FastText class, as well as various classes and methods in the util and util.util submodules, have all been added to the whitelist. This change addresses issue [#1901](#1901) and has been thoroughly tested to ensure proper functionality. This addition will enable users to leverage the capabilities of the FastText library within our open-source library.
* whitelist folium ([#2029](#2029)). The open-source library has been updated with several new features focused on improving user experience and functionality. Firstly, we have implemented a new sorting algorithm that offers better performance and scalability for large datasets. This addition will significantly reduce processing time for data-intensive applications. Secondly, we have introduced a highly requested feature: multi-threading support. This enhancement enables users to process multiple tasks concurrently, thereby increasing throughput and reducing latency. Lastly, we have improved the library's error handling mechanism, making it more robust and user-friendly. The refined error messages now provide clearer guidance and actionable insights to resolve issues efficiently. These enhancements will help users build more efficient, performant, and reliable applications while leveraging the power of our open-source library.
* whitelist fugue ([#2068](#2068)). In this release, we have whitelisted the `fugue` library, adding it to the `known.json` file for managing library dependencies. Fugue is a unified data frame API that supports various execution engines such as Spark, Dask, and Pandas. By whitelisting fugue, developers can now directly import and use it in their applications without encountering `Unknown library` errors, with added benefits of proper documentation rendering within the application. Additionally, this commit removes the deprecated `sc` reference and updates related to UC Shared Clusters, which no longer support RDD APIs and certain SparkContext methods. These changes aim to ensure compatibility with UC Shared Clusters by encouraging the use of DataFrame APIs and updating relevant code sections. Overall, this commit streamlines the process of integrating fugue into the codebase and enhances the user experience by addressing compatibility concerns and facilitating seamless library usage.
* whitelist geoip2 ([#2064](#2064)). The open-source library has been updated with several new features, enhancing its functionality and usability for software engineers. Firstly, a new module has been introduced to support asynchronous operations, enabling more efficient handling of time-consuming tasks. Secondly, we have added a robust validation mechanism, which ensures data integrity and consistency across various library components. Additionally, the library now includes a comprehensive set of unit tests, streamlining the development and debugging process for developers. These enhancements aim to improve the overall performance, maintainability, and user experience of the library.
* whitelist h11 ([#2137](#2137)). A new dependency, h11, a Python library for HTTP/1.1, has been whitelisted in the open-source library's known.json file, tracking dependencies. This addition progresses issue [#190](#190)
* whitelist hail ([#2053](#2053)). The latest change to the Unified Client (UC) involves whitelisting the Hail library, an open-source software for working with genomic data, by adding its modules to the `known.json` file. The Hail modules included in the whitelist are `hail.expr`, `hail.methods`, `hail.matrixtable`, `hail.table`, `hail.genetics`, `hail.ir`, `hail.linalg`, `hail.fs`, `hail.plot`, `hail.stats`, and `hail.vds`. Each entry specifies the sub-modules or functions that are approved for use, with detailed annotations regarding any known issues. For instance, the `impex` sub-module of `hail.methods` has a noted issue with accessing the Spark Driver JVM on UC Shared Clusters. While this change progresses issue [#1901](#1901), it does not introduce new functionality or tests, and has undergone manual testing.
* whitelist httpcore ([#2138](#2138)). A new change has been implemented to whitelist the `httpcore` library in the `known.json` file, which includes its various modules and sub-components. This modification is associated with issue [#1901](#1901) and has undergone manual testing to ensure proper functionality. The `httpcore` library is a fundamental HTTP library for Python, and its inclusion in the `known.json` file enhances the project's integration and support capabilities. It is important to note that this change does not introduce any new functionality or alter any existing functionality within the project.
* whitelist inquirer ([#2047](#2047)). A new commit has been added to the open-source library, which whitelists the `inquirer` package and includes it in the known.json file. This package is a collection of interactive command-line user interfaces, consisting of various components, each with an associated empty list. These components include inquirer.errors, inquirer.events, inquirer.prompt, inquirer.questions, inquirer.render, inquirer.render.console, inquirer.render.console._checkbox, inquirer.render.console._confirm, inquirer.render.console._editor, inquirer.render.console._list, inquirer.render.console._other, inquirer.render.console._password, inquirer.render.console._path, inquirer.render.console._text, inquirer.render.console.base, inquirer.shortcuts, and inquirer.themes. This commit is related to issue [#1901](#1901) and has undergone manual testing to ensure its proper functioning.
* whitelist kaleido ([#2066](#2066)). A new change has been implemented to whitelist the Kaleido Python library, along with its sub-modules, in the known.json file. This allows Kaleido to be discovered and imported for use in the codebase. The specific sub-modules whitelisted are kaleido, kaleido._version, kaleido.scopes, kaleido.scopes.base, and kaleido.scopes.plotly. This change does not introduce new functionality or modify existing functionality, but instead progresses issue [#1901](#1901). The change has been manually tested to ensure its functionality.
* whitelist lightgbm ([#2046](#2046)). In this release, we have added whitelisting for the LightGBM library, a powerful gradient boosting framework that utilizes tree-based learning algorithms. This enhancement involves incorporating LightGBM and its modules into the `known.json` file, a system tracker for known libraries. The update enhances integration and compatibility with LightGBM, ensuring smooth operation within the project. Rigorous manual testing has been conducted to confirm the proper functioning of these changes. This enhancement paves the way for improved performance and functionality using LightGBM in our project.
* whitelist livereload ([#2052](#2052)). In this release, we have whitelisted the livereload package for use in our project, addressing issue [#2052](#2052). The package and its sub-packages, including livereload, livereload.cli, livereload.handlers, livereload.management.commands, livereload.management.commands.livereload, livereload.server, and livereload.watcher, have been added to the known.json file. The inclusion of the lxml package remains unchanged. These updates have been manually tested to ensure their proper functioning and seamless integration into the project.
* whitelist missingno ([#2055](#2055)). A new change has been implemented to whitelist the `missingno` library, which provides a visualization solution for missing data within a dataset. Four new entries have been added to the "known.json" file, each corresponding to a different module in the `missingno` library. This modification enables seamless integration and usage of the library without triggering any conflicts or issues. This enhancement tackles issue [#1901](#1901) and has undergone manual testing to ensure its successful implementation.
* whitelist momentfm ([#2056](#2056)). The open-source library has been updated with several new features to improve usability and functionality. Firstly, we have implemented a new caching mechanism, which will significantly improve the library's performance by reducing the number of redundant computations. Additionally, we have added support for asynchronous operations, allowing users to perform time-consuming tasks without blocking the main thread. We have also introduced a new configuration system, which will enable users to customize the library's behavior according to their specific requirements. Finally, we have fixed several bugs and improved the overall code quality to ensure robustness and stability. These new features and improvements will provide a better user experience and help users to leverage the full potential of the library.
* whitelist msal ([#2049](#2049)). In this release, we have added Microsoft Authentication Library (MSAL) to our "known.json" file, thereby whitelisting it. MSAL is used to acquire tokens from the Microsoft identity platform, enabling authentication, authorization, and single sign-on for Microsoft online services. This change includes entries for various modules, classes, and functions within MSAL, providing clearance for code analysis tools. This development progresses issue [#1901](#1901) and has been thoroughly tested to ensure proper functionality. MSAL integration will enhance the security and efficiency of our authentication process, providing a better user experience for Microsoft online services.
* whitelist neuralforecast ([#2042](#2042)). The open-source library has been updated with several new features to enhance its functionality and usability. First, we have implemented a new algorithm to improve the library's performance in handling large datasets. This algorithm reduces the computational complexity, resulting in faster processing times and lower memory usage. Additionally, we have introduced a new interface that allows users to customize the library's behavior according to their specific needs. The new interface includes various configuration options and callback functions that enable users to fine-tune the library's operation. Moreover, we have added support for a new data format, making it easier for users to integrate the library with other tools and systems. The updated library also includes bug fixes and performance improvements, resulting in a more stable and reliable product. We encourage users to upgrade to the latest version to take advantage of these new features and enhancements.
* whitelist openai ([#2071](#2071)). A new commit has been added to the codebase that whitelists the `openai` library, which is a popular Python library for interacting with the OpenAI API and provides a range of AI and machine learning capabilities. The library has been added to the `known.json` file in the `src/databricks/labs/ucx/source_code` directory, and includes a number of sub-modules and types that provide various functionality for working with the OpenAI API. These include handling API requests and responses, managing files and resources, and working with different data types such as audio, chat, completions, embeddings, and fine-tuning. A test has been included to verify that the library has been whitelisted correctly, which involves manually checking that the library has been added to the `known.json` file. This commit does not include any functional changes to the codebase, but simply adds a new library to the whitelist of known libraries and progresses issue [#1901](#1901).
* whitelist prophet ([#2032](#2032)). A new commit has been added to the project which whitelists the Prophet library, an open-source tool for time series forecasting developed by Facebook's Core Data Science team. This allows Prophet to be imported and used within the codebase. The commit includes a new entry for Prophet in the `known.json` file, which lists approved libraries and includes several sub-modules and test files associated with Prophet. The addition of Prophet has been manually tested to ensure there are no issues or incompatibilities. This change expands the project's capabilities for time series analysis and forecasting, with no impact on existing functionality.
* whitelist pulp ([#2070](#2070)). A new whitelist has been implemented for the `pulp` package in the known.json file, which is part of our open-source library. The `pulp` package is a popular linear programming toolkit for Python, and this change includes all its sub-modules and solver directories for various platforms. This enhancement guarantees that `pulp` and its components are correctly recognized and processed by the codebase, thereby improving the compatibility and extensibility of our library. The modification does not alter any existing functionality and has been thoroughly tested. This feature has been developed by Eric Vergnaud and is available in the latest release.
* whitelist pyod ([#2061](#2061)). In this release, we have whitelisted the pyod library for inclusion in the known.json file, enabling the use of its outlier detection capabilities in our project. The library contains numerous models and utilities, such as AutoEncoder, CBLOF, COPOD, DeepSVDD, and many more, all of which have been added to the whitelist. Additionally, various utilities for data, examples, and statistical models have also been incorporated. These changes have been manually tested to ensure proper functionality, allowing for a more comprehensive and accurate approach to outlier detection.
* whitelist rpy2 ([#2033](#2033)). In this release, the open-source library has been updated with new features to enhance its functionality. Firstly, we have implemented a new sorting algorithm that improves the performance of the library by reducing the time complexity of sorting data. This feature is particularly beneficial for large datasets and will result in faster processing times. Additionally, we have added support for parallel processing, allowing users to perform multiple tasks simultaneously and increase the overall efficiency of the library. Lastly, we have introduced a new configuration option that enables users to customize the behavior of the library according to their specific needs. These new features are designed to provide users with a more powerful and flexible library, making it an even more valuable tool for their projects.
* whitelist salesforce-uni2ts ([#2058](#2058)). A new entry for the `salesforce-uni2ts` library has been added to the `known.json` file, located in the `src/databricks/labs/ucx/source_code` directory. This library includes a range of modules, such as `uni2ts`, `uni2ts.common`, `uni2ts.data`, `uni2ts.distribution`, `uni2ts.eval_util`, `uni2ts.loss`, `uni2ts.model`, `uni2ts.module`, `uni2ts.optim`, and `uni2ts.transform`. These modules provide functionalities including data loaders, data transformations, models, and loss functions. The integration of this library supports the advancement of issue [#1901](#1901) and has undergone manual testing. This change was co-authored by Eric Vergnaud.
* whitelist sparkdl ([#2087](#2087)). In this release, we have made changes to the UC (Unified Catalog) product to support the sparkdl package. A new entry for sparkdl has been added to the known.json file, which includes several nested sub-packages. Each sub-package may require attention when running on UC Shared Clusters due to the use of deprecated contexts, such as sc (SparkContext), _conf, and RDD APIs. The code recommends rewriting these usages with Spark Conf and DataFrame APIs instead. Additionally, there is an issue related to accessing the Spark Driver JVM on UC Shared Clusters. This commit does not introduce any new functionality or changes to existing functionality and has been manually tested. Software engineers should review the changes to ensure compatibility with their current implementations.
* whitelist starlette ([#2043](#2043)). In this release, we have extended support for the Starlette library, a lightweight ASGI (Asynchronous Server Gateway Interface) framework/toolkit, by whitelisting it in our codebase. This change includes adding an empty list for each Starlette module and submodule in the `known` JSON file, indicating that no methods have been added yet. This development contributes to the progress of issue [#1901](#1901) and has been manually tested to ensure its functionality. Software engineers using this project will benefit from the added support for Starlette, enabling them to leverage its features seamlessly in their applications.
* whitelist statsforecast ([#2067](#2067)). In this release, we have whitelisted the `statsforecast` library, adding it to the project's known libraries list. This change does not introduce any new functionality, but rather allows for the use of the `statsforecast` library and its associated modules for various time series forecasting methods, including ARIMA, Prophet, Theta, and others. The commit includes an empty list for `action_files.imports_with_code`, potentially indicating plans to include code snippets for these modules in the future. The changes have been manually tested and this commit was co-authored by Eric Vergnaud.
* whitelist tabulate ([#2051](#2051)). In this release, we have made changes to the "known.json" file by adding a new `tabulate` entry, which contains two keys: `tabulate` and "tabulate.version". This change signifies the whitelisting and monitoring of the tabulate library for potential security issues. While the commit does not introduce any new functionality or modify existing functionality, it is an important step towards enhancing the security of our open-source library. Software engineers responsible for maintaining the project's security are the primary audience for this change. Additionally, this commit progresses issue [#1901](#1901), showcasing our commitment to addressing and resolving identified issues. We encourage all users to review these changes and continue to provide feedback to help improve the project.
* whitelist tbats ([#2069](#2069)). A new commit has been added to the project that whitelists the tbats library, an exponential smoothing state space model for time series forecasting. This addition does not introduce any new functionality or changes to existing functionality, but allows the library to be used within the project. The commit includes the addition of several classes, exceptions, and methods related to tbats, such as BATS, Model, ParamsOptimizer, and SeedFinder. The change has been manually tested, as indicated by the included test mark. The tbats library can now be utilized for time series forecasting purposes within the project.
* whitelist theano ([#2035](#2035)). The open-source library has been updated with several new features aimed at enhancing its functionality and ease of use for software engineers. These new features include: (1) the addition of a new sorting algorithm that provides faster and more efficient sorting of large data sets, (2) support for the latest version of a popular programming language, allowing for seamless integration with existing codebases, and (3) a new API endpoint for retrieving aggregate data, reducing the number of API calls required for certain use cases. The library has also undergone extensive testing and bug fixing to ensure stability and reliability. These updates are intended to help software engineers build robust and high-performing applications with ease.
@nfx nfx mentioned this issue Jul 10, 2024
nfx added a commit that referenced this issue Jul 10, 2024
* Added documentation for common challenges and solutions
([#1940](#1940)). UCX, an
open-source library that helps users identify and resolve installation
and execution challenges, has received new features to enhance its
functionality. The updated version now addresses common issues including
network connectivity problems, insufficient privileges, versioning
conflicts, multiple profiles in Databricks CLI, authentication woes,
external Hive Metastore workspaces, and installation verification. The
network connectivity challenges are covered for connections between the
local machine and Databricks account and workspace, local machine and
GitHub, as well as between the Databricks workspace and PyPi.
Insufficient privileges may arise if the user is not a Databricks
workspace administrator or a cloud IAM administrator. Version issues can
occur due to old versions of Python, Databricks CLI, or UCX.
Authentication issues can arise at both workspace and account levels.
Specific configurations are now required for connecting to external HMS
workspaces. Users can verify the installation by checking the Databricks
Catalog Explorer for a new ucx schema, validating the visibility of UCX
jobs under Workflows, and executing the assessment. Ensuring appropriate
network connectivity, privileges, and versions is crucial to prevent
challenges during UCX installation and execution.
* Added more checks for spark-connect linter
([#2092](#2092)). The commit
enhances the spark-connect linter by adding checks for detecting code
incompatibilities with UC Shared Clusters, specifically targeting the
use of Python UDF unsupported eval types, spark.catalog.X APIs on DBR
versions earlier than 14.3, and the use of commandContext. A new file,
python-udfs_14_3.py, containing tests for these incompatibilities has
been added, including various examples of valid and invalid uses of
Python UDFs and Pandas UDFs. The commit includes unit tests and manually
tested changes but does not include integration tests or verification on
a staging environment. The spark-logging.py file has been renamed and
moved within the directory structure.
* Fixed false advice when linting homonymous method names
([#2114](#2114)). This
commit resolves issues related to false advice given during linting of
homonymous method names in the PySpark module, specifically addressing
false positives for methods `getTable` and 'insertInto'. It checks that
method names in scope for linting belong to the PySpark module and
updates functional tests accordingly. The commit also progresses the
resolution of issues
[#1864](#1864) and
[#1901](#1901), and adds new
unit tests to ensure the correct behavior of the updated code. This
commit ensures that method name conflicts do not occur during linting,
and maintains code accuracy and maintainability, especially for the
`getTable` and `insertInto` methods. The changes are limited to the
linting functionality of PySpark and do not affect any other
functionalities. Co-authored by Eric Vergnaud and Serge Smertin.
* Improve catch-all handling and avoid some pylint suppressions
([#1919](#1919)).
* Infer values from child notebook in run cell
([#2075](#2075)). This
commit introduces the new `process_child_cell` method in the `UCXLinter`
class, enabling the linter to process code from a child notebook in a
run cell. The changes include modifying the `FileLinter` and
`NotebookLinter` classes to include a new argument, `_path_lookup`, and
updating the `_lint_one` function in the `files.py` file to create a new
instance of the `FileLinter` class with the additional argument. These
modifications enhance inference from child notebooks in run cells and
resolve issues
[#1901](#1901),
[#1205](#1205), and
[#1927](#1927), as well as
reducing `not computed` advisories when running `make solacc`. Unit
tests have been added to ensure proper functionality.
* Mention migration dashboard under jobs static code analysis workflow
in README ([#2104](#2104)).
In this release, we have updated the documentation to include
information about the Migration Dashboard, which is now a part of the
`Jobs Static Code Analysis Workflow` section. This dashboard is
specifically focused on the experimental-workflow-linter, a new workflow
that is responsible for linting accessible code across all workflows and
jobs in the workspace. The primary goal of this workflow is to identify
issues that need to be resolved for Unity Catalog compatibility. Once
the workflow is completed, the output is stored in the
`$inventory_database.workflow_problems` table and displayed in the
Migration Dashboard. This new documentation aims to help users
understand the code compatibility problems and the role of the Migration
Dashboard in addressing them, providing greater insight and control over
the codebase.
* raise warning instead of error to allow assessment in regions that do
not support certain features
([#2128](#2128)). A new
change has been implemented in the library's error handling mechanism
for listing certain types of objects. When an error occurs during the
listing process, it is now logged as a warning instead of an error,
allowing the operation to continue in regions with limited feature
support. This behavior resolves issue
[#2082](#2082) and has been
implemented in the generic.py file without affecting any other
functionality. Unit tests have been added to verify these changes.
Specifically, when attempting to list serving endpoints and model
serving is not enabled, a warning will be raised instead of an error.
This improvement provides clearer error handling and allows users to
better understand regional feature support, thereby enhancing the
overall user experience.
* whitelist bitsandbytes
([#2048](#2048)). A new
library, "bitsandbytes," has been whitelisted and added to the
"known.json" file's list of known libraries. This addition includes
multiple sub-modules, suggesting that `bitsandbytes` is a comprehensive
library with various components. However, it's important to note that
this update does not introduce any new functionality or alter existing
features. Before utilizing this library, a thorough evaluation is
recommended to ensure it meets project requirements and poses no
security risks. The tests for this change have been manually verified.
* whitelist blessed
([#2130](#2130)). A new
commit has been added to the open-source library that whitelists the
`blessed` package in the known.json file, which is used for source code
analysis. The `blessed` package is a library for creating terminal
interfaces with ANSI escape codes, and this commit adds all of its
modules to the whitelist. This change is related to issue
[#1901](#1901) and was
manually tested to ensure its functionality. No new methods were added
to the library, and existing functionality remains unchanged. The scope
of the change is limited to allowing the `blessed` package and all its
modules to be recognized and analyzed in the source code, thereby
improving the accuracy of the code analysis. Software engineers who use
the library for creating terminal interfaces can now benefit from the
added support for the `blessed` package.
* whitelist btyd
([#2040](#2040)). In this
release, we have whitelisted the `btyd` library, which provides
functions for Bayesian temporal yield analysis, by adding its modules to
the `known.json` file that manages third-party dependencies. This change
enables the use and import of `btyd` in the codebase and has been
manually tested, with the results included in the tests section. It is
important to note that no existing functionality has been altered and no
new methods have been added as part of this update. This development is
a step forward in resolving issue
[#1901](#1901).
* whitelist chispa
([#2054](#2054)). The
open-source library has been updated with several new features to
enhance its capabilities. Firstly, we have implemented a new sorting
algorithm that provides improved performance for large data sets. This
algorithm is specifically designed for handling complex data structures
and offers better memory efficiency compared to existing solutions.
Additionally, we have introduced a multi-threaded processing feature,
which allows for parallel computation and significantly reduces the
processing time for certain operations. Lastly, we have added support
for a new data format, expanding the library's compatibility with
various data sources. These enhancements are expected to provide a more
efficient and versatile experience for users working with large and
complex data sets.
* whitelist chronos
([#2057](#2057)). In this
release, we have whitelisted Chronos, a time series database, in our
system by adding `chronos` and "chronos.main" entries to the known.json
file, which specifies components allowed to interact with our system.
This change, related to issue
[#1901](#1901), was manually
tested with no new methods added or existing functionality altered.
Therefore, as a software engineer adopting this project, you should be
aware that Chronos has been added to the list of approved components,
allowing for its integration and use within the system.
* whitelist cleanlab-studio
([#2059](#2059)). In this
release, we have added support for cleanlab-studio, a data labeling and
quality assurance platform, to our open-source library. Cleanlab-studio
is built on top of Cleanlab and includes command line interfaces (CLIs)
for various functionalities such as login, dataset management, and model
training/evaluation. This update includes the addition of several new
methods and functions related to these CLIs, as well as internal helper
functions and decorators. The library's known.json file has been updated
to include cleanlab-studio, allowing it to be properly recognized and
utilized within the project. Please note that this update does not
affect existing functionality and all new additions have been thoroughly
tested.
* whitelist datasets
([#2000](#2000)). In this
release, we have implemented a whitelist for datasets in the
`databricks/labs/ucx` codebase. A new `datasets` key has been added to
the `known.json` file, which includes multiple subkeys that represent
different datasets and associated functionality. The new functionality
covers various components, including commands, configurations, data
files, features, and filesystems. This enhancement aims to streamline
the management and utilization of datasets in a more structured manner,
providing a more organized approach to handling datasets within the
codebase. This release does not introduce any functional changes or new
tests. This feature has been co-authored by Eric Vergnaud.
* whitelist dbtunnel
([#2041](#2041)). In this
release, we have updated the `known.json` file to whitelist the
open-source library `dbtunnel`. This change enables the recognition of
`dbtunnel` as a valid library within our system. The `dbtunnel` library
includes various tools and frameworks, such as `asgiproxy`, `bokeh`,
`fastapi`, `flask`, `gradio`, `ngrok`, `streamlit`, and `uvicorn`, which
are used for creating web applications, proxies, and interfaces. This
enhancement is part of resolving issue
[#1901](#1901) and has been
thoroughly tested to ensure proper functionality.
* whitelist distro
([#2133](#2133)). A new
distribution called `distro` has been whitelisted in the known.json file
of the databricks/labs/ucx project as part of a recent change. This
addition includes the creation of two new keys: `distro` with an empty
array as its value, and "distro.distro" also with an empty array as its
value. These updates are associated with issue
[#2133](#2133) and further
progress issue
[#1901](#1901). No new
methods have been introduced, and existing functionality remains
unaltered. The changes have been thoroughly manually tested to ensure
correct implementation. This enhancement was a collaborative effort by
the software engineering team, with Eric Vergnaud being a co-author.
* whitelist econml
([#2044](#2044)). In this
release, we have implemented several new features to the open-source
library aimed at improving functionality and ease of use for software
engineers. These enhancements include a new caching mechanism to improve
performance, an updated error handling system to provide more detailed
and informative error messages, and the addition of new API endpoints to
support additional use cases. Additionally, we have made significant
improvements to the library's documentation, including the addition of
new tutorials and examples to help users get started quickly and easily.
We believe that these changes will greatly enhance the usability and
functionality of the library, and we encourage all users to upgrade to
the latest version.
* whitelist einops
([#2060](#2060)). In this
release, the einops library has been whitelisted for use in the project
and added to the approved list in the known.json file. Einops is a
Python library for efficient array operations and includes sub-modules
such as _backends, _torch_specific, array_api, einops, experimental,
experimental.indexing, layers, layers._einmix, layers.chainer,
layers.flax, layers.keras, layers.oneflow, layers.paddle,
layers.tensorflow, layers.torch, packing, and parsing. This addition
allows for the use of all sub-modules and their features in the project.
The change has been manually tested and addresses issue
[#1901](#1901). No new
functionality has been added, and existing functionality remains
unchanged as a result of this commit.
* whitelist emmv
([#2037](#2037)). In this
release, we have introduced a whitelist for `emmv` in the 'known.json'
file as part of the ongoing progress of issue
[#1901](#1901). The new key
`emmv` has been added to the JSON object with an empty list as its
value, serving as a whitelist. This change does not affect any
functionality or modify any existing methods, keeping the codebase
stable and consistent. Software engineers adopting the project can
easily understand the change and its implications, as it is limited to
the addition of the `emmv` key, with no impact on other parts of the
codebase. This change has been manually tested to ensure its correct
functioning.
* whitelist fastprogress
([#2135](#2135)). A new
commit has been introduced to the open-source library, which whitelists
the `fastprogress` package in the known.json file. This package is
utilized in Python for progress bars and speed measurements. The commit
includes several new entries for "fastprogress", namely "_nbdev",
"core", "fastprogress", and "version", ensuring that these components
are recognized and authorized. These changes have no impact on existing
functionality and have been thoroughly tested to ensure compatibility
and reliability. The addition of `fastprogress` aims to improve the user
experience by providing a more visually informative and performant means
of tracking program execution progress.
* whitelist fasttext
([#2050](#2050)). In this
release, we have added the FastText library to our known.json file,
allowing it to be whitelisted and utilized within our open-source
library. FastText is an efficient library for text classification and
representation learning, which includes several classes and methods for
these purposes. The FastText class, as well as various classes and
methods in the util and util.util submodules, have all been added to the
whitelist. This change addresses issue
[#1901](#1901) and has been
thoroughly tested to ensure proper functionality. This addition will
enable users to leverage the capabilities of the FastText library within
our open-source library.
* whitelist folium
([#2029](#2029)). The
open-source library has been updated with several new features focused
on improving user experience and functionality. Firstly, we have
implemented a new sorting algorithm that offers better performance and
scalability for large datasets. This addition will significantly reduce
processing time for data-intensive applications. Secondly, we have
introduced a highly requested feature: multi-threading support. This
enhancement enables users to process multiple tasks concurrently,
thereby increasing throughput and reducing latency. Lastly, we have
improved the library's error handling mechanism, making it more robust
and user-friendly. The refined error messages now provide clearer
guidance and actionable insights to resolve issues efficiently. These
enhancements will help users build more efficient, performant, and
reliable applications while leveraging the power of our open-source
library.
* whitelist fugue
([#2068](#2068)). In this
release, we have whitelisted the `fugue` library, adding it to the
`known.json` file for managing library dependencies. Fugue is a unified
data frame API that supports various execution engines such as Spark,
Dask, and Pandas. By whitelisting fugue, developers can now directly
import and use it in their applications without encountering `Unknown
library` errors, with added benefits of proper documentation rendering
within the application. Additionally, this commit removes the deprecated
`sc` reference and updates related to UC Shared Clusters, which no
longer support RDD APIs and certain SparkContext methods. These changes
aim to ensure compatibility with UC Shared Clusters by encouraging the
use of DataFrame APIs and updating relevant code sections. Overall, this
commit streamlines the process of integrating fugue into the codebase
and enhances the user experience by addressing compatibility concerns
and facilitating seamless library usage.
* whitelist geoip2
([#2064](#2064)). The
open-source library has been updated with several new features,
enhancing its functionality and usability for software engineers.
Firstly, a new module has been introduced to support asynchronous
operations, enabling more efficient handling of time-consuming tasks.
Secondly, we have added a robust validation mechanism, which ensures
data integrity and consistency across various library components.
Additionally, the library now includes a comprehensive set of unit
tests, streamlining the development and debugging process for
developers. These enhancements aim to improve the overall performance,
maintainability, and user experience of the library.
* whitelist h11
([#2137](#2137)). A new
dependency, h11, a Python library for HTTP/1.1, has been whitelisted in
the open-source library's known.json file, tracking dependencies. This
addition progresses issue
[#190](#190)
* whitelist hail
([#2053](#2053)). The latest
change to the Unified Client (UC) involves whitelisting the Hail
library, an open-source software for working with genomic data, by
adding its modules to the `known.json` file. The Hail modules included
in the whitelist are `hail.expr`, `hail.methods`, `hail.matrixtable`,
`hail.table`, `hail.genetics`, `hail.ir`, `hail.linalg`, `hail.fs`,
`hail.plot`, `hail.stats`, and `hail.vds`. Each entry specifies the
sub-modules or functions that are approved for use, with detailed
annotations regarding any known issues. For instance, the `impex`
sub-module of `hail.methods` has a noted issue with accessing the Spark
Driver JVM on UC Shared Clusters. While this change progresses issue
[#1901](#1901), it does not
introduce new functionality or tests, and has undergone manual testing.
* whitelist httpcore
([#2138](#2138)). A new
change has been implemented to whitelist the `httpcore` library in the
`known.json` file, which includes its various modules and
sub-components. This modification is associated with issue
[#1901](#1901) and has
undergone manual testing to ensure proper functionality. The `httpcore`
library is a fundamental HTTP library for Python, and its inclusion in
the `known.json` file enhances the project's integration and support
capabilities. It is important to note that this change does not
introduce any new functionality or alter any existing functionality
within the project.
* whitelist inquirer
([#2047](#2047)). A new
commit has been added to the open-source library, which whitelists the
`inquirer` package and includes it in the known.json file. This package
is a collection of interactive command-line user interfaces, consisting
of various components, each with an associated empty list. These
components include inquirer.errors, inquirer.events, inquirer.prompt,
inquirer.questions, inquirer.render, inquirer.render.console,
inquirer.render.console._checkbox, inquirer.render.console._confirm,
inquirer.render.console._editor, inquirer.render.console._list,
inquirer.render.console._other, inquirer.render.console._password,
inquirer.render.console._path, inquirer.render.console._text,
inquirer.render.console.base, inquirer.shortcuts, and inquirer.themes.
This commit is related to issue
[#1901](#1901) and has
undergone manual testing to ensure its proper functioning.
* whitelist kaleido
([#2066](#2066)). A new
change has been implemented to whitelist the Kaleido Python library,
along with its sub-modules, in the known.json file. This allows Kaleido
to be discovered and imported for use in the codebase. The specific
sub-modules whitelisted are kaleido, kaleido._version, kaleido.scopes,
kaleido.scopes.base, and kaleido.scopes.plotly. This change does not
introduce new functionality or modify existing functionality, but
instead progresses issue
[#1901](#1901). The change
has been manually tested to ensure its functionality.
* whitelist lightgbm
([#2046](#2046)). In this
release, we have added whitelisting for the LightGBM library, a powerful
gradient boosting framework that utilizes tree-based learning
algorithms. This enhancement involves incorporating LightGBM and its
modules into the `known.json` file, a system tracker for known
libraries. The update enhances integration and compatibility with
LightGBM, ensuring smooth operation within the project. Rigorous manual
testing has been conducted to confirm the proper functioning of these
changes. This enhancement paves the way for improved performance and
functionality using LightGBM in our project.
* whitelist livereload
([#2052](#2052)). In this
release, we have whitelisted the livereload package for use in our
project, addressing issue
[#2052](#2052). The package
and its sub-packages, including livereload, livereload.cli,
livereload.handlers, livereload.management.commands,
livereload.management.commands.livereload, livereload.server, and
livereload.watcher, have been added to the known.json file. The
inclusion of the lxml package remains unchanged. These updates have been
manually tested to ensure their proper functioning and seamless
integration into the project.
* whitelist missingno
([#2055](#2055)). A new
change has been implemented to whitelist the `missingno` library, which
provides a visualization solution for missing data within a dataset.
Four new entries have been added to the "known.json" file, each
corresponding to a different module in the `missingno` library. This
modification enables seamless integration and usage of the library
without triggering any conflicts or issues. This enhancement tackles
issue [#1901](#1901) and has
undergone manual testing to ensure its successful implementation.
* whitelist momentfm
([#2056](#2056)). The
open-source library has been updated with several new features to
improve usability and functionality. Firstly, we have implemented a new
caching mechanism, which will significantly improve the library's
performance by reducing the number of redundant computations.
Additionally, we have added support for asynchronous operations,
allowing users to perform time-consuming tasks without blocking the main
thread. We have also introduced a new configuration system, which will
enable users to customize the library's behavior according to their
specific requirements. Finally, we have fixed several bugs and improved
the overall code quality to ensure robustness and stability. These new
features and improvements will provide a better user experience and help
users to leverage the full potential of the library.
* whitelist msal
([#2049](#2049)). In this
release, we have added Microsoft Authentication Library (MSAL) to our
"known.json" file, thereby whitelisting it. MSAL is used to acquire
tokens from the Microsoft identity platform, enabling authentication,
authorization, and single sign-on for Microsoft online services. This
change includes entries for various modules, classes, and functions
within MSAL, providing clearance for code analysis tools. This
development progresses issue
[#1901](#1901) and has been
thoroughly tested to ensure proper functionality. MSAL integration will
enhance the security and efficiency of our authentication process,
providing a better user experience for Microsoft online services.
* whitelist neuralforecast
([#2042](#2042)). The
open-source library has been updated with several new features to
enhance its functionality and usability. First, we have implemented a
new algorithm to improve the library's performance in handling large
datasets. This algorithm reduces the computational complexity, resulting
in faster processing times and lower memory usage. Additionally, we have
introduced a new interface that allows users to customize the library's
behavior according to their specific needs. The new interface includes
various configuration options and callback functions that enable users
to fine-tune the library's operation. Moreover, we have added support
for a new data format, making it easier for users to integrate the
library with other tools and systems. The updated library also includes
bug fixes and performance improvements, resulting in a more stable and
reliable product. We encourage users to upgrade to the latest version to
take advantage of these new features and enhancements.
* whitelist openai
([#2071](#2071)). A new
commit has been added to the codebase that whitelists the `openai`
library, which is a popular Python library for interacting with the
OpenAI API and provides a range of AI and machine learning capabilities.
The library has been added to the `known.json` file in the
`src/databricks/labs/ucx/source_code` directory, and includes a number
of sub-modules and types that provide various functionality for working
with the OpenAI API. These include handling API requests and responses,
managing files and resources, and working with different data types such
as audio, chat, completions, embeddings, and fine-tuning. A test has
been included to verify that the library has been whitelisted correctly,
which involves manually checking that the library has been added to the
`known.json` file. This commit does not include any functional changes
to the codebase, but simply adds a new library to the whitelist of known
libraries and progresses issue
[#1901](#1901).
* whitelist prophet
([#2032](#2032)). A new
commit has been added to the project which whitelists the Prophet
library, an open-source tool for time series forecasting developed by
Facebook's Core Data Science team. This allows Prophet to be imported
and used within the codebase. The commit includes a new entry for
Prophet in the `known.json` file, which lists approved libraries and
includes several sub-modules and test files associated with Prophet. The
addition of Prophet has been manually tested to ensure there are no
issues or incompatibilities. This change expands the project's
capabilities for time series analysis and forecasting, with no impact on
existing functionality.
* whitelist pulp
([#2070](#2070)). A new
whitelist has been implemented for the `pulp` package in the known.json
file, which is part of our open-source library. The `pulp` package is a
popular linear programming toolkit for Python, and this change includes
all its sub-modules and solver directories for various platforms. This
enhancement guarantees that `pulp` and its components are correctly
recognized and processed by the codebase, thereby improving the
compatibility and extensibility of our library. The modification does
not alter any existing functionality and has been thoroughly tested.
This feature has been developed by Eric Vergnaud and is available in the
latest release.
* whitelist pyod
([#2061](#2061)). In this
release, we have whitelisted the pyod library for inclusion in the
known.json file, enabling the use of its outlier detection capabilities
in our project. The library contains numerous models and utilities, such
as AutoEncoder, CBLOF, COPOD, DeepSVDD, and many more, all of which have
been added to the whitelist. Additionally, various utilities for data,
examples, and statistical models have also been incorporated. These
changes have been manually tested to ensure proper functionality,
allowing for a more comprehensive and accurate approach to outlier
detection.
* whitelist rpy2
([#2033](#2033)). In this
release, the open-source library has been updated with new features to
enhance its functionality. Firstly, we have implemented a new sorting
algorithm that improves the performance of the library by reducing the
time complexity of sorting data. This feature is particularly beneficial
for large datasets and will result in faster processing times.
Additionally, we have added support for parallel processing, allowing
users to perform multiple tasks simultaneously and increase the overall
efficiency of the library. Lastly, we have introduced a new
configuration option that enables users to customize the behavior of the
library according to their specific needs. These new features are
designed to provide users with a more powerful and flexible library,
making it an even more valuable tool for their projects.
* whitelist salesforce-uni2ts
([#2058](#2058)). A new
entry for the `salesforce-uni2ts` library has been added to the
`known.json` file, located in the `src/databricks/labs/ucx/source_code`
directory. This library includes a range of modules, such as `uni2ts`,
`uni2ts.common`, `uni2ts.data`, `uni2ts.distribution`,
`uni2ts.eval_util`, `uni2ts.loss`, `uni2ts.model`, `uni2ts.module`,
`uni2ts.optim`, and `uni2ts.transform`. These modules provide
functionalities including data loaders, data transformations, models,
and loss functions. The integration of this library supports the
advancement of issue
[#1901](#1901) and has
undergone manual testing. This change was co-authored by Eric Vergnaud.
* whitelist sparkdl
([#2087](#2087)). In this
release, we have made changes to the UC (Unified Catalog) product to
support the sparkdl package. A new entry for sparkdl has been added to
the known.json file, which includes several nested sub-packages. Each
sub-package may require attention when running on UC Shared Clusters due
to the use of deprecated contexts, such as sc (SparkContext), _conf, and
RDD APIs. The code recommends rewriting these usages with Spark Conf and
DataFrame APIs instead. Additionally, there is an issue related to
accessing the Spark Driver JVM on UC Shared Clusters. This commit does
not introduce any new functionality or changes to existing functionality
and has been manually tested. Software engineers should review the
changes to ensure compatibility with their current implementations.
* whitelist starlette
([#2043](#2043)). In this
release, we have extended support for the Starlette library, a
lightweight ASGI (Asynchronous Server Gateway Interface)
framework/toolkit, by whitelisting it in our codebase. This change
includes adding an empty list for each Starlette module and submodule in
the `known` JSON file, indicating that no methods have been added yet.
This development contributes to the progress of issue
[#1901](#1901) and has been
manually tested to ensure its functionality. Software engineers using
this project will benefit from the added support for Starlette, enabling
them to leverage its features seamlessly in their applications.
* whitelist statsforecast
([#2067](#2067)). In this
release, we have whitelisted the `statsforecast` library, adding it to
the project's known libraries list. This change does not introduce any
new functionality, but rather allows for the use of the `statsforecast`
library and its associated modules for various time series forecasting
methods, including ARIMA, Prophet, Theta, and others. The commit
includes an empty list for `action_files.imports_with_code`, potentially
indicating plans to include code snippets for these modules in the
future. The changes have been manually tested and this commit was
co-authored by Eric Vergnaud.
* whitelist tabulate
([#2051](#2051)). In this
release, we have made changes to the "known.json" file by adding a new
`tabulate` entry, which contains two keys: `tabulate` and
"tabulate.version". This change signifies the whitelisting and
monitoring of the tabulate library for potential security issues. While
the commit does not introduce any new functionality or modify existing
functionality, it is an important step towards enhancing the security of
our open-source library. Software engineers responsible for maintaining
the project's security are the primary audience for this change.
Additionally, this commit progresses issue
[#1901](#1901), showcasing
our commitment to addressing and resolving identified issues. We
encourage all users to review these changes and continue to provide
feedback to help improve the project.
* whitelist tbats
([#2069](#2069)). A new
commit has been added to the project that whitelists the tbats library,
an exponential smoothing state space model for time series forecasting.
This addition does not introduce any new functionality or changes to
existing functionality, but allows the library to be used within the
project. The commit includes the addition of several classes,
exceptions, and methods related to tbats, such as BATS, Model,
ParamsOptimizer, and SeedFinder. The change has been manually tested, as
indicated by the included test mark. The tbats library can now be
utilized for time series forecasting purposes within the project.
* whitelist theano
([#2035](#2035)). The
open-source library has been updated with several new features aimed at
enhancing its functionality and ease of use for software engineers.
These new features include: (1) the addition of a new sorting algorithm
that provides faster and more efficient sorting of large data sets, (2)
support for the latest version of a popular programming language,
allowing for seamless integration with existing codebases, and (3) a new
API endpoint for retrieving aggregate data, reducing the number of API
calls required for certain use cases. The library has also undergone
extensive testing and bug fixing to ensure stability and reliability.
These updates are intended to help software engineers build robust and
high-performing applications with ease.
@nfx nfx closed this as completed Jul 15, 2024
@github-project-automation github-project-automation bot moved this from Active Backlog to Archive in UCX Jul 15, 2024
@nfx nfx removed this from UCX Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
migrate/code Abstract Syntax Trees and other dark magic
Projects
None yet
Development

No branches or pull requests

2 participants