Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classify imports correctly with isort #93

Merged
merged 3 commits into from
Feb 2, 2023

Conversation

jherland
Copy link
Member

It almost feels like I am cheating, leaning so heavily on isort to correctly classify imports 😅. That said, correctly differentiating between first- and third-party imports (as well as the other easier-to-discern categories) is certainly part of isort's core logic, so we're better off reusing their solution rather than rolling our own from scratch.

It does come at the price of having to configure isort correctly, which is not trivial: I took a look at the complete isort.Config definition, and it is suprisingly large and complex. For now, I just hope that we can get away by setting only the minimal stuff we need, and that the defaults are otherwise working in our favor. Another - smaller - cost is that this brings us further away from removing isort as a core dependency, something we were contemplating before this PR (issue #26).

This PR should directly solve issue #82, as well as tasks/checkboxes 2 and 3 on issue #89, and the first two tasks/checkboxed on issue #90.

Commits:

  • extract_imports: Use isort to filter out more non-external imports
  • extract_imports: Configure isort to consider ALL versions of the stdlib
  • extract_imports: Configure isort to correctly find first-party imports

We are only interested in finding imports that refer to _external_ code
(that is, modules/packages for which a corresponding dependency should
be declared).

Until now, we've used isort.place_module() to identify (and ignore)
stdlib modules, but there are more module types we need to ignore:

- from __future__ import ...: This is a special statement handled by the
  Python language itself (used to backport newer language features into
  older language versions).
- from fawltydeps.utils import ...: I.e. self-imports, a module that is
  part of a package will often import other modules from the same
  package.
- from utils import ...: Very similar to the above, a module will often
  import another module found in the same directory (or less commonly,
  found anywhere else in the configured $PYTHONPATH).

We haven't tackled these kinds of imports until now, but as it happens,
we have been sitting on the solution this whole time:

isort.place_module() returns a string that classifies the kind of import
passed to it. AFAICS[1] the possible return values are:

- "FUTURE": from __future import ...
- "STDLIB": Imports of stdlib modules, the one category we have used
            until now for ignoring modules.
- "THIRDPARTY": AFAICS, this corresponds _exactly_ to the external
                imports that we are ultimately interested in.
- "FIRSTPARTY": Corresponds to the last two categories above: self-
                imports, as well as imports of local modules/packages.
- "LOCALFOLDER": AFAICS, this corresponds only to _relative_ imports
                 (from .utils import ...) which we already ignore in
                 our parser (see the latter half of parse_code()).

So by flipping our test from:

    def is_stdlib_import(name: str) -> bool:
        return isort.place_module(name) == "STDLIB"

to:

    def is_external_import(name: str) -> bool:
        return isort.place_module(name) == "THIRDPARTY"

we can - in principle - begin to ignore both __future__ and first-party
imports. However, there are some complications (to be solved in the
following commits):

- For STDLIB imports, isort must be told which _versions_ of the Python
  stdlib should be considered for valid stdlib import names.
- For correctly differentiating between FIRSTPARTY and THIRDPARTY
  imports, isort must be told about the location of project itself, i.e.
  where to look for potential first-party imports.

[1]: https://github.com/PyCQA/isort/blob/5.12.0/isort/sections.py and
https://github.com/PyCQA/isort/blob/5.12.0/isort/place.py contain the
declaration of these strings, as well as the code that implements
isort.place_module(), respectively.
In the `requests` project, we find the following code:

    try:  # Python 3
        from http.server import HTTPServer, SimpleHTTPRequestHandler
    except ImportError:  # Python 2
        from BaseHTTPServer import HTTPServer
        from SimpleHTTPServer import SimpleHTTPRequestHandler

where the try-block attempts to import things from the Python 3 stdlib,
and the except clause falls back to attempting the same imports from
their corresponding locations in the Python 2 stdlib.

FawltyDeps, at this point in time, finds _all_ the above import
statements, and has no logic to determine that some of them are dead
code on Python 3. It merely asks isort whether each of them are STDLIB
modules (by passing "http", "BaseHTTPServer", and "SimpleHTTPServer" to
isort.place_module() in sequence).

By default, isort only considers the superset of Python 3 versions when
building its set of stdlib modules, which means that the latter two
modules above are NOT considered part of the stdlib. As a result FD
then reports them as undeclared dependencies.

Fix this by configuring isort to consider _all_ versions (i.e. Python 2
_and_ Python 3) of the Python stdlib.

There is a (small) risk associated with this change: Say that a project
uses an external dependency whose import name overlaps with a stdlib
module from an old version of Python (a version on which this particular
project will never run). In this case FD will mistakenly assume that
this external dependency is an stdlib import:

- If the external dependency is not declared, FD will incorrectly omit
  it from the list of undeclared deps.
- If the external dependency is declared, FD will incorrectly report it
  as an unused dependency.

At this point in time we assume that the number of projects that
contain fallback imports for Python 2 far outweighs the number of
projects that include external dependencies that happen to overlap with
outdated stdlib modules.
Recall that isort.place_module() returns one of these strings: "FUTURE"
(for `__future__` imports), "STDLIB" (for stdlib imports, see parent
commit for more details), "THIRDPARTY", "FIRSTPARTY", and "LOCALFOLDER"
(relative imports).

These categories are fairly clearly defined at this point, except for
the exact difference between FIRSTPARTY and THIRDPARTY: In the Python
syntax itself, there is nothing to differentiate these: `import foo`
could refer to a foo.py in the same directory (a FIRSTPARTY import), or
it could refer to a "foo" package from PyPI (a THIRDPARTY import).
Likewise, `from foo_package import bar` could refer to the _current_
package (or a sub-package of the current package), or it could refer to
an external dependency.

It all comes down to the context in which the import is made.
isort.place_module() is able to discern this context, as long as its
`directory` parameter is configured correctly, i.e. to a directory where
FIRSTPARTY imports can be found/resolved.

Anything that isort is unable to resolve/find inside this directory
defaults to being classified as THIRDPARTY.

Since we're specifically looking for THIRDPARTY imports, it is therefore
important that this `directory` is correctly configured, otherwise we
will report first-party imports as undeclared external dependencies!

For now, we fix this by setting `directory` to the path that is passed
to parse_dir(), and otherwise defaulting back to the current directory.

TODO/FIXME: There are some scenarios where this is probably not the
correct solution, for example:

- Passing --code as a single file or a stdin stream will bypass
  parse_dir() altogether, and the current directory will be used to find
  FIRSTPARTY imports. This may or may not be what the user intended.
- The project may do trickery with $PYTHONPATH or other mechanisms to
  alter where it is appropriate to look for FIRSTPARTY imports.

We may have to consider adding command-line (and configuration) options
to FawltyDeps for directly specifying this `directory`, and/or for
including/excluding specific names as FIRSTPARTY imports.
Copy link
Collaborator

@Nour-Mws Nour-Mws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unsettling how many issues this PR manages to solve in a concise and elegant way 🤯
Very helpful commits as usual, and it was thrilling to see the dependencies of requests disappear one by one.

Just a note that we should not forget to open an issue to hold the following:

TODO/FIXME: There are some scenarios where this is probably not the
correct solution, for example:

  • Passing --code as a single file or a stdin stream will bypass
    parse_dir() altogether, and the current directory will be used to find
    FIRSTPARTY imports. This may or may not be what the user intended.
  • The project may do trickery with $PYTHONPATH or other mechanisms to
    alter where it is appropriate to look for FIRSTPARTY imports.

@jherland
Copy link
Member Author

jherland commented Feb 1, 2023

Just a note that we should not forget to open an issue to hold the following:

Created #95 now.

@jherland jherland merged commit 564e02d into main Feb 2, 2023
@jherland jherland deleted the jherland/classify-imports-correctly-with-isort branch February 2, 2023 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Packages may refer to themselves in imports
2 participants