Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CI, in line with gitdb #53

Merged
merged 1 commit into from
Sep 18, 2023
Merged

Conversation

EliahKagan
Copy link
Contributor

This updates smmap's CI configuration in ways that are in line with recent updates to gitdb's. In most cases there is no difference in the changes, and the reason for the updates is more to avoid confusing differences than from the value of the changes themselves. In one case, there is a major difference (fetch-depth), and I think the change there is clarifying, at least when the repositories are compared.

gitdb PRs these changes correspond to:

As noted in gitpython-developers/gitdb#91 (comment), I don't intend to spend a lot of time proposing new features for gitdb and smmap. Having just done #52, which corresponds to gitpython-developers/gitdb#94, it was bugging me a bit that I had arbitrarily proposed the other changes in gitdb but not here.

This updates smmap's CI configuration in ways that are in line with
recent updates to gitdb's. In most cases there is no difference in
the changes, and the reason for the updates is more to avoid
confusing differences than from the value of the changes
themselves. In one case, there is a major difference (fetch-depth).

- gitpython-developers/gitdb#89 (same)

- gitpython-developers/gitdb#90 (same)
  It's just the project, not dependencies, but otherwise the same.

- gitpython-developers/gitdb#92 (opposite)
  This is the major difference. We don't need more than the tip of
  the branch in these tests. Keeping the default fetch-depth of 1
  by not setting it explicitly avoids giving the impression that
  the tests here are doing something they are not (and also serves
  as a speed optimization).

- gitpython-developers/gitdb#93 (same)
@Byron
Copy link
Member

Byron commented Sep 18, 2023

Thanks a lot, it's much appreciated.

I hope that while working with the code in these project, an idea might form on how to merge them into GitPython to avoid having this duplication in the first place.

@Byron Byron merged commit 0257382 into gitpython-developers:master Sep 18, 2023
@EliahKagan EliahKagan deleted the ci branch September 18, 2023 07:12
@EliahKagan
Copy link
Contributor Author

I hope that while working with the code in these project, an idea might form on how to merge them into GitPython to avoid having this duplication in the first place.

Unfortunately, not really, unless I have misunderstood what you're looking to do. I think you're looking for a way to include just the parts of them that GitPython really needs. This is something I might have insights about when I have greater knowledge of how GitPython works, but not at this time.

However, if you just want to include them in full in GitPython, of course that could be done. I suspect it would not even be a breaking change to GitPython. (If people are using GitPython and also separately using gitdb or smmap while relying on getting them as indirect dependencies through GitPython, I'd say that is already a bug in how their dependencies are declared.)

I'm not advocating this, though. It seems to me that it would not be worthwhile unless it brings further improvements, reductions in code, etc.

If this were to be done, then I think much of the discussion of it in gitpython-developers/GitPython#511 remains applicable. The best (or: least bad) way to do it would, I think, be to make gitdb and smmap Python submodules/subpackages (not to be confused with Git submodules) of the git module/package GitPython already provides. That should be straightforward, with minimal code changes, but the tests would also need to be merged in, which might require greater care (this would be easier, but I think still not trivial, after some other improvements to the tests in GitPython, and after its CI is more thorough). GitHub supports transferring issues between repositories, so that could be done, and then although the gitdb and smmap repositories should never go away, they could be marked as archived after a fairly short while.

To reiterate, I do not advocate this in this form. I think something like this is only worthwhile if it carries other benefits. Right now, there is considerable code duplication across the two CI test workflows in GitPython, which strikes me as a more serious issue because, with GitPython being much more active, its workflows are, and should be, read more often than here, and probably changed more often too. (That can itself be solved, and maybe should, but I'm inclined to put that off until native Windows CI jobs are added, because otherwise the way the deduplication is done might turn out to be totally unsuitable.)

If the goal is just to stop having three GitHub repositories, then the GitPython repository could be reorganized into a monorepo that hosts all three packages while having them remain separate on PyPI. It seems to me that this is not really worthwhile by itself either, though, given that all these repositories already exist and have been in use for a while.

@Byron
Copy link
Member

Byron commented Sep 19, 2023

Thanks for the analysis!

Unfortunately, not really, unless I have misunderstood what you're looking to do. I think you're looking for a way to include just the parts of them that GitPython really needs. This is something I might have insights about when I have greater knowledge of how GitPython works, but not at this time.

This is correct - the idea is to only have to deal with a single repository, the one of GitPython, and retire all other repositories and packages. Given how much time it would take to trim down GitDB (which would probably allow to remove smmap entirely) that idea can probably simply be dropped. To me it would already be a huge boon if I wouldn't have to deal with interdependencies and versions of 3 separate packages anymore. Doing so would most certainly be a breaking change as people wouldn't be able to use gitdb directly anymore, but that shouldn't be anyone, and it's fair to mark such a changes as breaking via semver.

From the options provided, maybe there is one that incurs the smallest cost? Maybe tests could also be kept separate enough to mostly work as is without trying to deduplicate too much of the testing boilerplate? These are stable, so don't cost anything by now.

@EliahKagan
Copy link
Contributor Author

This is correct

From context, I think you mean that what was correct was that I had misunderstood you before. The rest of this comment assumes that. Please let me know if I have misunderstood (again?).

the idea is to only have to deal with a single repository, the one of GitPython, and retire all other repositories and packages.

Oh! :)

I think (a) splicing smmap and gitdb into the GitPython repository and making them subpackages (in the Python sense) of the top-level git module, and (b) transferring the GitHub issues and closing the few PRs on the gitdb and smmap repos, and (c) moving their unit tests in, and (d) moving or otherwise covering their CI logic... would actually be the easy parts, both in the sense that they should not be too hard and in the sense that there's another much harder part.

I recently found out that a number of GitPython's tests rely on having those Git submodules in the repository. Also, it is not limited to tests of submodule handling. It's probably more than the 10 tests shown there; that CI output is misleading (if one doesn't notice this line) because of the --maxfail=10 option configured for pytest in pyproject.toml.

So (e) modifying the GitPython test suite so tests no longer expect those to exist as submodules in the GitPython repository would probably be the bulk of the work. Fortunately, that could be done first, separately, and I think should, since no longer having them as Git submodules is its own benefit, and since this might help in avoiding too much at once (on the same feature branch).

To me it would already be a huge boon if I wouldn't have to deal with interdependencies and versions of 3 separate packages anymore.

The gitdb readme still suggests using gitdb-speedups. Is that still relevant in any way (to this or otherwise)?

Doing so would most certainly be a breaking change as people wouldn't be able to use gitdb directly anymore, but that shouldn't be anyone, and it's fair to mark such a changes as breaking via semver.

I want to say that, unless the ability to declare only GitPython as a dependency and directly import gitdb and smmap is documented somewhere, removing that ability (while retaining the ability to obtain those packages, albeit old unmaintained versions) is not a breaking change. However, that impulse of mine may very well be wrong: GitPython is a very popular library, and it may be that many people are expecting to be able to do that.

@Byron
Copy link
Member

Byron commented Sep 19, 2023

the idea is to only have to deal with a single repository, the one of GitPython, and retire all other repositories and packages.

Oh! :)

For context, if I'd layout a project like GitPython today, I'd definitely keep everything related to it in a single repository, while using subpackages.

To me it would already be a huge boon if I wouldn't have to deal with interdependencies and versions of 3 separate packages anymore.

The gitdb readme still suggests using gitdb-speedups. Is that still relevant in any way (to this or otherwise)?

I don't even know if these are tested and built as part of CI, and I'd be hesitant to endorse any C-code written by me. Further, the speedup it promises is ridiculous in comparison to how slow all of it is. GitDB really shouldn't be used, neither the python version nor, and particularly so, any C-extensions of it.


❤️!

The above reaction is due to the fact that you are saying that (a), (b), (c) and (d) would actually be simple, because it's my belief that (e) can even be postponed indefinitely if needed. After all, there is no issue in keeping submodules to archive repositories.

(e) also touches on a much larger issue, representing a possible first step: the lack of isolation of the test suite. Many tests require the parent repository to be available and in a certain state, including the availability of a reflog, from which a lot of the issues in running the tests come from. However, I think it's a major amount of work to fix this and maybe it's not worth it unless it's rewarding for the person doing it in other ways.

So in a away, I'd avoid the bulk of the work as I don't see a benefit in removing these submodules unless one is willing to tackle the isolation problem of the entire test-suite, or at least, see it as part of that. It's entirely possible to alter only the submodule tests to work in isolation, and maybe from there start tackling the isolation problem of the test-suite as a whole. But that, to me, is entirely optional (despite being valued, and valuable), in favor of (a), (b), (c) and (d) to reduce the maintenance burden.

@EliahKagan
Copy link
Contributor Author

EliahKagan commented Sep 22, 2023

Given how much time it would take to trim down GitDB (which would probably allow to remove smmap entirely) that idea can probably simply be dropped.

Or perhaps consolidating the repositories as you are proposing would enable someone to come along later and do that project more easily.

If/when gitdb and smmap are moved into GitPython and made into Python subpackages of git, should they be conceptually public, so that people using GitPython can rely on their presence and on them providing particular things? Or should they be conceptually private to GitPython (perhaps as _gitdb and _smmap)?

Also, this is a premature question that you definitely don't have to answer at this time, but do copyright notices from gitdb and smmap, other than those in Python modules, need to be copied over to GitPython when the gitdb and smmap code are moved into the GitPython source tree? That is, does the LICENSE file in GitPython need to be expanded, or something, or can this be avoided? I ask because the 3-clause BSD license used in all three repositories requires the copyright notice found in the license ("the above copyright notice") to be preserved. The copyright notice in GitPython's license file is:

Copyright (C) 2008, 2009 Michael Trier and contributors

In contrast, the copyright notices in gitdb's and smmap's license files are:

Copyright (C) 2010, 2011 Sebastian Thiel and contributors

Intuitively it feels like this is simply a matter of asking you if you are okay with the GitPython one applying also to your work in gitdb and smmap. However, perhaps contributors also only license their work in gitdb and smmap to be redistributed if they are credited with "Sebastian Thiel and contributors" rather than "Michael Trier and contributors". I'm not a lawyer and I'm not sure what the best thing is to do for that, though my guess is that it would be sufficient to state somewhere that GitPython includes code from gitdb and smmap, and also have both lines, separately, at the top of GitPython's LICENSE file, like:

Copyright (C) 2008, 2009 Michael Trier and contributors
Copyright (C) 2010, 2011 Sebastian Thiel and contributors

Alternatively, there could be a separate file for such other notices, as many large projects have.

Related, though probably independent of license requirements and thus not in any way a substitute of figuring the above out: All the contributors and their contributions will be shown by Git if, instead of (0) doing the reasonable thing and copying the code from smmap and gitdb into GitPython straightforwardly, I were to do weird surgery, either (1) merging them in, though then GitPython would have three separate initial commits due to the merge of unrelated histories, or (2) doing some kind of rewrite of the paths, so they don't clash with the history of existing files in GitPython, and rebasing them onto GitPython, though then hundreds of commits from ancient times would appear at the tip of GitPython's main branch.

The above reaction is due to the fact that you are saying that (a), (b), (c) and (d) would actually be simple

Yes, I think so. Well, or at least not too hard. :)

After all, there is no issue in keeping submodules to archive repositories.

Yes, the archived status would not be an issue. That's arguably better than having Git submodules to non-archived repositories, because the archived status would make the situation clearer. I do think there is potential or confusion with multiple copies of gitdb and smmap under a local GitPython source tree, though, or even with one copy of them that isn't used. It's easy to open he wrong file, and easy for editors and IDEs to helpfully suggest the wrong file to open.

From time to time I find myself embarrassingly editing the copy of a file in build/ when I should be editing it in the actual source tree. Old versions in submodules could be like that but worse, because they would be different from the current source code, so moving one's changes from them to the file one should be editing would be harder, and because commonly used editors and shell prompt customizations would shown them in a way that is easy to confuse with an indication of changes to the outer repository one images one is editing.

When I was recently looking at the source code of LockedFD while working on gitpython-developers/GitPython#1669, I used my editor to navigate to it automatically, and I didn't realize initially that it was outside GitPython, in gitdb. This was no problem, because once I saw it was in gitdb, I knew where it was, because there is currently only one gitdb. Removing gitdb as a submodule and having it either as part of GitPython or used from the PyPI package would be fine for the same reason. Having an unused submodule as well as one those other things would, in contrast, make it very easy to examine, draw conclusions about, and even attempt to modify the wrong one.

I wonder if a weakened version of (e) might be possible, where git/ext/gitdb and git/ext/gitdb/gitdb/ext/smmap could be replaced with submodules that exist solely for testing, e.g., test/data/outer-submodule and test/data/outer-submodule/subdir/inner-submodule, where outer-submodule and inner-submodule are small repositories with minimal test data and no .py files. But I suspect that might require new remote repositories to be available.

@Byron
Copy link
Member

Byron commented Sep 22, 2023

If/when gitdb and smmap are moved into GitPython and made into Python subpackages of git, should they be conceptually public, so that people using GitPython can rely on their presence and on them providing particular things? Or should they be conceptually private to GitPython (perhaps as _gitdb and _smmap)?

I think they should be, in order to make it most similar to what is already there.

Also, this is a premature question that you definitely don't have to answer at this time, but do copyright notices from gitdb and smmap, other than those in Python modules, need to be copied over to GitPython when the gitdb and smmap code are moved into the GitPython source tree? That is, does the LICENSE file in GitPython need to be expanded, or something, or can this be avoided? I ask because the 3-clause BSD license used in all three repositories requires the copyright notice found in the license ("the above copyright notice") to be preserved. The copyright notice in GitPython's license file is:

Copyright (C) 2008, 2009 Michael Trier and contributors

In contrast, the copyright notices in gitdb's and smmap's license files are:

Copyright (C) 2010, 2011 Sebastian Thiel and contributors

Intuitively it feels like this is simply a matter of asking you if you are okay with the GitPython one applying also to your work in gitdb and smmap. However, perhaps contributors also only license their work in gitdb and smmap to be redistributed if they are credited with "Sebastian Thiel and contributors" rather than "Michael Trier and contributors". I'm not a lawyer and I'm not sure what the best thing is to do for that, though my guess is that it would be sufficient to state somewhere that GitPython includes code from gitdb and smmap, and also have both lines, separately, at the top of GitPython's LICENSE file, like:

Copyright (C) 2008, 2009 Michael Trier and contributors
Copyright (C) 2010, 2011 Sebastian Thiel and contributors

Alternatively, there could be a separate file for such other notices, as many large projects have.

To me it seems it would be easiest if the sub-packages, which I presume will live in their own directory, keep their own license files. That way one can't go wrong as one would not modify an existing license file. The feasibility of this certainly depends on the structural needs of python submodules though. Also CC @empty.

Related, though probably independent of license requirements and thus not in any way a substitute of figuring the above out: All the contributors and their contributions will be shown by Git if, instead of (0) doing the reasonable thing and copying the code from smmap and gitdb into GitPython straightforwardly, I were to do weird surgery, either (1) merging them in, though then GitPython would have three separate initial commits due to the merge of unrelated histories, or (2) doing some kind of rewrite of the paths, so they don't clash with the history of existing files in GitPython, and rebasing them onto GitPython, though then hundreds of commits from ancient times would appear at the tip of GitPython's main branch.

That's a great point! I'd definitely prefer to take the 'merge-into-their-own-subtree' route. This keeps the git history of both projects pristine (so no rewrites should be done). For this to work, they would be merged into their own subtree. From there, I think it's fine to move them once more, into place, if this is needed to turn them into submodules. Of course, I'd love it if they could just stay in place so people using blame don't see "moved into place" as their only commit message - it's so hard to poke through that wall with git blame unless there is a trick I don't know.

Yes, the archived status would not be an issue. That's arguably better than having Git submodules to non-archived repositories, because the archived status would make the situation clearer. I do think there is potential or confusion with multiple copies of gitdb and smmap under a local GitPython source tree, though, or even with one copy of them that isn't used. It's easy to open he wrong file, and easy for editors and IDEs to helpfully suggest the wrong file to open.

A very valid point, I didn't see that. Maybe that will be a great incentive to at least isolate the submodule tests to not require its parent repository anymore, which should allow to remove the submodules in a future step.

I wonder if a weakened version of (e) might be possible, where git/ext/gitdb and git/ext/gitdb/gitdb/ext/smmap could be replaced with submodules that exist solely for testing, e.g., test/data/outer-submodule and test/data/outer-submodule/subdir/inner-submodule, where outer-submodule and inner-submodule are small repositories with minimal test data and no .py files. But I suspect that might require new remote repositories to be available.

I think that's it! This is easier to accomplish than isolating the submodule tests, and is thus a more reasonable first cleanup step. Nothing would prevent one from taking what I said earlier as a step after that though - I guess I like the idea as it would lead towards learning how to achieve test-isolation. gitoxide naturally has that and it's very easy to setup fixtures for use in tests. One just creates shell scripts that use git to do what's needed - these are cached and if read-only, only created once.

@EliahKagan
Copy link
Contributor Author

EliahKagan commented Sep 22, 2023

I think they should be, in order to make it most similar to what is already there.

I'm actually not sure if you mean they should be conceptually public or that they should be conceptually private, based on this. gitdb and smmap are not currently part of GitPython, and having them be conceptually private would maintain the current situation that code using GitPython cannot import git.gitdb or git.smmap. But maybe you mean they should be conceptually public so that if for some reason someone wanted to use them in code outside GitPython then they could do so (without risking unannounced breakage).

As you have pointed out, making GitPython no longer have them as external dependencies would be considered a breaking change, due, if I understand your reasoning correctly, to a long-standing expectation that installing GitPython installs those packages and makes them available for use. (Furthermore, it occurs to me that this is an opportunity to make other breaking changes: having only what people can reasonably use included in each module's __all__, removing things that are already marked as deprecated, making non-compatible type hinting changes, and so forth.) After that, however, if the subpackages start out private and become public, that would not be a breaking change, whereas if they start out public then making them private would be a breaking change. There might be reasons to prefer starting them out public anyway, though.

To me it seems it would be easiest if the sub-packages, which I presume will live in their own directory

If they are to be accessed as git.gitdb and git.smmap then they would become the git/gitdb and git/smmap directories, yes, though what would go there would be the gitdb and smmap subdirectories from the gitdb and smmap repositories, not the top-level repository directories that are also named gitdb and smmap and that contain license files. (The imports they contain, of their own modules, would be modified accordingly, as would the imports in Python modules that had already been in GitPython that make use of them.)

@Byron
Copy link
Member

Byron commented Sep 23, 2023

I'm actually not sure if you mean they should be conceptually public or that they should be conceptually private, based on this.

I thought they should be conceptually public, but with the intent that the status quo doesn't change. Probably I have a wrong understanding of the status quo though.

gitdb and smmap are not currently part of GitPython, and having them be conceptually private would maintain the current situation that code using GitPython cannot import git.gitdb or git.smmap. But maybe you mean they should be conceptually public so that if for some reason someone wanted to use them in code outside GitPython then they could do so (without risking unannounced breakage).

I thought it's possible to do import gitdb because it's installed as dependency, and that is the same as installing it explicitly. Making these sub-packages would either make them available as git.gitdb, which wasn't the case before, or not make the available at all by setting them private.

As you have pointed out, making GitPython no longer have them as external dependencies would be considered a breaking change, due, if I understand your reasoning correctly, to a long-standing expectation that installing GitPython installs those packages and makes them available for use.

Yes, we are definitely aligned here. However, I'd refrain from making it a breaking change because I dare to say that most people won't use gitdb types directly. And I'd try to phrase it as 'fix' which means no breaking semver indication is needed. Further, to make porting code that does break easier, I think we should make the sub-packages public to allow accessing gitdb through git.gitdb.
This also means that 'actual' breaking changes shouldn't be done.

If they are to be accessed as git.gitdb and git.smmap then they would become the git/gitdb and git/smmap directories, yes, though what would go there would be the gitdb and smmap _sub_directories from the gitdb and smmap repositories, not the top-level repository directories that are also named gitdb and smmap and that contain license files. (The imports they contain, of their own modules, would be modified accordingly, as would the imports in Python modules that had already been in GitPython that make use of them.)

I see, python is dependent on the actual directory structure, and adding git/gitdb would be the repository root, which isn't a valid python package (yet). So I wonder if one could just add __init__.py into the repository root as follow-up commits after merging the history, which then imports the contents of gitdb into itself to effectively 'forward' to the underlying gitdb package. Effectively, it would turn git.gitdb.gitdb into git.gitdb. The same would be done for smmap I presume.

Does this make any sense?

@EliahKagan
Copy link
Contributor Author

EliahKagan commented Sep 24, 2023

And I'd try to phrase it as 'fix' which means no breaking semver indication is needed.

In that case, I would regard it as a new feature, one that is not inherently breaking but affects the public API: GitPython including gitdb and smmap, and thus not having to have them as external dependencies. So if it is not considered a breaking change then the minor version number could be bumped, bringing GitPython to 3.2.0 (unless some other update bumps this minor version number first).

I see, python is dependent on the actual directory structure, and adding git/gitdb would be the repository root, which isn't a valid python package (yet). So I wonder if one could just add __init__.py into the repository root as follow-up commits after merging the history, which then imports the contents of gitdb into itself to effectively 'forward' to the underlying gitdb package. Effectively, it would turn git.gitdb.gitdb into git.gitdb. The same would be done for smmap I presume.

If you mean the original gitdb and smmap repository trees could be moved directly inside GitPython's git directory, and that then git/gitdb/__init__.py could be added and contain something like from .gitdb import *, and git/smmap/__init__.py could be added and contain something like from .smmap import *, then that would have this effect, yes.

However, it seems to me that there is little reason to do it that way, especially in view of other changes that would be being made anyway. When moving the directories into place to make them Python subpackages of git, I think they may as well be moved to the simplest, most intuitive place: as immediate subdirectories of the git directory. Consider the other changes:

  • Other files--at minimum, the tests--should be moved out of the gitdb and smmap directories anyway, to become part of GitPython's test suite. To know that things are still working, the tests have to be runnable on CI, and to ensure this change is a thoroughgoing improvement (rather than trading one overcomplication for another), tests must be no harder or more complicated to run than before (and thus no harder for the test runner to find).
  • Probably the documentation should likewise be moved and integrated into the GitPython documentation (especially if the gitdb and smmap Python submodules of git are to be public).
  • As mentioned earlier, numerous imports would have to be changed to refer to them as submodules.
  • To prevent confusion, any file like setup.py that has come to reside in a directory with an __init__.py file would should be deleted, moved, or renamed to be conceptually private. (But with the approach I'm suggesting, there would be no such files, because gitdb and smmap files from outside any package would never be moved anywhere under git/ in the first place.)
  • Probably there are a number of other little changes like those that I am not thinking of.

Most of these changes would, I believe, be fairly fast and easy. (Integrating the documentation is actually the one I feel most apprehensive about.) But with all these kinds of integration changes being done, I think there is no reason to try to force all the old gitdb and smmap files to remain in the same locations relative to each other. If people want to know what the gitdb and smmap repositories were like when they were separate repositories, they will have to look at the archived repositories (or the history) anyway.

Furthermore, if it is considered valuable to preserve the files other than Python packages/modules and documentation (besides in the repository history), then that can still be done: the repository could contain separate top-level subdirectories gitdb and smmap with their original README.md files (except that those files would be modified to add a note about what was done), LICENSE, and various other original non-integrated files (including the doc subdirectories unless or until they are integrated) from the top-level directories of the old gitdb and smmap repositories, as desired.

@Byron
Copy link
Member

Byron commented Sep 24, 2023

In that case, I would regard it as a new feature, one that is not inherently breaking but affects the public API: GitPython including gitdb and smmap, and thus not having to have them as external dependencies. So if it is not considered a breaking change then the minor version number could be bumped, bringing GitPython to 3.2.0 (unless some other update bumps this minor version number first).

Yes, a valid point and I can go along with that.

I understand that due to all changes needed to properly integrate tests and documentation, one should prefer to not use a 'virtual' root directory as initially proposed by me. The reason I proposed it in the first place was that it could solve the licensing question (i.e. just leave it as is). From the considerations made in the above comment, I also see now that there seems to be no standard or established way of handling submodules as their own, independent projects with all the files that come with it, at least not without further complications.

With all that said, it seems that a viable course of action could be the following:

  • subtree-merge the history of both packages, gitdb and smmap into GitPython/<subpackage> or GitPython/<packages>/<subpackage>, or whatever is the most common in the python world.
    • this is to preserve the entire history and have a pristine, well-known initial state of the pending transformation
  • integrate tests and docs (and maybe make other changes that nobody can think of yet) of smmap and gitdb (probably that order as GitPython uses gitdb uses smmap).

This should make it possible to trace back all files and changes to their original history later on, probably not so much with git blame due to the rename, but with more powerful tools that we might have one day (or the right arguments for git blame if it can do that already) as the information, technically, is present.

Maybe there are other ways to achieve that same - for me the only point of importance is the initial state that should be history preserving (i.e. start out with a subtree-merge), everything else really is in your most capable hands.

@EliahKagan
Copy link
Contributor Author

EliahKagan commented Sep 24, 2023

The reason I proposed it in the first place was that it could solve the licensing question (i.e. just leave it as is).

gitdb and smmap's current license files could still be put in the subpackage subdirectories (even in addition to elsewhere), i.e., one level lower than they started but still directly associated with the code they came with, if that's something that would help.

On the other hand, if the goal is to make gitdb and smmap really a part of GitPython completely, then isolating them in separate directories for license-related reasons would not be the best approach. With gitdb and smmap becoming submodules of git, they could depend on code in other, new submodules of git, and so some code--LockedFD maybe?--might end up getting moved out of gitdb and put somewhere else. The end result of this over time might end up that very little still in the gitdb submodule would be needed by anything outside it, at which point it could be eliminated (assuming doing so would be an acceptable breaking change at that future point). Even if dramatic benefits like that are not to be seen, I think being able to fully treat the three projects as one, move code between them, extract similar code into a shared utility function, etc., would be a huge benefit of this reorganization.

The licenses are compatible--more than compatible, the same but just with different copyright lines. I think using a single license file with both notices is enough (or more than enough--intuitively one could combine them into a single line, I just don't want to assume that's really okay because it's arguably not preserving the notice). But if for some reason even that is not enough, then the full text of both could be listed, one after the other, in a single file, or otherwise duplicated in a prominent way at or near the top level of the repository.

From the considerations made in the above comment, I also see now that there seems to be no standard or established way of handling submodules as their own, independent projects with all the files that come with it, at least not without further complications.

Well, should they be independent?

If they should be kept as separate projects but just all hosted in the GitPython repository, that's another option, and an alternative to making them Python subpackages. Actually this covers two alternatives:

  • They can be kept as separate PyPI projects, just hosted here (i.e., this would become a "monorepo" of all three separate projects). That is fully compatible, not a breaking change. But they would remain separately versioned (or they could be forced to be versioned together in an artificial way that produces lots of identical releases for gitdb and smmap which are less active, but that would be worse). Because they would be separately versioned, I'm not sure how much easier this makes things.
  • The GitPython distribution package obtainable from PyPI could provide the git, gitdb, and smmap Python modules/packages. This would preserve the exact usage we have now, allowing top-level import gitdb and import smmap. For the same reason, I think this unfortunately would actually be a breaking change, because it would make GitPython incompatible with the gitdb and smmap PyPI packages--due to the clash, any project that had separately listed gitdb or smmap as a dependency would be subject to breakage. That might be okay, though, since there are probably very few such projects.

for me the only point of importance is the initial state that should be history preserving (i.e. start out with a subtree-merge)

This is actually something that, once done, I think you could review: when it comes time to do this, I could open a draft PR with just that part of the change, and then do everything else (the integration and so forth) on the PR branch afterwards. It would stick around as a draft PR for as long as those changes took, but that seems okay to me. The PR could then be marked "ready for review" when it's ready for the rest to be reviewed.

I emphasize that this is not something I would be ready to do immediately, though, so even if you like this idea, please don't expect such a draft PR around the corner. At minimum and even aside from the issue that I cannot make assurances or commitments about my own availability, I think native Windows CI, and possibly other test cleanup, so we know what tests are currently expected to fail and on what platforms, should come first. (I also want to either try to fix or, at minimum, open an issue for gitpython-developers/GitPython#1650 (comment) / gitpython-developers/GitPython#1650 (comment) so it's not forgotten about, and to present the downsides I'm aware of in all the possible approaches--they all have downsides, though some would be better than the current race-condition situation.) But this is actually good rather than bad, I think, because there is no need to rush this while the plan is still forming and possible alternatives are still being considered.

@Byron
Copy link
Member

Byron commented Sep 24, 2023

The licenses are compatible--more than compatible, the same but just with different copyright lines. I think using a single license file with both notices is enough (or more than enough--intuitively one could combine them into a single line, I just don't want to assume that's really okay because it's arguably not preserving the notice). But if for some reason even that is not enough, then the full text of both could be listed, one after the other, in a single file, or otherwise duplicated in a prominent way at or near the top level of the repository.

Thank you for refreshing my memory - for some reason I completely discarded this initial idea in favour of keeping the files verbatim, but I also see how just combining both will have great advantages when considering the amount of refactoring the code is likely to undergo.

If they should be kept as separate projects but just all hosted in the GitPython repository, that's another option, and an alternative to making them Python subpackages. Actually this covers two alternatives: […]

This reveals another assumption I made implicitly: Somehow I assumed that this would be a multi-step process, with the first one bringing in gitdb and smmap as subtrees, while allowing CI to run tests and docs to build. From what I could gather, this would place them in such a way that they will import as git.gitdb and git.smmap. This would effectively remove them as explicit dependencies, their code would now be part of the GitPython package.

This effectively skips over the separate PyPi projects, just hosted in 'monorepo' step, even though it probably doesn't have to. However, maybe it's ultimately easier to make the integration work if there is no need to deal with different CI configurations.

This would allow people who depend on gitdb directly to make an easy switch to git.gitdb, and from there one can probably find ways to deprecate and eventually, remove gitdb entirely. Please note that I am not saying that you should even take that on, I merely want say that the reorganization is a step in this direction.

This is actually something that, once done, I think you could review: when it comes time to do this, I could open a draft PR with just that part of the change, and then do everything else (the integration and so forth) on the PR branch afterwards. It would stick around as a draft PR for as long as those changes took, but that seems okay to me. The PR could then be marked "ready for review" when it's ready for the rest to be reviewed.

I agree, having a longer-running draft that can receive intermediate feedback is a good way of handling this work.

I emphasize that this is not something I would be ready to do immediately […]

I understand - if nothing else this thread can serve as reference if this gets tackled in the future. There are many lower-hanging but just as valuable fruit left to be picked :).

@EliahKagan
Copy link
Contributor Author

EliahKagan commented Oct 9, 2023

This would allow people who depend on gitdb directly to make an easy switch to git.gitdb

For code outside GitPython that were modified to depend on GitPython instead of gitdb and/or smmap, there would be another impediment to just using git.gitdb in place of gitdb and using git.smmap in place of smmap. Importing these submodules will cause git to imported, and the current default behavior when GitPython's git module is imported (modifiable by the GIT_PYTHON_REFRESH environment variable) is to raise an exception if no git command is found. In contrast, gitdb and smmap do not currently do this.

I think it's possible that code using the GitPython library is depending on this behavior, such as by wrapping import git in try-except. (For example, it's not hard to imagine an application that uses GitPython to provide optional Git support relying on this to decide whether it should surface Git features to its users in its UI.) So I'm not sure what, if any, change should be made related to it, when moving gitdb and smmap into git.

@Byron
Copy link
Member

Byron commented Oct 11, 2023

So I'm not sure what, if any, change should be made related to it, when moving gitdb and smmap into git.

That sounds like no change is needed. Despite being a breaking change unless we keep the gitdb dependency alive for a while while saying 'it's deprecrated', I think people who for whichever reason rely on gitdb can easily re-add it as their own direct dependency for an immediate fix. I don't think dependencies of GitPython should be considered public interface in the first place as these can change at any time.

So I think it's fair to actually remove the gitdb dependency and move these around to make them submodules as the code moves into the GitPython repository.

@EliahKagan
Copy link
Contributor Author

In that case, would it be okay for gitdb and smmap to become nonpublic _gitdb and _smmap submodules of git? Or would you prefer they still be public submodules? (I ask because it's non-breaking for _gitdb and _smmap to later become public submodules gitdb and smmap, but breaking if done the other way around. But then, if you know you want the gitdb and smmap submodules to be public, then they may as well start out that way.)

@Byron
Copy link
Member

Byron commented Oct 12, 2023

You are right, it makes perfect sense to start them out as non-public then, let's do that.

@EliahKagan
Copy link
Contributor Author

EliahKagan commented Oct 12, 2023

Some bad news: Now that I'm revisiting this topic with fresh eyes, a greater awareness from GitPython #1656 #1659 & gitdb #98 of how the git and gitdb top-level modules share exception types, and some more insight into how code that uses GitPython often handles imports... unfortunately it seems to me that moving gitdb and smmap into the git module would be a serious breaking change (no matter how it were done).

The issue is that the GitPython git module republishes classes and functions defined in gitdb, including exception classes, and users can import them from either. Searching on GitHub (originally for an unrelated reason) reveals that it appears common for people using GitPython--and not just in forks or repos that vendor it--to import things from gitdb explicitly. In the case of exceptions, attempting to catch an exception imported from gitdb.exc would silently fail to catch the actual same-named exceptions from GitPython, if GitPython were to begin using is own gitdb or _gitdb submodule. This is independent of submodule naming or whether the submodules were public: GitPython would still have to make those exceptions accessible directly in git and git.exc to support current usage.

Even advancing the major version number in GitPython would arguably be insufficient to mitigate this, because a significant amount of code would probably break silently. (I say this because not all projects have robust tests, that is especially so for legacy code that is especially likely to use GitPython, and because an exception that fails to be caught by an intended except clause may still be caught by a more general except clause rather than showing an immediately discernible error whose traceback contains information related to the bug.) Please note that this is much more serious than not being able to import gitdb for purposes other than using GitPython, an issue I had previously put too much attention on relative to this more serious issue. The breaking change I'm describing here is to how GitPython can be used.

Possible solutions:

  1. Somehow make newer versions of the GitPython package have the gitdb (and possibly smmap) package as a "negative" dependency, so they cannot be installed together. I don't think Python dependencies support this (regardless of build backend) so it would have to be done with some kludge if it could even be done at all. Furthermore, this would definitely be a breaking change.
  2. Move gitdb and smmap into the GitPython repository and make them part of the GitPython PyPI package, but have that package provide three separate top-level modules. This would completely preserve interface compatibility. They would need to be versioned in such a way as to express the core dependency incompatibility, though: the new GitPython package would not be able to be installed at the same time as the smmap or gitdb packages. CI would still be unified. Maybe this is the way to go.
  3. Move gitdb and smmap into the GitPython repository but keep them as separate packages, including on PyPI. This would make the GitPython repository a monorepo, publishing all three separate packages, and the gitdb and smmap repositories could be archived. CI could still be unified. This would still work fine, though without moving files currently in GitPython, the organization of the repository could be confusing.
  4. Try to reverse the dependencies: Maybe there is a way for GitPython to not require that separate gitdb and smmap packages be installed, but require that if they are installed then they be at a particular major version or higher. If so, that major version could (a) require GitPython and provide nothing of its own, (b) require GitPython and import and republish from new GitPython submodules what they had formerly provided themselves, or (c) somehow be totally uninstallable, in which case this is 1.
  5. Keep everything separate, including the repositories, but in some other way handle the cumbersome nature of maintaining these closely tied separately repositories. A bot could automate merging multiple pull requests, so that a change that affects both GitPython and gitdb or smmap could signify in some way in its PR description on GitPython what other PRs should be approved or rejected along with it, and bot commands could perform the approvals or rejections. (Opening the PRs on the other repositories could possibly even be automated in some way.) My concern with this approach is that it might end up moving the complexity around rather than decreasing it.
  6. Do nothing soon, but gradually document a new design for GitPython that would be a major API change but no longer have (or no longer have separate) gitdb and smmap, would no longer have anything currently deprecated, and possibly would have other changes. Eventually this could be actually developed on its own branch, and become a new major version of GitPython. The problem with this, of course, is that GitPython is important mainly as legacy software, and it's unclear what the impact would be of stopping support for the old major version even after an exended period.
  7. This is the option whose ramifications I've thought through the least, but I wanted to include it: It may be possible for a gitdb (and smmap) package to conditionally republish exceptions and other classes and functions from GitPython when GitPython is installed, and otherwise provide its own. There would still be weird dependency wrangling that I think might make this a breaking change. My bigger concern with this approach is that Jupyter/IPython are very popular, and packages might be installed with %pip or !pip even after others' modules have already been imported.

Sorry about the bad news! 😿

@Byron
Copy link
Member

Byron commented Oct 14, 2023

Thanks so much for the analysis! I consider this good news, as it allows to make more educated decisions and prevent major accidental breakage.

Without wanting to push or force a decision, my intuition here lies with option number 3. That way, I seems that gitdb and smmap could live in their own directories and be full imported (with history) via subtree-merge. Then it would be possible deduplicate CI the setup and start benefit from the 'monorepo' approach.

For a vision of what would constitute a worthy breaking change that automatically does away with gitdb would be to redesign GitPython to use gitoxide as underlying engine instead to loose its git dependency. Getting there is major effort on all sides. But I also think it would be well worth it if GitPython would be easier to use afterwards and faster and for the first time, correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants