Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mkdir race condition in LooseObjectDB.store #91

Merged
merged 1 commit into from
Sep 11, 2023

Conversation

EliahKagan
Copy link
Contributor

Fixes #85

This replaces the conditional call to os.mkdir that raises an unintended FileExistsError if the directory is created between the check and the os.mkdir call, using a single os.makedirs call instead, with exist_ok=True.

This way, we attempt creation in a way that produces no error if the directory is already present, while still raising FileExistsError if a non-directory filesystem entry (such as a regular file) is present where we want the directory to be. This is the advantage of this approach over the approach of swallowing FileExistError as suggested in #85.

Note, however, that os.makedirs behaves like mkdir -p: it attempts to create parent directories (and their parents, etc.) if they do not already exist. So it should only be used if that is acceptable in this case. I am not aware of a reason it wouldn't be, but I am not very familiar with gitdb.

So that aspect of the situation deserves special consideration in reviewing this PR. I'd be pleased to change the approach if os.makdirs is judged not suitable here. I think the approach suggested in #85 is reasonable, and it can be made more robust by checking that the directory exists after the creation attempt (or in other ways).

The code was under test: that line is exercised in TestExamples.test_base, TestGitDB.test_writing, TestLooseDB.test_basics, and TestObjDBPerformance.test_large_data_streaming. However, no test catches the race condition this fixes, and I have not added one.

Testing that the race condition does not occur in the specific way as before by accessing and calling the same functions as before in the same order would be easy, but it would be more of an illusion of a regression test than a useful test. Testing by trying to brute-force a race condition, without modifying the operation of the code for the test, would work but the tests would take a very long time to run. Testing it in a way that is fairly robust against new ways of reintroducing the race condition and that is not too slow should be possible, but I don't know of a good way to do it; everything I've thought of would be complicated, and possibly make running the test in a debugger like pdb infeasible. So I have not added a regression test for this bug. However, if it is considered important to have one, then I can consider the matter further.

This replaces the conditional call to os.mkdir that raises an
unintended FileExistsError if the directory is created between the
check and the os.mkdir call, using a single os.makedirs call
instead, with exist_ok=True.

This way, we attempt creation in a way that produces no error if
the directory is already present, while still raising
FileExistsError if a non-directory filesystem entry (such as a
regular file) is present where we want the directory to be.
@EliahKagan EliahKagan marked this pull request as ready for review September 11, 2023 08:11
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

Please do that I would prefer if GitDB could be removed from GitPython entirely, which is even worth a major version change to me. I dare to say that not too many people use it, and if they do, they should rather use an implementation that is more trustworthy (like the native git one). In hindsight, I do consider it a mistake to have tried to write an ODB in python.

With that said, it might be that parts of GitPython rely on GitDB, maybe for writing, and I'd hope these could be ported over to GitPython ease maintenance - I firmly believe that most of the code in GitDB isn't actually used by GitPython if the default git ODB backend is used.

@Byron Byron merged commit 295a55b into gitpython-developers:master Sep 11, 2023
@EliahKagan
Copy link
Contributor Author

Thanks--if I contemplate significant changes to, or related to, gitdb, I'll being by looking into whether I can remove GitPython's dependence on it or vendor the specific parts that GitPython needs.

In this case, I just noticed that I was aware of a way to fix the race condition that might be acceptable. so I figured I'd open a PR. Although there are some other changes I may want to propose to the CI workflow in this repository, I'm not sure I really know enough to fix the most important gitdb-related issues, in GitPython or in gitdb itself. As my familiarity with the GitPython codebase increases, that might change.

@EliahKagan EliahKagan deleted the toctou branch September 11, 2023 08:55
@Byron
Copy link
Member

Byron commented Sep 11, 2023

Although there are some other changes I may want to propose to the CI workflow in this repository, I'm not sure I really know enough to fix the most important gitdb-related issues, in GitPython or in gitdb itself. As my familiarity with the GitPython codebase increases, that might change.

I would absolutely love if I could direct your incredible skill and energy away from the GitDB CI and towards making it unnecessary entirely :). It's probably more of a refactoring task, albeit a complex one.

As my familiarity with the GitPython codebase increases, that might change.

It's a great gift to have you contribute to this project, and I love the idea to have more of that. The project, to my mind, has significant issues with quality and some limitations stem from these. The greatest problems come from incorrect handling of encodings, both for paths and for data, along with many naive implementations of some git data-structures (see Index).
With over half a billion downloads per year GitPython is more influential than I feel comfortable with given its current state, and your work will be majorly impactful.
I also hope your contribution experience is rewarding and pleasant so you can continue this awesome work.

@EliahKagan
Copy link
Contributor Author

Thanks! Based on this, the only further gitdb CI pull request I will open now is one to re-add Python 3.7 support, as you indicated should be done in a few of the comments in gitpython-developers/GitPython#1654. (That will include CI, but also changing the lower bound back to 3.7 in setup.py. It's very simple, because it's just a revert of one commit.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Race condition writing objects in LooseObjectDB
2 participants