-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support internal dependencies that are resolved on-disk via another tool #16380
Comments
To manage expectations here, it would take very good arguments for this bug to be fixed (and the associated design doc to be approved) The current Bazel way is to put the source tree on a FUSE file system and implement laziness that way; this makes it possible for Bazel to be completely agnostic as to how the source tree is put together and this would be both a pretty big change from that approach and a pretty large chunk of functionality for us to support indefinitely that is, in light of that approach, redundant. @meteorcloudy WDYT? I could imagine considering merging this if there was a widespread need, but for one use case, I'd be reluctant to take on the increased mental burden and support load. |
I agree with your judgment, but I think this is more of a decision to make for the core team. |
Then @haxorz WDYT? I thought @meteorcloudy is appropriate because he has all the state about external dependencies in hi head, which is (partially) relevant. |
Hello! I'd like to add a bit of context around why I submitted this, what I'm looking for, etc. w.r.t your comments. For some background: while I'm new to Bazel, I contribute pretty actively on Git (especially I spent some time over the past year interviewing monorepo maintainers to learn what their biggest pain points were in Git, and Bazel came up in nearly every conversation. I learned that 1) everyone seems to build their own custom, internal tool to integrate Bazel and sparse-checkout with no standardized approach, and 2) even with those tools, sparse-checkout performance in those repos was substantially worse1 than it should be because of the need to have files on disk before running
I did see some projects related to this approach (namely sandboxfs) but ultimately leaned away from that sort of integration. Historically, FUSE hasn't been a feasible option for interacting with Git on the scale of repos that need the performance gains of
That's an understandable concern, and it's frankly the main reason I spent months evaluating alternative approaches before submitting this issue. But the overhead (both computational and mental) from forcing the use of a VFS layer makes it impractical as a solution in my eyes.
Multiple large projects that use Bazel have built entire engineering systems around coordinating it with Git (there were three talks at Git Merge this year about them! 1, 2, 3). While it might seem like a limited use-case, it's an increasingly common - and critical - one. Ideally, solving this problem by extending Bazel would make it easier for projects to standardize on a common open-source approach to integrating with Git. A well-designed implementation should minimize the support load added Bazel, while also removing the distributed mental/support burden for all of those other projects' custom tools. Plus, it would lower the barrier to entry for using a performance-optimized Git + Bazel setup for the projects that don't have a dedicated engineering tooling team. Anyway, thanks for the feedback and I hope this context helps! Footnotes
|
Wow, that's an impressive amount of research! Does the sentence "Bazel came up in nearly every conversation" imply that a majority of the folks who have huge git repositories use Bazel? My heart is saying that we should have some sort of functionality like this in Bazel, but my head is very reluctant for the reasons already elaborated above. I was about to ask who exactly this would benefit, but you YouTube links more than adequately answer this. What I will do now is to ask around at Google:
The two alternatives that I know if is a FUSE-based approach we discussed above and a recursive For the latter, I am somewhat skeptical of the claim that |
+1. @vdye, thank you for the incredibly thorough and thoughful FR! @michaeledgar has a lot of prior experience with sparse checkout in both Git and Mercurial and he's coincidentally on
+1 to this. Ideally we can come up with something that doesn't involve intrusive changes to Skyframe (Bazel's core execution and incrementality engine). I fear this might be challenging though. But my fear is partially based on ignorance of the details of spare checkouts, so maybe Mike can think of a concrete ~API that is both tasteful and feasible to support. I'd also like to emphasize the inverse statement: If this FR is to be reified as something very tightly coupled to Git, I don't think |
Without taking sides, @vdye 's design doc is not coupled to git, so on that side, we're good. I thought a bit about how this could look like in Skyframe and I think I could come up with a way to make this as unintrusive as possible to Skyframe (essentially: make The issues I immediately see are:
|
👋 One of those monorepo maintainers that wrote their own sparse checkout on bazel chiming in. I am vaguely aware of the google virtual file system (srcfs?) and the way you can implement extended file attributes for hashes. All that being said, you still need some service (Content Addressable Storage?) network accessible somewhere and this is easier said than done. Securing and scaling these services takes resources and we already have to consider remote caching and remote execution with bazel in our ecosystem - adding a FUSE file system backend to maintain would be another challenge at less-than-google scale. |
Hi @vdye, Big fan of your work over the Git Dev mailing list! I think bazelbuild/proposals#277 is quite solid and I am in favor of it personally. However, I would like to note that FUSE usage with Bazel does provide a bit more than just "lazy file loading", which I believe to be the core thesis of https://www.youtube.com/watch?v=rQd9Zd1ONOw In short, Bazel relies heavily on the hash of input files in the source tree. There are 2 problems trying to re-construct this using
Edit: I wrote this last night and forgot to hit send. I see that @maxious has raised similar concerns in the post above mine 🤝 |
I'm not (currently) a bazel user, and am coming more from the Git side (I'm one of the other Git developers working on sparse-checkout capabilities), but I'd like to throw out another idea for folks to consider:
Idea: Would there be an opportunity to just provide Bazel with the contents of the files it needs for computing dependencies instead of vivifying the files? Reasons: One thing I was disappointed in with the Git Merge monorepo talks this year was that folks were vivifying these files (or using an additional checkout) rather than allowing the build system to be provided the file contents directly. Granted, vivifying files is what I did for our first cut at using sparse checkouts as well, but we found it rather suboptimal. Providing contents directly to the build system has multiple advantages: (1) The working directory is kept nice and tidy both for the user and for IDEs -- either of which can get confused by the extra files and directories appearing (perhaps transiently). I will note that you can work around at least the first issue by e.g. keeping a parallel copy of the checkout maintained; this requires a little cleverness to avoid the performance issues of a full second copy (which is doable), but much more importantly you're introducing the risk of the parallel copy being wrong due to not having local changes made by the user in their working copy (or users going and messing with that copy). Providing the appropriate file contents directly to the build system avoids this problem as well as the above. I've implemented the contents-directly-to-build-system idea I suggested above. (To get the contents of files just-in-time, for SKIP_WORKTREE files I make use of |
Thanks everyone for the commentary/discussion! @newren: I noted in the full proposal that most Bazel rules (like As far as I can tell, source files aren't needed on-disk until the "Execution" phase of the build, and external rules don't directly access the build files on disk. With those two assumptions in mind, only core Bazel (not individual custom rules) would need to be able to to access build file content from somewhere other than the filesystem (e.g. using It's probably still more appropriate to add this functionality via an extension API vs. putting @lberki @michaeledgar (or anyone, really) - what do you think? I'd like to start working on an implementation, but don't want to get too far with it in case there's a much easier approach that hasn't come up yet. |
I think there's a problem when globbing source files, since then the rules depend on the presence of source files on disk (even if not their contents)? |
Yes, absolutely. It only made sense to handle the "build files" that special way; the "source files" were absolutely required to be on disk for the other build system too. One thing to note, though, is that our build system was okay with just providing {module/package}-level dependencies, which was what we needed to get the directories to specify for the sparse-checkout configuration. If we weren't using cone mode and were attempting to exclude all unnecessary files, or if the system had been hardcoded to only determine file-level dependencies (which is often determined by globbing all source files under a directory), then our changes to the build system would have needed to be more invasive. I think it's a bad idea to try to force sparse-checkouts into the minimum list of needed files and much prefer the idea of the minimum set of packages, but the question of package-level vs file-level dependencies might come up so I thought I'd note it.
👍 |
I think this goes back to the question I also noted above: is it possible to determine package-level dependencies without determining all file-level dependencies? We only need the former for sparse checkouts. |
Update: we talked with @linzhp in person at BazelCon and I offered him the option of building a
I don't think this is doable with This system could be used to do sparse checkouts with Bazel without deep integration like what @vdye proposed. The user experience would not be as nice, but, it would widen the interface of Bazel much less than that proposal and it would be a much smaller ongoing maintenance burden. Would this be an option? |
This approach is almost identical to one of the ones I considered before opening this issue ("create a mode/option allowing In the meantime, I've been working on a draft implementation based on the alternative approach suggested by Elijah earlier. With respect to maintenance burden, I haven't had to do much in the way of refactoring or changing core infrastructure; the virtual file resolution is limited to a new |
@vdye : yeah, it's not a great idea I had, it's a preference of mine between two of your ideas :) I do realize that it would not be as nice a user experience, but it would be much easier to support indefinitely. (Thus spake the Master Programmer: I'll resist the temptation to pass judgement on the approach you have taken because it's much easier to have an informed opinion when I see the code. From what I can glean from your above comment, it seems to work by teaching Bazel has a pretty well-developed virtual file system layer, which sounds like another possible abstraction layer to plug this in. I don't know if it's easier, but it's certainly a possibility and we have a number of alternative On the git side, what is the future of non-cone mode? Will it be supported? From a brief perusal of the documentation, it looks like that cone mode is now preferred, which is unfortunate because then AFAIU if one wants to checkout e.g. package Only checking out
|
To add to my previous comment: I can see how piggybacking on
|
I wasn't looking to take credit, just clarify that I'd deeply considered the pros/cons of what you suggested and note the conclusion I had reached.
I detailed earlier why third-party tools are insufficient from a usability perspective. Tools that do what you’re suggesting already exist (most similarly, You’ve mentioned maintainability a number of times now, but it would be extremely helpful for me to know what you would quantify as “too complex” or “too invasive.” Are there specific things I need to avoid doing in my implementation that have burned the Bazel team in the past? I’d be much more comfortable working within clear guidelines than leaving the “go/no go” call up to (from my perspective) chance.
Per Elijah's comment:
So it's not vivifying the files, it's reading the files directly from the Git index into memory for Bazel to compile (via Note that this approach has two limitations: it can only run successfully in commands that don't touch the non- Like I said, though, all of this is still WIP, so nothing is set in stone. For example, one thing I'd like to try is to skip the request to I currently have a functioning end-to-end MVP, but it almost certainly breaks outside of the “happy path” example I’m working with. I’m planning on spending a week (not next, but the one after) cleaning that up, then will open a draft pull request so people can look it over.
This approach would likely require hardcoding Git logic into Bazel (which I don't think anyone here wants) and the FileSystem layer is responsible for all filesystem accesses, so any integration would need to distinguish between in-repo file reads & everything else. I think that'd be far more invasive and complicated than anything I'm proposing here.
Repositories that need the scalability of cone mode are going to tend towards rearchitecting their package structure to best take advantage of it. Using your example, a cone mode repository like that would either intend for Note that the dependency adapter proposal doesn’t mandate one of cone mode or non-cone mode (although it is designed with cone mode use cases in mind). A Git adapter would pull
Some of these (EDIT: and a few of the additional ones you mentioned in your most recent comment) aren't supported by my prototype in its current state, but 1) I intend to at least evaluate how much work they’d be and adjust my design accordingly, and 2) if they’re non-trivial to support, document why. Ultimately, I don't think it's unreasonable to say "the I personally find it most beneficial to introduce and incrementally expand a capability over time, rather than to wait just to make sure it's compatible with every possible workflow. That's especially true here, as the feature is opt-in and, if a user tries something and it doesn’t work, that can inform where incremental improvements could be made. In the meantime, users could either adjust their build infrastructure to fit the restrictions, or continue working without the feature at all. |
I'd echo this sentiment. We already enforce conventions against patterns in Bazel (and other tools) that are incompatible with sparse checkouts. |
Neither did I so I thought it's better to clarify. It's all good!
I can't think of an objective measure, unfortunately. If I had, I would have already informed you about it; what I'm looking for is some combination of:
Does this help?
Ah, got it; I misunderstood that, but now I am confused: you do have a point that the Even limiting ourselves to
I agree, but if we don't want the vivifying approach, I'm not sure if the "plumb the bits directly from git to
I understand that; what I wanted to know is whether I can expect non-cone mode to be supported in git in the long term because if not, we'd have to just accept that
I promise I won't hold you to impossible standards :) What I'm looking for here is not a perfect implementation, only an implementation that has a reasonable chance of eventually supporting every use case. As long as that is true, it can always be put behind a Comparing various alternatives against the above simple interface / simple implementation / no surprising interactions standard:
|
@vdye this thread is now long enough that I think a video call would be more productive. Which time zone do you live in? I'm at BazelCon this week and my calendar is completely full, but we could arrange a call next week, which hopefully let us come to an agreement earlier. Until then, I guess a centithread it is? |
With all the talks in BazelCon around wrappers built on top of bazelisk and the plugin model they provide, couldn't we just use the magic method approach and do all the git dancing in there? That way we can all benefit from it while keeping the internals clean. |
Given that EDIT: actually, I have no idea where to start with @lberki I'm not around next week, but I am the week after (time zone UTC−08:00). That said, I'd like to try this |
@vdye they're built on top, they mentioned 2 in the conference https://github.com/aspect-build/aspect-cli and https://github.com/buddy-works/buddy-cli |
Hmmm, I must have misinterpreted then. I assumed the "plugin model they support" was referring to One thing I wanted to mention, since the |
@manuelnaranjo From user experience's perspective, the
3 could quickly eat up the time saved from 2. It's still nice to have, though. Currently, there is another issue with |
Yeah, I am aware of the limitations here and that it currently cannot be done with re: @vdye 's comment that this would then require a third-party tool outside of the Bazel and git ecosystem, I thought that there was no way around that: I don't want to embed knowledge about git into Bazel, you presumably don't want to embed knowledge about Bazel into git, that knowledge must live somewhere so a third place is the only option, isn't it? @newren : what is suboptimal about the vivifying approach? If |
I've caught up with the entire thread! I want to try to get to everything --
WORKSPACE can't load because not all transitive .bzl files are presentI don't think we can make do with an invalid workspace - I would treat the WORKSPACE contents and everything it references as "toolchain" source that simply must be present for Bazel to behave in a predictable manner. I can followup and check if we produce BEP that could be used to determine which .bzl file was missing. Then you could maybe write a wrapper script that fetches the required repo based on the BEP output (the Build/test target can't load because not all transitive .bzl files are presentWe currently need all the transitive It's not true today, but it's conceivable that we could be tolerant to the failure to load rule definitions not in the transitive deps of the requested build targets. Typically it's easier to just partition unrelated rule definitions into parallel directory trees. How to download fewer bytesGoogle's monorepo tooling is famously based on virtual filesystems where tools may assume all files are available, and the filesystem lazily-loads the source content of unmodified files as needed. This is the way we download the fewest .bzl files during loading at scale, so that's the path of least resistance. As has been discussed, to run There's two broad categories of solutions here: FUSE and non-FUSE. FUSE SolutionI'm not familiar with any non-Googley FUSE filesystems. But. If there is a read-only FUSE filesystem for your VCS, I think the following approach is essentially what Git users at Google used for many years. You use the VCS to manage the files you want checked out, and wrapper scripts maintain a symlink to a read-only FUSE filesystem that provides a consistent baseline snapshot of all other files. This solution looks like this:
Any BUILD/bzl files not found in the workspace will be retried under If you're patching together a bunch of repos, you can do this with multiple symlinks to different mount points, though they will be checked sequentially, so loading performance may degrade if the additional failed FUSE filesystem reads are slow. I'm not 100% sure if we're incrementally correct with changes to the REPO symlink - it would likely depend also on details of the FUSE filesystem, eg. mtimes. Custom
|
@lberki : I'll try to answer your questions as best I can, and provide my experience and perspective on why I find certain things important or unimportant...
non-cone mode came first, but it is deprecated: https://git.kernel.org/pub/scm/git/git.git/commit/?id=a8defed07c It is true that some people still use non-cone mode. However, it has lots of problems (which is why we deprecated it), we think it cannot pragmatically be supported beyond its original basic feature set, and we've pointed out that it has a bunch of gotchas even on the original basic feature set that cannot be fixed. We've already added new features (such as the sparse-index) which are incompatible with non-cone mode. If we as Git developers say that non-cone mode is broken and isn't worth supporting (or maybe isn't possible to support) for new features, why should a project like bazel try in cases where it's non-trivial for them to do so? But, even more importantly: I think it's actively a bad idea to try worry about file-level partial checkouts even if you could easily implement it; doing so is likely so computationally expensive that, given how often it must be done, it would defeat the whole point of the exercise: speed. The point of letting users work on a subset of the repository is so that common operations are faster. Computing fine-grained full tree file-level dependencies is likely not fast.
You seem to be presuming that full file-level dependencies must be implemented for this feature, or this feature should not implemented at all. I'm a bit confused by that. We don't want full file-level dependencies; we only want package-level dependencies.
The point of the feature being asked is to allow vivifying (or unsparsifying) the necessary packages, but only those. If you vivify other things (even if only a few files from other packages), then you run the risk that users, IDEs, and/or scripts can get confused. (As a couple examples, users may ask why they have unnecessary directories, or why "git grep" is turning up hits in places they weren't working on and "didn't want checked out"; IDEs may attempt to auto-build stuff that isn't part of the user's focus.) However, if it's only a few files from otherwise unwanted packages, you may still achieve the overall goal of allowing users to work with a subset of their repo quickly, so you might have a workable even if suboptimal solution. In contrast, if determining package dependencies requires first checking out everything (i.e. to unsparsify completely), then sparse-checkouts end up costing more than they save. unsparsifying and resparsifying is a very expensive operation. And we need to potentially re-determine dependencies with every merge/rebase/reset/switch-branch/etc. (If any of the build control files -- *.bzl files I think in bazel's case, *.metaconf cases in the case I was dealing with -- are newer than $GIT_DIR/info/sparse-checkout, then the dependencies could have changed and we need to determine the new package-level dependencies and update the sparse-checkout patterns appropriately). In fact, we did have some scripts that only a few users had to use, which required the unsparsify-then-run-script-logic-then-resparsify dance. Those users all ended up abandoning sparse-checkouts unless we found a way to remove the need to unsparsify -- usually by either rewriting the logic in it, or moving it to CI systems so they didn't need to run it locally. But it's not just unsparsifying and resparsifying that can be expensive. I'm also a bit worried about the dependency calculation step used to update the sparse checkout patterns. Since we have to perform that check often (with potentially every merge/rebase/reset/switch-branch/etc), it was important in our case to make sure we only computed package dependencies and not full file dependencies. The latter would have been so excruciatingly slow that I think it would have been a deal breaker. And from what I saw at the Git Merge conference, those using bazel are working on even bigger monorepos than we were, so I would suspect the problem would be even bigger for them. |
While downloading fewer bytes is important for many folks, there are monorepo folks who have a performance profile where a dense clone is a reasonable cost but they really want sparse-checkouts. Not everyone is Google/Microsoft/Facebook-sized.
I get the feeling that the following two ideas are foreign to bazel:
Is that correct? Would introducing those two ideas be too high an impedance mis-match with how bazel works? All that said, I'm not working on this feature, Victoria is. I'm only giving context from my role as a Git developer, and as someone on a team who implemented this idea for a different (internal) build system. |
I've been asked to assign a Priority label for this issue. Given that the design discussion is still ongoing, I think |
After a lot of consideration, I’m going to pause my efforts towards getting upstream acceptance of this functionality. I’d love it if the Bazel team could this issue open for people to find and lend their perspectives but, at the moment, we seem to have reached an impasse on the necessity and approach to this feature. For anyone still interested, I plan/hope to keep working on an adapter API in my fork, and I’ll keep an eye on this thread in case there’s something valuable I can add to the discussion. Thanks! |
@newren : you are wrong on both counts :) It's quite natural for Bazel to build a subset of the code base; it's just that currently, determining the particular subset one needs to build anything particular in the code base is more complicated than it should be. That's something that would be much easier to implement than this issue either with @vdye 's design or something else. As for determining what part of the code base is needed to build "X and Y", it's called @vdye : understood and acknowledged, with some sadness. It feels like there is a nice design wanting to come into existence, but between the simplicity of implementing easier dependency discovery, FUSE (or FUSE-by-NFS, as seems to be in fashion on Mac OS these days) which limit the potential benefits and the difficulty of coming up with an interface that lets you do what you want without excessive collateral damage to Bazel, I think this is the right decision, at least for the time being. I'll then close this issue and move the design doc to "dropped" once I'm back to my usual terminal. |
bazelbuild/proposals#281 sent out. |
It looks like Facebook is also going the FUSE route: https://github.com/facebook/sapling/blob/main/eden/fs/docs/Overview.md |
Here are my takes on this: RecapI think there is a spectrum of repositories sizes, and thus there is a spectrum of solutions for Bazel to consider enabling those use cases. In my mind, the red line above is the distribution of repositories that are and going to use Bazel today, or in the near future.
We know that for (3), FUSE is much needed. But behind FUSE is a colossal infrastructure setup: client, server, cloud workspace, custom vcs integration. The cost of ownership for a FUSE solution is huge. It will come down in the future as more companies reach that scale and if there are known open source solutions. But for non-big-tech company, the cost of running such a solution is way to high today. For (2), what we can observed from Canva, Twitter and Uber in Git Merge 2022 conference is that What to do from hereI think in recent Git contributor submit, @jrn from Google's git team (who supporting AOSP) has expressed interest in building a "git-aware filesystem" VFS of sort: http://public-inbox.org/git/[email protected]/. If there is a mature FUSE solution equipped with Bazel + Git integration to come out soon, I would suggest we go ahead and close this issue in hope that we could drive the FUSE adoption among all users. If there isn't, I think git-aware Bazel improvements might be much needed, specifically toward supporting git-sparse-checkout use cases. Victoria proposal (and Newren's suggestions) would be a great first step toward this direction. |
I want to be clear, I absolutely plan to prototype something here in the coming weeks! I agree that the user experience of "also install a read-only FUSE filesystem" just isn't adequate. I'm sorry for the long period of silence on this issue: I neglected to mention that I'd be honeymooning for most of December, a twice-postponed trip originally booked before COVID. The solutions I offered in my previous post were simply meant as stopgaps that work today, if there is any filesystem (or combined set of filesystems) that provides read-only access to the monolithic source. I'll spend some time trying out existing open-source git filesystems with Bazel and observing what happens in practice, and doing some performance testing. I don't expect the eventual write-up to satisfy many: installing a whole filesystem as step 1.5 to use Bazel is simply asking a lot, even if it's a recommended filesystem. For everybody else who does not have a complete read-only filesystem for their monorepo, there are a few improvements that can be made. If Bazel were aware that the workspace were incomplete and knows the git repo and sparse-checkout configuration, it can read content on-demand from git ("read-vcs-on-demand"). This is true, but there is a limiting principle: it's implementing another userspace filesystem. Any code written for Bazel's that looks at all like a gitfs will become a gitfs, and will be worse than all the other gitfs out there because it will be the newest. So it should punch above its weight in benefit when time can also be spent integrating with existing filesystems.
|
Based on the conversation so far, it sounds like the easiest integration point in Bazel is using Since the FileSystem class in Bazel is extensible and Google already has other implementations, how hard would it be to create a small FileSystem implementation that speaks a simple protocol over stdin/stdout with a subprocess, then extend the
In this case, the Unlike a disk-backed |
Quick note: git in recent versions has the ability to spawn daemons with interprocess communication for file system watching functionality https://git-scm.com/docs/api-simple-ipc https://git-scm.com/docs/git-fsmonitor--daemon |
While the proposed command is conceptually similar to
Rather than put all that git-specific logic in Bazel, I'd prefer if the protocol were something close to what Bazel wants. I don't have strong opinions about whether the git side is implemented as another flag on |
I don't know why the existing git / virtual file system bridges are slow, in particular, if they are slow due to the inherent limitations of git, any particular operating system or the implementation of the bridges themselves. I would much like not to rely on I think the most feasible approach is some variant of @vdye 's "Bazel pokes at some external tool that materializes packages as it needs them" idea, modulo the issues outlined in #16380 (comment) . |
If multi-element
Those both have some performance impact since every read needs to hit both places, instead of just locating a package, but it would probably still be less than the win from being able to get hashes directly from the VFS implementation. More fundamentally, I challenge the idea that users actually want to include all the transitive dependencies of their code in the sparse checkout. There are some dependency patterns where your code might technically depend on a large number of other packages at analysis time that in practice you're not actually interested in modifying, and may not want to pay the cost of having on-disk. Using a FileSystem backed by git directly means Bazel can work with whatever sparse checkout the user wants to use (as long as they aren't running local execution actions), potentially including the null checkout for CI workers. |
I unfortunately can't state that I would like to avoid entangling Bazel with source control; sure, we could do a The current answer to "conveniently narrow interface" for reading sources is FUSE despite all its problems. I do realize that that is getting less and less tenable, but there isn't an easy answer :( The best I can think of is in my previous comment and that's not very satisfying, either. |
https://lore.kernel.org/git/CAJoAoZ=Cig_kLocxKGax31sU7Xe4==BGzC__Bg2_pr7krNq6MA@mail.gmail.com/ should be interesting to the folks here. Google's Git team wrote about their experience trying to make a VFS on top of git. |
Thanks for the link, @sluongng :) |
Description of the feature request:
Add "filesystem adapter" API(s) so that when a source file isn't found on-disk during the loading phase, Bazel can fall back on a user-specified function to vivify the file (if it exists) rather than throwing an error.
This request comes from a desire to have Bazel work in a Git sparse-checkout environment without pre-specifying patterns (see below), but the feature probably shouldn't be scoped to just Git; for example, Mercurial has a similar sparse checkout
feature. I'd be happy to see a feature like this be even more general purpose if there are other cases it could support!
What underlying problem are you trying to solve with this feature?
Internal dependencies in Bazel are implicitly required to be on-disk when performing Bazel operations (
query
,build
, etc.) that involve loading the associated source files. However, in a Git sparse-checkout environment, source files are not present in the worktree (i.e., where Bazel is looking for them) unless vivified withgit sparse-checkout add
. Even when not on-disk, though, these source files are more consistent with internal dependencies (fixed to the same "version" as the rest of the repo, not a separate pre-built package, are checked out to source directories) than external ones.Right now, the only way to work in a sparse-checkout with Bazel is to specify the files that a particular build will need before running
bazel build
. A user has to maintain two independent sources of truth for which files are needed in a build: the Bazel dependency graph, and the sparse-checkout patterns. If the not-on-disk sparse-checkout files can reasonably be interpreted as "internal dependencies" without running contrary to Bazel's underlying principles, it would be extremely helpful to users if they could resolve the files "just-in-time" while Bazel builds its internal dependency graph.Which operating system are you running Bazel on?
No response
What is the output of
bazel info release
?No response
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
Within Bazel itself, it looks like the only other reference to
sparse-checkout
is a feature request to include a static pattern list in the specification of agit_repository()
. That request doesn't quite match this one, since 1) that deals with an external repository, rather than the local one, and 2) it still requires pre-determined patterns, rather than resolving files just-in-time.Outside of Bazel, third-party Bazel/sparse-checkout interoperability tools are pretty common, especially among large monorepo projects (1, 2). However, these tools still rely on a fully on-disk copy of a Git repository to generate the Bazel dependency graph, severely inhibiting the potential performance gains of
git sparse-checkout
. And, because these integrations use a tool separate from eithergit
orbazel
,they are potentially more difficult to adopt than an integration contained in a Bazel rule library.
Any other information, logs, or outputs that you want to share?
I know this is a pretty large feature request (with lots of details to consider: UX, performance, thread safety, etc.), so if this request doesn't seem completely infeasible, I'm happy to write & submit a more detailed design doc!
The text was updated successfully, but these errors were encountered: