-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for MinIO and B2 buckets #620
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @TaperChipmunk32)
bucket_setup.md
line 7 at r1 (raw file):
### Note For MinIO setup In order to access the MinIO bucket locally, you must have a VPN connected to its network.
It might be helpful to add "If you need VPN access, please reach out to an SILNLP dev team member."
bucket_setup.md
line 13 at r1 (raw file):
**Windows** The following will mount /silnlp on your B drive or /nlp-research on your M drive and allow you to explore, read and write.
What's the motivation behind the B drive folder and M drive folder having different names?
silnlp/common/environment.py
line 33 at r1 (raw file):
self.bucket_service = os.getenv("BUCKET_SERVICE", "").lower() self.set_s3_bucket()
I'm not sure that we want to jump to setting up the s3 bucket immediately. The user, especially non-SIL users, may want to specify a local data dir on their computer rather than use a bucket. Under the current approach, set_data_dir immediately calls resolve_data_dir if data_dir is None, and from there it checks if the SIL_NLP_DATA_PATH is a local directory first, before checking to see if it's an S3 path. We may want to also keep the SIL_NLP_DATA_PATH environment variable to maintain this feature.
silnlp/common/environment.py
line 183 at r1 (raw file):
# TEMPORARY: This allows users to still connect to AWS S3 if they have not set up MinIO or B2 yet. This will be removed in the future. if self.bucket_service == "aws" or (os.getenv("MINIO_ACCESS_KEY") is None and os.getenv("B2_KEY_ID") is None): LOGGER.warning("Support for AWS S3 will soon be removed. Please set up MinIO and/or B2 credentials.")
This is a great idea to include!
silnlp/common/environment.py
line 212 at r1 (raw file):
self.bucket_service = "minio" except Exception as e: LOGGER.info(e)
It's probably best to upgrade this to a warning level, since it does involve a failure to connect.
silnlp/common/environment.py
line 226 at r1 (raw file):
self.bucket_service = "b2" except Exception as e: LOGGER.info(e)
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 8 of 12 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @mshannon-sil)
silnlp/common/environment.py
line 33 at r1 (raw file):
Previously, mshannon-sil wrote…
I'm not sure that we want to jump to setting up the s3 bucket immediately. The user, especially non-SIL users, may want to specify a local data dir on their computer rather than use a bucket. Under the current approach, set_data_dir immediately calls resolve_data_dir if data_dir is None, and from there it checks if the SIL_NLP_DATA_PATH is a local directory first, before checking to see if it's an S3 path. We may want to also keep the SIL_NLP_DATA_PATH environment variable to maintain this feature.
Done, the original functionality should be restored. If SIL_NLP_DATA_PATH is included and BUCKET_SERVICE is not, then it will try the local file system.
silnlp/common/environment.py
line 212 at r1 (raw file):
Previously, mshannon-sil wrote…
It's probably best to upgrade this to a warning level, since it does involve a failure to connect.
Done.
silnlp/common/environment.py
line 226 at r1 (raw file):
Previously, mshannon-sil wrote…
Same as above.
Done.
bucket_setup.md
line 7 at r1 (raw file):
Previously, mshannon-sil wrote…
It might be helpful to add "If you need VPN access, please reach out to an SILNLP dev team member."
Done.
bucket_setup.md
line 13 at r1 (raw file):
Previously, mshannon-sil wrote…
What's the motivation behind the B drive folder and M drive folder having different names?
This allows both drives to be mounted at the same time, if anyone ever wants to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r2, all commit messages.
Reviewable status: 10 of 12 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @TaperChipmunk32)
silnlp/common/environment.py
line 33 at r1 (raw file):
Previously, TaperChipmunk32 (Matthew Beech) wrote…
Done, the original functionality should be restored. If SIL_NLP_DATA_PATH is included and BUCKET_SERVICE is not, then it will try the local file system.
Great. The only thing I'm noticing now is that, since SIL_NLP_DATA_PATH would only look at local filesystems, it would not be possible for a non-SIL user who doesn't have access to our S3 bucket to make their own S3 bucket with their own data and point to it. Maybe we should keep the SIL_NLP_DATA_PATH variable for all folder names, local or bucket, alongside the new BUCKET_SERVICE variable. And not deprecate the AWS bucket feature if it doesn't add much overhead. Any thoughts @ddaspit ?
silnlp/common/environment.py
line 145 at r1 (raw file):
else: raise Exception( f"The path defined by environment variable data_path ({data_path}) is not a "
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 12 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil and @TaperChipmunk32)
silnlp/common/environment.py
line 33 at r1 (raw file):
Previously, mshannon-sil wrote…
Great. The only thing I'm noticing now is that, since SIL_NLP_DATA_PATH would only look at local filesystems, it would not be possible for a non-SIL user who doesn't have access to our S3 bucket to make their own S3 bucket with their own data and point to it. Maybe we should keep the SIL_NLP_DATA_PATH variable for all folder names, local or bucket, alongside the new BUCKET_SERVICE variable. And not deprecate the AWS bucket feature if it doesn't add much overhead. Any thoughts @ddaspit ?
I don't think anyone is currently using silnlp with another S3 bucket, but I don't have a problem with leaving in the AWS support, so that we don't lose that functionality. We can always remove it in the future if we decide it is not worth continuing to maintain.
README.md
line 143 at r2 (raw file):
B2_KEY_ID=xxxxxxxx B2_APPLICATION_KEY=xxxxxxxx MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000
Will most users only need to setup B2 and not MinIO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @mshannon-sil)
README.md
line 143 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Will most users only need to setup B2 and not MinIO?
Syncing between the two buckets using rclone takes 8 minutes, even when there are no transfers needed, so we will likely need to have users connect to MinIO.
silnlp/common/environment.py
line 33 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I don't think anyone is currently using silnlp with another S3 bucket, but I don't have a problem with leaving in the AWS support, so that we don't lose that functionality. We can always remove it in the future if we decide it is not worth continuing to maintain.
Done.
silnlp/common/environment.py
line 145 at r1 (raw file):
Previously, mshannon-sil wrote…
Same as above.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but let's hold on on merging this until we know what we're doing with the VPN/syncing issue.
Reviewed 2 of 4 files at r2.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)
aa231b4
to
976d203
Compare
ec5b1f6
to
c4b78c7
Compare
-Refactored SilNlpEnv in silnlp/common/environment.py to support connection to either MinIO or B2 -Kept in support for AWS temporarily -Updated readme and other documentation to show instructions on MinIO and B2 bucket setup
c4b78c7
to
833e27b
Compare
-Refactored SilNlpEnv in silnlp/common/environment.py to support connection to either MinIO or B2
-Kept in support for AWS temporarily
-Updated readme and other documentation to show instructions on MinIO and B2 bucket setup
With this PR, everyone can start using Backblaze B2 or MinIO once they have their environment set up.
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)