Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for MinIO and B2 buckets #620

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

TaperChipmunk32
Copy link
Collaborator

@TaperChipmunk32 TaperChipmunk32 commented Jan 7, 2025

-Refactored SilNlpEnv in silnlp/common/environment.py to support connection to either MinIO or B2

-Kept in support for AWS temporarily

-Updated readme and other documentation to show instructions on MinIO and B2 bucket setup

With this PR, everyone can start using Backblaze B2 or MinIO once they have their environment set up.


This change is Reviewable

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @TaperChipmunk32)


bucket_setup.md line 7 at r1 (raw file):

### Note For MinIO setup

In order to access the MinIO bucket locally, you must have a VPN connected to its network.

It might be helpful to add "If you need VPN access, please reach out to an SILNLP dev team member."


bucket_setup.md line 13 at r1 (raw file):

**Windows**

The following will mount /silnlp on your B drive or /nlp-research on your M drive and allow you to explore, read and write.

What's the motivation behind the B drive folder and M drive folder having different names?


silnlp/common/environment.py line 33 at r1 (raw file):

        self.bucket_service = os.getenv("BUCKET_SERVICE", "").lower()

        self.set_s3_bucket()

I'm not sure that we want to jump to setting up the s3 bucket immediately. The user, especially non-SIL users, may want to specify a local data dir on their computer rather than use a bucket. Under the current approach, set_data_dir immediately calls resolve_data_dir if data_dir is None, and from there it checks if the SIL_NLP_DATA_PATH is a local directory first, before checking to see if it's an S3 path. We may want to also keep the SIL_NLP_DATA_PATH environment variable to maintain this feature.


silnlp/common/environment.py line 183 at r1 (raw file):

        # TEMPORARY: This allows users to still connect to AWS S3 if they have not set up MinIO or B2 yet. This will be removed in the future.
        if self.bucket_service == "aws" or (os.getenv("MINIO_ACCESS_KEY") is None and os.getenv("B2_KEY_ID") is None):
            LOGGER.warning("Support for AWS S3 will soon be removed. Please set up MinIO and/or B2 credentials.")

This is a great idea to include!


silnlp/common/environment.py line 212 at r1 (raw file):

                self.bucket_service = "minio"
            except Exception as e:
                LOGGER.info(e)

It's probably best to upgrade this to a warning level, since it does involve a failure to connect.


silnlp/common/environment.py line 226 at r1 (raw file):

                self.bucket_service = "b2"
            except Exception as e:
                LOGGER.info(e)

Same as above.

Copy link
Collaborator Author

@TaperChipmunk32 TaperChipmunk32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 12 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @mshannon-sil)


silnlp/common/environment.py line 33 at r1 (raw file):

Previously, mshannon-sil wrote…

I'm not sure that we want to jump to setting up the s3 bucket immediately. The user, especially non-SIL users, may want to specify a local data dir on their computer rather than use a bucket. Under the current approach, set_data_dir immediately calls resolve_data_dir if data_dir is None, and from there it checks if the SIL_NLP_DATA_PATH is a local directory first, before checking to see if it's an S3 path. We may want to also keep the SIL_NLP_DATA_PATH environment variable to maintain this feature.

Done, the original functionality should be restored. If SIL_NLP_DATA_PATH is included and BUCKET_SERVICE is not, then it will try the local file system.


silnlp/common/environment.py line 212 at r1 (raw file):

Previously, mshannon-sil wrote…

It's probably best to upgrade this to a warning level, since it does involve a failure to connect.

Done.


silnlp/common/environment.py line 226 at r1 (raw file):

Previously, mshannon-sil wrote…

Same as above.

Done.


bucket_setup.md line 7 at r1 (raw file):

Previously, mshannon-sil wrote…

It might be helpful to add "If you need VPN access, please reach out to an SILNLP dev team member."

Done.


bucket_setup.md line 13 at r1 (raw file):

Previously, mshannon-sil wrote…

What's the motivation behind the B drive folder and M drive folder having different names?

This allows both drives to be mounted at the same time, if anyone ever wants to.

@TaperChipmunk32 TaperChipmunk32 linked an issue Jan 8, 2025 that may be closed by this pull request
Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r2, all commit messages.
Reviewable status: 10 of 12 files reviewed, 4 unresolved discussions (waiting on @ddaspit and @TaperChipmunk32)


silnlp/common/environment.py line 33 at r1 (raw file):

Previously, TaperChipmunk32 (Matthew Beech) wrote…

Done, the original functionality should be restored. If SIL_NLP_DATA_PATH is included and BUCKET_SERVICE is not, then it will try the local file system.

Great. The only thing I'm noticing now is that, since SIL_NLP_DATA_PATH would only look at local filesystems, it would not be possible for a non-SIL user who doesn't have access to our S3 bucket to make their own S3 bucket with their own data and point to it. Maybe we should keep the SIL_NLP_DATA_PATH variable for all folder names, local or bucket, alongside the new BUCKET_SERVICE variable. And not deprecate the AWS bucket feature if it doesn't add much overhead. Any thoughts @ddaspit ?


silnlp/common/environment.py line 145 at r1 (raw file):

                else:
                    raise Exception(
                        f"The path defined by environment variable data_path ({data_path}) is not a "

Same as above.

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 12 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @mshannon-sil and @TaperChipmunk32)


silnlp/common/environment.py line 33 at r1 (raw file):

Previously, mshannon-sil wrote…

Great. The only thing I'm noticing now is that, since SIL_NLP_DATA_PATH would only look at local filesystems, it would not be possible for a non-SIL user who doesn't have access to our S3 bucket to make their own S3 bucket with their own data and point to it. Maybe we should keep the SIL_NLP_DATA_PATH variable for all folder names, local or bucket, alongside the new BUCKET_SERVICE variable. And not deprecate the AWS bucket feature if it doesn't add much overhead. Any thoughts @ddaspit ?

I don't think anyone is currently using silnlp with another S3 bucket, but I don't have a problem with leaving in the AWS support, so that we don't lose that functionality. We can always remove it in the future if we decide it is not worth continuing to maintain.


README.md line 143 at r2 (raw file):

   B2_KEY_ID=xxxxxxxx
   B2_APPLICATION_KEY=xxxxxxxx
   MINIO_ENDPOINT_URL=https://truenas.psonet.languagetechnology.org:9000

Will most users only need to setup B2 and not MinIO?

Copy link
Collaborator Author

@TaperChipmunk32 TaperChipmunk32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @mshannon-sil)


README.md line 143 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Will most users only need to setup B2 and not MinIO?

Syncing between the two buckets using rclone takes 8 minutes, even when there are no transfers needed, so we will likely need to have users connect to MinIO.


silnlp/common/environment.py line 33 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I don't think anyone is currently using silnlp with another S3 bucket, but I don't have a problem with leaving in the AWS support, so that we don't lose that functionality. We can always remove it in the future if we decide it is not worth continuing to maintain.

Done.


silnlp/common/environment.py line 145 at r1 (raw file):

Previously, mshannon-sil wrote…

Same as above.

Done.

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but let's hold on on merging this until we know what we're doing with the VPN/syncing issue.

Reviewed 2 of 4 files at r2.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)

-Refactored SilNlpEnv in silnlp/common/environment.py to support connection to either MinIO or B2

-Kept in support for AWS temporarily

-Updated readme and other documentation to show instructions on MinIO and B2 bucket setup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants