-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest): clarify s3/s3a requirements and platform defaults #4263
Conversation
def read_file_spark(self, file: str, is_aws: bool) -> Optional[DataFrame]: | ||
|
||
if is_aws: | ||
file = f"s3a://{file}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: s3://{file}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or is the "a" intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a typo—Spark uses the s3a protocol rather than s3 (see https://spark.apache.org/docs/latest/cloud-integration.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay got it.. Thanks !
@@ -40,6 +41,15 @@ def ensure_profiling_pattern_is_passed_to_profiling( | |||
profiling.allow_deny_patterns = values["profile_patterns"] | |||
return values | |||
|
|||
@pydantic.validator("platform", always=True) | |||
def validate_platform(cls, value: str, values: Dict[str, Any]) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is "values" coming from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the rest of the config values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values
are the previously-validated fields here since this is a Pydantic validator — see https://pydantic-docs.helpmanual.io/usage/validators/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, otherwise LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Adds a few sentences to the docs clarifying the need to grant S3A permissions when running profiling on files in AWS S3. Also sets the
platform
tos3
orfile
depending on thebase_path
ifplatform
is unspecified.Checklist