-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Consolidate write.folder-storage.path and write.object-storage.path to write.data.path #3094
Conversation
…o write.data.path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for handling this @flyrain!
A couple nits and open-ended questions and one more question (for my own understanding).
Clarifying for my own understanding, but if user enables object-storage location provider and doesnt set any of these paths, what happens?
Like in this case, does the data file still get the hash? And if so, where is it placed? table.location() + "/data" + <hash>
? This might be something to address in another PR (or maybe I just need more coffee).
table.updateProperties()
.set(TableProperties.OBJECT_STORE_ENABLED, "true")
.commit();
Assert.assertTrue("default data location should be used when object storage path not set",
table.locationProvider().newDataLocation("file").contains(table.location() + "/data"));
// This only applies to files written after this property is set. Files previously written aren't | ||
// relocated to reflect this parameter. | ||
// If not set, defaults to a "data" folder underneath the root path of the table. | ||
public static final String WRITE_DATA_LOCATION = "write.data.path"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Do we want to update the comment for the default if not set to reflect anything about the possibility of object storage location provider?
Up to you. The more I think about it, the more I think it just complicates things and that we should just properly document the behavior on the website. But up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can move L161 to after L144 to avoid deleting the comments and add again here.
@@ -141,4 +133,18 @@ private static String stripTrailingSlash(String path) { | |||
} | |||
return result; | |||
} | |||
|
|||
public static String dataLocation(Map<String, String> properties, String tableLocation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Does this need to be public
or can it be private
? If it needs to be visible for testing or something, that's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visible for testing should use protected or package-private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I make it public because it is used in IcebergSourceBenchmark.java. Maybe we can change WRITE_FOLDER_STORAGE_LOCATION
to the WRITE_DATA_PATH
in IcebergSourceBenchmark.java?
public static String dataLocation(Map<String, String> properties, String tableLocation) { | ||
String dataLocation = properties.get(TableProperties.WRITE_DATA_LOCATION); | ||
if (dataLocation == null) { | ||
dataLocation = properties.get(TableProperties.OBJECT_STORE_PATH); | ||
if (dataLocation == null) { | ||
dataLocation = properties.get(TableProperties.WRITE_FOLDER_STORAGE_LOCATION); | ||
if (dataLocation == null) { | ||
dataLocation = String.format("%s/data", tableLocation); | ||
} | ||
} | ||
} | ||
return dataLocation; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're falling back through several deprecated options, we might want to list out the intended fallback order / behavior (in a comment up top for example) so we can more easily verify that it happens.
Or better yet, add tests setting different combinations.
We should potentially also address whether or not it's possible for users to have set too many flags and then error out.
Lastly, might want to drop a warning level deprecation log if one of the deprecated flags is non-null (or maybe just info). :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testObjectStorageLocationProviderPathResolution
is for that. Will add warn log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the log.
public static String dataLocation(Map<String, String> properties, String tableLocation) { | ||
String dataLocation = properties.get(TableProperties.WRITE_DATA_LOCATION); | ||
if (dataLocation == null) { | ||
dataLocation = properties.get(TableProperties.OBJECT_STORE_PATH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only correct if the table is using object storage. I think you need to update this logic to select whether to return OBJECT_STORE_PATH or WRITE_FOLDER_STORAGE_LOCATION depending on the location provider selected. Both should fall back to the table location. And the new property should take precedence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Made the change. This is going to be a little different from #2965. Modified the test case testObjectStorageLocationProviderPathResolution
and added testDefaultStorageLocationProviderPathResolution
.
@@ -135,21 +135,31 @@ private TableProperties() { | |||
public static final String OBJECT_STORE_ENABLED = "write.object-storage.enabled"; | |||
public static final boolean OBJECT_STORE_ENABLED_DEFAULT = false; | |||
|
|||
/** | |||
* @deprecated will be removed in 0.14.0, use {@link #WRITE_DATA_LOCATION} instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt this will ever be removed. It isn't worth breaking on older tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, remove the comments.
This PR doesn't change the behavior of object storage location provider. It will fall back to |
private static String dataLocation(Map<String, String> properties, String tableLocation, String deprecatedProperty) { | ||
String dataLocation = properties.get(TableProperties.WRITE_DATA_LOCATION); | ||
if (dataLocation == null) { | ||
dataLocation = properties.get(deprecatedProperty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to check if deprecatedProperty is null, the map can take a null.
core/src/test/java/org/apache/iceberg/TestLocationProvider.java
Outdated
Show resolved
Hide resolved
@@ -167,7 +167,7 @@ public SnapshotTable tableProperty(String property, String value) { | |||
// remove any possible location properties from origin properties | |||
properties.remove(LOCATION); | |||
properties.remove(TableProperties.WRITE_METADATA_LOCATION); | |||
properties.remove(TableProperties.WRITE_FOLDER_STORAGE_LOCATION); | |||
properties.remove(TableProperties.WRITE_DATA_LOCATION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When snapshotting a table that has folder storage location set, should we also remove it in addition to write data location?
Also should we remove object storage location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be safer to remove both. Added in the new commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a PR #2966 for this change, you can either also port my tests here, or remove this and I will update that PR once this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jackye1995 , thanks for the information. I've added both WRITE_FOLDER_STORAGE_LOCATION
and OBJECT_STORE_PATH
. Can you rebase it in PR #2966 once this got merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that
Hi @jun-he, do we need a similar change in python? table_properties.py contains WRITE_NEW_DATA_LOCATION and WRITE_METADATA_LOCATION as well. I'm glad to add it if the change is needed. Another option is to handle it in a follow-up PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review, was a bit busy last few days, thanks for working on this!
@@ -56,10 +57,12 @@ public static LocationProvider locationsFor(String location, Map<String, String> | |||
return ctor.newInstance(location, properties); | |||
} catch (ClassCastException e) { | |||
throw new IllegalArgumentException( | |||
String.format("Provided implementation for dynamic instantiation should implement %s.", | |||
String.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need for the newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I was using the formatting tool for the whole file. Will remove it.
LocationProvider.class), e); | ||
} | ||
} else if (PropertyUtil.propertyAsBoolean(properties, | ||
} else if (PropertyUtil.propertyAsBoolean( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need for the newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it.
this.storageLocation = stripTrailingSlash(properties.getOrDefault(OBJECT_STORE_PATH, | ||
defaultDataLocation(tableLocation, properties))); | ||
this.storageLocation = | ||
stripTrailingSlash(dataLocation(properties, tableLocation, TableProperties.OBJECT_STORE_PATH)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is that we should fallback configs in the order of:
- write.data.path
- write.object-storage.path
- write.folder-storage.path
- default table location/data
The current implementation seems to skip 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my original logic. I made the change per Ryan's comments #3094 (comment), which makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I would agree with that if the change is not in 0.12.0, but unfortunately that PR is a part of the 0.12.0 release. apache-iceberg-0.12.0...master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jackye1995, made the logic to be the same with #2845. The implementation looks not as clean as before, but that's probably the best thing we can do.
// This only applies to files written after this property is set. Files previously written aren't | ||
// relocated to reflect this parameter. | ||
// If not set, defaults to a "data" folder underneath the root path of the table. | ||
public static final String WRITE_DATA_LOCATION = "write.data.path"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can move L161 to after L144 to avoid deleting the comments and add again here.
@@ -167,7 +167,7 @@ public SnapshotTable tableProperty(String property, String value) { | |||
// remove any possible location properties from origin properties | |||
properties.remove(LOCATION); | |||
properties.remove(TableProperties.WRITE_METADATA_LOCATION); | |||
properties.remove(TableProperties.WRITE_FOLDER_STORAGE_LOCATION); | |||
properties.remove(TableProperties.WRITE_DATA_LOCATION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a PR #2966 for this change, you can either also port my tests here, or remove this and I will update that PR once this is merged.
I am trying to put these two guys together.
|
private static String dataLocation(Map<String, String> properties, String tableLocation, boolean isObjectStore) { | ||
String dataLocation = properties.get(TableProperties.WRITE_DATA_LOCATION); | ||
if (dataLocation == null) { | ||
String deprecatedProperty = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use String deprecatedProperty = isObjectStorage ? TableProperties.OBJECT_STORE_PATH : TableProperties.WRITE_FOLDER_STORAGE_LOCATION
, then deprecatedProperty
is always not null, and you don't need to do null check for the warning, and we can avoid having that warning helper method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change. Thanks for the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I have a small comment around documentation, apart from that I don't have additional feedbacks. Let's see if anyone else has additional concern, if not I will merge it after some time.
site/docs/aws.md
Outdated
@@ -373,8 +373,7 @@ s3://my-table-data-bucket/2d3905f8/my_ns.db/my_table/category=orders/00000-0-5af | |||
``` | |||
|
|||
Note, the path resolution logic for `ObjectStoreLocationProvider` is as follows: | |||
- if `write.object-storage.path` is set, use it | |||
- if not found, fallback to `write.folder-storage.path` | |||
- if `write.data.path` is set, use it | |||
- if not found, use `<tableLocation>/data` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we have a simple path resolution strategy, I think we can just put these listing in a single sentence. Also add 2 warning blocks describing the legacy behaviors:
- before 0.12.0,
write.object-storage.path
must be set - at 0.12.0,
write.object-storage.path
thenwrite.folder-storage.path
then<tableLocation>/data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jack! Made the change for the doc.
Thanks all for the review. Is it ready for merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me. Thank you for doing this @flyrain!
* For the {@link ObjectStoreLocationProvider}, the priority level are | ||
* "write.data.path" -> "write.object-storage.path" -> "write.folder-storage.path" -> "table-location/data". | ||
*/ | ||
private static String dataLocation(Map<String, String> properties, String tableLocation, boolean isObjectStore) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit / non-blocking: Naming this isObjectStorageLocationProvider
or something to indicate that it's not just object storage, but specifically the usage of object storage location provider.
However, since I just realized this is already in LocationProviders.java
, I think this is probably fine and no need to change (especially as it's documented in the java doc).
@flyrain thanks for pinging. Yep, we need to have the similar changes. But I think it is fine not to add it in the python_legacy. Instead, we can add it to the new python library. I am currently working on its layout with a high level design. Once that is done, we will start the implementation and then add it to python. I will create an issue for us to track it. |
} | ||
} | ||
|
||
return dataLocation; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks way too complicated to be worth it. Let's just duplicate the logic and minimize the changes in each location provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change.
} else if (deprecatedProperty.equals(TableProperties.OBJECT_STORE_PATH)) { | ||
dataLocation = properties.get(TableProperties.WRITE_FOLDER_STORAGE_LOCATION); | ||
if (dataLocation != null) { | ||
LOG.warn(warnMsg, TableProperties.WRITE_FOLDER_STORAGE_LOCATION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for warnings. This is not important enough to nag users about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
Looks like the change that defaulted object storage tables to the folder-storage path has not been released yet: aef12c0 @flyrain and @jackye1995, how about simplifying this by only using |
Hi @rdblue, I like your idea to simplify the logic, but as @jackye1995 mentioned, the logic to use write.folder-storage.path in the object store location provider was introduced by #2845, which has been released in 0.12.0. #2965 is a PR to just change the constant name. |
I think if 1.0.0 is the next release, what @rdblue proposes is definitely the right approach to go, and we have also a way out for users of 0.12.0, which is great. If we are still thinking about 0.13.0, we need to just be more cautious, because these config keys will be there all the time, I think it's better to keep the resolution strategy. We decided to merge #2845 mostly because services try to hide the table's root location, but the object storage mode forces user to configure a location that the user might not know. Chaining the fallback config key seems to be the only way out for the 2 use cases described here: #2845 (comment) |
I don't think that we shouldn't make a breaking behavior change at 1.0. It isn't important enough to clean this up to warrant a breaking change. Let's just continue with the released behavior as a fallback. |
Thanks, @flyrain! |
Thanks all for the review and commit. @rdblue @jackye1995 @kbendick @karuppayya @jun-he. |
Per discussion in #2965, we deprecate both
write.folder-storage.path
andwrite.object-storage.path
, and usewrite.data.path
instead.Created a new method
LocationProviders::dataLocation
to get the data location and to be compatible with deprecated configs.cc @rdblue @jackye1995 @aokolnychyi @RussellSpitzer @kbendick @karuppayya