Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Consolidate write.folder-storage.path and write.object-storage.path to write.data.path #3094

Merged
merged 11 commits into from
Sep 22, 2021

Conversation

flyrain
Copy link
Contributor

@flyrain flyrain commented Sep 9, 2021

Per discussion in #2965, we deprecate both write.folder-storage.path and write.object-storage.path, and use write.data.path instead.

Created a new method LocationProviders::dataLocation to get the data location and to be compatible with deprecated configs.

cc @rdblue @jackye1995 @aokolnychyi @RussellSpitzer @kbendick @karuppayya

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for handling this @flyrain!

A couple nits and open-ended questions and one more question (for my own understanding).

Clarifying for my own understanding, but if user enables object-storage location provider and doesnt set any of these paths, what happens?

Like in this case, does the data file still get the hash? And if so, where is it placed? table.location() + "/data" + <hash>? This might be something to address in another PR (or maybe I just need more coffee).

    table.updateProperties()
        .set(TableProperties.OBJECT_STORE_ENABLED, "true")
        .commit();

    Assert.assertTrue("default data location should be used when object storage path not set",
        table.locationProvider().newDataLocation("file").contains(table.location() + "/data"));

Comment on lines +158 to +161
// This only applies to files written after this property is set. Files previously written aren't
// relocated to reflect this parameter.
// If not set, defaults to a "data" folder underneath the root path of the table.
public static final String WRITE_DATA_LOCATION = "write.data.path";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do we want to update the comment for the default if not set to reflect anything about the possibility of object storage location provider?

Up to you. The more I think about it, the more I think it just complicates things and that we should just properly document the behavior on the website. But up to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can move L161 to after L144 to avoid deleting the comments and add again here.

@@ -141,4 +133,18 @@ private static String stripTrailingSlash(String path) {
}
return result;
}

public static String dataLocation(Map<String, String> properties, String tableLocation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Does this need to be public or can it be private? If it needs to be visible for testing or something, that's ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visible for testing should use protected or package-private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make it public because it is used in IcebergSourceBenchmark.java. Maybe we can change WRITE_FOLDER_STORAGE_LOCATION to the WRITE_DATA_PATH in IcebergSourceBenchmark.java?

Comment on lines 137 to 149
public static String dataLocation(Map<String, String> properties, String tableLocation) {
String dataLocation = properties.get(TableProperties.WRITE_DATA_LOCATION);
if (dataLocation == null) {
dataLocation = properties.get(TableProperties.OBJECT_STORE_PATH);
if (dataLocation == null) {
dataLocation = properties.get(TableProperties.WRITE_FOLDER_STORAGE_LOCATION);
if (dataLocation == null) {
dataLocation = String.format("%s/data", tableLocation);
}
}
}
return dataLocation;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're falling back through several deprecated options, we might want to list out the intended fallback order / behavior (in a comment up top for example) so we can more easily verify that it happens.

Or better yet, add tests setting different combinations.

We should potentially also address whether or not it's possible for users to have set too many flags and then error out.

Lastly, might want to drop a warning level deprecation log if one of the deprecated flags is non-null (or maybe just info). :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testObjectStorageLocationProviderPathResolution is for that. Will add warn log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the log.

public static String dataLocation(Map<String, String> properties, String tableLocation) {
String dataLocation = properties.get(TableProperties.WRITE_DATA_LOCATION);
if (dataLocation == null) {
dataLocation = properties.get(TableProperties.OBJECT_STORE_PATH);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only correct if the table is using object storage. I think you need to update this logic to select whether to return OBJECT_STORE_PATH or WRITE_FOLDER_STORAGE_LOCATION depending on the location provider selected. Both should fall back to the table location. And the new property should take precedence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Made the change. This is going to be a little different from #2965. Modified the test case testObjectStorageLocationProviderPathResolution and added testDefaultStorageLocationProviderPathResolution.

@@ -135,21 +135,31 @@ private TableProperties() {
public static final String OBJECT_STORE_ENABLED = "write.object-storage.enabled";
public static final boolean OBJECT_STORE_ENABLED_DEFAULT = false;

/**
* @deprecated will be removed in 0.14.0, use {@link #WRITE_DATA_LOCATION} instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt this will ever be removed. It isn't worth breaking on older tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, remove the comments.

@flyrain
Copy link
Contributor Author

flyrain commented Sep 10, 2021

Clarifying for my own understanding, but if user enables object-storage location provider and doesnt set any of these paths, what happens?

Like in this case, does the data file still get the hash? And if so, where is it placed? table.location() + "/data" + <hash>?

This PR doesn't change the behavior of object storage location provider. It will fall back to table.location() + "/data" if no config is set.

private static String dataLocation(Map<String, String> properties, String tableLocation, String deprecatedProperty) {
String dataLocation = properties.get(TableProperties.WRITE_DATA_LOCATION);
if (dataLocation == null) {
dataLocation = properties.get(deprecatedProperty);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to check if deprecatedProperty is null, the map can take a null.

@@ -167,7 +167,7 @@ public SnapshotTable tableProperty(String property, String value) {
// remove any possible location properties from origin properties
properties.remove(LOCATION);
properties.remove(TableProperties.WRITE_METADATA_LOCATION);
properties.remove(TableProperties.WRITE_FOLDER_STORAGE_LOCATION);
properties.remove(TableProperties.WRITE_DATA_LOCATION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When snapshotting a table that has folder storage location set, should we also remove it in addition to write data location?
Also should we remove object storage location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be safer to remove both. Added in the new commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR #2966 for this change, you can either also port my tests here, or remove this and I will update that PR once this is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jackye1995 , thanks for the information. I've added both WRITE_FOLDER_STORAGE_LOCATION and OBJECT_STORE_PATH. Can you rebase it in PR #2966 once this got merged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that

@flyrain
Copy link
Contributor Author

flyrain commented Sep 13, 2021

Hi @jun-he, do we need a similar change in python? table_properties.py contains WRITE_NEW_DATA_LOCATION and WRITE_METADATA_LOCATION as well. I'm glad to add it if the change is needed. Another option is to handle it in a follow-up PR.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, was a bit busy last few days, thanks for working on this!

@@ -56,10 +57,12 @@ public static LocationProvider locationsFor(String location, Map<String, String>
return ctor.newInstance(location, properties);
} catch (ClassCastException e) {
throw new IllegalArgumentException(
String.format("Provided implementation for dynamic instantiation should implement %s.",
String.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for the newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was using the formatting tool for the whole file. Will remove it.

LocationProvider.class), e);
}
} else if (PropertyUtil.propertyAsBoolean(properties,
} else if (PropertyUtil.propertyAsBoolean(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for the newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it.

this.storageLocation = stripTrailingSlash(properties.getOrDefault(OBJECT_STORE_PATH,
defaultDataLocation(tableLocation, properties)));
this.storageLocation =
stripTrailingSlash(dataLocation(properties, tableLocation, TableProperties.OBJECT_STORE_PATH));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that we should fallback configs in the order of:

  1. write.data.path
  2. write.object-storage.path
  3. write.folder-storage.path
  4. default table location/data

The current implementation seems to skip 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my original logic. I made the change per Ryan's comments #3094 (comment), which makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I would agree with that if the change is not in 0.12.0, but unfortunately that PR is a part of the 0.12.0 release. apache-iceberg-0.12.0...master

Copy link
Contributor Author

@flyrain flyrain Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the logic is in #2845, and has been in the release 0.12.0. We'd better to stick to it. I will make the change. cc @rdblue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jackye1995, made the logic to be the same with #2845. The implementation looks not as clean as before, but that's probably the best thing we can do.

Comment on lines +158 to +161
// This only applies to files written after this property is set. Files previously written aren't
// relocated to reflect this parameter.
// If not set, defaults to a "data" folder underneath the root path of the table.
public static final String WRITE_DATA_LOCATION = "write.data.path";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can move L161 to after L144 to avoid deleting the comments and add again here.

@@ -167,7 +167,7 @@ public SnapshotTable tableProperty(String property, String value) {
// remove any possible location properties from origin properties
properties.remove(LOCATION);
properties.remove(TableProperties.WRITE_METADATA_LOCATION);
properties.remove(TableProperties.WRITE_FOLDER_STORAGE_LOCATION);
properties.remove(TableProperties.WRITE_DATA_LOCATION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR #2966 for this change, you can either also port my tests here, or remove this and I will update that PR once this is merged.

@flyrain
Copy link
Contributor Author

flyrain commented Sep 13, 2021

nit: we can move L161 to after L144 to avoid deleting the comments and add again here.

I am trying to put these two guys together.

  // This only applies to files written after this property is set. Files previously written aren't
  // relocated to reflect this parameter.
  // If not set, defaults to a "data" folder underneath the root path of the table.
  public static final String WRITE_DATA_LOCATION = "write.data.path";

  // This only applies to files written after this property is set. Files previously written aren't
  // relocated to reflect this parameter.
  // If not set, defaults to a "metadata" folder underneath the root path of the table.
  public static final String WRITE_METADATA_LOCATION = "write.metadata.path";

private static String dataLocation(Map<String, String> properties, String tableLocation, boolean isObjectStore) {
String dataLocation = properties.get(TableProperties.WRITE_DATA_LOCATION);
if (dataLocation == null) {
String deprecatedProperty = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use String deprecatedProperty = isObjectStorage ? TableProperties.OBJECT_STORE_PATH : TableProperties.WRITE_FOLDER_STORAGE_LOCATION, then deprecatedProperty is always not null, and you don't need to do null check for the warning, and we can avoid having that warning helper method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change. Thanks for the suggestion.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I have a small comment around documentation, apart from that I don't have additional feedbacks. Let's see if anyone else has additional concern, if not I will merge it after some time.

site/docs/aws.md Outdated
@@ -373,8 +373,7 @@ s3://my-table-data-bucket/2d3905f8/my_ns.db/my_table/category=orders/00000-0-5af
```

Note, the path resolution logic for `ObjectStoreLocationProvider` is as follows:
- if `write.object-storage.path` is set, use it
- if not found, fallback to `write.folder-storage.path`
- if `write.data.path` is set, use it
- if not found, use `<tableLocation>/data`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we have a simple path resolution strategy, I think we can just put these listing in a single sentence. Also add 2 warning blocks describing the legacy behaviors:

  • before 0.12.0, write.object-storage.path must be set
  • at 0.12.0, write.object-storage.path then write.folder-storage.path then <tableLocation>/data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jack! Made the change for the doc.

@flyrain
Copy link
Contributor Author

flyrain commented Sep 17, 2021

Thanks all for the review. Is it ready for merging?

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. Thank you for doing this @flyrain!

* For the {@link ObjectStoreLocationProvider}, the priority level are
* "write.data.path" -> "write.object-storage.path" -> "write.folder-storage.path" -> "table-location/data".
*/
private static String dataLocation(Map<String, String> properties, String tableLocation, boolean isObjectStore) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit / non-blocking: Naming this isObjectStorageLocationProvider or something to indicate that it's not just object storage, but specifically the usage of object storage location provider.

However, since I just realized this is already in LocationProviders.java, I think this is probably fine and no need to change (especially as it's documented in the java doc).

@jun-he
Copy link
Collaborator

jun-he commented Sep 17, 2021

Hi @jun-he, do we need a similar change in python? table_properties.py contains WRITE_NEW_DATA_LOCATION and WRITE_METADATA_LOCATION as well. I'm glad to add it if the change is needed. Another option is to handle it in a follow-up PR.

@flyrain thanks for pinging. Yep, we need to have the similar changes. But I think it is fine not to add it in the python_legacy. Instead, we can add it to the new python library. I am currently working on its layout with a high level design. Once that is done, we will start the implementation and then add it to python. I will create an issue for us to track it.
Additionally, wondering if we still need to support those deprecated fields for backward compatibility?

}
}

return dataLocation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks way too complicated to be worth it. Let's just duplicate the logic and minimize the changes in each location provider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change.

} else if (deprecatedProperty.equals(TableProperties.OBJECT_STORE_PATH)) {
dataLocation = properties.get(TableProperties.WRITE_FOLDER_STORAGE_LOCATION);
if (dataLocation != null) {
LOG.warn(warnMsg, TableProperties.WRITE_FOLDER_STORAGE_LOCATION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for warnings. This is not important enough to nag users about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@rdblue
Copy link
Contributor

rdblue commented Sep 20, 2021

Looks like the change that defaulted object storage tables to the folder-storage path has not been released yet: aef12c0

@flyrain and @jackye1995, how about simplifying this by only using write.data.path and removing the use of write.folder-storage.path from the object store location provider? That simplifies the documentation and makes it easier to understand.

@flyrain
Copy link
Contributor Author

flyrain commented Sep 20, 2021

Hi @rdblue, I like your idea to simplify the logic, but as @jackye1995 mentioned, the logic to use write.folder-storage.path in the object store location provider was introduced by #2845, which has been released in 0.12.0. #2965 is a PR to just change the constant name.

@jackye1995
Copy link
Contributor

@rdblue @flyrain

I think if 1.0.0 is the next release, what @rdblue proposes is definitely the right approach to go, and we have also a way out for users of 0.12.0, which is great.

If we are still thinking about 0.13.0, we need to just be more cautious, because these config keys will be there all the time, I think it's better to keep the resolution strategy. We decided to merge #2845 mostly because services try to hide the table's root location, but the object storage mode forces user to configure a location that the user might not know. Chaining the fallback config key seems to be the only way out for the 2 use cases described here: #2845 (comment)

@rdblue
Copy link
Contributor

rdblue commented Sep 22, 2021

I don't think that we shouldn't make a breaking behavior change at 1.0. It isn't important enough to clean this up to warrant a breaking change. Let's just continue with the released behavior as a fallback.

@rdblue rdblue merged commit 12e30eb into apache:master Sep 22, 2021
@rdblue
Copy link
Contributor

rdblue commented Sep 22, 2021

Thanks, @flyrain!

@flyrain
Copy link
Contributor Author

flyrain commented Sep 22, 2021

Thanks all for the review and commit. @rdblue @jackye1995 @kbendick @karuppayya @jun-he.
Hi @anuragmantri, fyi, this will affect the relative path design a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants