Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Consolidate write.folder-storage.path and write.object-storage.path to write.data.path #3094

Merged
merged 11 commits into from
Sep 22, 2021
52 changes: 42 additions & 10 deletions core/src/main/java/org/apache/iceberg/LocationProviders.java
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,11 @@
import org.apache.iceberg.transforms.Transforms;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.util.PropertyUtil;

import static org.apache.iceberg.TableProperties.OBJECT_STORE_PATH;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class LocationProviders {
private static final Logger LOG = LoggerFactory.getLogger(LocationProviders.class);

private LocationProviders() {
}
Expand Down Expand Up @@ -68,16 +69,11 @@ public static LocationProvider locationsFor(String location, Map<String, String>
}
}

private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
return properties.getOrDefault(TableProperties.WRITE_FOLDER_STORAGE_LOCATION,
String.format("%s/data", tableLocation));
}

static class DefaultLocationProvider implements LocationProvider {
private final String dataLocation;

DefaultLocationProvider(String tableLocation, Map<String, String> properties) {
this.dataLocation = stripTrailingSlash(defaultDataLocation(tableLocation, properties));
this.dataLocation = stripTrailingSlash(dataLocation(properties, tableLocation, false));
}

@Override
Expand All @@ -99,8 +95,7 @@ static class ObjectStoreLocationProvider implements LocationProvider {
private final String context;

ObjectStoreLocationProvider(String tableLocation, Map<String, String> properties) {
this.storageLocation = stripTrailingSlash(properties.getOrDefault(OBJECT_STORE_PATH,
defaultDataLocation(tableLocation, properties)));
this.storageLocation = stripTrailingSlash(dataLocation(properties, tableLocation, true));
this.context = pathContext(tableLocation);
}

Expand Down Expand Up @@ -141,4 +136,41 @@ private static String stripTrailingSlash(String path) {
}
return result;
}

/**
* Get the data file location. For the {@link DefaultLocationProvider}, the priority level are
* "write.data.path" -> "write.folder-storage.path" -> "table-location/data".
* For the {@link ObjectStoreLocationProvider}, the priority level are
* "write.data.path" -> "write.object-storage.path" -> "write.folder-storage.path" -> "table-location/data".
*/
private static String dataLocation(Map<String, String> properties, String tableLocation, boolean isObjectStore) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit / non-blocking: Naming this isObjectStorageLocationProvider or something to indicate that it's not just object storage, but specifically the usage of object storage location provider.

However, since I just realized this is already in LocationProviders.java, I think this is probably fine and no need to change (especially as it's documented in the java doc).

String dataLocation = properties.get(TableProperties.WRITE_DATA_LOCATION);
if (dataLocation == null) {
dataLocation = deprecatedDataLocation(properties, isObjectStore);
if (dataLocation == null) {
dataLocation = String.format("%s/data", tableLocation);
}
}
return dataLocation;
}

private static String deprecatedDataLocation(Map<String, String> properties, boolean isObjectStore) {
String deprecatedProperty = isObjectStore ?
TableProperties.OBJECT_STORE_PATH : TableProperties.WRITE_FOLDER_STORAGE_LOCATION;

String dataLocation = properties.get(deprecatedProperty);

final String warnMsg = "Table property {} is deprecated, please use " + TableProperties.WRITE_DATA_LOCATION +
" instead.";
if (dataLocation != null) {
LOG.warn(warnMsg, deprecatedProperty);
} else if (deprecatedProperty.equals(TableProperties.OBJECT_STORE_PATH)) {
dataLocation = properties.get(TableProperties.WRITE_FOLDER_STORAGE_LOCATION);
if (dataLocation != null) {
LOG.warn(warnMsg, TableProperties.WRITE_FOLDER_STORAGE_LOCATION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for warnings. This is not important enough to nag users about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

}
}

return dataLocation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks way too complicated to be worth it. Let's just duplicate the logic and minimize the changes in each location provider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the change.

}
}
18 changes: 14 additions & 4 deletions core/src/main/java/org/apache/iceberg/TableProperties.java
Original file line number Diff line number Diff line change
Expand Up @@ -135,21 +135,31 @@ private TableProperties() {
public static final String OBJECT_STORE_ENABLED = "write.object-storage.enabled";
public static final boolean OBJECT_STORE_ENABLED_DEFAULT = false;

/**
* @deprecated Use {@link #WRITE_DATA_LOCATION} instead.
*/
@Deprecated
public static final String OBJECT_STORE_PATH = "write.object-storage.path";

public static final String WRITE_LOCATION_PROVIDER_IMPL = "write.location-provider.impl";

// This only applies to files written after this property is set. Files previously written aren't
// relocated to reflect this parameter.
// If not set, defaults to a "data" folder underneath the root path of the table.
/**
* @deprecated Use {@link #WRITE_DATA_LOCATION} instead.
*/
@Deprecated
public static final String WRITE_FOLDER_STORAGE_LOCATION = "write.folder-storage.path";

/**
* @deprecated will be removed in 0.14.0, use {@link #WRITE_FOLDER_STORAGE_LOCATION} instead
* @deprecated will be removed in 0.14.0, use {@link #WRITE_DATA_LOCATION} instead
*/
@Deprecated
public static final String WRITE_NEW_DATA_LOCATION = "write.folder-storage.path";

// This only applies to files written after this property is set. Files previously written aren't
// relocated to reflect this parameter.
// If not set, defaults to a "data" folder underneath the root path of the table.
public static final String WRITE_DATA_LOCATION = "write.data.path";
Comment on lines +158 to +161
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do we want to update the comment for the default if not set to reflect anything about the possibility of object storage location provider?

Up to you. The more I think about it, the more I think it just complicates things and that we should just properly document the behavior on the website. But up to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can move L161 to after L144 to avoid deleting the comments and add again here.


// This only applies to files written after this property is set. Files previously written aren't
// relocated to reflect this parameter.
// If not set, defaults to a "metadata" folder underneath the root path of the table.
Expand Down
36 changes: 35 additions & 1 deletion core/src/test/java/org/apache/iceberg/TestLocationProvider.java
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ public void testDefaultLocationProvider() {
@Test
public void testDefaultLocationProviderWithCustomDataLocation() {
this.table.updateProperties()
.set(TableProperties.WRITE_FOLDER_STORAGE_LOCATION, "new_location")
.set(TableProperties.WRITE_DATA_LOCATION, "new_location")
.commit();

this.table.locationProvider().newDataLocation("my_file");
Expand Down Expand Up @@ -237,5 +237,39 @@ public void testObjectStorageLocationProviderPathResolution() {

Assert.assertTrue("object storage path should be used when set",
table.locationProvider().newDataLocation("file").contains(objectPath));

String dataPath = "s3://random/data/location";
table.updateProperties()
.set(TableProperties.WRITE_DATA_LOCATION, dataPath)
.commit();

Assert.assertTrue("write data path should be used when set",
table.locationProvider().newDataLocation("file").contains(dataPath));
}

@Test
public void testDefaultStorageLocationProviderPathResolution() {
table.updateProperties()
.set(TableProperties.OBJECT_STORE_ENABLED, "false")
.commit();

Assert.assertTrue("default data location should be used when object storage path not set",
table.locationProvider().newDataLocation("file").contains(table.location() + "/data"));

String folderPath = "s3://random/folder/location";
table.updateProperties()
.set(TableProperties.WRITE_FOLDER_STORAGE_LOCATION, folderPath)
.commit();

Assert.assertTrue("folder storage path should be used when set",
table.locationProvider().newDataLocation("file").contains(folderPath));

String dataPath = "s3://random/data/location";
table.updateProperties()
.set(TableProperties.WRITE_DATA_LOCATION, dataPath)
.commit();

Assert.assertTrue("write data path should be used when set",
table.locationProvider().newDataLocation("file").contains(dataPath));
}
}
12 changes: 6 additions & 6 deletions site/docs/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -344,7 +344,7 @@ Data stored in S3 with a traditional Hive storage layout can face S3 request thr

Iceberg by default uses the Hive storage layout, but can be switched to use the `ObjectStoreLocationProvider`.
With `ObjectStoreLocationProvider`, a determenistic hash is generated for each stored file, with the hash appended
directly after the `write.object-storage.path`. This ensures files written to s3 are equally distributed across multiple [prefixes](https://aws.amazon.com/premiumsupport/knowledge-center/s3-object-key-naming-pattern/) in the S3 bucket. Resulting in minimized throttling and maximized throughput for S3-related IO operations. When using `ObjectStoreLocationProvider` having a shared and short `write.object-storage.path` across your Iceberg tables will improve performance.
directly after the `write.data.path`. This ensures files written to s3 are equally distributed across multiple [prefixes](https://aws.amazon.com/premiumsupport/knowledge-center/s3-object-key-naming-pattern/) in the S3 bucket. Resulting in minimized throttling and maximized throughput for S3-related IO operations. When using `ObjectStoreLocationProvider` having a shared and short `write.data.path` across your Iceberg tables will improve performance.

For more information on how S3 scales API QPS, checkout the 2018 re:Invent session on [Best Practices for Amazon S3 and Amazon S3 Glacier]( https://youtu.be/rHeTn9pHNKo?t=3219). At [53:39](https://youtu.be/rHeTn9pHNKo?t=3219) it covers how S3 scales/partitions & at [54:50](https://youtu.be/rHeTn9pHNKo?t=3290) it discusses the 30-60 minute wait time before new partitions are created.

Expand All @@ -358,7 +358,7 @@ CREATE TABLE my_catalog.my_ns.my_table (
USING iceberg
OPTIONS (
'write.object-storage.enabled'=true,
'write.object-storage.path'='s3://my-table-data-bucket')
'write.data.path'='s3://my-table-data-bucket')
PARTITIONED BY (category);
```

Expand All @@ -372,10 +372,10 @@ Which will write the data to S3 with a hash (`2d3905f8`) appended directly after
s3://my-table-data-bucket/2d3905f8/my_ns.db/my_table/category=orders/00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
```

Note, the path resolution logic for `ObjectStoreLocationProvider` is as follows:
- if `write.object-storage.path` is set, use it
- if not found, fallback to `write.folder-storage.path`
- if not found, use `<tableLocation>/data`
Note, the path resolution logic for `ObjectStoreLocationProvider` is `write.data.path` then `<tableLocation>/data`.
However, for the older versions up to 0.12.0, the logic is as follows:
- before 0.12.0, `write.object-storage.path` must be set.
- at 0.12.0, `write.object-storage.path` then `write.folder-storage.path` then `<tableLocation>/data`.

For more details, please refer to the [LocationProvider Configuration](../custom-catalog/#custom-location-provider-implementation) section.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,7 @@ protected String newTableLocation() {

protected String dataLocation() {
Map<String, String> properties = table.properties();
return properties.getOrDefault(TableProperties.WRITE_FOLDER_STORAGE_LOCATION,
String.format("%s/data", table.location()));
return properties.getOrDefault(TableProperties.WRITE_DATA_LOCATION, String.format("%s/data", table.location()));
}

protected void cleanupFiles() throws IOException {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ public void testWapFilesAreKept() throws InterruptedException {
public void testMetadataFolderIsIntact() throws InterruptedException {
// write data directly to the table location
Map<String, String> props = Maps.newHashMap();
props.put(TableProperties.WRITE_FOLDER_STORAGE_LOCATION, tableLocation);
props.put(TableProperties.WRITE_DATA_LOCATION, tableLocation);
Table table = TABLES.create(SCHEMA, SPEC, props, tableLocation);

List<ThreeColumnRecord> records = Lists.newArrayList(
Expand Down Expand Up @@ -357,7 +357,7 @@ public void testOlderThanTimestamp() throws InterruptedException {
@Test
public void testRemoveUnreachableMetadataVersionFiles() throws InterruptedException {
Map<String, String> props = Maps.newHashMap();
props.put(TableProperties.WRITE_FOLDER_STORAGE_LOCATION, tableLocation);
props.put(TableProperties.WRITE_DATA_LOCATION, tableLocation);
props.put(TableProperties.METADATA_PREVIOUS_VERSIONS_MAX, "1");
Table table = TABLES.create(SCHEMA, SPEC, props, tableLocation);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ public void testWriteWithCustomDataLocation() throws IOException {
File tablePropertyDataLocation = temp.newFolder("test-table-property-data-dir");
Table table = createTable(new Schema(SUPPORTED_PRIMITIVES.fields()), location);
table.updateProperties().set(
TableProperties.WRITE_FOLDER_STORAGE_LOCATION, tablePropertyDataLocation.getAbsolutePath()).commit();
TableProperties.WRITE_DATA_LOCATION, tablePropertyDataLocation.getAbsolutePath()).commit();
writeAndValidateWithLocations(table, location, tablePropertyDataLocation);
}

Expand Down Expand Up @@ -271,7 +271,7 @@ public void testNullableWithWriteOption() throws IOException {
String sourcePath = String.format("%s/nullable_poc/sourceFolder/", location.toString());
String targetPath = String.format("%s/nullable_poc/targetFolder/", location.toString());

tableProperties = ImmutableMap.of(TableProperties.WRITE_FOLDER_STORAGE_LOCATION, targetPath);
tableProperties = ImmutableMap.of(TableProperties.WRITE_DATA_LOCATION, targetPath);

// read this and append to iceberg dataset
spark
Expand Down Expand Up @@ -312,7 +312,7 @@ public void testNullableWithSparkSqlOption() throws IOException {
String sourcePath = String.format("%s/nullable_poc/sourceFolder/", location.toString());
String targetPath = String.format("%s/nullable_poc/targetFolder/", location.toString());

tableProperties = ImmutableMap.of(TableProperties.WRITE_FOLDER_STORAGE_LOCATION, targetPath);
tableProperties = ImmutableMap.of(TableProperties.WRITE_DATA_LOCATION, targetPath);

// read this and append to iceberg dataset
spark
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,8 @@ protected Map<String, String> destTableProps() {
properties.remove(LOCATION);
properties.remove(TableProperties.WRITE_METADATA_LOCATION);
properties.remove(TableProperties.WRITE_FOLDER_STORAGE_LOCATION);
properties.remove(TableProperties.OBJECT_STORE_PATH);
properties.remove(TableProperties.WRITE_DATA_LOCATION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When snapshotting a table that has folder storage location set, should we also remove it in addition to write data location?
Also should we remove object storage location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be safer to remove both. Added in the new commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR #2966 for this change, you can either also port my tests here, or remove this and I will update that PR once this is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jackye1995 , thanks for the information. I've added both WRITE_FOLDER_STORAGE_LOCATION and OBJECT_STORE_PATH. Can you rebase it in PR #2966 once this got merged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that


// set default and user-provided props
properties.put(TableCatalog.PROP_PROVIDER, "iceberg");
Expand Down