Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to configure dynamodb lock client #1091

Open
wjones127 opened this issue Jan 23, 2023 · 17 comments
Open

Document how to configure dynamodb lock client #1091

wjones127 opened this issue Jan 23, 2023 · 17 comments
Labels
documentation Improvements or additions to documentation

Comments

@wjones127
Copy link
Collaborator

Description

Although we have an error message telling users to configure the Lock client if they want concurrent writes with S3, we don't have any documentation on how to do that. We should also provide general advice on concurrency, like not mixing different connectors in concurrent writers.

See conversation: https://delta-users.slack.com/archives/C013LCAEB98/p1674435354811639

Use Case

Related Issue(s)

We probably shouldn't do this until we improve the conflict resolution, though. #593

@wjones127 wjones127 added the documentation Improvements or additions to documentation label Jan 23, 2023
@wjones127
Copy link
Collaborator Author

@MrPowers this would probably be a good thing to blog about once the conflict resolution is improved. Concurrent writes is definitely something you can't do with plain Parquet tables. 😉

@LucaSoato
Copy link

Let me know if I can help you in this, we'll need this feature. 🙂

@MrPowers
Copy link
Collaborator

@wjones127 - feel free to assign me to this issue. I will be happy to create the docs when #593 is finished.

@hongbo-miao
Copy link

hongbo-miao commented May 12, 2023

Hi folks, is it possible to have a draft document first so that everyone can start to try and provide feedback?
Or just wonder if there is already a guide somewhere else? Thanks! 😃

@yuhanz
Copy link

yuhanz commented Dec 4, 2023

I'm looking for the documentation on how to setup the LockClient in Python as well.

@yuhanz
Copy link

yuhanz commented Dec 4, 2023

In crates/deltalake-core/src/test_utils.rs, seems like it just need to setup an environment variable to point to a DynamoDB table by DYNAMO_LOCK_TABLE_NAME:

set_env_if_not_set(s3_storage_options::AWS_ACCESS_KEY_ID, "deltalake");
set_env_if_not_set(s3_storage_options::AWS_SECRET_ACCESS_KEY, "weloverust");
set_env_if_not_set("AWS_DEFAULT_REGION", "us-east-1");
set_env_if_not_set(s3_storage_options::AWS_REGION, "us-east-1");
set_env_if_not_set(s3_storage_options::AWS_S3_LOCKING_PROVIDER, "dynamodb");
set_env_if_not_set("DYNAMO_LOCK_TABLE_NAME", "test_table");
set_env_if_not_set("DYNAMO_LOCK_REFRESH_PERIOD_MILLIS", "100");
set_env_if_not_set("DYNAMO_LOCK_ADDITIONAL_TIME_TO_WAIT_MILLIS", "100");

In a different project, it documented the table schema of the dynamodb table: https://github.com/delta-io/kafka-delta-ingest#writing-to-s3

aws dynamodb create-table --table-name delta_rs_lock_table \
    --attribute-definitions \
        AttributeName=key,AttributeType=S \
    --key-schema \
        AttributeName=key,KeyType=HASH \
    --provisioned-throughput \
        ReadCapacityUnits=10,WriteCapacityUnits=10

(The same schema is documented in python/deltalake/writer.py as well)

  • Key Schema: AttributeName=key, KeyType=HASH
  • Attribute Definitions: AttributeName=key, AttributeType=S

However, the python documentation python/docs/source/usage.rst explicitly says to specify the options in storage_options . So the environment variable may not be required. I am going to give this one a try.

    >>> from deltalake import write_deltalake
    >>> df = pd.DataFrame({'x': [1, 2, 3]})
    >>> storage_options = {'AWS_S3_LOCKING_PROVIDER': 'dynamodb', 'DYNAMO_LOCK_TABLE_NAME': 'custom_table_name'}
    >>> write_deltalake('s3://path/to/table', df, 'storage_options'= storage_options)

@danielgafni
Copy link

danielgafni commented Jan 17, 2024

@yuhanz hey, did you find the correct solution for Python?

Edit: this worked with deltalake 0.15.1

@yuhanz
Copy link

yuhanz commented Jan 20, 2024

@danielgafni : I went with storage_options, and it worked well with deltalake 0.13.0.

storage_options = {
    "AWS_DEFAULT_REGION": "us-east-1",
    "AWS_ACCESS_KEY_ID": AWS_ACCESS_KEY_ID,
    "AWS_SECRET_ACCESS_KEY": AWS_SECRET_ACCESS_KEY,
    # "AWS_S3_ALLOW_UNSAFE_RENAME": "true",
    'AWS_S3_LOCKING_PROVIDER': 'dynamodb',
    'DYNAMO_LOCK_TABLE_NAME': 'MyLockTable',
}

@danielgafni
Copy link

danielgafni commented Jan 20, 2024

Thanks.
I'm on 0.15.1.
Just setting the environment variable "AWS_S3_LOCKING_PROVIDER" worked for me (with the default "delta_log" table name).

@ale-rinaldi
Copy link
Contributor

I think it's also worth documenting the required permissions to work on a deltalake stored on AWS S3.

In my case, I needed:

  • On the bucket storing the deltalake: s3:GetObject, s3:PutObject, s3:DeleteObject. Permission to delete is needed for temporary files in the log folder, even if you're just appending.
  • On the DynamoDB table: dynamodb:GetItem, dynamodb:Query, dynamodb:PutItem, dynamodb:UpdateItem. I've seen some code that also calls create_table, I don't know if it's used or not, but I created the table manually and avoiding this permission caused no problems to me.

@MusKaya
Copy link

MusKaya commented Mar 13, 2024

@wjones127 when using an S3 compatible storage (other than AWS S3), one might have a set of access and secret key for the storage and another set for the DynamoDB. In this case, how to provide these two pairs (of access and secret keys) separately so one is used for storage and the other for DynamoDB?

@ion-elgreco
Copy link
Collaborator

@ale-rinaldi would you mind adding this info to our docs?

@MusKaya
Copy link

MusKaya commented Apr 6, 2024

@ale-rinaldi would you mind adding this info to our docs?

@ion-elgreco you are not referring to this right? Right now we have a real use case for what I have described above (using different credentials for s3 and dynamodb) and I created #2287 for it. If it is already supported it would be great to have the documentation clarify it. Otherwise we need to accommodate separate set of credentials for dynamodb to unblock uncoupling dynamodb from s3.

@ale-rinaldi
Copy link
Contributor

@ion-elgreco of course! I opened #2393

ion-elgreco pushed a commit that referenced this issue Apr 6, 2024
# Description
This documents the required AWS permissions on S3 and DynamoDB to
interact with deltalakes.

# Related Issue(s)
- mentions #1091
@kwodzicki
Copy link

Experiencing some issues that may be related to this.

I set up a DynamoDB table using the following command:

aws dynamodb create-table \
    --table-name delta_rs_lock_table \
    --attribute-definitions AttributeName=key,AttributeType=S \
    --key-schema AttributeName=key,KeyType=HASH \
    --billing-mode PAY_PER_REQUEST

And running following example

import boto3
import pandas as pd

from deltalake import DeltaTable
from deltalake import writer

credentials = boto3.Session().get_credentials().get_frozen_credentials()

storage_options = {
    "AWS_ACCESS_KEY_ID": credentials.access_key,
    "AWS_SECRET_ACCESS_KEY": credentials.secret_key,
    "AWS_SESSION_TOKEN": credentials.token,
    "AWS_REGION": "us-east-1",
    "AWS_S3_LOCKING_PROVIDER": "dynamodb",
    "DYNAMO_LOCK_PARTITION_KEY_VALUE": "key",
    "DYNAMO_LOCK_TABLE_NAME": "delta_rs_lock_table",
}

df = pd.DataFrame(
    {"x": [1, 2, 3]},
)

output = f"s3://{bucket}/some_delta_lake"
writer.write_deltalake(output, df, storage_options=storage_options)

I receive the following error when running...

[2024-06-03T16:02:48Z ERROR deltalake_aws::logstore] dynamodb client failed to write log entry: GenericDynamoDb { source: Unhandled(Unhandled { source: ErrorMetadata { code: Some("ValidationException"), message: Some("One or more parameter values were invalid: Missing the key key in the item"), extras: Some({"aws_request_id": "******"}) }, meta: ErrorMetadata { code: Some("ValidationException"), message: Some("One or more parameter values were invalid: Missing the key key in the item"), extras: Some({"aws_request_id": "******"}) } }) }

Looking at the policies assigned to my AWS account, it seems that I have all the permissions/policies that have been discussed above.

Not sure what I am missing.

@dhirschfeld
Copy link
Contributor

In the published documentation they specify the create-table command as:

aws dynamodb create-table \
    --table-name delta_log \
    --attribute-definitions AttributeName=tablePath,AttributeType=S AttributeName=fileName,AttributeType=S \
    --key-schema AttributeName=tablePath,KeyType=HASH AttributeName=fileName,KeyType=RANGE \
    --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5

@kwodzicki
Copy link

Thank you @dhirschfeld, this solved my issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests