Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamoDB: Add table loader for full-load operations #226

Merged
merged 1 commit into from
Aug 19, 2024
Merged

Conversation

amotl
Copy link
Member

@amotl amotl commented Aug 19, 2024

About

Bring DynamoDB full-load to Toolkit's ctk load table interface.

Documentation

https://cratedb-toolkit--226.org.readthedocs.build/io/dynamodb/loader.html

Status

Alpha. For now, the implementation uses the same easy strategy to converge the source record into a single data (OBJECT) column in CrateDB. The 1:1 strategy may follow.

Backlog

/cc @hammerhead, @zolbatar

@amotl amotl force-pushed the amo/dynamodb-next branch from c85e125 to a86ec49 Compare August 19, 2024 02:16
records_target = self.cratedb_adapter.count_records(self.cratedb_table)
logger.info(f"Target: CrateDB table={self.cratedb_table} count={records_target}")
progress_bar = tqdm(total=records_in)
result = self.dynamodb_adapter.scan(table_name=self.dynamodb_table)
Copy link
Member Author

@amotl amotl Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another variant to scan the table, maybe for resuming on errors?

key = None
while True:
  if key is None:
    response = table.scan()
  else:
    response = table.scan(ExclusiveStartKey=key)
  key = response.get("LastEvaluatedKey", None)

/cc @wierdvanderhaar

Convert data for record items to INSERT statements.
"""
for item in items:
yield self.translator.to_sql(item)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's another item transformation idea picked up from an example program. Please advise if this is sensible in all situations, or if it's just a special case.

if 'id' in item and not isinstance(item['id'], str):
    item['id'] = str(item['id'])

/cc @wierdvanderhaar

Comment on lines 1 to 10
# DynamoDB Backlog

- Pagination / Batch Getting.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/programming-with-python.html#programming-with-python-pagination

- Use `batch_get_item`.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb/client/batch_get_item.html

- Scan by query instead of full.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wierdvanderhaar: With respect to alternative implementations, using batched reading from DynamoDB is probably way to go when processing large amounts of data?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should use the batched method. Let's use a default batch size of 100 but give the option to use different batch sizes.

Copy link
Member Author

@amotl amotl Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I added the remnant items to the backlog.

Do you agree to merge and release it first, following the "first make it work, then make it fast|beautiful|robust" paradigm, in order to get it out as quickly as possible?

@amotl amotl force-pushed the amo/dynamodb-next branch 2 times, most recently from 4a746ff to 882676e Compare August 19, 2024 12:47
@amotl amotl requested review from surister and hlcianfagna August 19, 2024 12:48
@amotl amotl force-pushed the amo/dynamodb-next branch from 882676e to 736a603 Compare August 19, 2024 12:53
@amotl amotl mentioned this pull request Aug 19, 2024
11 tasks
@amotl amotl requested a review from wierdvanderhaar August 19, 2024 13:05
@amotl amotl marked this pull request as ready for review August 19, 2024 13:05
@amotl amotl merged commit 12e7313 into main Aug 19, 2024
29 checks passed
@amotl amotl deleted the amo/dynamodb-next branch August 19, 2024 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants