Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move rest_api, sql_database and filesystem sources to dlt core #1728

Merged

Conversation

willi-mueller
Copy link
Collaborator

@willi-mueller willi-mueller commented Aug 22, 2024

Description

This PR:

  • Includes the rest_api source as of 7c913bfad9033029a21371cf6c8d90f2b5f2e142.
  • corrects type annotations in test suite
  • modularizes test suite
  • moves tests/sources/helpers/rest_client/conftest.py into tests/sources/rest_api/conftest.py and imports it from rest_api to rest_client
  • formats imports
  • refactors types
  • executes the demo pipelines as part of the test suite, uses secret from Google Secrets Manager
  • corrects the docs on how to configure secrets for the demo pipeline

Related Issues

@willi-mueller willi-mueller linked an issue Aug 22, 2024 that may be closed by this pull request
7 tasks
@willi-mueller willi-mueller self-assigned this Aug 22, 2024
@willi-mueller willi-mueller requested a review from burnash August 22, 2024 11:18
Copy link

netlify bot commented Aug 22, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit ea1ce2c
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66ddb7369c24280008683a52
😎 Deploy Preview https://deploy-preview-1728--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@willi-mueller willi-mueller force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch 2 times, most recently from 1150cc1 to 4de267f Compare August 22, 2024 11:50
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see my input re. test organization

you also didn't move the "demo pipeline". this is OK for now. we will do a cleanup when all sources are moved

tests/utils.py Show resolved Hide resolved
tests/utils.py Show resolved Hide resolved
@willi-mueller willi-mueller requested a review from rudolfix August 23, 2024 12:29
@willi-mueller willi-mueller force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch from 7588b51 to 449625c Compare August 23, 2024 12:37
@rudolfix rudolfix added ci full run the full load tests on pr sprint Marks group of tasks with core team focus at this moment labels Aug 25, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need to work on test structure:
1. it is good you moved all tests that run pipelines to tests/load/sources
2. but all other tests should be in tests/sources - see my filesystem PR. they should be able to run in common tests!

3. also look for filesystem PR and see how I run all examples from pipeline file as a test. easy!

paginator: Optional[PaginatorConfig]


class IncrementalArgs(TypedDict, total=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move this to extract.incremental.typing module. leave only convert here which is not a part of dlt interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dlt/sources/rest_api/typing.py Outdated Show resolved Hide resolved

class IncrementalArgs(TypedDict, total=False):
cursor_path: str
initial_value: Optional[str]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this TypedDict should be generic using TCursorValue. all values that now have "str" and also LastValueFunc should remain generic in the extract...typing module.

then make a concrete TypedDict here with Any and add convert

let's clean this up :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the generic TCursorValue instead of str we get the following errors:

E                       dlt.common.exceptions.DictValidationException: In path .: field 'resources[0]' expects the following types: str, EndpointResource. Provided value {'name': 'posts', 'endpoint': {'path': 'posts', 'incremental': {'start_param': 'since', 'end_param': 'until', 'cursor_path': 'updated_at', 'initial_value': '1', 'end_value': '86401', 'convert': <function test_posts_with_inremental_date_conversion.<locals>.<lambda> at 0x15abca980>}}} with type 'dict' is invalid with the following errors:
E                       For EndpointResource: In path ./resources[0]: field 'endpoint' expects the following types: str, Endpoint. Provided value {'path': 'posts', 'incremental': {'start_param': 'since', 'end_param': 'until', 'cursor_path': 'updated_at', 'initial_value': '1', 'end_value': '86401', 'convert': <function test_posts_with_inremental_date_conversion.<locals>.<lambda> at 0x15abca980>}} with type 'dict' is invalid with the following errors:
E                       For Endpoint: In path ./resources[0]/endpoint/incremental: field 'cursor_path' has expected type 'TCursorValue' which lacks validator
E                       For str: In path ./resources[0]: field 'endpoint' with value {'path': 'posts', 'incremental': {'start_param': 'since', 'end_param': 'until', 'cursor_path': 'updated_at', 'initial_value': '1', 'end_value': '86401', 'convert': <function test_posts_with_inremental_date_conversion.<locals>.<lambda> at 0x15abca980>}} has invalid type 'dict' while 'str' is expected

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to leave out support for non string configs in this PR and I create a follow-up issue: #1757

It seems that our implementation already supports int etc. but we'd need to implement the dict validation.

map: Optional[Callable[[Any], Any]] # noqa: A003


class ResourceBase(TypedDict, total=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge with TResourceHints: extract common base to TResourceHintsBase and use it here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we inherit from TResourceHintsBase here, then mypy complains. Is there another way to make all fields optional in the subtype which are required in the sibling or parent type?

dlt/sources/rest_api/typing.py:243: error: Overwriting TypedDict field "write_disposition" while extending  [misc]
dlt/sources/rest_api/typing.py:244: error: Overwriting TypedDict field "parent" while extending  [misc]
dlt/sources/rest_api/typing.py:246: error: Overwriting TypedDict field "primary_key" while extending  [misc]
dlt/sources/rest_api/typing.py:248: error: Overwriting TypedDict field "schema_contract" while extending  [misc]
dlt/sources/rest_api/typing.py:249: error: Overwriting TypedDict field "table_format" while extending  [misc]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by making the fields optional in the supertype.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rudolfix you'll have to verify that this is ok, I think it is


@pytest.mark.parametrize(
"destination_config",
destinations_configs(default_sql_configs=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add local filesystem configs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d47cf64

@willi-mueller
Copy link
Collaborator Author

we still need to work on test structure: 1. it is good you moved all tests that run pipelines to tests/load/sources 2. but all other tests should be in tests/sources - see my filesystem PR. they should be able to run in common tests! 3. also look for filesystem PR and see how I run all examples from pipeline file as a test. easy!

Thank you, I applied your technique – nice! Waiting for CI to verify if the github secret is present in this repo here too.

@willi-mueller willi-mueller force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch 9 times, most recently from e7a23c2 to 0328be6 Compare September 2, 2024 14:03
Copy link
Contributor

@AstrakhantsevaAA AstrakhantsevaAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sh-rp these init templates are good, especially default one, actually users would use it even just to have a structure for their future pipelines, so they don't have to write all these @dlt.resource/@dlt.source/dlt.pipeline/dlt.run stuff, it's common for every dlt pipeline, so it's already quite useful even for me. Do not overcomplicate this default example, keep it like this - simple.

Notes:

  • Add pandas example similar to pyarrow (maybe the same script?) and call it dataframe.
  • Add RestAPI Client example, because we don't have templates for it
  • I would add all examples from our Introduction page maybe in a one script, and call it intro or getting_started so people could play with these examples before start building their own pipelines.

Screenshot 2024-09-05 at 17 18 19

def resource():
# here we create an arrow table from a list of python objects for demonstration
# in the real world you will have a source that already has arrow tables
yield pa.Table.from_pylist([{"name": "tom", "age": 25}, {"name": "angela", "age": 23}])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this toy data outside of the resource function, something like:

def get_data():
    return pa.Table.from_pylist([{"name": "tom", "age": 25}, {"name": "angela", "age": 23}])

@dlt.resource(write_disposition="append", name="people")
def resource():
    # here we create an arrow table from a list of python objects for demonstration
    # in the real world you will have a source that already has arrow tables
    yield get_data()

so users could easier change this template for their use case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also we should do the same for pandas dataframe

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AstrakhantsevaAA I agree on all points, I also like the very simple starting point templates and would like to keep them this way. Maybe you could add a rest_client example when you have time, I'll try to do all the other stuff tomorrow before I leave.

@@ -195,7 +201,7 @@ def _copy(item: FileItemDict):
"parquet_example": 1034,
"listing": 11,
"csv_example": 1279,
"csv_duckdb_example": 1280,
"csv_duckdb_example": 1281, # TODO: i changed this from 1280, what is going on? :)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be investigated

@@ -155,6 +156,9 @@ def test_load_sql_table_incremental(
"""
os.environ["SOURCES__SQL_DATABASE__CHAT_MESSAGE__INCREMENTAL__CURSOR_PATH"] = "updated_at"

if not IS_SQL_ALCHEMY_20 and backend == "connectorx":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably get these to work again on sqlalchemy 1.4 if we don't use an date at but an int column for incremental

@@ -1022,12 +1025,17 @@ def assert_no_precision_columns(
# no precision, no nullability, all hints inferred
# pandas destroys decimals
expected = convert_non_pandas_types(expected)
# on one of the timestamps somehow there is timezone info...
actual = remove_timezone_info(actual)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the timezone related stuff on this test should be better understood

@@ -96,6 +96,7 @@ def assert_csv_file(item: FileItem):
assert len(list(nested_file | assert_csv_file)) == 1


@pytest.mark.skip("Needs secrets toml to work..")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should have a place for source tests where secrets are present. we can probably fix this with the right ENV var though

date_col=mimesis.Datetime().date(),
time_col=mimesis.Datetime().time(),
float_col=random.random(),
json_col='{"data": [1, 2, 3]}', # NOTE: can we do this?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: this is change from an actual object to a serialized json and this fixes a number of problems in the tests. I think this is fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a string is a valid JSON object so now we test if we can store a string. not sure those tests are meaningful but I think it is good enough

Column("time_col", Time, nullable=nullable),
Column("float_col", Float, nullable=nullable),
Column("json_col", JSONB, nullable=nullable),
Column("bool_col", Boolean, nullable=nullable),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Uuid column removed here, we can put it back with a conditional on sqlalchemy 2.0

@sh-rp sh-rp force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch from cb47aea to f9177dc Compare September 5, 2024 19:40
@sh-rp sh-rp force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch from f9177dc to 5e32407 Compare September 5, 2024 19:48
@sh-rp sh-rp force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch from 322e9d7 to caaa8e5 Compare September 5, 2024 23:02
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for an alpha! thanks everyone for working on this!

@rudolfix rudolfix merged commit 51516b1 into devel Sep 8, 2024
46 of 57 checks passed
@rudolfix rudolfix deleted the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch September 8, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci full run the full load tests on pr sprint Marks group of tasks with core team focus at this moment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate REST API generic source into dlt core
4 participants