move rest_api, sql_database and filesystem sources to dlt core #1728

willi-mueller · 2024-08-22T11:18:18Z

Description

This PR:

Includes the rest_api source as of 7c913bfad9033029a21371cf6c8d90f2b5f2e142.
corrects type annotations in test suite
modularizes test suite
moves tests/sources/helpers/rest_client/conftest.py into tests/sources/rest_api/conftest.py and imports it from rest_api to rest_client
formats imports
refactors types
executes the demo pipelines as part of the test suite, uses secret from Google Secrets Manager
corrects the docs on how to configure secrets for the demo pipeline

Related Issues

Resolves Integrate REST API generic source into dlt core #1484

netlify · 2024-08-22T11:18:32Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`ea1ce2c`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/66ddb7369c24280008683a52
😎 Deploy Preview	https://deploy-preview-1728--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

rudolfix

please see my input re. test organization

you also didn't move the "demo pipeline". this is OK for now. we will do a cleanup when all sources are moved

tests/sources/rest_api/test_rest_api_source.py

tests/utils.py

rudolfix

we still need to work on test structure:
1. it is good you moved all tests that run pipelines to tests/load/sources
2. but all other tests should be in tests/sources - see my filesystem PR. they should be able to run in common tests!
3. also look for filesystem PR and see how I run all examples from pipeline file as a test. easy!

rudolfix · 2024-08-27T10:51:12Z

dlt/sources/rest_api/typing.py

+    paginator: Optional[PaginatorConfig]
+
+
+class IncrementalArgs(TypedDict, total=False):


please move this to extract.incremental.typing module. leave only convert here which is not a part of dlt interface.

dlt/sources/rest_api/typing.py

rudolfix · 2024-08-27T10:55:19Z

dlt/sources/rest_api/typing.py

+
+class IncrementalArgs(TypedDict, total=False):
+    cursor_path: str
+    initial_value: Optional[str]


this TypedDict should be generic using TCursorValue. all values that now have "str" and also LastValueFunc should remain generic in the extract...typing module.

then make a concrete TypedDict here with Any and add convert

let's clean this up :)

Using the generic TCursorValue instead of str we get the following errors:

E dlt.common.exceptions.DictValidationException: In path .: field 'resources[0]' expects the following types: str, EndpointResource. Provided value {'name': 'posts', 'endpoint': {'path': 'posts', 'incremental': {'start_param': 'since', 'end_param': 'until', 'cursor_path': 'updated_at', 'initial_value': '1', 'end_value': '86401', 'convert': <function test_posts_with_inremental_date_conversion.<locals>.<lambda> at 0x15abca980>}}} with type 'dict' is invalid with the following errors: E For EndpointResource: In path ./resources[0]: field 'endpoint' expects the following types: str, Endpoint. Provided value {'path': 'posts', 'incremental': {'start_param': 'since', 'end_param': 'until', 'cursor_path': 'updated_at', 'initial_value': '1', 'end_value': '86401', 'convert': <function test_posts_with_inremental_date_conversion.<locals>.<lambda> at 0x15abca980>}} with type 'dict' is invalid with the following errors: E For Endpoint: In path ./resources[0]/endpoint/incremental: field 'cursor_path' has expected type 'TCursorValue' which lacks validator E For str: In path ./resources[0]: field 'endpoint' with value {'path': 'posts', 'incremental': {'start_param': 'since', 'end_param': 'until', 'cursor_path': 'updated_at', 'initial_value': '1', 'end_value': '86401', 'convert': <function test_posts_with_inremental_date_conversion.<locals>.<lambda> at 0x15abca980>}} has invalid type 'dict' while 'str' is expected

We decided to leave out support for non string configs in this PR and I create a follow-up issue: #1757

It seems that our implementation already supports int etc. but we'd need to implement the dict validation.

rudolfix · 2024-08-27T11:02:49Z

dlt/sources/rest_api/typing.py

+    map: Optional[Callable[[Any], Any]]  # noqa: A003
+
+
+class ResourceBase(TypedDict, total=False):


Merge with TResourceHints: extract common base to TResourceHintsBase and use it here

When we inherit from TResourceHintsBase here, then mypy complains. Is there another way to make all fields optional in the subtype which are required in the sibling or parent type?

dlt/sources/rest_api/typing.py:243: error: Overwriting TypedDict field "write_disposition" while extending [misc] dlt/sources/rest_api/typing.py:244: error: Overwriting TypedDict field "parent" while extending [misc] dlt/sources/rest_api/typing.py:246: error: Overwriting TypedDict field "primary_key" while extending [misc] dlt/sources/rest_api/typing.py:248: error: Overwriting TypedDict field "schema_contract" while extending [misc] dlt/sources/rest_api/typing.py:249: error: Overwriting TypedDict field "table_format" while extending [misc]

Done by making the fields optional in the supertype.

@rudolfix you'll have to verify that this is ok, I think it is

rudolfix · 2024-08-27T11:06:39Z

tests/load/sources/rest_api/test_rest_api_source.py

+
+@pytest.mark.parametrize(
+    "destination_config",
+    destinations_configs(default_sql_configs=True),


also add local filesystem configs

Done in d47cf64

willi-mueller · 2024-08-28T08:13:51Z

we still need to work on test structure: ~~1. it is good you moved all tests that run pipelines to tests/load/sources 2. but all other tests should be in tests/sources - see my filesystem PR. they should be able to run in common tests!~~ 3. also look for filesystem PR and see how I run all examples from pipeline file as a test. easy!

Thank you, I applied your technique – nice! Waiting for CI to verify if the github secret is present in this repo here too.

…ncompatibilities except for POST request (/search/posts)

…/rest_api

AstrakhantsevaAA

@sh-rp these init templates are good, especially default one, actually users would use it even just to have a structure for their future pipelines, so they don't have to write all these @dlt.resource/@dlt.source/dlt.pipeline/dlt.run stuff, it's common for every dlt pipeline, so it's already quite useful even for me. Do not overcomplicate this default example, keep it like this - simple.

Notes:

Add pandas example similar to pyarrow (maybe the same script?) and call it dataframe.
Add RestAPI Client example, because we don't have templates for it
I would add all examples from our Introduction page maybe in a one script, and call it intro or getting_started so people could play with these examples before start building their own pipelines.

AstrakhantsevaAA · 2024-09-05T14:57:00Z

dlt/sources/pipeline_templates/arrow_pipeline.py

+def resource():
+    # here we create an arrow table from a list of python objects for demonstration
+    # in the real world you will have a source that already has arrow tables
+    yield pa.Table.from_pylist([{"name": "tom", "age": 25}, {"name": "angela", "age": 23}])


can you move this toy data outside of the resource function, something like:

def get_data(): return pa.Table.from_pylist([{"name": "tom", "age": 25}, {"name": "angela", "age": 23}]) @dlt.resource(write_disposition="append", name="people") def resource(): # here we create an arrow table from a list of python objects for demonstration # in the real world you will have a source that already has arrow tables yield get_data()

so users could easier change this template for their use case

also we should do the same for pandas dataframe

@AstrakhantsevaAA I agree on all points, I also like the very simple starting point templates and would like to keep them this way. Maybe you could add a rest_client example when you have time, I'll try to do all the other stuff tomorrow before I leave.

sh-rp · 2024-09-05T18:27:22Z

tests/load/sources/filesystem/test_filesystem_source.py

@@ -195,7 +201,7 @@ def _copy(item: FileItemDict):
        "parquet_example": 1034,
        "listing": 11,
        "csv_example": 1279,
-        "csv_duckdb_example": 1280,
+        "csv_duckdb_example": 1281,  # TODO: i changed this from 1280, what is going on? :)


this needs to be investigated

sh-rp · 2024-09-05T18:28:10Z

tests/load/sources/sql_database/test_sql_database_source_all_destinations.py

@@ -155,6 +156,9 @@ def test_load_sql_table_incremental(
    """
    os.environ["SOURCES__SQL_DATABASE__CHAT_MESSAGE__INCREMENTAL__CURSOR_PATH"] = "updated_at"

+    if not IS_SQL_ALCHEMY_20 and backend == "connectorx":


we can probably get these to work again on sqlalchemy 1.4 if we don't use an date at but an int column for incremental

sh-rp · 2024-09-05T18:29:55Z

tests/load/sources/sql_database/test_sql_database_source.py

@@ -1022,12 +1025,17 @@ def assert_no_precision_columns(
        # no precision, no nullability, all hints inferred
        # pandas destroys decimals
        expected = convert_non_pandas_types(expected)
+        # on one of the timestamps somehow there is timezone info...
+        actual = remove_timezone_info(actual)


the timezone related stuff on this test should be better understood

sh-rp · 2024-09-05T18:33:48Z

tests/load/sources/filesystem/test_filesystem_source.py

@@ -96,6 +96,7 @@ def assert_csv_file(item: FileItem):
    assert len(list(nested_file | assert_csv_file)) == 1


+@pytest.mark.skip("Needs secrets toml to work..")


maybe we should have a place for source tests where secrets are present. we can probably fix this with the right ENV var though

sh-rp · 2024-09-05T18:55:52Z

tests/load/sources/sql_database/sql_source.py

+                date_col=mimesis.Datetime().date(),
+                time_col=mimesis.Datetime().time(),
+                float_col=random.random(),
+                json_col='{"data": [1, 2, 3]}',  # NOTE: can we do this?


fyi: this is change from an actual object to a serialized json and this fixes a number of problems in the tests. I think this is fine.

a string is a valid JSON object so now we test if we can store a string. not sure those tests are meaningful but I think it is good enough

sh-rp · 2024-09-05T18:56:22Z

tests/load/sources/sql_database/sql_source.py

+                Column("time_col", Time, nullable=nullable),
+                Column("float_col", Float, nullable=nullable),
+                Column("json_col", JSONB, nullable=nullable),
+                Column("bool_col", Boolean, nullable=nullable),


TODO: Uuid column removed here, we can put it back with a conditional on sqlalchemy 2.0

This reverts commit 47e1933.

clean up other examples a bit

…-into-dlt-core

rudolfix

LGTM for an alpha! thanks everyone for working on this!

willi-mueller linked an issue Aug 22, 2024 that may be closed by this pull request

Integrate REST API generic source into dlt core #1484

Closed

7 tasks

willi-mueller self-assigned this Aug 22, 2024

willi-mueller requested a review from burnash August 22, 2024 11:18

willi-mueller force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch 2 times, most recently from 1150cc1 to 4de267f Compare August 22, 2024 11:50

rudolfix requested changes Aug 22, 2024

View reviewed changes

tests/sources/rest_api/test_rest_api_source.py Show resolved Hide resolved

tests/utils.py Show resolved Hide resolved

tests/utils.py Show resolved Hide resolved

willi-mueller requested a review from rudolfix August 23, 2024 12:29

willi-mueller force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch from 7588b51 to 449625c Compare August 23, 2024 12:37

rudolfix added ci full run the full load tests on pr sprint Marks group of tasks with core team focus at this moment labels Aug 25, 2024

rudolfix requested changes Aug 27, 2024

View reviewed changes

willi-mueller force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch 9 times, most recently from e7a23c2 to 0328be6 Compare September 2, 2024 14:03

willi-mueller added 8 commits September 3, 2024 11:05

copies rest_api source code and test suite, adjusts imports

beab83f

integrates rest_client/conftest.pi into rest_api/conftest.py. Fixes i…

16ba857

…ncompatibilities except for POST request (/search/posts)

integrates POST search test

c0c7bed

do no longer skip test with typed dict config

d7e1ef0

reuses tests/sources/helpers/rest_client/conftest.py in tests/sources…

ff97717

…/rest_api

checks off TODO

d252113

formats rest_api code according to dlt-core rules

5c58a59

fixes typing errors and graphlib import error

f1122ed

AstrakhantsevaAA reviewed Sep 5, 2024

View reviewed changes

sh-rp added 4 commits September 5, 2024 18:20

fix all sql database tests for sqlalchemy 2.0

3427cc8

fix some tests for sqlalchemy 1.4

7264799

deselect connectorx incremental tests on sqlalchemy 1.4

c3ef897

fixes some more tests

c02e87b

sh-rp reviewed Sep 5, 2024

View reviewed changes

some cleanup

55c135d

sh-rp reviewed Sep 5, 2024

View reviewed changes

sh-rp added 2 commits September 5, 2024 21:31

fix bug in init script

94e69f1

Revert "remove destination tests for now, revert later"

32bcc89

This reverts commit 47e1933.

sh-rp force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch from cb47aea to f9177dc Compare September 5, 2024 19:40

exclude sources load tests from destination workflows

5e32407

sh-rp force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch from f9177dc to 5e32407 Compare September 5, 2024 19:48

sh-rp added 2 commits September 6, 2024 00:03

fix openpyxl install

3e7b0e0

disable requests tests for now

caaa8e5

sh-rp force-pushed the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch from 322e9d7 to caaa8e5 Compare September 5, 2024 23:02

sh-rp and others added 7 commits September 6, 2024 07:55

fix commen tests

2388223

add dataframe example pipeline

5ef6672

clean up other examples a bit

add intro examples

0830f56

update cleaning scripts for athena and redshift

26832d8

make timezone tests slightly more strict

dc7406c

reorders sql_database import to get user friendly dependency error

a0d90ac

Merge branch 'devel' into feat/1484-integrate-rest-api-generic-source…

ea1ce2c

…-into-dlt-core

rudolfix approved these changes Sep 8, 2024

View reviewed changes

rudolfix merged commit 51516b1 into devel Sep 8, 2024
46 of 57 checks passed

rudolfix deleted the feat/1484-integrate-rest-api-generic-source-into-dlt-core branch September 8, 2024 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move rest_api, sql_database and filesystem sources to dlt core #1728

move rest_api, sql_database and filesystem sources to dlt core #1728

willi-mueller commented Aug 22, 2024 •

edited

Loading

netlify bot commented Aug 22, 2024 •

edited

Loading

rudolfix left a comment

rudolfix left a comment •

edited

Loading

rudolfix Aug 27, 2024

willi-mueller Aug 29, 2024

rudolfix Aug 27, 2024

willi-mueller Aug 27, 2024

willi-mueller Aug 29, 2024

rudolfix Aug 27, 2024

willi-mueller Aug 27, 2024

willi-mueller Aug 29, 2024

sh-rp Aug 30, 2024

rudolfix Aug 27, 2024

willi-mueller Aug 28, 2024

willi-mueller commented Aug 28, 2024

AstrakhantsevaAA left a comment

AstrakhantsevaAA Sep 5, 2024

AstrakhantsevaAA Sep 5, 2024

sh-rp Sep 5, 2024

sh-rp Sep 5, 2024

sh-rp Sep 5, 2024

sh-rp Sep 5, 2024

sh-rp Sep 5, 2024

sh-rp Sep 5, 2024

rudolfix Sep 8, 2024

sh-rp Sep 5, 2024

rudolfix left a comment

		paginator: Optional[PaginatorConfig]


		class IncrementalArgs(TypedDict, total=False):

		map: Optional[Callable[[Any], Any]] # noqa: A003


		class ResourceBase(TypedDict, total=False):

		@@ -96,6 +96,7 @@ def assert_csv_file(item: FileItem):
		assert len(list(nested_file \| assert_csv_file)) == 1


		@pytest.mark.skip("Needs secrets toml to work..")

move rest_api, sql_database and filesystem sources to dlt core #1728

move rest_api, sql_database and filesystem sources to dlt core #1728

Conversation

willi-mueller commented Aug 22, 2024 • edited Loading

Description

Related Issues

netlify bot commented Aug 22, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

willi-mueller commented Aug 28, 2024

AstrakhantsevaAA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

willi-mueller commented Aug 22, 2024 •

edited

Loading

netlify bot commented Aug 22, 2024 •

edited

Loading

rudolfix left a comment •

edited

Loading