-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move rest_api, sql_database and filesystem sources to dlt core #1728
move rest_api, sql_database and filesystem sources to dlt core #1728
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
1150cc1
to
4de267f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see my input re. test organization
you also didn't move the "demo pipeline". this is OK for now. we will do a cleanup when all sources are moved
7588b51
to
449625c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need to work on test structure:
1. it is good you moved all tests that run pipelines to tests/load/sources
2. but all other tests should be in tests/sources
- see my filesystem PR. they should be able to run in common tests!
3. also look for filesystem PR and see how I run all examples from pipeline file as a test. easy!
dlt/sources/rest_api/typing.py
Outdated
paginator: Optional[PaginatorConfig] | ||
|
||
|
||
class IncrementalArgs(TypedDict, total=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please move this to extract.incremental.typing
module. leave only convert
here which is not a part of dlt interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlt/sources/rest_api/typing.py
Outdated
|
||
class IncrementalArgs(TypedDict, total=False): | ||
cursor_path: str | ||
initial_value: Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this TypedDict should be generic using TCursorValue
. all values that now have "str" and also LastValueFunc should remain generic in the extract...typing
module.
then make a concrete TypedDict here with Any
and add convert
let's clean this up :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the generic TCursorValue
instead of str
we get the following errors:
E dlt.common.exceptions.DictValidationException: In path .: field 'resources[0]' expects the following types: str, EndpointResource. Provided value {'name': 'posts', 'endpoint': {'path': 'posts', 'incremental': {'start_param': 'since', 'end_param': 'until', 'cursor_path': 'updated_at', 'initial_value': '1', 'end_value': '86401', 'convert': <function test_posts_with_inremental_date_conversion.<locals>.<lambda> at 0x15abca980>}}} with type 'dict' is invalid with the following errors:
E For EndpointResource: In path ./resources[0]: field 'endpoint' expects the following types: str, Endpoint. Provided value {'path': 'posts', 'incremental': {'start_param': 'since', 'end_param': 'until', 'cursor_path': 'updated_at', 'initial_value': '1', 'end_value': '86401', 'convert': <function test_posts_with_inremental_date_conversion.<locals>.<lambda> at 0x15abca980>}} with type 'dict' is invalid with the following errors:
E For Endpoint: In path ./resources[0]/endpoint/incremental: field 'cursor_path' has expected type 'TCursorValue' which lacks validator
E For str: In path ./resources[0]: field 'endpoint' with value {'path': 'posts', 'incremental': {'start_param': 'since', 'end_param': 'until', 'cursor_path': 'updated_at', 'initial_value': '1', 'end_value': '86401', 'convert': <function test_posts_with_inremental_date_conversion.<locals>.<lambda> at 0x15abca980>}} has invalid type 'dict' while 'str' is expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to leave out support for non string configs in this PR and I create a follow-up issue: #1757
It seems that our implementation already supports int etc. but we'd need to implement the dict validation.
dlt/sources/rest_api/typing.py
Outdated
map: Optional[Callable[[Any], Any]] # noqa: A003 | ||
|
||
|
||
class ResourceBase(TypedDict, total=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge with TResourceHints
: extract common base to TResourceHintsBase
and use it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we inherit from TResourceHintsBase
here, then mypy complains. Is there another way to make all fields optional in the subtype which are required in the sibling or parent type?
dlt/sources/rest_api/typing.py:243: error: Overwriting TypedDict field "write_disposition" while extending [misc]
dlt/sources/rest_api/typing.py:244: error: Overwriting TypedDict field "parent" while extending [misc]
dlt/sources/rest_api/typing.py:246: error: Overwriting TypedDict field "primary_key" while extending [misc]
dlt/sources/rest_api/typing.py:248: error: Overwriting TypedDict field "schema_contract" while extending [misc]
dlt/sources/rest_api/typing.py:249: error: Overwriting TypedDict field "table_format" while extending [misc]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done by making the fields optional in the supertype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rudolfix you'll have to verify that this is ok, I think it is
|
||
@pytest.mark.parametrize( | ||
"destination_config", | ||
destinations_configs(default_sql_configs=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add local filesystem configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in d47cf64
Thank you, I applied your technique – nice! Waiting for CI to verify if the github secret is present in this repo here too. |
e7a23c2
to
0328be6
Compare
…ncompatibilities except for POST request (/search/posts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sh-rp these init templates are good, especially default one, actually users would use it even just to have a structure for their future pipelines, so they don't have to write all these @dlt.resource/@dlt.source/dlt.pipeline/dlt.run stuff, it's common for every dlt pipeline, so it's already quite useful even for me. Do not overcomplicate this default example, keep it like this - simple.
Notes:
- Add pandas example similar to pyarrow (maybe the same script?) and call it
dataframe
. - Add RestAPI Client example, because we don't have templates for it
- I would add all examples from our Introduction page maybe in a one script, and call it
intro
orgetting_started
so people could play with these examples before start building their own pipelines.
def resource(): | ||
# here we create an arrow table from a list of python objects for demonstration | ||
# in the real world you will have a source that already has arrow tables | ||
yield pa.Table.from_pylist([{"name": "tom", "age": 25}, {"name": "angela", "age": 23}]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this toy data outside of the resource function, something like:
def get_data():
return pa.Table.from_pylist([{"name": "tom", "age": 25}, {"name": "angela", "age": 23}])
@dlt.resource(write_disposition="append", name="people")
def resource():
# here we create an arrow table from a list of python objects for demonstration
# in the real world you will have a source that already has arrow tables
yield get_data()
so users could easier change this template for their use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also we should do the same for pandas dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AstrakhantsevaAA I agree on all points, I also like the very simple starting point templates and would like to keep them this way. Maybe you could add a rest_client example when you have time, I'll try to do all the other stuff tomorrow before I leave.
@@ -195,7 +201,7 @@ def _copy(item: FileItemDict): | |||
"parquet_example": 1034, | |||
"listing": 11, | |||
"csv_example": 1279, | |||
"csv_duckdb_example": 1280, | |||
"csv_duckdb_example": 1281, # TODO: i changed this from 1280, what is going on? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be investigated
@@ -155,6 +156,9 @@ def test_load_sql_table_incremental( | |||
""" | |||
os.environ["SOURCES__SQL_DATABASE__CHAT_MESSAGE__INCREMENTAL__CURSOR_PATH"] = "updated_at" | |||
|
|||
if not IS_SQL_ALCHEMY_20 and backend == "connectorx": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably get these to work again on sqlalchemy 1.4 if we don't use an date at but an int column for incremental
@@ -1022,12 +1025,17 @@ def assert_no_precision_columns( | |||
# no precision, no nullability, all hints inferred | |||
# pandas destroys decimals | |||
expected = convert_non_pandas_types(expected) | |||
# on one of the timestamps somehow there is timezone info... | |||
actual = remove_timezone_info(actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the timezone related stuff on this test should be better understood
@@ -96,6 +96,7 @@ def assert_csv_file(item: FileItem): | |||
assert len(list(nested_file | assert_csv_file)) == 1 | |||
|
|||
|
|||
@pytest.mark.skip("Needs secrets toml to work..") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should have a place for source tests where secrets are present. we can probably fix this with the right ENV var though
date_col=mimesis.Datetime().date(), | ||
time_col=mimesis.Datetime().time(), | ||
float_col=random.random(), | ||
json_col='{"data": [1, 2, 3]}', # NOTE: can we do this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi: this is change from an actual object to a serialized json and this fixes a number of problems in the tests. I think this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a string is a valid JSON object so now we test if we can store a string. not sure those tests are meaningful but I think it is good enough
Column("time_col", Time, nullable=nullable), | ||
Column("float_col", Float, nullable=nullable), | ||
Column("json_col", JSONB, nullable=nullable), | ||
Column("bool_col", Boolean, nullable=nullable), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Uuid column removed here, we can put it back with a conditional on sqlalchemy 2.0
This reverts commit 47e1933.
cb47aea
to
f9177dc
Compare
f9177dc
to
5e32407
Compare
322e9d7
to
caaa8e5
Compare
clean up other examples a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for an alpha! thanks everyone for working on this!
Description
This PR:
Related Issues