-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix clickhouse and optimize compatibility when using SQL link #1496
Fix clickhouse and optimize compatibility when using SQL link #1496
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1496 +/- ##
==========================================
- Coverage 80.33% 80.32% -0.02%
==========================================
Files 95 107 +12
Lines 6602 7467 +865
==========================================
+ Hits 5304 5998 +694
- Misses 1298 1469 +171 ☔ View full report in Codecov by Sentry. |
45c9431
to
b564dd7
Compare
Don't forget to update the |
There is a query that doesn't work when using Clickhouse, but works fine using SQLite and MySQL Test case :
|
All update operations will report the following error, but work when using SQLite and MySQL Test case:
https://clickhouse-sqlalchemy.readthedocs.io/en/latest/features.html#update-and-delete |
fca2339
to
4acb5b8
Compare
import pandas as pd | ||
|
||
|
||
def _default_insert_processor(table_name, datas): | ||
"""Default insert processor for SQL dialects.""" | ||
return table_name, datas | ||
|
||
|
||
def _clickhouse_insert_processor(table_name, datas): | ||
"""Insert processor for ClickHouse.""" | ||
return f'`{table_name}`', pd.DataFrame(datas) | ||
|
||
|
||
def get_insert_processor(dialect): | ||
"""Get the insert processor for the given dialect.""" | ||
funcs = { | ||
'clickhouse': _clickhouse_insert_processor, | ||
} | ||
return funcs.get(dialect, _default_insert_processor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave a place for future compatibility with other databases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
def get_output_table_name(model_identifier, version): | ||
"""Get the output table name for the given model.""" | ||
# use `_` to connect the model_identifier and version | ||
return f'_outputs_{model_identifier}_{version}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some databases do not support using "/" in table name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good spot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do __
("_" * 2
)? This may be needed for some kind of name analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can, but it doesn’t seem to make much sense.
Because we only need to mark the model text between the first _
and the last _
, unless it is changed to a more general symbol to be compatible with more data sets.
A more elegant way is to provide a method in the future to handle the table name processing function of a specific database, so we support a lot of databases, and different databases may have different restrictions.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blythed @jieguangzhou
should we have a helper function in databackend to create table name?
so for mongodb
def build_table_name(key, model):
return f'{model}/{key}/{model}'
for ibis
def build_table_name:
return compatible name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a db helper to adapt the different databases in the future.
But now we have to change a lot of stuff about the query because some methods will build table_name, but the methods in the query
do not get the db parameter, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
def get_db_config(dialect): | ||
if dialect == 'clickhouse': | ||
return create_clickhouse_config() | ||
else: | ||
return DefaultConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If other databases have other table creation behaviors on the meatastore in the future, you can add them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great
|
||
self._lock = threading.Lock() | ||
|
||
def _init_tables(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using this way of creating tables has better compatibility than using class objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A driver of clickhouse uses the class object method to build a table and does not support updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you explain this MetaData
in more detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetaData is used for storing and managing the definitions of tables, and this is a common practice for this way to create tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.metadata is assigned here so that he can create dynamically within _init_tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, move the metadata into _init_tables
# Connect to metadata store. | ||
# ------------------------------ | ||
# 1. try to connect to the metadata store specified in the configuration. | ||
# 2. if that fails, try to connect to the data backend engine. | ||
# 3. if that fails, try to connect to the data backend uri. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the engines of ibis and sqlalchemy are incompatible, directly use the URL to build the metadatastore
|
||
self._lock = threading.Lock() | ||
|
||
def _init_tables(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you explain this MetaData
in more detail?
|
||
def _clickhouse_insert_processor(table_name, datas): | ||
"""Insert processor for ClickHouse.""" | ||
return f'`{table_name}`', pd.DataFrame(datas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this do something? f'\
{table_name}`'`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some table names we defined can not insert data sometimes, for example, table name with -
_outputs _a-b-c_0
.
After adding this, the generated sql statement will be "INSERT `_outputs _a-b-c_0` xxxxx" in the query of the table name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
superduperdb/backends/ibis/query.py
Outdated
for k in self.renamings.values(): | ||
if ( | ||
re.match(f'^_outputs/{model}/[0-9]+$', tab.identifier) | ||
re.match(f'^_outputs_{model}_[0-9]+$', tab.identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use the function above right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/unittest/base/test_datalayer.py
Outdated
@pytest.mark.parametrize( | ||
"db", [DBConfig.mongodb_empty, DBConfig.sqldb_empty], indirect=True | ||
) | ||
@pytest.mark.parametrize("db", [DBConfig.sqldb_empty], indirect=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A miss, have fixed
4acb5b8
to
3a7247f
Compare
3a7247f
to
bf8bc7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great pr just small change need !
if not self.in_memory: | ||
self.conn.insert(table_name, raw_documents) | ||
else: | ||
self.conn.create_table(table_name, pandas.DataFrame(raw_documents)) | ||
|
||
@staticmethod | ||
def convert_data_format(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jieguangzhou
so we override bytes data to base64 string for all ibis backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can also make different databases behave differently. But base64 is more versatile
@@ -55,7 +83,7 @@ def create_model_table_or_collection(self, model: t.Union[Model, APIModel]): | |||
'key': dtype('string'), | |||
} | |||
return Table( | |||
identifier=f'_outputs/{model.identifier}/{model.version}', | |||
identifier=get_output_table_name(model.identifier, model.version), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Databackend should have
def get_table_name(..., ouput=True/False):
...
@jieguangzhou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import pandas as pd | ||
|
||
|
||
def _default_insert_processor(table_name, datas): | ||
"""Default insert processor for SQL dialects.""" | ||
return table_name, datas | ||
|
||
|
||
def _clickhouse_insert_processor(table_name, datas): | ||
"""Insert processor for ClickHouse.""" | ||
return f'`{table_name}`', pd.DataFrame(datas) | ||
|
||
|
||
def get_insert_processor(dialect): | ||
"""Get the insert processor for the given dialect.""" | ||
funcs = { | ||
'clickhouse': _clickhouse_insert_processor, | ||
} | ||
return funcs.get(dialect, _default_insert_processor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@@ -301,7 +298,7 @@ def model_update( # type: ignore[override] | |||
for r in table_records: | |||
if isinstance(r['output'], dict) and '_content' in r['output']: | |||
r['output'] = r['output']['_content']['bytes'] | |||
db.databackend.insert(f'_outputs/{model}/{version}', table_records) | |||
db.databackend.insert(get_output_table_name(model, version), table_records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
db.databackend.get_table_name(model, version, output=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_db_config(dialect): | ||
if dialect == 'clickhouse': | ||
return create_clickhouse_config() | ||
else: | ||
return DefaultConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great
|
||
self._lock = threading.Lock() | ||
|
||
def _init_tables(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.metadata is assigned here so that he can create dynamically within _init_tables
@@ -121,7 +138,7 @@ def drop(self, force: bool = False): | |||
default=False, | |||
): | |||
logging.warn('Aborting...') | |||
Base.metadata.drop_all(self.conn) | |||
self.metadata.drop_all(self.conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jieguangzhou since we assign metadata other than Base
im not sure if we will have weird bug because of this ...wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, change to
self.query_id_table.drop(self.conn)
self.job_table.drop(self.conn)
self.parent_child_association_table.drop(self.conn)
self.component_table.drop(self.conn)
self.meta_table.drop(self.conn)
) | ||
parents = [a.parent_id for a in assocations] | ||
res = session.execute(stmt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a nitpick:
self.execute(stmt)
which returns parsed result
and make it one-liner since we are using this two three times below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Done
Component.version == version, | ||
).update({key: value}) | ||
stmt = ( | ||
self.component_table.update() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated value is defined after the statement, But now, I change to
stmt = (
update(self.component_table)
.where(
self.component_table.c.type_id == type_id,
self.component_table.c.identifier == identifier,
self.component_table.c.version == version,
)
.values({key: value})
)
superduperdb/components/schema.py
Outdated
@@ -40,7 +40,7 @@ def pre_create(self, db) -> None: | |||
@property | |||
def raw(self): | |||
return { | |||
k: (v.identifier if not isinstance(v, Encoder) else 'Bytes') | |||
k: (v.identifier if not isinstance(v, Encoder) else 'String') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so we are moving away from bytes to string for Encoders type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that efficient? Should this not be case-by-case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that efficient?
It will be a little less efficient than directly storing bytes
Should this not be case-by-case?
Hey, @kartik4949 @blythed . I added a DBHelper to handle this.
Not converted by default, which will be converted to String unless otherwise defined (such as clickhouse).
…bility across various databases.
2d918fa
to
253dd9c
Compare
#1471
#1490
Description
Related Issues
Checklist
make test
successfully?Additional Notes or Comments