-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sql notebook and a few fixes #1281
Conversation
@@ -17,10 +19,10 @@ def __init__(self, conn: BaseBackend, name: str, in_memory: bool = False): | |||
self.in_memory = in_memory | |||
|
|||
def build_artifact_store(self): | |||
raise NotImplementedError | |||
return FileSystemArtifactStore(conn='.superduperdb/artifacts/', name='ibis') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaults for SQL databases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this '.superduperdb/artifacts/'
is used a lot of place may be we. should make it global constant or in config please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently only 1 place in main code-base.
@@ -157,8 +157,52 @@ def _get_all_fields(self, db): | |||
def select_table(self): | |||
return self.table_or_collection | |||
|
|||
def _execute_with_pre_like(self, db): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separated this logic from base, since requirements different from MongoDB.
@@ -273,16 +374,26 @@ def select_using_ids(self, ids): | |||
).isin(ids) | |||
) | |||
|
|||
def select_ids_of_missing_outputs(self, key: str, model: str): | |||
return self.select(self.table_or_collection.primary_id) | |||
def _select_ids_of_missing_outputs(self, key: str, model: str, query_id: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic was missing - anti_join
selects diff of 2 tables.
|
||
def __getitem__(self, item): | ||
return IbisQueryTable( | ||
identifier=self.identifier, primary_id=self.primary_id | ||
).__getitem__(item) | ||
|
||
def to_query(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Table
is not a CompoundSelect
- convert it to one with this. It's possible this entire logic can be simplified.
filtered = output_table.filter( | ||
output_table.key == key and output_table.query_id == query_id | ||
) | ||
return self.anti_join(filtered, filtered.input_id == self[self.primary_id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a way to unify this logic?
out = super().repr_() | ||
match = re.match('.*__([a-z]{2,2})__\(([a-z0-9_\.\']+)\)', out) | ||
match = re.match('.*__([a-z]+)__\(([a-z0-9_\.\']+)\)', out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was broken for, e.g., my_table['id']
.
@@ -27,10 +27,9 @@ def as_dict(self): | |||
class QueryID(Base): # type: ignore[valid-type, misc] | |||
__tablename__ = 'query_id_table' | |||
|
|||
query_id = Column(Integer, primary_key=True, autoincrement=False) | |||
query_id = Column(String, primary_key=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks due to the order of creation. Maybe a better way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query_id should be auto incremented integer, 0, 1, 2, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was addressed elsewhere - by having query_id
as string prevents book-keeping necessary on the side of the developer.
f765932
to
890f3f2
Compare
'query': query_serialized, | ||
'model': model, | ||
'hash': query_hash, | ||
'query_id': query_hash, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query_id
was incremented in memory - would only work within session. Replacing this integer-id with
the hash as the id. This means that queries automatically carry their id around with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -332,6 +333,8 @@ def execute(self, query: ExecuteQuery, *args, **kwargs) -> ExecuteResult: | |||
return self.insert(query, *args, **kwargs) | |||
if isinstance(query, Select): | |||
return self.select(query, *args, **kwargs) | |||
if isinstance(query, Table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A query can now also be a Table
.
|
||
if db is not None: | ||
if isinstance(select, IbisCompoundSelect): | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplifying the logic of "adding queries" to the system.
@@ -210,7 +202,7 @@ def test_nested_query(ibis_sqllite_db): | |||
}, | |||
) | |||
|
|||
t = IbisTable(identifier='my_table', schema=schema) | |||
t = Table(identifier='my_table', schema=schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed this to Table
to bring in line with the way we use MongoDB Collection
.
@@ -239,20 +239,6 @@ def test_pm_predict(predict_mixin): | |||
predict_mixin.predict('x', db, select, distributed=True) | |||
predict_func.assert_called_once() | |||
|
|||
with patch.object(db.metadata, 'add_query') as add_query, patch.object( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer relevant.
) | ||
raise e | ||
|
||
if out is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQLAlchemy
doesn't raise an error if the item doesn't exist.
@@ -53,8 +52,8 @@ class Job(Base, DictMixin): # type: ignore[valid-type, misc] | |||
class ParentChildAssociation(Base): # type: ignore[valid-type, misc] | |||
__tablename__ = 'parent_child_association' | |||
|
|||
parent_id = Column(String, ForeignKey('component.id'), primary_key=True) | |||
child_id = Column(String, ForeignKey('component.id'), primary_key=True) | |||
parent_id = Column(String, primary_key=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was causing an issue in duckdb
if the parent hadn't been created yet.
@@ -86,7 +86,8 @@ def __post_init__(self): | |||
else: | |||
self.identifier = f'{self.model.identifier}/{self.id_key}' | |||
self.features = {} | |||
if hasattr(self.select, 'features'): | |||
|
|||
if 'features' in dir(self.select): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ducktyping -- will be refactored here: #1282
@@ -33,7 +34,9 @@ def _auto_identify_connection_string(item: str, **kwargs) -> t.Any: | |||
CFG.vector_search = item | |||
|
|||
else: | |||
raise NotImplementedError(f'Can\'t auto-identify connection string {item}') | |||
if re.match(r'^[a-zA-Z0-9]+://', item) is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to be more flexible now with ibis
.
890f3f2
to
71a9d82
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #1281 +/- ##
===========================================
- Coverage 80.33% 63.76% -16.58%
===========================================
Files 95 103 +8
Lines 6602 7010 +408
===========================================
- Hits 5304 4470 -834
- Misses 1298 2540 +1242
☔ View full report in Codecov by Sentry. |
pyproject.toml
Outdated
@@ -119,7 +117,7 @@ testing = [ | |||
"pytest-asyncio", | |||
"urllib3<2", # see https://github.com/urllib3/urllib3/issues/3053 | |||
] | |||
dev = ["superduperdb[torch,apis,docs,quality,testing]"] | |||
dev = ["superduperdb[torch,apis,docs,quality,testing,server]"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, server
also can we add it as new commit if possible
@@ -254,7 +249,7 @@ def select_ids_of_missing_outputs(self, key: str, model: str): | |||
|
|||
return self._query_from_parts( | |||
table_or_collection=self.table_or_collection, | |||
query_linker=self.query_linker.select_ids_of_missing_outputs( | |||
query_linker=self.query_linker._select_ids_of_missing_outputs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_select_ids_of_missing_outputs is a private method we should not use it outside this module
def wrap_document(self, r, encoders): | ||
""" | ||
Wrap a document in a ``Document``. | ||
""" | ||
return IbisDocument(Document.decode(r, encoders)) | ||
|
||
def as_pandas(self): | ||
return pandas.DataFrame([Document(r).unpack() for r in self.raw_cursor]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.raw_cursor is already a pandas object by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a list of dictionaries currently.
@@ -17,10 +19,10 @@ def __init__(self, conn: BaseBackend, name: str, in_memory: bool = False): | |||
self.in_memory = in_memory | |||
|
|||
def build_artifact_store(self): | |||
raise NotImplementedError | |||
return FileSystemArtifactStore(conn='.superduperdb/artifacts/', name='ibis') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this '.superduperdb/artifacts/'
is used a lot of place may be we. should make it global constant or in config please
@@ -27,10 +27,9 @@ def as_dict(self): | |||
class QueryID(Base): # type: ignore[valid-type, misc] | |||
__tablename__ = 'query_id_table' | |||
|
|||
query_id = Column(Integer, primary_key=True, autoincrement=False) | |||
query_id = Column(String, primary_key=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query_id should be auto incremented integer, 0, 1, 2, ...
'query': query_serialized, | ||
'model': model, | ||
'hash': query_hash, | ||
'query_id': query_hash, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
if self.query_linker is not None: | ||
new_query_linker = IbisQueryLinker( | ||
table_or_collection=self.table_or_collection, | ||
members=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: [query_linker_stub.members + self.query_linker.members]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
superduperdb/backends/ibis/query.py
Outdated
assert self.pre_like is None | ||
assert self.post_like is None | ||
assert self.query_linker is not None | ||
query_id = str(hash(json.dumps(self.serialize()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query_id = self.hash()
assert self.query_linker is not None | ||
query_id = str(hash(json.dumps(self.serialize()))) | ||
|
||
out = self._query_from_parts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return self._query_from_parts
filtered = output_table.filter( | ||
output_table.key == key and output_table.query_id == query_id | ||
) | ||
out = self.anti_join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return self.anti_join(
4bac706
to
da930d8
Compare
da930d8
to
e1fe714
Compare
examples/sql-use-case.ipynb
IbisCompoundSelect.select_ids_of_missing_outputs
IbisCompoundSelect.execute
SQLAlchemyMetadata.create_parent_child_association