Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sql notebook and a few fixes #1281

Merged
merged 1 commit into from
Nov 13, 2023

Conversation

blythed
Copy link
Collaborator

@blythed blythed commented Nov 13, 2023

  • Add use-case examples/sql-use-case.ipynb
  • Added missing logic around IbisCompoundSelect.select_ids_of_missing_outputs
  • Refactored IbisCompoundSelect.execute
  • Fixed bug in SQLAlchemyMetadata.create_parent_child_association

pyproject.toml Outdated Show resolved Hide resolved
@@ -17,10 +19,10 @@ def __init__(self, conn: BaseBackend, name: str, in_memory: bool = False):
self.in_memory = in_memory

def build_artifact_store(self):
raise NotImplementedError
return FileSystemArtifactStore(conn='.superduperdb/artifacts/', name='ibis')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults for SQL databases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this '.superduperdb/artifacts/'
is used a lot of place may be we. should make it global constant or in config please

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently only 1 place in main code-base.

@@ -157,8 +157,52 @@ def _get_all_fields(self, db):
def select_table(self):
return self.table_or_collection

def _execute_with_pre_like(self, db):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated this logic from base, since requirements different from MongoDB.

@@ -273,16 +374,26 @@ def select_using_ids(self, ids):
).isin(ids)
)

def select_ids_of_missing_outputs(self, key: str, model: str):
return self.select(self.table_or_collection.primary_id)
def _select_ids_of_missing_outputs(self, key: str, model: str, query_id: str):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was missing - anti_join selects diff of 2 tables.


def __getitem__(self, item):
return IbisQueryTable(
identifier=self.identifier, primary_id=self.primary_id
).__getitem__(item)

def to_query(self):
Copy link
Collaborator Author

@blythed blythed Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Table is not a CompoundSelect - convert it to one with this. It's possible this entire logic can be simplified.

filtered = output_table.filter(
output_table.key == key and output_table.query_id == query_id
)
return self.anti_join(filtered, filtered.input_id == self[self.primary_id])
Copy link
Collaborator Author

@blythed blythed Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a way to unify this logic?

out = super().repr_()
match = re.match('.*__([a-z]{2,2})__\(([a-z0-9_\.\']+)\)', out)
match = re.match('.*__([a-z]+)__\(([a-z0-9_\.\']+)\)', out)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was broken for, e.g., my_table['id'].

@@ -27,10 +27,9 @@ def as_dict(self):
class QueryID(Base): # type: ignore[valid-type, misc]
__tablename__ = 'query_id_table'

query_id = Column(Integer, primary_key=True, autoincrement=False)
query_id = Column(String, primary_key=True)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks due to the order of creation. Maybe a better way to do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_id should be auto incremented integer, 0, 1, 2, ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was addressed elsewhere - by having query_id as string prevents book-keeping necessary on the side of the developer.

@blythed blythed force-pushed the notebook/sql-example branch from f765932 to 890f3f2 Compare November 13, 2023 12:11
'query': query_serialized,
'model': model,
'hash': query_hash,
'query_id': query_hash,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_id was incremented in memory - would only work within session. Replacing this integer-id with
the hash as the id. This means that queries automatically carry their id around with them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -332,6 +333,8 @@ def execute(self, query: ExecuteQuery, *args, **kwargs) -> ExecuteResult:
return self.insert(query, *args, **kwargs)
if isinstance(query, Select):
return self.select(query, *args, **kwargs)
if isinstance(query, Table):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A query can now also be a Table.


if db is not None:
if isinstance(select, IbisCompoundSelect):
try:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplifying the logic of "adding queries" to the system.

@@ -210,7 +202,7 @@ def test_nested_query(ibis_sqllite_db):
},
)

t = IbisTable(identifier='my_table', schema=schema)
t = Table(identifier='my_table', schema=schema)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this to Table to bring in line with the way we use MongoDB Collection.

@@ -239,20 +239,6 @@ def test_pm_predict(predict_mixin):
predict_mixin.predict('x', db, select, distributed=True)
predict_func.assert_called_once()

with patch.object(db.metadata, 'add_query') as add_query, patch.object(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer relevant.

)
raise e

if out is None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLAlchemy doesn't raise an error if the item doesn't exist.

@@ -53,8 +52,8 @@ class Job(Base, DictMixin): # type: ignore[valid-type, misc]
class ParentChildAssociation(Base): # type: ignore[valid-type, misc]
__tablename__ = 'parent_child_association'

parent_id = Column(String, ForeignKey('component.id'), primary_key=True)
child_id = Column(String, ForeignKey('component.id'), primary_key=True)
parent_id = Column(String, primary_key=True)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing an issue in duckdb if the parent hadn't been created yet.

@@ -86,7 +86,8 @@ def __post_init__(self):
else:
self.identifier = f'{self.model.identifier}/{self.id_key}'
self.features = {}
if hasattr(self.select, 'features'):

if 'features' in dir(self.select):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ducktyping -- will be refactored here: #1282

@@ -33,7 +34,9 @@ def _auto_identify_connection_string(item: str, **kwargs) -> t.Any:
CFG.vector_search = item

else:
raise NotImplementedError(f'Can\'t auto-identify connection string {item}')
if re.match(r'^[a-zA-Z0-9]+://', item) is None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to be more flexible now with ibis.

@blythed blythed force-pushed the notebook/sql-example branch from 890f3f2 to 71a9d82 Compare November 13, 2023 12:59
@codecov-commenter
Copy link

codecov-commenter commented Nov 13, 2023

Codecov Report

Attention: 234 lines in your changes are missing coverage. Please review.

Comparison is base (34830a7) 80.33% compared to head (71a9d82) 63.76%.
Report is 41 commits behind head on main.

❗ Current head 71a9d82 differs from pull request most recent head e1fe714. Consider uploading reports for the commit e1fe714 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1281       +/-   ##
===========================================
- Coverage   80.33%   63.76%   -16.58%     
===========================================
  Files          95      103        +8     
  Lines        6602     7010      +408     
===========================================
- Hits         5304     4470      -834     
- Misses       1298     2540     +1242     
Files Coverage Δ
superduperdb/__init__.py 100.00% <100.00%> (ø)
superduperdb/backends/mongodb/cdc/base.py 77.77% <100.00%> (ø)
superduperdb/backends/mongodb/cdc/listener.py 82.78% <100.00%> (ø)
superduperdb/base/config.py 100.00% <100.00%> (ø)
superduperdb/ext/torch/model.py 39.75% <100.00%> (-37.05%) ⬇️
superduperdb/misc/download.py 37.44% <100.00%> (-37.92%) ⬇️
superduperdb/misc/serialization.py 82.60% <100.00%> (-2.34%) ⬇️
superduperdb/vector_search/in_memory.py 84.21% <100.00%> (-1.76%) ⬇️
superduperdb/vector_search/lance.py 77.77% <100.00%> (+0.41%) ⬆️
superduperdb/backends/base/metadata.py 51.28% <66.66%> (-2.64%) ⬇️
... and 25 more

... and 34 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

examples/data Outdated Show resolved Hide resolved
pyproject.toml Outdated
@@ -119,7 +117,7 @@ testing = [
"pytest-asyncio",
"urllib3<2", # see https://github.com/urllib3/urllib3/issues/3053
]
dev = ["superduperdb[torch,apis,docs,quality,testing]"]
dev = ["superduperdb[torch,apis,docs,quality,testing,server]"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, server
also can we add it as new commit if possible

superduperdb/backends/base/metadata.py Show resolved Hide resolved
@@ -254,7 +249,7 @@ def select_ids_of_missing_outputs(self, key: str, model: str):

return self._query_from_parts(
table_or_collection=self.table_or_collection,
query_linker=self.query_linker.select_ids_of_missing_outputs(
query_linker=self.query_linker._select_ids_of_missing_outputs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_select_ids_of_missing_outputs is a private method we should not use it outside this module

def wrap_document(self, r, encoders):
"""
Wrap a document in a ``Document``.
"""
return IbisDocument(Document.decode(r, encoders))

def as_pandas(self):
return pandas.DataFrame([Document(r).unpack() for r in self.raw_cursor])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.raw_cursor is already a pandas object by default

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a list of dictionaries currently.

@@ -17,10 +19,10 @@ def __init__(self, conn: BaseBackend, name: str, in_memory: bool = False):
self.in_memory = in_memory

def build_artifact_store(self):
raise NotImplementedError
return FileSystemArtifactStore(conn='.superduperdb/artifacts/', name='ibis')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this '.superduperdb/artifacts/'
is used a lot of place may be we. should make it global constant or in config please

@@ -27,10 +27,9 @@ def as_dict(self):
class QueryID(Base): # type: ignore[valid-type, misc]
__tablename__ = 'query_id_table'

query_id = Column(Integer, primary_key=True, autoincrement=False)
query_id = Column(String, primary_key=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_id should be auto incremented integer, 0, 1, 2, ...

'query': query_serialized,
'model': model,
'hash': query_hash,
'query_id': query_hash,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

superduperdb/components/model.py Outdated Show resolved Hide resolved
superduperdb/backends/base/query.py Show resolved Hide resolved
if self.query_linker is not None:
new_query_linker = IbisQueryLinker(
table_or_collection=self.table_or_collection,
members=[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: [query_linker_stub.members + self.query_linker.members]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

assert self.pre_like is None
assert self.post_like is None
assert self.query_linker is not None
query_id = str(hash(json.dumps(self.serialize())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query_id = self.hash()

assert self.query_linker is not None
query_id = str(hash(json.dumps(self.serialize())))

out = self._query_from_parts(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return self._query_from_parts

filtered = output_table.filter(
output_table.key == key and output_table.query_id == query_id
)
out = self.anti_join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return self.anti_join(

@blythed blythed force-pushed the notebook/sql-example branch 4 times, most recently from 4bac706 to da930d8 Compare November 13, 2023 14:39
@blythed blythed force-pushed the notebook/sql-example branch from da930d8 to e1fe714 Compare November 13, 2023 14:47
@blythed blythed merged commit 66ea8ce into superduper-io:main Nov 13, 2023
2 checks passed
@blythed blythed deleted the notebook/sql-example branch June 1, 2024 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants