Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose spill configuration to users #2119

Open
Tracked by #1111
wjones127 opened this issue Mar 27, 2024 · 4 comments
Open
Tracked by #1111

Expose spill configuration to users #2119

wjones127 opened this issue Mar 27, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@wjones127
Copy link
Contributor

You can configure the memory limit for spilling when making vector indices (#1702) but not for scalar indices (#2043). We should make them both configurable from the same place.

@wjones127 wjones127 added the enhancement New feature or request label Mar 27, 2024
@wjones127 wjones127 self-assigned this Mar 27, 2024
@westonpace
Copy link
Contributor

We shouldn't actually need to configure this for scalar indices. It is a bug that we are getting a Resources exhausted error. We are using a spilling sort and so this parameter should only be a minor tuning parameter to control how often you need to spill to disk vs. how much RAM to use. For scalar indices the workload is pretty well bounded (scalar columns don't get all that large) and 100MiB should give good enough performance in all cases. I'd rather avoid a configuration parameter unless we can show there are use cases that need to adjust it and get a much better sense of the tradeoff.

The bug is because I had incorrectly assumed that SortExec was the only spillable operator in the training plan. As a result I used the greedy pool which states:

This pool works well for queries that do not need to spill or have a single spillable operator

Unfortunately, it turns out that SortPreservingMergeExec is also spillable. The result is a resources exhausted bug. I will create a PR to switch to the fair pool.

@sandias42
Copy link

I'm getting a Resource Exhausted error when trying to construct a scalar index on a large dataset (>1TB) despite have more than enough RAM and ample disk space. Is this the same issue?

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[15], line 2
      1 # I'll need a bigger node to run this it seems
----> 2 tbl.create_scalar_index("resume_id", replace=True)

File [/opt/conda/envs/raycluster/lib/python3.11/site-packages/lancedb/table.py:1167](/opt/conda/envs/raycluster/lib/python3.11/site-packages/lancedb/table.py#line=1166), in LanceTable.create_scalar_index(self, column, replace)
   1166 def create_scalar_index(self, column: str, *, replace: bool = True):
-> 1167     self._dataset_mut.create_scalar_index(
   1168         column, index_type="BTREE", replace=replace
   1169     )

File [/opt/conda/envs/raycluster/lib/python3.11/site-packages/lance/dataset.py:1217](/opt/conda/envs/raycluster/lib/python3.11/site-packages/lance/dataset.py#line=1216), in LanceDataset.create_scalar_index(self, column, index_type, name, replace)
   1209 if index_type != "BTREE":
   1210     raise NotImplementedError(
   1211         (
   1212             'Only "BTREE" is supported for ',
   1213             f"index_type.  Received {index_type}",
   1214         )
   1215     )
-> 1217 self._ds.create_index([column], index_type, name, replace)

OSError: LanceError(IO): Execution error: LanceError(IO): Execution error: External error: Resources exhausted: Failed to allocate additional 1874928 bytes for ExternalSorter[0] with 66245352 bytes already allocated - maximum available is 67055264, [/home/runner/work/lance/lance/rust/lance-datafusion/src/chunker.rs:47:46](/home/runner/work/lance/lance/rust/lance-datafusion/src/chunker.rs#line=46), [/home/runner/work/lance/lance/rust/lance-index/src/scalar/btree.rs:1057:29](/home/runner/work/lance/lance/rust/lance-index/src/scalar/btree.rs#line=1056)

@westonpace
Copy link
Contributor

westonpace commented May 29, 2024

@sandias42

Yes, that looks like the same issue. The root cause is unfortunately an upstream issue: apache/datafusion#10073

It will give you this error no matter what size the spill pool has been configured to.

The only work around right now is to bypass spilling completely by setting LANCE_BYPASS_SPILLING=true

@sandias42
Copy link

Thanks @westonpace! I just found this workaround independently in the lance docs, and it works for me.

FYI s a new user to lancedb its a little tricky to figure out what part of the lance documentation carries over to lancedb tables, so I've had to do a fair bit of guess and check. (Another example of this is Table.merge, which is not in the python API lancedb docs AFAICT, even though Table.merge_insert is but is in the lance docs).

Thanks again for the work around!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants