update: Rewrite update script #372

fstachura · 2024-12-29T20:25:05Z

No description provided.

fstachura · 2024-12-29T22:44:36Z

~~I noticed that deduplicating definitions from references doesn't work properly.~~

Daniil159x · 2025-01-31T20:37:42Z

Hi, I think these changes are good and work better than update.py in the master.

But I have CPU idle when processing futures in the main thread.

Maybe async would be better in utilizing CPU?

tleb · 2025-02-01T17:10:20Z

Cannot reproduce good performance. I compared the original update.py versus my own PoC (called update-ng.py below) versus yours (called update-franek.py below). Everything is in a single branch to simplify testing (sorry for the crappy commit messages).

Command	Mean [s]	Min [s]	Max [s]	Relative
`update.py`	40.472 ± 0.196	40.250	40.617	4.72 ± 0.04
`update-ng.py`	8.578 ± 0.055	8.531	8.639	1.00
`update-franek.py`	80.363 ± 0.164	80.204	80.531	9.37 ± 0.06

Here is what it looks like:

⟩ hyperfine --min-runs 3 --export-markdown benchmark-table.md \
--parameter-list update update.py,update-ng.py,update-franek.py \
--prepare 'rm -rf data/musl/data/*' \
'TLEB_UPDATE={update} TLEB_NO_FETCH=1 ./utils/index ./data musl'
Benchmark 1: TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     40.472 s ±  0.196 s    [User: 71.356 s, System: 39.680 s]
  Range (min … max):   40.250 s … 40.617 s    3 runs

Benchmark 2: TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):      8.578 s ±  0.055 s    [User: 72.419 s, System: 38.537 s]
  Range (min … max):    8.531 s …  8.639 s    3 runs

Benchmark 3: TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     80.363 s ±  0.164 s    [User: 78.747 s, System: 49.339 s]
  Range (min … max):   80.204 s … 80.531 s    3 runs

Summary
  TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl ran
    4.72 ± 0.04 times faster than TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl
    9.37 ± 0.06 times faster than TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl

script.sh is limiting to first 10 tags because laptop on battery.
TLEB_NO_FETCH=1 makes sure that utils/index does not try doing Git fetches (avoid network ops in benchmark).
Not done in a Docker container to have hyperfine report valid usr and sys timings.
I have a weird reproducible issue with my update-ng.py that has those timings: wallclock 7.040s, usr 18.589s, sys 178.334s. On the same system, update.py does wallclock 22.012s, usr 23.667s, sys 46.578s. Notice the massive sys, which I cannot understand.

fstachura · 2025-02-06T11:13:47Z

@tleb On how many tags did you test? I decreased chunksize in my script to 100 (the "1000" argument in update_version call) and it seems to be running much faster now, without much performance difference between our scripts. But I also ran it only on a single tag. Chunksize calculation in yours makes much more sense though.

@Daniil159x I'm not sure if async would helps much here. It maybe would if threads were blocked on I/O most of the time, but both mine and @tleb scripts do database I/O on a single thread. There is also I/O related to reading from other processes (a lot of processing happens in script.sh/ctags/git), but again, it's not clear to me if that's the bottleneck. async also has some overhead AFAIK. On the other hand, I do see that neither of the scripts achieve 100% CPU utilization, it always hovers ~99%, so maybe?

fstachura · 2025-02-07T12:50:49Z

With chunksize calculation from @tleb's script (already on my faster-update branch), performance of the scripts is very similar, at least on my machine:

Benchmark 1: TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     82.743 s ±  0.109 s    [User: 125.653 s, System: 88.404 s]
  Range (min … max):   82.623 s … 82.835 s    3 runs
 
Benchmark 2: TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     41.873 s ±  0.280 s    [User: 154.509 s, System: 132.848 s]
  Range (min … max):   41.685 s … 42.195 s    3 runs

Benchmark 2: TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     45.776 s ±  0.854 s    [User: 154.594 s, System: 138.161 s]
  Range (min … max):   45.131 s … 46.745 s    3 runs
 
Summary
  TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl ran
    1.09 ± 0.02 times faster than TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl
    1.98 ± 0.01 times faster than TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl

script.sh

tleb · 2025-02-07T16:24:17Z

elixir/update.py

+        self.refs_lock = Lock()
+        self.docs_lock = Lock()
+        self.comps_lock = Lock()
+        self.comps_docs_lock = Lock()


Why are those locks required? I thought a single thread was writing into the database? If it is because you are indexing multiple versions at the same time, why do that? My PoC did one version after the other, which simplifies code without losing performance.

I think the locks are just defensive programming/leftover from an older design. I will remove them. I wish there was a simple way to ensure that UpdatePartialState does not end up in another thread somehow.

tleb · 2025-02-07T16:26:12Z

elixir/update.py

+        if hash in self.hash_to_idx:
+            return self.hash_to_idx[hash]
+        else:
+            return self.db.blob.get(hash)


Why do you have to look locally or in the DB? Why not only do one or the other?

Some data, like hash -> idx, idx -> hash/filename mappings, whatever was in vers.db, is not saved into the database until the update process finishes. I was hoping to make interrupting the update process a bit safer thanks to that (although I'm actually not 100% anymore if default berkeley db can handle interruptions without breaking).

The idea is pretty basic, refs/defs added in an interrupted update won't have entries in hash/filename/blob databases.
numBlobs is updated first, reserving id space for currently processed blobs forever. An interrupted update might leave entries with unknown blob ids, but AFAIK this is (definitely could be if it's not) handled gracefully by the backend. The unknown entries could be garbage collected later.

Can we do processing in a version per version basis? That requires storing all key-value pairs to be updated somewhere: either in memory or in an append-only file. Then once we are done indexing the file, we do an "update database" step that does all the writes.

That would avoid any database issue that would be caused by indexing raising an error. Also, it removes all DB ops from indexing functions that are purely focused on calling ctags or whatever.

So pseudocode would be like:

for version_name in versions_todo: all_blobs = ... new_blobs = ... new_defs = compute_new_defs_multithreaded(version_name, new_blobs) list_of_defs_in_version = find_all_defs_in_all_blobs(all_blobs) new_refs = find_new_refs_multithreaded(version_name, list_of_defs_in_version) # same thing for all types of values we have # OK, we are done, we can update the database with new_defs, new_refs, etc. save_defs(new_defs) save_refs(new_refs) # ...

elixir/update.py

tleb · 2025-02-07T16:33:43Z

elixir/update.py

+    for idx, path in buf:
+        obj.append(idx, path)
+
+    state.db.vers.put(state.tag, obj, sync=True)


Why is part of the "add to databases" done in UpdatePartialState and another part is done here?

I separate parts that can have garbage entries left by an unfinished update, from parts that cannot have garbage entries. vers is used to tell if a tag was already indexed. That's also (partially) why using a database while it's being updated was such a problem.

tleb · 2025-02-07T16:33:59Z

elixir/update.py

+        state.db.blob.put(hash, idx)
+
+    # Update versions
+    blobs = scriptLines('list-blobs', '-p', state.tag)


This is done twice, can we cache the value?

Good point, I think it could be cached.

tleb · 2025-02-07T16:43:01Z

elixir/update.py

+# NOTE: it is assumed that update_refs and update_defs are not running
+# concurrently. hence, defs are not locked
+# defs database MUSNT be updated while get_refs is running
+# Get references for a file


What are update_refs and update_defs? No functions are named that way in the file. Also why mustn't they run at the same time? Why is a comment required to explicit this, it looks hard to see from the code.

That comment wouldn't be required if the code was more readable, something like:

defs = get_defs(...) refs = get_refs(..., defs)

Why is the code calling get_refs() and get_defs() more complex than the above?

What are update_refs and update_defs? No functions are named that way in the file

I think the names are out of date. Maybe I meant get_defs and get_refs.

Also why mustn't they run at the same time? Why is a comment required to explicit this, it looks hard to see from the code

References update requires computed definitions.

Why is the code calling get_refs() and get_defs() more complex than the above?

It's submitted as a job to futures pool. I can't, for example, pass a closure to another thread. This is why the awkward batch_* functions exist.

elixir/update.py

tleb · 2025-02-07T16:48:56Z

elixir/update.py

+def batch(job):
+    def f(chunk, **kwargs):
+        return [job(*args, **kwargs) for args in chunk]
+    return f


Why is that required when pool's *map*() methods all have a chunksize argument?

I opted for manually creating jobs this weird way, because I wanted to be able to see how many are finished from update_version. Map just gives you an iterable with results.

tleb · 2025-02-07T16:50:15Z

elixir/update.py

+    return f
+
+# NOTE: some of the following functions are kind of redundant, and could sometimes be
+# higher-order functions, but that's not supported by multiprocessing


The question is: why are those required? Not how we could abstract them in a single function, but why we can't fit into the niceties provided by the pools from Python's stdlib.

I'm not sure what you would replace it with, although there probably are better options. The main issue was that from what I remember I couldn't pass callables returned by higher-order functions to the executor. Only named functions could be passed. That made some this of the code awkward.

tleb · 2025-02-07T16:52:06Z

elixir/update.py

+    comps = BsdDB(getDataDir() + '/compatibledts.db', True, DefList)
+    result = [get_comps_docs(*args, comps=comps, **kwargs) for args in chunk]
+    comps.close()
+    return result


Surprised to see a wild database opened here. Aren't them all opened already by the caller?

When batch_comps_docs is called, comps database is already closed in the main thread. You shouldn't open the same database from two different threads with the current berkeleydb config (at least not if one is opened as read-write). I think you also cannot pass them between processes. Here it's only opened in read only mode.

tleb · 2025-02-07T16:52:52Z

elixir/update.py

+
+# Split list into sublist of chunk_size size
+def split_into_chunks(list, chunk_size):
+    return [list[i:i+chunk_size] for i in range(0, len(list), chunk_size)]


Again, we should use multiprocessing.pool.Pool.map(chunksize=) and not reimplement it ourselves.

Map only gives you an iterable that you can block on. I wanted to see how many tasks are finished from the main thread, to put progress logs in a single place, and I needed a list of futures for that.

tleb · 2025-02-07T16:53:17Z

elixir/update.py

+
+        # Start refs job
+        futures = [pool.submit(batch_refs, ch) for ch in chunks]
+        return ("refs", (futures, handle_refs_results(state), None))


Why is code not linear? Why do we need callbacks?

I maybe tired too hard to "engineer" this. Maybe it could be simplified a bit.

First, I wanted to avoid tracking progress all over the place. I wanted to hide progress bars in a single loop, without thinking about it in functions that do computation.
Second, I was hoping to model relationships between parts of the update job. References update needs definitions update to finish first. comps_docs needs comps. But I don't want to wait for defs to finish to start comps_docs, if comps finishes first.

I basically wanted the scheduling logic to be separate from the part that actually computes data, and all that to be separate from the part that flushes stuff into the database.
I was also hoping to handle database safety better, (do not share between threads, do not share between processes, do not read if it's open as read-write) that's why databases are closed sometimes. It's still not great, I feel limited by multiprocessing lib a bit (you can't/shouldn't share state).

So I decided to represent the jobs in a declarative way (submit part that does computation to pool, call part that flushes to the database on each result, call part that schedules the next step when all parts finish) , and have a loop handle these declarations, without caring about what is actually computed.

But I don't want to wait for defs to finish to start comps_docs, if comps finishes first.

Why? I think the whole issue with the current solution is that it tries doing everything all at the same time. It is fine to do defs then refs then docs, etc. Doing everything at the same time doesn't make things faster. Doing all X at once then all Y is as fast (or faster).

It's still not great, I feel limited by multiprocessing lib a bit (you can't/shouldn't share state).

We should make our processing fit into the abstractions it gives us. What we do is map from blobs to defs/refs inside of it. Not sharing state is a benefit to readability, it shouldn't be seen as something that hinders your solution.

tleb · 2025-02-07T16:53:49Z

elixir/update.py

+        # and bsddb cannot be shared between processes until/unless bsddb concurrent data store is implemented.
+        # Operations on closed databases raise exceptions that would in this case be indicative of a bug.
+        state.db.defs.sync()
+        state.db.defs.close()


Why do we only close this database? I would expect all databases to be closed once at the end.

Defs it's reopened read-only in a different thread, in batch_refs. I want to close the read-write version to make sure reading it is safe later.

tleb · 2025-02-07T16:55:06Z

elixir/update.py

+    to_track = {
+        "defs": ([], handle_defs_results(state), after_all_defs_done),
+        "docs": ([], handle_batch_results(state.add_docs), None),
+    }


That part (and below code) is really hard to make sense of. We should aim at making more linear code. Probably build arrays of tasks to do and pass them to a pool. Or something else (?).

Explained in previous comment

tleb

Overall, my complaint is that code is not linear. It makes it really hard to understand, and I've been reading and writing Elixir indexing code more or less recently. I can't imagine myself or others in two years time.

As said in some comments, I'd expect code to be more like

x = get_required_values()

y = compute_y(x)
z = compute_z(x, y)

cleanup_foobar()

If things need to be done in parallel, it should be a single call to pool.*map*() function(s). That makes execution flow (what happens when) and ressource management (what is used/freed when) easy to understand. Futures & co are nice features but they make code obtuse.

Something else not touched in the comments (but discussed in real-life :-) ) is that the logs are not useful. I haven't run this version (please rebase on master that has the utils/index script to make testing easier), but code says it prints not-so-useful things.

A good logs starting point would be a summary at the end of each tag indexed. Something like (ideally aligned):

tag v2.6: N blobs, M defs, O refs, P compatibles, S seconds

One last thing about logging: make sure to not lose errors! There are try except Exception blocks that ignore the exception and print a generic error message. That is not useful for debugging purposes. It must do what it can (maybe stop all processing of the current version and try indexing other versions).

fstachura · 2025-02-12T07:20:24Z

Thanks for the review!

please rebase on master that has the utils/index script to make testing easier

Rebased.

One last thing about logging: make sure to not lose errors! There are try except Exception blocks that ignore the exception and print a generic error message. That is not useful for debugging purposes. It must do what it can (maybe stop all processing of the current version and try indexing other versions).

logging.exception also prints the exception. I also don't like that the exception is not explicitly passed as an argument.

I explained why the code is not linear in of the review comments.

I thought I would maybe state design goals. We should've discussed that earlier. I think we agree about most of this, but some is up for discussion.

From the current script:

Avoid threads (GIL)
Make scheduling more effective (tag locks, long tail that is only done on a single thread)
Separate scheduling from computation (this also refers to your ideas to compute an index update, and apply it later, maybe with another program)
Make logs more meaningful
Do not write to the database from multiple threads
Make update script crashes/interruptions safer

If not for clunkiness of berkeleydb, last two points would be unnecessary. But I'm assuming we are staying it berkeleydb for now.

Avoid calling this parse-docs script that is expensive. This heuristic avoids running it on most files, and is almost free. Signed-off-by: Théo Lebrun <[email protected]>

By default ctags sorts entries. This is not useful to the update script, but takes time. user time for `update.py 16` on musl v1.2.5 went from 1m21.613s to 1m11.849s.

New update script uses futures to dynamically schedule many smaller tasks between a constant number of threads, instead of statically assigning a single long running task to each thread. This results in better CPU saturation. Database handles are not shared between threads anymore, instead the main thread is used to commit results of other threads into the database. This trades locking on database access for serialization costs - since multiprocessing is used, values returned from futures are pickled. (although in practice that depends on ProcessPool configuration)

Code by Théo Lebrun

fstachura marked this pull request as draft December 29, 2024 22:43

fstachura force-pushed the faster-update branch from 63d1535 to 178226a Compare December 30, 2024 00:20

fstachura marked this pull request as ready for review December 30, 2024 00:20