-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ExLlamaV2Sampler.Settings.logits_processor
#634
base: master
Are you sure you want to change the base?
Conversation
This is interesting, and I'll be giving it a closer look later today. I'm a little skeptical, though, for a couple of reasons. Logit processors tend to do a lot of extraneous work. Many operations and temporary allocations that could be a single iteration over a block of memory in the CPU's L2 cache (sometimes fitting in L1, even), or even literally one line of C++ code in some cases, turn into multiple kernel launches, each of which has to process the entire logit array after all but a few dozen options have been masked out. And you end up performing multiple softmax operations too if you want to combine samplers, since each processor has to output logits for the next processor in the stack. Batched sampling would be a clear advantage in itself, except ExLlama doesn't require all sequences in a batch to use any of the same settings, so every processor would have to take batched parameters as well to take advantage of this. Not sure what's standard in that regard. CPUs aren't that slow, either. You have AVX2 to help with anything that requires any real arithmetic (AVX512 is an option, too, blame Intel for screwing that one up for so many users), and you can split batches over multiple cores easily. I could also see issues arising from individual threads competing for the CUDA stream, unless logit processors were used exclusively and/or without multithreaded sampling enabled. As for the Outlines example, currently with a library like Formatron, grammar constraints can be evaluated entirely in the background adding essentially zero overhead by using the dedicated filter interface. LMFE is written in Python which blocks multithreading, but it can still run while the CPU is waiting for the GPU to complete the forward pass. The straightforward way to use a logit processor as a grammar constraint doesn't really allow for concurrency of any kind. (I haven't checked, but I also doubt it uses pinned memory for the allowed token mask (?), forcing a sync point that would reduce any benefit from running the other processors on the GPU.) But the main concern is that performance is going to suffer. Samplers in general are kind of irksome and (I feel) often ill-conceived, and this feels like opening the floodgates to a whole host of new issues and complaints. I'll need to give it some careful consideration and run some tests, I suppose. |
Thanks for your thoughtful reply! Your concerns about performance are valid, but for structured generation filtering, ExLlamaV2 lags behind both vLLM and Transformers. Recent benchmarks show that ExLlamaV2 incurs 2-15x the overhead compared to vLLM/Transformers. The key difference is that vLLM/Transformers support logits processors. In our own tests with Outlines, we saw a 50x performance boost by switching from list-based filtering to using a tensor of legal tokens within our logits processors. Also, I’d like to reaffirm that with
Based on this, and after some profiling, I agree that your current sampler implementation shouldn't be replaced with logits processors. The core benefit of this PR would be to take advantage of high-performance structured generation logits processors and reduce ExLlamaV2 overhead for that specific task. Please let me know if I'm missing something or if you have any other questions. |
This may be the case for Outlines, idk. But with Formatron the overhead is negligible, often zero depending on model and batch size. It can even be net negative in some cases since sampling can be skipped when it's constrained to a single token. The way the pipeline works, the constraint is evaluated while the forward pass is still completing on the GPU and the CPU is idle/busywaiting anyway. For grammar libraries that do the bulk of their work in C++ or Rust with the GIL released, it starts at the same time as the forward pass and runs completely in the background on other CPU cores. This means the final overhead is almost entirely from:
There are several places this could be improved to reduce the overhead even further. But mostly it comes down to reducing the amount of time spent in the Python/Rust/C++ interop layers. If you pass a Python list to a C++ function, whether it's the sampling logic in exllamav2_ext or an indexing operation in libtorch, it has to be unboxed one element at a time, and this is slow. A tensor reduces to a single pointer so it's thousands of times faster to pass as an argument. This really has nothing to do with CUDA, though, and it would be trivial to pass a mask tensor to ExLlama's sampler function instead of a list (provided the grammar library outputs such a tensor) eliminating most of the remaining overhead. For Formatron specifically, the Rust component internally produces a I'm not sure what the current ExLlama integrations for Outlines look like, though. But I do plan to revisit the grammar stuff soon, and see if there's a way to integrate it into the current filters pipeline. |
These benchmarks are from the Formatron repo, they indicate that overhead with their vLLM integration (FormatronLogitsProcessor), there is overhead of 0.0 to 0.23 ms / token while their ExLlamaV2 integration has overhead of 0.17 to 1.46 ms / token. I might be missing something though, I haven't dug too deeply into Formatrons internals.
Nice to see you have fast-forward implemented! I'll look further into this later since we'll need to consider how our implementations interface might best be suited for downstream consumption :)
Currently we have a one logits processor per generation type (regex, grammars, json schema, etc). Each logits processor works with vLLM, transformers, mlxlm, llama.cpp, and hopefully ExLlamaV2 soon :). There is no distinct logits processor for any of these engines, their implementation is shared. We've tested the outlines integration with this PR. Users would simply need to run
I'll let you take some time to review this further. Please let me know if you have any questions or requested changes to help ensure this change conforms to your vision for the project! |
212d4cd
to
8ce2970
Compare
8ce2970
to
ce08f16
Compare
Overview / Motivation
Implements
ExLlamaV2Sampler.Settings.logits_processor
which allows us to take advantage of third party libraries logits processors, such as Outlines which implements JSON Schema, regex, and Lark structured generation logits processorsChanges
ExLlamaV2Sampler.Settings.logits_processor
which allows for logits filtering and augmentation with torchtests/test_logits_processors.py
which is the same astests/test.py
but using a logits processor for all samplingexamples/json_schema_outlines.py
Performance
tests.py
between this branch andmaster
logits_processor
argument is enabledsample_basic
and once with torch.normal
performance to 145tokens/secmaster
->tests.py
(Note:
Generating, batched multi cache
fails inmaster
, not due to this PR)sampler-logits-processor
->tests.py
sampler-logits-processor
->test_logits_processor.py
Tests
All tests pass except for
tests/test.py
/Generating, batched multi cache
which also fails inmaster