0.10.2
This version introduces a new CLI argument: --max-seq-len <n>
. It allows you to reduce the context size and, at the same time, reduce memory consumption. This argument works with the following commands: dllama inference
, dllama chat
, and dllama-api
. You don't need to set it in the worker because the root node will distribute the information to the worker.
Example:
./dllama chat --model ... --nthreads 8 --max-seq-len 1024