-
-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: version 3.0 #105
feat: version 3.0 #105
Conversation
* feat: evaluate multiple sequences in parallel with automatic batching * feat: improve automatic chat wrapper resolution * feat: smart context shifting * feat: improve TS types * refactor: improve API * build: support beta releases * build: improve dev configurations BREAKING CHANGE: completely new API (docs will be updated before a stable version is released)
Hey, I have switched to the beta due to the infamous I know that it's not always possible to detect the correct context length, but it would be great if this would not crash the entire app, and instead, e.g., throw an error. Is it possible to add a mechanism to not crash the module if the provided context size is different from the training context size? |
* feat: function calling support * feat: stateless `LlamaChat` * feat: improve chat wrapper * feat: `LlamaText` util * test: add basic model-dependent tests * fix: threads parameter * fix: disable Metal for `x64` arch by default
# Conflicts: # llama/addon.cpp # src/llamaEvaluator/LlamaContext.ts # src/llamaEvaluator/LlamaModel.ts # src/utils/getBin.ts
@nathanlesage I'm pretty sure that the reason your app crashes is that larger context size requires more VRAM, and your machine doesn't have enough VRAM for a context length of 4096 but has enough for 2048. Unfortunately, it's not possible to safeguard against this at the moment on To mitigate this issue I've created this feature request on If this issue is something you expect to happen frequently in your application lifecycle, you can wrap your code with a worker thread until this is fixed properly. |
I thought that at first, but then I tried the same code on a windows computer, also with 16 GB of RAM, and it didn't crash. Then I tried out the most recent llama.cpp "manually" (I.e., pulled and ran main) and it worked even with the larger context sizes. I'm beginning to think that this was a bug in the metal code of llama.cpp -- I'll try out beta.2 that you just released, that should fix the issue hopefully. And thanks for the tip with the worker, I begin to feel a bit stupid for not realizing this earlier, but I've never worked so closely with native code in node before 🙈 |
* feat: get embedding for text * feat(minor): improve `resolveChatWrapperBasedOnModel` logic * style: improve GitHub release notes formatting
# Conflicts: # llama/addon.cpp # src/cli/commands/ChatCommand.ts # src/llamaEvaluator/LlamaContext.ts # src/utils/getBin.ts
@giladgd hi, I think the embedding Fn, can you follow the interface? |
* feat: manual binding loading - load the bindings using the `getLlama` method instead of it loading up by itself on import * feat: log settings - configure the log level or even set a custom logger for llama.cpp logs * fix: bugs
* fix: no thread limit when using a GPU * fix: improve `defineChatSessionFunction` types and docs * fix: format numbers printed in the CLI * fix: disable the browser's autocomplete in the docs search
* feat: `resetChatHistory` function on a `LlamaChatSession` * feat: copy model response in the example Electron app template
* fix: improve model downloader CI logs * fix: `CodeGemma` adaptations
# Conflicts: # .github/workflows/build.yml # README.md # llama/CMakeLists.txt # llama/addon.cpp # package.json # src/config.ts # src/utils/compileLLamaCpp.ts
🎉 This PR is included in version 3.0.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
How to use this beta
To install the beta version of
node-llama-cpp
, run this command inside of your project:To get started quickly, generate a new project from a template by running this command:
The interface of
node-llama-cpp
will change multiple times before a new stable version is released, so the documentation of the new version will be updated only a bit before the stable version release.If you'd like to use this beta, visit this PR for updated examples of how to use the latest beta version.
How you can help
Like this issue to help make it resolve sooner, or open a PR for it: Memory estimation utilities ggerganov/llama.cpp#4315- Implemented innode-llama-cpp
eventuallyIncluded in this beta
init
command to scaffold a project from a template (feat:init
command to scaffold a new project from a template #217,3.0.0-beta.21
)3.0.0-beta.2
)3.0.0-beta.23
)3.0.0-beta.3
)3.0.0-beta.1
)3.0.0-beta.2
)3.0.0-beta.15
)3.0.0-beta.15
)3.0.0-beta.1
)3.0.0-beta.1
)3.0.0-beta.6
)getLlama
method (feat: manual binding loading #153,3.0.0-beta.6
)3.0.0-beta.10
)3.0.0-beta.11
)3.0.0-beta.23
)3.0.0-beta.23
)3.0.0-beta.13
)3.0.0-beta.12
)3.0.0-beta.14
)inspect gpu
command (feat: use the best compute layer available by default #175,3.0.0-beta.13
)inspect gguf
cli command (feat: automatically adapt to current free VRAM state #182,3.0.0-beta.15
)inspect estimate
cli command (feat: new docs #309,3.0.0-beta.45
)TemplateChatWrapper
(feat: use the best compute layer available by default #175,3.0.0-beta.13
)3.0.0-beta.16
)3.0.0-beta.16
)3.0.0-beta.16
)3.0.0-beta.17
)3.0.0-beta.39
)3.0.0-beta.18
)init
command to scaffold a new project from a template #217,3.0.0-beta.21
)3.0.0-beta.23
)n_tokens <= n_batch
error (3.0.0-beta.1
)threads
doesn't work #114,3.0.0-beta.2
)3.0.0-beta.14
)Planned changes before release
LlamaChatSession
prompt
function #101)llama.cpp
logscontextSize
andbatchSize
defaults that depend on the current machine hardware (feat: automatically adapt to current free VRAM state #182)CLI usage
Chat with popular recommended models in your terminal with a single command:
Check what GPU devices are automatically detected by
node-llama-cpp
in your project with this command:Download and build the latest release of
llama.cpp
(learn more)npx --no node-llama-cpp source download --release latest
Usage example
How to stream a response
How to use function calling
Some models have official support for function calling in
node-llama-cpp
(such as Llama 3.1 Instruct and Llama 3 Instruct),while other models fallback to a generic function calling mechanism that works with many models, but not all of them.
How to get embedding for text
How to customize binding settings
How to generate a completion
How to generate an infill
Infill, also known as fill-in-middle, is used to generate a completion for an input that should connect to a given continuation.
For example, for a prefix input
123
and suffix input789
, the model is expected to generate456
to make the final text be123456789
.Not every model supports infill, so only those that do can be used for generating an infill.
Using a specific compute layer
node-llama-cpp
detects the available compute layers on the system and uses the best one by default.If the best one fails to load, it'll try the next best option and so on until it manages to load the bindings.
To use this logic, just use
getLlama
without specifying the compute layer:To force it to load a specific compute layer, you can use the
gpu
parameter ongetLlama
:To inspect what compute layers are detected in your system, you can run this command:
Using
TemplateChatWrapper
To create a simple chat wrapper to use in a
LlamaChatSession
, you can useTemplateChatWrapper
.Example usage:
{{systemPrompt}}
is optional and is replaced with the first system message(when is does, that system message is not included in the history).
{{history}}
is replaced with the chat history.Each message in the chat history is converted using template passed to
historyTemplate
, and all messages are joined together.{{completion}}
is where the model's response is generated.The text that comes after
{{completion}}
is used to determine when the model has finished generating the response,and thus is mandatory.
functionCallMessageTemplate
is used to specify the format in which functions can be called by the model andhow their results are fed to the model after the function call.
Using
JinjaTemplateChatWrapper
You can use an existing Jinja template by using
JinjaTemplateChatWrapper
, but note that not all the functionality of Jinja is supported yet.If you want to create a new chat wrapper from scratch, using this chat wrapper is not recommended, and instead you better inherit
from the
ChatWrapper
class and implement a custom chat wrapper of your own in TypeScriptExample usage:
Custom memory management options
node-llama-cpp
adapt to the current free VRAM state to choose the best defaultgpuLayers
andcontextSize
values that maximize those values values within the available VRAM.It's best to not customize
gpuLayers
andcontextSize
in order to utilize this feature, but you can also set agpuLayers
value with your constraints, andnode-llama-cpp
will try to adapt to it.node-llama-cpp
also predicts how much VRAM is needed to load a model or create a context when you pass a specificgpuLayers
orcontextSize
value, and throws an error if it's not enough VRAM in order to make sure the process won't crash if there's not enough VRAM.Those estimations are not always accurate, so if you find that it throws an error when it shouldn't, you can pass
ignoreMemorySafetyChecks
to forcenode-llama-cpp
to ignore those checks.Also, in case those calculations are way too inaccurate, please let us know here, and attach the output of
npx --no node-llama-cpp inspect measure <model path>
with a link to the model file you used.Token bias
Here is an example of to increase the probability of the word "hello" being generated and prevent the word "day" from being generated:
Prompt preloading
Preloading a prompt while the user is still typing can make the model start generating a response to the final prompt much earlier, as it builds most of the context state needed to generate the response.
Prompt completion
Prompt completion is a feature that allows you to generate a completion for a prompt without actually prompting the model.
The completion is context-aware and is generated based on the prompt and the current context state.
When a completion for a prompt there's no use to preloading a prompt before generating a completion for it, as the completion method will preload the prompt automatically.
Pull-Request Checklist
master
branchnpm run format
to apply eslint formattingnpm run test
passes with this changeFixes #0000