feat: version 3.0 #105

giladgd · 2023-11-26T19:31:09Z

How to use this beta

To install the beta version of node-llama-cpp, run this command inside of your project:

npm install node-llama-cpp@beta

To get started quickly, generate a new project from a template by running this command:

npm create --yes node-llama-cpp@beta

The interface of node-llama-cpp will change multiple times before a new stable version is released, so the documentation of the new version will be updated only a bit before the stable version release.
If you'd like to use this beta, visit this PR for updated examples of how to use the latest beta version.

How you can help

Try the beta and provide feedback here
Open issues for bugs you encounter
~~Like this issue to help make it resolve sooner, or open a PR for it: Memory estimation utilities ggerganov/llama.cpp#4315~~ - Implemented in node-llama-cpp eventually

Included in this beta

feat: init command to scaffold a project from a template (feat: init command to scaffold a new project from a template #217, 3.0.0-beta.21)
feat: function calling (feat: function calling #139, 3.0.0-beta.2)
feat: parallel function calling (feat: parallel function calling #225, 3.0.0-beta.23)
feat: get embedding for text (feat: get embedding for text #123, 3.0.0-beta.3)
feat: automatic chat wrapper detection based on the model file name, metadata, and BOS string representation (3.0.0-beta.1)
feat: chat syntax aware context shifting (3.0.0-beta.2)
feat: automatically adapt to current free VRAM state (feat: automatically adapt to current free VRAM state #182, 3.0.0-beta.15)
feat: use the chat template from the model file when there's no better chat wrapper available (feat: automatically adapt to current free VRAM state #182, 3.0.0-beta.15)
feat: automatic batching (feat: automatic batching #104, 3.0.0-beta.1)
feat: basic context shifting (3.0.0-beta.1)
feat: logging configuration (feat: manual binding loading #153, 3.0.0-beta.6)
feat: custom binding settings using the getLlama method (feat: manual binding loading #153, 3.0.0-beta.6)
feat: minP support (feat: minP support #162, 3.0.0-beta.10)
feat: completion and infill(feat: completion and infill #164, 3.0.0-beta.11)
feat: preload prompt (feat: parallel function calling #225, 3.0.0-beta.23)
feat: prompt completion engine (feat: parallel function calling #225, 3.0.0-beta.23)
feat: use the best compute layer available by default (feat: use the best compute layer available by default #175, 3.0.0-beta.13)
feat: Vulkan support - recommended for AMD GPUs (feat: Vulkan support #171, 3.0.0-beta.12)
feat: async model and context loading (feat: async operations #178, 3.0.0-beta.14)
feat: inspect gpu command (feat: use the best compute layer available by default #175, 3.0.0-beta.13)
feat: inspect gguf cli command (feat: automatically adapt to current free VRAM state #182, 3.0.0-beta.15)
feat: inspect estimate cli command (feat: new docs #309, 3.0.0-beta.45)
feat: TemplateChatWrapper (feat: use the best compute layer available by default #175, 3.0.0-beta.13)
feat: download models using the CLI (feat: download models using the CLI #191, 3.0.0-beta.16)
feat: interactively select a model from CLI commands (feat: download models using the CLI #191, 3.0.0-beta.16)
feat: change the default log level to warn (feat: download models using the CLI #191, 3.0.0-beta.16)
feat: Llama 3 support (feat: Llama 3 support #205, 3.0.0-beta.17)
feat: Llama 3.1 support (feat: Llama 3.1 support, Phi-3 support #273, 3.0.0-beta.39)
feat: split gguf files support (feat: split gguf files support #214, 3.0.0-beta.18)
feat: load LoRA adapters (feat: init command to scaffold a new project from a template #217, 3.0.0-beta.21)
feat: model compatibility warnings (feat: parallel function calling #225, 3.0.0-beta.23)
fix: fix the n_tokens <= n_batch error (3.0.0-beta.1)
fix: threads parameter (bug: model parameter threads doesn't work #114, 3.0.0-beta.2)
fix: automatically resolve CUDA compilation errors (feat: async operations #178, 3.0.0-beta.14)

Planned changes before release

CLI usage

Chat with popular recommended models in your terminal with a single command:

npx --yes node-llama-cpp@beta chat

Check what GPU devices are automatically detected by node-llama-cpp in your project with this command:

npx --no node-llama-cpp inspect gpu

Download and build the latest release of llama.cpp (learn more)

npx --no node-llama-cpp source download --release latest

Usage example

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);


const q2 = "Summarize what you said";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);

How to stream a response

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1, {
    onTextChunk(chunk) {
        process.stdout.write(chunk);
    }
});
console.log("AI: " + a1);

How to use function calling

Some models have official support for function calling in node-llama-cpp (such as Llama 3.1 Instruct and Llama 3 Instruct),
while other models fallback to a generic function calling mechanism that works with many models, but not all of them.

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, defineChatSessionFunction, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context = await model.createContext();
const functions = {
    getDate: defineChatSessionFunction({
        description: "Retrieve the current date",
        handler() {
            return new Date().toLocaleDateString();
        }
    }),
    getNthWord: defineChatSessionFunction({
        description: "Get an n-th word",
        params: {
            type: "object",
            properties: {
                n: {
                    enum: [1, 2, 3, 4]
                }
            }
        },
        handler(params) {
            return ["very", "secret", "this", "hello"][params.n - 1];
        }
    })
};
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "What is the second word?";
console.log("User: " + q1);

const a1 = await session.prompt(q1, {functions});
console.log("AI: " + a1);


const q2 = "What is the date? Also tell me the word I previously asked for";
console.log("User: " + q2);

const a2 = await session.prompt(q2, {functions});
console.log("AI: " + a2);

How to get embedding for text

import {fileURLToPath} from "url";
import path from "path";
import {getLlama} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const embeddingContext = await model.createEmbeddingContext();

const text = "Hello world";
const embedding = await embeddingContext.getEmbeddingFor(text);

console.log(text, embedding.vector);

How to customize binding settings

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama({
    logLevel: LlamaLogLevel.debug // enable debug logs from llama.cpp
});
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf"),
    onLoadProgress(loadProgress: number) {
        console.log(`Load progress: ${loadProgress * 100}%`);
    }
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

How to generate a completion

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaCompletion} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "stable-code-3b.Q5_K_M.gguf")
});
const context = await model.createContext();
const completion = new LlamaCompletion({
    contextSequence: context.getSequence()
});

const input = "const arrayFromOneToTwenty = [1, 2, 3,";
console.log("Input: " + input);

const res = await completion.generateCompletion(input);
console.log("Completion: " + res);

How to generate an infill

Infill, also known as fill-in-middle, is used to generate a completion for an input that should connect to a given continuation.
For example, for a prefix input 123 and suffix input 789, the model is expected to generate 456 to make the final text be 123456789.

Not every model supports infill, so only those that do can be used for generating an infill.

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaCompletion, UnsupportedError} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "stable-code-3b.Q5_K_M.gguf")
});
const context = await model.createContext();
const completion = new LlamaCompletion({
    contextSequence: context.getSequence()
});

if (!completion.infillSupported)
    throw new UnsupportedError("Infill completions are not supported by this model");

const prefix = "const arrayFromOneToFourteen = [1, 2, 3, ";
const suffix = "10, 11, 12, 13, 14];";
console.log("prefix: " + prefix);
console.log("suffix: " + suffix);

const res = await completion.generateInfillCompletion(prefix, suffix);
console.log("Infill: " + res);

Using a specific compute layer

node-llama-cpp detects the available compute layers on the system and uses the best one by default.
If the best one fails to load, it'll try the next best option and so on until it manages to load the bindings.

To use this logic, just use getLlama without specifying the compute layer:

import {getLlama} from "node-llama-cpp";

const llama = await getLlama();

To force it to load a specific compute layer, you can use the gpu parameter on getLlama:

import {getLlama} from "node-llama-cpp";

const llama = await getLlama({
    gpu: "vulkan" // defaults to `"auto"`. can also be `"cuda"` or `false` (to not use the GPU at all)
});

To inspect what compute layers are detected in your system, you can run this command:

npx --no node-llama-cpp inspect gpu

Using `TemplateChatWrapper`

To create a simple chat wrapper to use in a LlamaChatSession, you can use TemplateChatWrapper.

Example usage:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, TemplateChatWrapper} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const chatWrapper = new TemplateChatWrapper({
    template: "{{systemPrompt}}\n{{history}}model: {{completion}}\nuser: ",
    historyTemplate: {
        system: "system: {{message}}\n",
        user: "user: {{message}}\n",
        model: "model: {{message}}\n"
    },
    // functionCallMessageTemplate: { // optional
    //     call: "[[call: {{functionName}}({{functionParams}})]]",
    //     result: " [[result: {{functionCallResult}}]]"
    // }
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence(),
    chatWrapper
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);


const q2 = "Summarize what you said";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);

{{systemPrompt}} is optional and is replaced with the first system message
(when is does, that system message is not included in the history).

{{history}} is replaced with the chat history.
Each message in the chat history is converted using template passed to historyTemplate, and all messages are joined together.

{{completion}} is where the model's response is generated.
The text that comes after {{completion}} is used to determine when the model has finished generating the response,
and thus is mandatory.

functionCallMessageTemplate is used to specify the format in which functions can be called by the model and
how their results are fed to the model after the function call.

Using `JinjaTemplateChatWrapper`

You can use an existing Jinja template by using JinjaTemplateChatWrapper, but note that not all the functionality of Jinja is supported yet.
If you want to create a new chat wrapper from scratch, using this chat wrapper is not recommended, and instead you better inherit
from the ChatWrapper class and implement a custom chat wrapper of your own in TypeScript

Example usage:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, JinjaTemplateChatWrapper} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const chatWrapper = new JinjaTemplateChatWrapper({
    template: "<Jinja template here>"
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence(),
    chatWrapper
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);

Custom memory management options

node-llama-cpp adapt to the current free VRAM state to choose the best default gpuLayers and contextSize values that maximize those values values within the available VRAM.
It's best to not customize gpuLayers and contextSize in order to utilize this feature, but you can also set a gpuLayers value with your constraints, and node-llama-cpp will try to adapt to it.

node-llama-cpp also predicts how much VRAM is needed to load a model or create a context when you pass a specific gpuLayers or contextSize value, and throws an error if it's not enough VRAM in order to make sure the process won't crash if there's not enough VRAM.
Those estimations are not always accurate, so if you find that it throws an error when it shouldn't, you can pass ignoreMemorySafetyChecks to force node-llama-cpp to ignore those checks.
Also, in case those calculations are way too inaccurate, please let us know here, and attach the output of npx --no node-llama-cpp inspect measure <model path> with a link to the model file you used.

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, JinjaTemplateChatWrapper} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf"),
    gpuLayers: {
        min: 20,
        fitContext: {
            contextSize: 8192 // to make sure there will be enough VRAM left to create a context with this size
        }
    }
});
const context = await model.createContext({
    contextSize: {
        min: 8192 // will throw an error if a context with this context size cannot be created
    }
});

Token bias

Here is an example of to increase the probability of the word "hello" being generated and prevent the word "day" from being generated:

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession, TokenBias} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1, {
    tokenBias: (new TokenBias(model))
        .set("Hello", 1)
        .set("hello", 1)
        .set("Day", "never")
        .set("day", "never")
        .set(model.tokenize("day"), "never") // you can also do this to set bias for specific tokens
});
console.log("AI: " + a1);

Prompt preloading

Preloading a prompt while the user is still typing can make the model start generating a response to the final prompt much earlier, as it builds most of the context state needed to generate the response.

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Hi there, how are you?";

await session.preloadPrompt(q1);

console.log("User: " + q1);

// now prompting the model will start generating a response much ealier
const a1 = await session.prompt(q1);
console.log("AI: " + a1);

Prompt completion

Prompt completion is a feature that allows you to generate a completion for a prompt without actually prompting the model.

The completion is context-aware and is generated based on the prompt and the current context state.

When a completion for a prompt there's no use to preloading a prompt before generating a completion for it, as the completion method will preload the prompt automatically.

import {fileURLToPath} from "url";
import path from "path";
import {getLlama, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const llama = await getLlama();
const model = await llama.loadModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = await model.createContext();
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const partialPrompt = "What is the best ";
console.log("Partial prompt: " + partialPrompt);

const completion = await session.completePrompt(partialPrompt);
console.log("Completion: " + completion);

Pull-Request Checklist

Code is up-to-date with the master branch
npm run format to apply eslint formatting
npm run test passes with this change
This pull request links relevant issues as Fixes #0000
There are new or updated unit tests validating the change
Documentation has been updated to reflect this change
The new commits and pull request title follow conventions explained in pull request guidelines (PRs that do not follow this convention will not be merged)

* feat: evaluate multiple sequences in parallel with automatic batching * feat: improve automatic chat wrapper resolution * feat: smart context shifting * feat: improve TS types * refactor: improve API * build: support beta releases * build: improve dev configurations BREAKING CHANGE: completely new API (docs will be updated before a stable version is released)

nathanlesage · 2024-01-17T12:04:24Z

Hey, I have switched to the beta due to the infamous n_tokens <= n_batch error, and I saw that it is now possible to automatically detect the correct context size. However, there is a problem with that: I have been trying this out with Mistral's OpenOrca 7b in the Q4_K_M quantized size, and the issue is that the training context is 2^15 (32,768), but the quantized version reduces this context to 2,048. With your code example, this will immediately crash the entire server when using your code, since contextSize: Math.min(4096, model.trainContextSize) will in this case resolve to contextSize: Math.min(4096, 32768) and then contextSize: 4096 which is > 2048.

I know that it's not always possible to detect the correct context length, but it would be great if this would not crash the entire app, and instead, e.g., throw an error.

Is it possible to add a mechanism to not crash the module if the provided context size is different from the training context size?

* feat: function calling support * feat: stateless `LlamaChat` * feat: improve chat wrapper * feat: `LlamaText` util * test: add basic model-dependent tests * fix: threads parameter * fix: disable Metal for `x64` arch by default

# Conflicts: # llama/addon.cpp # src/llamaEvaluator/LlamaContext.ts # src/llamaEvaluator/LlamaModel.ts # src/utils/getBin.ts

giladgd · 2024-01-20T00:25:47Z

@nathanlesage I'm pretty sure that the reason your app crashes is that larger context size requires more VRAM, and your machine doesn't have enough VRAM for a context length of 4096 but has enough for 2048.
If you try to create a context with a larger size than is supported by the model, it won't crash your app but may cause the model to generate gibberish as it crosses the supported context length size.

Unfortunately, it's not possible to safeguard against this at the moment on node-llama-cpp's side since llama.cpp is the one that crashes the process, and node-llama-cpp is not aware of the available VRAM and memory requirements for creating a context with a specific size.

To mitigate this issue I've created this feature request on llama.cpp: ggerganov/llama.cpp#4315
After this feature is added on llama.cpp I'll be able to improve this situation on node-llama-cpp's side.

If this issue is something you expect to happen frequently in your application lifecycle, you can wrap your code with a worker thread until this is fixed properly.

nathanlesage · 2024-01-20T08:45:15Z

I thought that at first, but then I tried the same code on a windows computer, also with 16 GB of RAM, and it didn't crash. Then I tried out the most recent llama.cpp "manually" (I.e., pulled and ran main) and it worked even with the larger context sizes. I'm beginning to think that this was a bug in the metal code of llama.cpp -- I'll try out beta.2 that you just released, that should fix the issue hopefully.

And thanks for the tip with the worker, I begin to feel a bit stupid for not realizing this earlier, but I've never worked so closely with native code in node before 🙈

* feat: get embedding for text * feat(minor): improve `resolveChatWrapperBasedOnModel` logic * style: improve GitHub release notes formatting

# Conflicts: # llama/addon.cpp # src/cli/commands/ChatCommand.ts # src/llamaEvaluator/LlamaContext.ts # src/utils/getBin.ts

hiepxanh · 2024-01-23T09:31:12Z

@giladgd hi, I think the embedding Fn, can you follow the interface?
EmbeddingsInterface here https://github.com/langchain-ai/langchainjs/blob/5df71ccbc734f41b79b486ae89281c86fbb70768/langchain-core/src/embeddings.ts#L9

* feat: add `--systemPromptFile` flag to the `chat` command * feat: add `--promptFile` flag to the `chat` command * feat: add `--batchSize` flag to the `chat` command

* feat: manual binding loading - load the bindings using the `getLlama` method instead of it loading up by itself on import * feat: log settings - configure the log level or even set a custom logger for llama.cpp logs * fix: bugs

* fix: no thread limit when using a GPU * fix: improve `defineChatSessionFunction` types and docs * fix: format numbers printed in the CLI * fix: disable the browser's autocomplete in the docs search

* feat: `resetChatHistory` function on a `LlamaChatSession` * feat: copy model response in the example Electron app template

* fix: improve model downloader CI logs * fix: `CodeGemma` adaptations

# Conflicts: # .github/workflows/build.yml # README.md # llama/CMakeLists.txt # llama/addon.cpp # package.json # src/config.ts # src/utils/compileLLamaCpp.ts

github-actions · 2024-09-24T18:09:48Z

🎉 This PR is included in version 3.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

giladgd self-assigned this Nov 26, 2023

giladgd had a problem deploying to npm November 26, 2023 19:36 — with GitHub Actions Failure

giladgd deployed to npm November 26, 2023 19:37 — with GitHub Actions View deployment

giladgd mentioned this pull request Dec 3, 2023

feat: hide llama.cpp logs #106

Closed

3 tasks

giladgd added this to the v3.0.0 milestone Dec 16, 2023

giladgd mentioned this pull request Dec 25, 2023

Node exits when n_tokens <= n_batch #129

Closed

3 tasks

giladgd and others added 2 commits January 20, 2024 00:11

feat: function calling (#139)

5fcdf9b

* feat: function calling support * feat: stateless `LlamaChat` * feat: improve chat wrapper * feat: `LlamaText` util * test: add basic model-dependent tests * fix: threads parameter * fix: disable Metal for `x64` arch by default

Merge branch 'master' into beta

2cf74fa

# Conflicts: # llama/addon.cpp # src/llamaEvaluator/LlamaContext.ts # src/llamaEvaluator/LlamaModel.ts # src/utils/getBin.ts

giladgd had a problem deploying to npm January 19, 2024 22:17 — with GitHub Actions Failure

giladgd had a problem deploying to npm January 19, 2024 22:26 — with GitHub Actions Error

feat: show llama.cpp release in GitHub releases (#142)

36c779d

giladgd deployed to npm January 20, 2024 00:16 — with GitHub Actions View deployment

feat: get embedding for text (#144)

4cf1fba

* feat: get embedding for text * feat(minor): improve `resolveChatWrapperBasedOnModel` logic * style: improve GitHub release notes formatting

giladgd deployed to npm January 21, 2024 00:35 — with GitHub Actions View deployment

giladgd mentioned this pull request Jan 21, 2024

LlamaCppEmbeddings does not work langchain-ai/langchainjs#3626

Closed

giladgd added 2 commits January 21, 2024 02:52

Merge remote-tracking branch 'origin/master' into beta

9d033a6

# Conflicts: # llama/addon.cpp # src/cli/commands/ChatCommand.ts # src/llamaEvaluator/LlamaContext.ts # src/utils/getBin.ts

chore: merge master into beta

dc530d6

giladgd deployed to npm January 21, 2024 01:09 — with GitHub Actions View deployment

feat(minor): add more flags to the chat command (#149)

cab617a

* feat: add `--systemPromptFile` flag to the `chat` command * feat: add `--promptFile` flag to the `chat` command * feat: add `--batchSize` flag to the `chat` command

giladgd had a problem deploying to npm January 24, 2024 21:56 — with GitHub Actions Failure

fix: build Metal by default for Apple silicone devices (#150)

187627e

giladgd deployed to npm January 24, 2024 22:11 — with GitHub Actions View deployment

feat: manual binding loading (#153)

0e4b8d2

* feat: manual binding loading - load the bindings using the `getLlama` method instead of it loading up by itself on import * feat: log settings - configure the log level or even set a custom logger for llama.cpp logs * fix: bugs

giladgd had a problem deploying to npm February 4, 2024 21:49 — with GitHub Actions Failure

fix: release config (#154)

cc78961

build: fix release bug (#316)

253b5d6

giladgd mentioned this pull request Sep 18, 2024

Compiling LLAMA for cuda is single threaded #311

Closed

giladgd added 2 commits September 18, 2024 23:47

build: fix release bug (#317)

b98767c

build: fix CI config (#318)

d0795c1

giladgd deployed to npm September 19, 2024 01:38 — with GitHub Actions View deployment

giladgd had a problem deploying to Documentation website September 19, 2024 19:12 — with GitHub Actions Failure

giladgd temporarily deployed to Documentation website September 19, 2024 19:31 — with GitHub Actions Inactive

fix: no thread limit when using a GPU (#322)

2204e7a

* fix: no thread limit when using a GPU * fix: improve `defineChatSessionFunction` types and docs * fix: format numbers printed in the CLI * fix: disable the browser's autocomplete in the docs search

giladgd had a problem deploying to npm September 20, 2024 01:59 — with GitHub Actions Failure

fix: revert electron-builder version used in Electron template (#323)

6c644ff

giladgd deployed to npm September 20, 2024 03:53 — with GitHub Actions View deployment

giladgd temporarily deployed to Documentation website September 20, 2024 16:17 — with GitHub Actions Inactive

giladgd added 3 commits September 22, 2024 00:37

docs: improve documentation (#324)

7805fd5

build: fix CI config (#325)

1c720ca

build: fix CI config (#326)

cf791f1

giladgd temporarily deployed to Documentation website September 21, 2024 22:19 — with GitHub Actions Inactive

feat: resetChatHistory function on a LlamaChatSession (#327)

ebc4e83

* feat: `resetChatHistory` function on a `LlamaChatSession` * feat: copy model response in the example Electron app template

giladgd had a problem deploying to npm September 22, 2024 01:25 — with GitHub Actions Failure

giladgd and others added 2 commits September 23, 2024 02:08

fix: improve model downloader CI logs (#329)

4b7ef5b

* fix: improve model downloader CI logs * fix: `CodeGemma` adaptations

Merge remote-tracking branch 'origin/master' into beta

c35fcad

# Conflicts: # .github/workflows/build.yml # README.md # llama/CMakeLists.txt # llama/addon.cpp # package.json # src/config.ts # src/utils/compileLLamaCpp.ts

giladgd deployed to npm September 23, 2024 00:56 — with GitHub Actions View deployment

giladgd temporarily deployed to Documentation website September 23, 2024 18:55 — with GitHub Actions Inactive

feat: v3.0 stable release (#331)

8565b7c

giladgd marked this pull request as ready for review September 23, 2024 21:24

giladgd requested a review from ido-pluto September 23, 2024 21:25

giladgd merged commit fc0fca5 into master Sep 23, 2024
23 of 24 checks passed

giladgd deleted the beta branch September 23, 2024 21:25

giladgd had a problem deploying to npm September 23, 2024 22:47 — with GitHub Actions Failure

github-actions bot added the released label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: version 3.0 #105

feat: version 3.0 #105

giladgd commented Nov 26, 2023 •

edited

Loading

nathanlesage commented Jan 17, 2024

giladgd commented Jan 20, 2024

nathanlesage commented Jan 20, 2024

hiepxanh commented Jan 23, 2024

github-actions bot commented Sep 24, 2024 •

edited by giladgd

Loading

feat: version 3.0 #105

feat: version 3.0 #105

Conversation

giladgd commented Nov 26, 2023 • edited Loading

How to use this beta

How you can help

Included in this beta

Planned changes before release

CLI usage

Usage example

How to stream a response

How to use function calling

How to get embedding for text

How to customize binding settings

How to generate a completion

How to generate an infill

Using a specific compute layer

Using TemplateChatWrapper

Using JinjaTemplateChatWrapper

Custom memory management options

Token bias

Prompt preloading

Prompt completion

Pull-Request Checklist

nathanlesage commented Jan 17, 2024

giladgd commented Jan 20, 2024

nathanlesage commented Jan 20, 2024

hiepxanh commented Jan 23, 2024

github-actions bot commented Sep 24, 2024 • edited by giladgd Loading

giladgd commented Nov 26, 2023 •

edited

Loading

Using `TemplateChatWrapper`

Using `JinjaTemplateChatWrapper`

github-actions bot commented Sep 24, 2024 •

edited by giladgd

Loading