Introduce llama-run #10291

ericcurtin · 2024-11-14T11:33:23Z

It's like simple-chat but it uses smart pointers to avoid manual
memory cleanups. Less memory leaks in the code now. Avoid printing
multiple dots. Split code into smaller functions. Uses no exception
handling.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ericcurtin · 2024-11-14T14:51:58Z

Some of these builds seem hardcoded to c++11, when we use a feature from c++14.

Any reason we aren't using say c++17

Any reasonable platform should be up-to date with c++17 I think

ericcurtin · 2024-11-14T16:28:03Z

Converted to C++11 only

ericcurtin · 2024-11-15T17:41:19Z

@slaren @ggerganov PTAL, I'm hoping to add other features to this example such as, read prompt from '-p' arg and read prompt from stdin

slaren · 2024-11-15T17:50:30Z

It would be good to have a more elaborated chat example, but the goal of this example is to show in the simplest way possible how to use the llama.cpp API. I don't think that these changes achieve that, I think that users will have a harder time understanding the llama.cpp API with all the extra boilerplate that is being added here.

If you want to use this as the base of a new example that adds more features that would be great, but I think we should keep this example as simple as possible.

ericcurtin · 2024-11-15T18:00:08Z

Cool sounds good @slaren, mind if I call the new example ramalama-core ?

ericcurtin · 2024-11-16T21:22:16Z

llama-inferencer? I really don't care too much about the name, just want to agree on one to unblock this PR.

@rhatdan @slp any suggestions on names? I plan on using this as the main program during "ramalama run" but I'm happy for anyone to use it or make changes to it to suit their needs. It's like a drastically simplified version of llama-cli, with one or two additional features, read from stdin, read from -p flag.

But it does seem stabler and less error prone than llama-cli also. And the verbose info is all cleaned up to only spit out errors. It was based on llama-simple-chat initially.

slaren · 2024-11-16T21:25:38Z

Maybe something like llama-chat. I mentioned before that I think it would be good to have a example focused on chat only, that does that very well, that in time could replace the current llama-cli as the main program of llama.cpp, which at this is point is basically unmaintainable and should be retired.

ericcurtin · 2024-11-16T21:28:21Z

SGTM

ericcurtin · 2024-11-17T00:51:09Z

It's also tempting to call this something like run and use a kinda RamaLama, LocalAI, Ollama type CLI interface to interact with models. Kinda like daemonless Ollama:

llama-run file://somedir/somefile.gguf

ericcurtin · 2024-11-17T01:12:57Z

We could even possibly add https://, http://, ollama://, hf:// as valid syntaxes to pull models since they all are just a http pull in the end

ericcurtin · 2024-11-17T01:33:49Z

That might be implemented as llama-pull that llama-run can fork/exec (or they share a common library)

ericcurtin · 2024-11-17T23:42:49Z

I will be afk for 3 weeks, so expect inactivity in this PR. I did the rename in case we want to merge as is and not leave this go stale.

Although the syntax will completely change to:

llama-run [file://]somedir/somefile.gguf [prompt] [flags]

file:// will be optional, but will set up the possibility of adding the pullers discussed above.

ericcurtin · 2024-11-18T18:13:02Z

This drives the compiler crazy FWIW:

diff --git a/include/llama.h b/include/llama.h
index 5e742642..c3285da3 100644
--- a/include/llama.h
+++ b/include/llama.h
@@ -537,6 +537,13 @@ extern "C" {
                          int32_t   il_start,
                          int32_t   il_end);

+#ifdef __cplusplus
+    // Smart pointers
+    typedef std::unique_ptr<llama_model, decltype(&llama_free_model)> llama_model_ptr;
+    typedef std::unique_ptr<llama_context, decltype(&llama_free)> llama_context_ptr;
+    typedef std::unique_ptr<llama_sampler, decltype(&llama_sampler_free)> llama_sampler;
+#endif
+
     //
     // KV cache
     //

seems like it only wants to build C code in that file or something.

slaren · 2024-11-19T20:11:50Z

You are probably doing something wrong. Put it at the end of the file, include <memory>, and rename the llama_sampler ptr to llama_sampler_ptr. You should also structs (function objects) as the deleter, see ggml-cpp.h for an example.

ericcurtin · 2024-11-22T22:02:04Z

Added the smart pointers typedef stuff @slaren

slaren · 2024-11-23T13:43:44Z

include/llama-cpp.h

+typedef std::unique_ptr<llama_model, llama_model_deleter> llama_model_ptr;
+typedef std::unique_ptr<llama_context, llama_context_deleter> llama_context_ptr;
+typedef std::unique_ptr<llama_sampler, llama_sampler_deleter> llama_sampler_ptr;
+typedef std::unique_ptr<char[]> char_array_ptr;


The rest look good, but this one (char_array_ptr) does not belong here.

I also need to use more parameters as references, I notice I still pass raw pointers in places where it could be a reference to a smart pointer

I removed char_array_ptr

slaren

Looks like a good start for an improved chat example.

llama-cpp.h also needs to be added to the list of public headers in CMakeLists.txt.

ericcurtin · 2024-11-25T04:30:08Z

Added as public header

It's like simple-chat but it uses smart pointers to avoid manual memory cleanups. Less memory leaks in the code now. Avoid printing multiple dots. Split code into smaller functions. Uses no exception handling. Signed-off-by: Eric Curtin <[email protected]>

ericcurtin · 2024-11-25T21:51:21Z

Happy to merge this @slaren @ggerganov

Although there will be breaking changes in a follow on PR, want to change the syntax to this (ramalama/ollama style):

Description:
  Runs a llm

Usage:
  llama-run [options] MODEL [PROMPT]

Options:
  -c, --context-size <value>
      Context size (default: 2048)
  -n, --ngl <value>
      Number of GPU layers (default: 0)
  -h, --help
      Show help message

Examples:
  llama-run ~/.local/share/ramalama/models/ollama/smollm\:135m
  llama-run --ngl 99 ~/.local/share/ramalama/models/ollama/smollm\:135m
  llama-run --ngl 99 ~/.local/share/ramalama/models/ollama/smollm\:135m Hello World

Eventually want to add optional prefixes for the model also (like file://, hf://, maybe we implement http://, https://, ollama:// eventually also, its not that hard), but for no prefix we will just assume the string is a file path I guess.

slaren · 2024-11-25T21:54:04Z

Sounds good. Btw, to support dynamic loading of backends (#10469), you should add a call to ggml_backend_load_all at the beginning. The goal is to allow projects to distribute a single binary rather than needing different versions for each backend, which I imagine would be relevant to a project like ramalama.

github-actions bot added the examples label Nov 14, 2024

ericcurtin force-pushed the simple-chat-smart branch 11 times, most recently from b2a336e to 17f086b Compare November 14, 2024 14:00

ericcurtin force-pushed the simple-chat-smart branch 2 times, most recently from bf26504 to 0d016a4 Compare November 14, 2024 16:27

ericcurtin force-pushed the simple-chat-smart branch 3 times, most recently from 0af3f55 to 33eb456 Compare November 14, 2024 17:00

ericcurtin mentioned this pull request Nov 14, 2024

Switch to llama-simple-chat containers/ramalama#454

Merged

ericcurtin force-pushed the simple-chat-smart branch 4 times, most recently from d2711eb to ca45737 Compare November 15, 2024 12:06

ericcurtin mentioned this pull request Nov 15, 2024

Add .clang-format file #10308

Open

4 tasks

ericcurtin force-pushed the simple-chat-smart branch from ca45737 to 0fbc0ae Compare November 15, 2024 17:53

ericcurtin force-pushed the simple-chat-smart branch from 0fbc0ae to 83988df Compare November 15, 2024 18:07

ericcurtin force-pushed the simple-chat-smart branch 2 times, most recently from 1d7f97c to 4defcf2 Compare November 17, 2024 23:40

ericcurtin changed the title ~~Introduce ramalama-core~~ Introduce llama-run Nov 17, 2024

ericcurtin force-pushed the simple-chat-smart branch from 4defcf2 to 3760e6e Compare November 22, 2024 21:57

slaren reviewed Nov 23, 2024

View reviewed changes

ericcurtin force-pushed the simple-chat-smart branch 2 times, most recently from 32b2999 to bdac00f Compare November 24, 2024 18:00

slaren approved these changes Nov 24, 2024

View reviewed changes

ericcurtin force-pushed the simple-chat-smart branch 2 times, most recently from bd63eb5 to a9054cb Compare November 25, 2024 04:29

github-actions bot added the build Compilation issues label Nov 25, 2024

ericcurtin force-pushed the simple-chat-smart branch from a9054cb to faeba27 Compare November 25, 2024 17:24

Introduce llama-run

aa0a507

It's like simple-chat but it uses smart pointers to avoid manual memory cleanups. Less memory leaks in the code now. Avoid printing multiple dots. Split code into smaller functions. Uses no exception handling. Signed-off-by: Eric Curtin <[email protected]>

ericcurtin force-pushed the simple-chat-smart branch from faeba27 to aa0a507 Compare November 25, 2024 17:24

ggerganov approved these changes Nov 25, 2024

View reviewed changes

slaren merged commit 0cc6375 into ggerganov:master Nov 25, 2024
54 checks passed

ericcurtin deleted the simple-chat-smart branch November 26, 2024 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce llama-run #10291

Introduce llama-run #10291

ericcurtin commented Nov 14, 2024 •

edited

Loading

ericcurtin commented Nov 14, 2024

ericcurtin commented Nov 14, 2024

ericcurtin commented Nov 15, 2024

slaren commented Nov 15, 2024

ericcurtin commented Nov 15, 2024

ericcurtin commented Nov 16, 2024 •

edited

Loading

slaren commented Nov 16, 2024

ericcurtin commented Nov 16, 2024

ericcurtin commented Nov 17, 2024 •

edited

Loading

ericcurtin commented Nov 17, 2024 •

edited

Loading

ericcurtin commented Nov 17, 2024

ericcurtin commented Nov 17, 2024 •

edited

Loading

ericcurtin commented Nov 18, 2024

slaren commented Nov 19, 2024

ericcurtin commented Nov 22, 2024

slaren Nov 23, 2024 •

edited

Loading

ericcurtin Nov 23, 2024

ericcurtin Nov 24, 2024

slaren left a comment

ericcurtin commented Nov 25, 2024

ericcurtin commented Nov 25, 2024 •

edited

Loading

slaren commented Nov 25, 2024 •

edited

Loading

Introduce llama-run #10291

Introduce llama-run #10291

Conversation

ericcurtin commented Nov 14, 2024 • edited Loading

ericcurtin commented Nov 14, 2024

ericcurtin commented Nov 14, 2024

ericcurtin commented Nov 15, 2024

slaren commented Nov 15, 2024

ericcurtin commented Nov 15, 2024

ericcurtin commented Nov 16, 2024 • edited Loading

slaren commented Nov 16, 2024

ericcurtin commented Nov 16, 2024

ericcurtin commented Nov 17, 2024 • edited Loading

ericcurtin commented Nov 17, 2024 • edited Loading

ericcurtin commented Nov 17, 2024

ericcurtin commented Nov 17, 2024 • edited Loading

ericcurtin commented Nov 18, 2024

slaren commented Nov 19, 2024

ericcurtin commented Nov 22, 2024

slaren Nov 23, 2024 • edited Loading

Choose a reason for hiding this comment

ericcurtin Nov 23, 2024

Choose a reason for hiding this comment

ericcurtin Nov 24, 2024

Choose a reason for hiding this comment

slaren left a comment

Choose a reason for hiding this comment

ericcurtin commented Nov 25, 2024

ericcurtin commented Nov 25, 2024 • edited Loading

slaren commented Nov 25, 2024 • edited Loading

ericcurtin commented Nov 14, 2024 •

edited

Loading

ericcurtin commented Nov 16, 2024 •

edited

Loading

ericcurtin commented Nov 17, 2024 •

edited

Loading

ericcurtin commented Nov 17, 2024 •

edited

Loading

ericcurtin commented Nov 17, 2024 •

edited

Loading

slaren Nov 23, 2024 •

edited

Loading

ericcurtin commented Nov 25, 2024 •

edited

Loading

slaren commented Nov 25, 2024 •

edited

Loading