Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce llama-run #10291

Merged
merged 1 commit into from
Nov 25, 2024
Merged

Introduce llama-run #10291

merged 1 commit into from
Nov 25, 2024

Conversation

ericcurtin
Copy link
Contributor

@ericcurtin ericcurtin commented Nov 14, 2024

It's like simple-chat but it uses smart pointers to avoid manual
memory cleanups. Less memory leaks in the code now. Avoid printing
multiple dots. Split code into smaller functions. Uses no exception
handling.

@ericcurtin ericcurtin force-pushed the simple-chat-smart branch 11 times, most recently from b2a336e to 17f086b Compare November 14, 2024 14:00
@ericcurtin
Copy link
Contributor Author

Some of these builds seem hardcoded to c++11, when we use a feature from c++14.

Any reason we aren't using say c++17

Any reasonable platform should be up-to date with c++17 I think

@ericcurtin ericcurtin force-pushed the simple-chat-smart branch 2 times, most recently from bf26504 to 0d016a4 Compare November 14, 2024 16:27
@ericcurtin
Copy link
Contributor Author

Converted to C++11 only

@ericcurtin ericcurtin force-pushed the simple-chat-smart branch 3 times, most recently from 0af3f55 to 33eb456 Compare November 14, 2024 17:00
@ericcurtin ericcurtin force-pushed the simple-chat-smart branch 4 times, most recently from d2711eb to ca45737 Compare November 15, 2024 12:06
@ericcurtin ericcurtin mentioned this pull request Nov 15, 2024
4 tasks
@ericcurtin
Copy link
Contributor Author

@slaren @ggerganov PTAL, I'm hoping to add other features to this example such as, read prompt from '-p' arg and read prompt from stdin

@slaren
Copy link
Collaborator

slaren commented Nov 15, 2024

It would be good to have a more elaborated chat example, but the goal of this example is to show in the simplest way possible how to use the llama.cpp API. I don't think that these changes achieve that, I think that users will have a harder time understanding the llama.cpp API with all the extra boilerplate that is being added here.

If you want to use this as the base of a new example that adds more features that would be great, but I think we should keep this example as simple as possible.

@ericcurtin
Copy link
Contributor Author

Cool sounds good @slaren, mind if I call the new example ramalama-core ?

@ericcurtin
Copy link
Contributor Author

ericcurtin commented Nov 16, 2024

llama-inferencer? I really don't care too much about the name, just want to agree on one to unblock this PR.

@rhatdan @slp any suggestions on names? I plan on using this as the main program during "ramalama run" but I'm happy for anyone to use it or make changes to it to suit their needs. It's like a drastically simplified version of llama-cli, with one or two additional features, read from stdin, read from -p flag.

But it does seem stabler and less error prone than llama-cli also. And the verbose info is all cleaned up to only spit out errors. It was based on llama-simple-chat initially.

@slaren
Copy link
Collaborator

slaren commented Nov 16, 2024

Maybe something like llama-chat. I mentioned before that I think it would be good to have a example focused on chat only, that does that very well, that in time could replace the current llama-cli as the main program of llama.cpp, which at this is point is basically unmaintainable and should be retired.

@ericcurtin
Copy link
Contributor Author

SGTM

@ericcurtin
Copy link
Contributor Author

ericcurtin commented Nov 17, 2024

It's also tempting to call this something like run and use a kinda RamaLama, LocalAI, Ollama type CLI interface to interact with models. Kinda like daemonless Ollama:

llama-run file://somedir/somefile.gguf

@ericcurtin
Copy link
Contributor Author

ericcurtin commented Nov 17, 2024

We could even possibly add https://, http://, ollama://, hf:// as valid syntaxes to pull models since they all are just a http pull in the end

@ericcurtin
Copy link
Contributor Author

That might be implemented as llama-pull that llama-run can fork/exec (or they share a common library)

@ericcurtin ericcurtin force-pushed the simple-chat-smart branch 2 times, most recently from 1d7f97c to 4defcf2 Compare November 17, 2024 23:40
@ericcurtin ericcurtin changed the title Introduce ramalama-core Introduce llama-run Nov 17, 2024
@ericcurtin
Copy link
Contributor Author

ericcurtin commented Nov 17, 2024

I will be afk for 3 weeks, so expect inactivity in this PR. I did the rename in case we want to merge as is and not leave this go stale.

Although the syntax will completely change to:

llama-run [file://]somedir/somefile.gguf [prompt] [flags]

file:// will be optional, but will set up the possibility of adding the pullers discussed above.

@ericcurtin
Copy link
Contributor Author

This drives the compiler crazy FWIW:

diff --git a/include/llama.h b/include/llama.h
index 5e742642..c3285da3 100644
--- a/include/llama.h
+++ b/include/llama.h
@@ -537,6 +537,13 @@ extern "C" {
                          int32_t   il_start,
                          int32_t   il_end);

+#ifdef __cplusplus
+    // Smart pointers
+    typedef std::unique_ptr<llama_model, decltype(&llama_free_model)> llama_model_ptr;
+    typedef std::unique_ptr<llama_context, decltype(&llama_free)> llama_context_ptr;
+    typedef std::unique_ptr<llama_sampler, decltype(&llama_sampler_free)> llama_sampler;
+#endif
+
     //
     // KV cache
     //

seems like it only wants to build C code in that file or something.

@slaren
Copy link
Collaborator

slaren commented Nov 19, 2024

You are probably doing something wrong. Put it at the end of the file, include <memory>, and rename the llama_sampler ptr to llama_sampler_ptr. You should also structs (function objects) as the deleter, see ggml-cpp.h for an example.

@ericcurtin
Copy link
Contributor Author

Added the smart pointers typedef stuff @slaren

typedef std::unique_ptr<llama_model, llama_model_deleter> llama_model_ptr;
typedef std::unique_ptr<llama_context, llama_context_deleter> llama_context_ptr;
typedef std::unique_ptr<llama_sampler, llama_sampler_deleter> llama_sampler_ptr;
typedef std::unique_ptr<char[]> char_array_ptr;
Copy link
Collaborator

@slaren slaren Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest look good, but this one (char_array_ptr) does not belong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also need to use more parameters as references, I notice I still pass raw pointers in places where it could be a reference to a smart pointer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed char_array_ptr

@ericcurtin ericcurtin force-pushed the simple-chat-smart branch 2 times, most recently from 32b2999 to bdac00f Compare November 24, 2024 18:00
Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start for an improved chat example.

llama-cpp.h also needs to be added to the list of public headers in CMakeLists.txt.

@ericcurtin ericcurtin force-pushed the simple-chat-smart branch 2 times, most recently from bd63eb5 to a9054cb Compare November 25, 2024 04:29
@github-actions github-actions bot added the build Compilation issues label Nov 25, 2024
@ericcurtin
Copy link
Contributor Author

Added as public header

It's like simple-chat but it uses smart pointers to avoid manual
memory cleanups. Less memory leaks in the code now. Avoid printing
multiple dots. Split code into smaller functions. Uses no exception
handling.

Signed-off-by: Eric Curtin <[email protected]>
@ericcurtin
Copy link
Contributor Author

ericcurtin commented Nov 25, 2024

Happy to merge this @slaren @ggerganov

Although there will be breaking changes in a follow on PR, want to change the syntax to this (ramalama/ollama style):

Description:
  Runs a llm

Usage:
  llama-run [options] MODEL [PROMPT]

Options:
  -c, --context-size <value>
      Context size (default: 2048)
  -n, --ngl <value>
      Number of GPU layers (default: 0)
  -h, --help
      Show help message

Examples:
  llama-run ~/.local/share/ramalama/models/ollama/smollm\:135m
  llama-run --ngl 99 ~/.local/share/ramalama/models/ollama/smollm\:135m
  llama-run --ngl 99 ~/.local/share/ramalama/models/ollama/smollm\:135m Hello World

Eventually want to add optional prefixes for the model also (like file://, hf://, maybe we implement http://, https://, ollama:// eventually also, its not that hard), but for no prefix we will just assume the string is a file path I guess.

@slaren
Copy link
Collaborator

slaren commented Nov 25, 2024

Sounds good. Btw, to support dynamic loading of backends (#10469), you should add a call to ggml_backend_load_all at the beginning. The goal is to allow projects to distribute a single binary rather than needing different versions for each backend, which I imagine would be relevant to a project like ramalama.

@slaren slaren merged commit 0cc6375 into ggerganov:master Nov 25, 2024
54 checks passed
@ericcurtin ericcurtin deleted the simple-chat-smart branch November 26, 2024 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants