-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add load balancing #17
Comments
This works now in the output branch. I opted to do something between naive and advanced. When you supply a vector of servers, you can assign a name to each, corresponding to what share of requests should be fulfilled by that server. So library(rollama)
library(tidyverse)
reviews_df <- read_csv("https://raw.githubusercontent.com/AFAgarap/ecommerce-reviews-analysis/master/Womens%20Clothing%20E-Commerce%20Reviews.csv",
show_col_types = FALSE) |>
sample_n(500)
#> New names:
#> • `` -> `...1`
make_query <- function(t) {
tribble(
~role, ~content,
"system", "You assign texts into categories. Answer with just the correct category, which is either {positive}, {neutral} or {negative}.",
"user", t
)
}
start <- Sys.time()
reviews_df_annotated <- reviews_df |>
mutate(query = map(`Review Text`, make_query),
category = query(query, screen = FALSE,
model = "llama3.2:3b-instruct-q8_0",
server = c("0.6" = "http://localhost:11434/",
"0.4" = "http://192.168.2.45:11434/"),
output = "text"))
stop <- Sys.time()
stop - start
#> Time difference of 18.19546 secs Created on 2024-10-18 with reprex v2.1.0 |
I'm trying to understand (and implement) this. Would this be the functional equivalent of having two GPU's (for example) in 1 system? That is, they could handle through combined VRAM a much larger model? Or this merely going to split a single request at the ratios you select, and it would just merely process everything quicker, but not take advantage of the larger combined VRAM? Or is this an implementation of Ollama's new parallel request feature? |
AFAIK there is no way to combine the vram on consumer gpus. This is indeed just using parallel requests and if you have multiple machines (or gpus running multiple instances of Ollama I guess) you can divide the queue. E.g., you have two PCs with gpu and one laptop. The laptop will be slow but could still fulfill some of the requests if you have thousands. |
Ok, but do you mean queues in what sense? Within the context of a single ollama call that may just take a long time to complete, or do you mean individual calls that are kept in a list or something? |
A queue would be the same annotation task with examples and prompt for 1000 texts. I updated the vignette to show an example: https://jbgruber.github.io/rollama/articles/annotation.html#another-example-using-a-dataframe |
The same approach implemented in #16 could also be used to send requests to multiple Ollama servers at once to process requests in parallel. There are at least two approaches we could follow:
In 1., the total run time would be determined by the slowest instance. 2. would be much more efficient in scenarios with a mix of fast and slow machines, but also harder to implement.
The text was updated successfully, but these errors were encountered: