Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-0001-The-Llama-Stack #8

Merged
merged 5 commits into from
Aug 21, 2024
Merged

RFC-0001-The-Llama-Stack #8

merged 5 commits into from
Aug 21, 2024

Conversation

raghotham
Copy link
Contributor

@raghotham raghotham commented Jul 23, 2024

As part of the Llama 3.1 release, Meta is releasing an RFC for ‘Llama Stack’, a comprehensive set of interfaces / API for ML developers building on top of Llama foundation models. We are looking for feedback on where the API can be improved, any corner cases we may have missed and your general thoughts on how useful this will be. Ultimately, our hope is to create a standard for working with Llama models in order to simplify the developer experience and foster innovation across the Llama ecosystem.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 23, 2024
@raghotham raghotham linked an issue Jul 23, 2024 that may be closed by this pull request
@joshcarp
Copy link

joshcarp commented Jul 23, 2024

One question I've got about the lifecycle is around the monitoring/human feedback portion of this diagram and if this proposal could look into adding a more complete observability standard for development and deployment:

I think it would be a pretty good idea to include some standard around how observability is taken into account in this lifecycle, and there have been first efforts to address this in the gen-ai semantic conventions in OpenTelemetry

One issue with the current OpenTelemetry semantic conventions that exist currently are:

  • They only are for black box LLMs (Anthropic, OpenAI, Gemini, Cohere)
  • They currently only expose superficial attributes (model, max_tokens, temperature, response, request)

With work, especially around activation steering with papers like Activation Addition and the sparse auto encoder work done by Anthropic and OpenAI it's only a matter of time before we get better information about the internals of these models. Last month Anthropic also released a waitlist for their Beta Steering API which will hopefully release feature clamping and monitoring.

I think recently there’s been a lot of work showing that the single shot that we allow models to have when answering a prompt is fundamentally a simplistic way of using the models, and I imagine that a strong agent framework would allow for monitoring and observability of internal states which is a richer, continuous and differentiable representation of the models. Unfortunately API models are fundamentally restricted to the “Chat API” era - they can’t give any access to the underlying activation space because that’s their multi-billion dollar intellectual property. This is where I see open source models completely leap frogging in terms of capabilities as having this access would allow for so much more than we’ve got today.

What I would like:

It would be nice to have things like a tracing setup that was extendible that could incorporate useful information we know about the models today, as well as allowing for future improvements:

  • feature.<feature-name> attributes which are some sort of activation monitoring, possibly from dictionary learning approaches building on work from OpenAI and Anthropic.
    • In the Anthropic paper the interesting ones that were able to be extracted were “code correctness” features and “internal conflict”. Obviously this would be invaluable for deployment monitoring and guard railing.
  • Perplexity - The ability for the monitoring and observability to report on the likelihood of the sequence output or the prompt input. This monitoring would allow for the ability to correlate a quantitative measure with the downstream llm performance
    • I’m not exactly sure about how to go about this, but it would also be nice to have something around “pretraining perplexity” vs “fine tuning perplexity” - some high stakes applications you would want to have the ability to monitor the KL divergence between the fine tuning set and the deployed behaviour to see hoe much out of distribution requests are made. If there’s a high KL divergence but the downstream performance is still good it might still indicate that either your model is being used for something unintended (see memes about using amazon chat for generating js code), or your fine tuning set isn’t diverse enough (in which your evaluation sets are also incomplete).
  • Hallucination detection like To Believe or not to Believe your LLM - I’m not sure how this would be implemented, but hallucination detection in agentic systems would be extremely important, especially around the “in context” vs “in weight” difference that this paper was able to introduce.

I’ve been thinking about this a lot, and whilst there are tools like transformer_lens for development and research, it doesn’t seem like there are many tools or efforts out there that allow for this for deployed systems. This might also be a bunch of scope creep on this spec, but observability/monitoring solutions are an essential part of any complex software system and it feels like the LLM agent world hasn’t caught up yet and a lot of these applications have been limiting some of the development I want to do on other downstream tasks.

@prassanna-ravishankar
Copy link

Couple of things I'm interested in

  • Asynchronous post-training: A High level protocol combining the post training API with the inference API to granularly update the model as and when an input comes in, and provide a new model
  • Model monitoring: I like the idea @joshcarp mentioned, an entire observability stack would be helpful, but also with additional hooks to capture ML specific metrics such as drift during inference.
    • Inference hardware monitoring would also be super useful, to monitor GPU/CPU via
  • Prompt management: Mechanisms to store prompts and evaluate prompts across model checkpoints.
  • Experiment tracking callbacks: Callbacks or hooks to plug into rich logging services for ML experimentation (Aim, Mlflow, wandb), relevant only for training and post-training.
  • Export interface: Mechanisms to export the model to various target hardware (almost like a real toolchain 😉 )
  • Pipe interface: Allowing various model instances to pipe their output to each other (like in deepstream), enabling simple agential applications and more complex systems.

@ashwinb
Copy link
Contributor

ashwinb commented Aug 21, 2024

@joshcarp

Thanks for the pointer. This is a great addition! We have as an update to this PR itself, added an Observability API which can be used from fine-tuning. Please take a look and let us know what you think.

@ashwinb
Copy link
Contributor

ashwinb commented Aug 21, 2024

@prassanna-ravishankar

We are just starting right now :) and we'd like to make sure we cover the basics well and provide enough value so the ecosystem finds this worth building and adopting. If things go well, all you suggest could arrive.

@ashwinb ashwinb merged commit 2232bfa into main Aug 21, 2024
3 checks passed
@ashwinb ashwinb deleted the RFC-0001-The-Llama-Stack branch August 21, 2024 02:01
@cubxxw
Copy link

cubxxw commented Aug 23, 2024

At present, I am pushing the implementation of LLM production. I think there are many production level standards missing, which makes it difficult for me to know what is right. I hope llama can be standardized by using open source.

Copy link

@Mustafsjamal Mustafsjamal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I'm going back to work now and then I'm going back and then I can come get the kids from work school if that you would like me and you can go just to get the kids out and then I we can could you just come let me in and the @prassanna-ravishankar image

heyjustinai pushed a commit that referenced this pull request Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC-0001 - Llama Stack
7 participants