SCI for AI #385

jawache · 2024-06-05T15:48:23Z

jawache
Jun 5, 2024
Maintainer

@atg-abhishek, @Henry-WattTime, @navveenb, @seanmcilroy29

Now we have SCI in ISO perhaps it's time to revisit some early conversations we had about SCI and work towards more concrete versions for specific domains, I'm proposing we explore how we update the specification to create a clear definition for an SCI score as applied to the domain of AI (or LLMs specifically)

I believe there are two areas where we need to agree on a clear definition as we apply SCI score to AI, Software Boundary and Functional Unit.

Software Boundaries

The reason we spoke about a software boundary in the SCI was because we recognized that that would be a way that the metric would be "gamed" in a way that is not aligned with the goal of the spec and also to provide a method of comparability between two similar software systems.

Example:

Product A which is a SaaS LLM has an SCI score of 5g per prompt, and includes their database in their software boundary.
Product B is a SaaS LLM with an SCI score of 4g per prompt, but it doesn't include their database in their software boundary.

It's not a comparable metric, because the boundary condition is different.

Product A has an SCI score of 5g per prompt in 2024, and includes their database in their software boundary.
Product A reduced their an SCI score to 4g per prompt in 2025, but it doesn't include their database in their software boundary. They reduced their score not through efficiency measures, but just through a redefinition of the software boundary.

This isn't a problem with physical products since two fridges have to have all the components, you can't leave the door off one. But with software the boundary is a lot more subjective and open to interpretation.

Suggestion:

The spec provides a clear set of software components that MUST be includes in an SCI score as it's applied to AI, for example:

All the software components involved with training the original models.
All the software components involved with inference.
Any and all databases
Data storage used in the training of models.
Transmissions and networking, both on the server/DC side but also in the DC to User interactions.
The user interfaces (mobile, web, api)
All software components used in the operations of the AI, including CI/CD, QA machines, dev machines, logging, metrics.
All SaaS products used in the user journey of delivering the result of a prompt to the end user.

Functional Unit

The second is deciding the functional unit, to be comparable the Functional Units must match. But also to be an honest metric over time the functional unit can't be redefined over time to game the score.

For example:

If we decide on a functional unit of "prompt" does prompt mean all forms of prompts the end user is entering, e.g. image/video prompts are far more costly than text prompts. Does it include prompts that the user enters from the official front end as well as perhaps prompts that come from an API. We want to provide enough clarity so there is no room for interpretation regarding what we mean by a prompt.

Suggestion
Per Prompt where Prompt is defined by all forms of requests to the LLM whether they come from the official user interface or from an API call.

Note

The Functional Unit seems to be linked to a user journey. We should perhaps clarify a functional unit by describing a long list of user stories which describe without ambiguity the user journey we are expressing with the Functional Unit. E.g. As a user I click on the prompt box and enter "How old is the universe" and then click enter. As a user I enter the prompt "Generate an image of a monkey riding a bicycle"

atg-abhishek · 2024-06-07T04:23:01Z

atg-abhishek
Jun 7, 2024
Maintainer

Thanks Asim for bringing this up! !00% aligned on the need for restarting conversations on this and ensuring that the SCI remains applicable and relevant for people who will use it to make decisions. On the point of the software boundaries, I think a neat thing to try out would be thinking of a software bill-of-materials (SBOM) which we could help templatize / standardize - or at the very least provide a few exemplars that can help make comparisons more useful across a range of folks who are creating different kinds of services and products. On the point of the functional unit, I like the idea of a user journey and think that it also sort of ties into the shape of the UX/UI and what kind of interactions are enabled via the product/service design in the first place and how users commonly expect to use it. Helping the users make in-context decisions on that would be a great way to bring transparency around the functional unit and again providing some exemplars here might be the way to go. As for next steps, how do we want to proceed? Do we want to kickstart a discussion thread on GH or bring it up for broader discussion already during the live SWG meetings? Abhishek Gupta ( @atg_abhishek ( https://twitter.com/atg_abhishek ) ) Founder and Principal Researcher, Montreal AI Ethics Institute ( https://montrealethics.ai/ ) Director, Responsible AI, Boston Consulting Group (BCG) ( https://www.bcg.com/about/people/experts/abhishek-gupta ) Fellow, Augmented Collective Intelligence, BCG Henderson Institute ( https://bcghendersoninstitute.com/contributors/abhishek-gupta/ ) Author, State of AI Ethics Report ( https://montrealethics.ai/state ) and AI Ethics Brief ( https://brief.montrealethics.ai/ ) Chair, Standards Working Group, Green Software Foundation ( https://greensoftware.foundation/ ) Learn more about my work here ( https://atg-abhishek.github.io/ ). You can now support my work by buying me a coffee ( https://buymeacoff.ee/abhishekgupta ) !

…

On Wed, Jun 05, 2024 at 9:18 PM, Asim Hussain < ***@***.*** > wrote: @ atg-abhishek ( https://github.com/atg-abhishek ) , @ Henry-WattTime ( https://github.com/Henry-WattTime ) , @ navveenb ( https://github.com/navveenb ) , @ seanmcilroy29 ( https://github.com/seanmcilroy29 ) Now we have SCI in ISO perhaps it's time to revisit some early conversations we had about SCI and work towards more concrete versions for specific domains, I'm proposing we explore how we update the specification to create a clear definition for an SCI score as applied to the domain of AI (or LLMs specifically) I believe there are two areas where we need to agree on a clear definition as we apply SCI score to AI, Software Boundary and Functional Unit. Software Boundaries ------------------- The reason we spoke about a software boundary in the SCI was because we recognized that that would be a way that the metric would be "gamed" in a way that is not aligned with the goal of the spec and also to provide a method of comparability between two similar software systems. Example: * Product A which is a SaaS LLM has an SCI score of 5g per prompt, and includes their database in their software boundary. * Product B is a SaaS LLM with an SCI score of 4g per prompt, but it *doesn't* include their database in their software boundary. It's not a comparable metric, because the boundary condition is different. * Product A has an SCI score of 5g per prompt in 2024, and includes their database in their software boundary. * Product A reduced their an SCI score to 4g per prompt in 2025, but it *doesn't* include their database in their software boundary. They reduced their score not through efficiency measures, but just through a redefinition of the software boundary. This isn't a problem with physical products since two fridges have to have all the components, you can't leave the door off one. But with software the boundary is a lot more subjective and open to interpretation. *Suggestion:* The spec provides a clear set of software components that MUST be includes in an SCI score as it's applied to AI, for example: * All the software components involved with training the original models. * All the software components involved with inference. * Any and all databases * Data storage used in the training of models. * Transmissions and networking, both on the server/DC side but also in the DC to User interactions. * The user interfaces (mobile, web, api) * All software components used in the operations of the AI, including CI/CD, QA machines, dev machines, logging, metrics. * All SaaS products used in the user journey of delivering the result of a prompt to the end user. Functional Unit --------------- The second is deciding the functional unit, to be comparable the Functional Units must match. But also to be an honest metric over time the functional unit can't be redefined over time to game the score. For example: If we decide on a functional unit of "prompt" does prompt mean all forms of prompts the end user is entering, e.g. image/video prompts are far more costly than text prompts. Does it include prompts that the user enters from the official front end as well as perhaps prompts that come from an API. We want to provide enough clarity so there is no room for interpretation regarding what we mean by a prompt. *Suggestion* Per Prompt where Prompt is defined by all forms of requests to the LLM whether they come from the official user interface or from an API call. Note The Functional Unit seems to be linked to a user journey. We should perhaps clarify a functional unit by describing a long list of user stories which describe without ambiguity the user journey we are expressing with the Functional Unit. E.g. As a user I click on the prompt box and enter "How old is the universe" and then click enter. As a user I enter the prompt "Generate an image of a monkey riding a bicycle" — Reply to this email directly, view it on GitHub ( #385 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AA6CZPCCQUHDW44JMTFFINTZF4XNZAVCNFSM6AAAAABI3AACZ6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWG44DEMJYGA ). You are receiving this because you were mentioned. Message ID: <Green-Software-Foundation/sci/repo-discussions/385 @ github. com>

0 replies

mrchrisadams · 2024-06-07T07:20:00Z

mrchrisadams
Jun 7, 2024

It might be worth looking at the recent Green Coding AI project from Green Coding Solutions linked below:

https://green-coding.ai

It's the first project I know working to expose SCI figures on a per-prompt basis, and lets you see the results from using different models for the same query and how that impacts the SCI.

I've linked to an example of the results for query:

https://green-coding.ai?key=74959a3c0d77eefe5e0b25bbd85b689f&model=llama3&prompt=What%20kind%20of%20dish%20is%20Vigo%20famous%20for%3F

0 replies

jawache · 2024-06-07T12:51:29Z

jawache
Jun 7, 2024
Maintainer Author

Very interesting @mrchrisadams!

@chrisxie-fw pointing the above green-coding.ai to your attention, I believe it's similar to the vein of the hugging face/scer work you were showing the other day. In addition, in your research into SCER is there anything regarding the "boundary of AI software" that might be useful to this conversation?

@atg-abhishek thanks, SBOM is also an interesting approach. I actually spoke to one of the maintainers of https://github.com/CycloneDX/specification this week at an Apache Software Foundation conference and he mentioned how there might be some overlap (I was talking about Impact Framework and Manifest Files). I might be able to get him in to discuss on a SWG call?

I suggest first perhaps some research is required. The decision to include/exclude something in the software boundary should have some discussion and citations to back up the why. How about we structure this as a research topic trying to answer the question: "What software components/infrastructure/user-stories MUST be included to compute the environmental impact of a deployed AI software service?"

In terms of scope, I believe the goal must be to create guidance for organizations that are providing AI services, which is why I chose the word "deployed" above. It's useful to know the SCI score for OSS LLM models as a baseline, but the goal is a standard which organizations can use to compute the environmental impact of their proprietary AI solutions/services in such a way that it's useful for a consumer of the AI service.

I've put some sources of information that might be useful are below:

Carbon Hack AI/LLM Submissions

We had quite a few people submit hackathon solutions (content and tooling) regarding measuring AI emissions for this years hackathon. The winner of the content prize was a great analysis into the measurement of LLMs using Impact Framework.

Research Related to Measuring AI from the Awesome Green Software List

Gathered from our Awesome Green Software List

General Research

SBOM specification: https://github.com/CycloneDX/specification
GreenCoding.ai
SCER for LLMs
Hugging Faces EnergyStar for AI

0 replies

chrisxie-fw · 2024-06-07T21:00:54Z

chrisxie-fw
Jun 7, 2024

This is great! How about putting SCER Rating on each query to the LLM, just for the sake of it:

It's good to have all these numbers, but if a SCER color code rating label is displayed alongside these numbers, that would make it so much easier to assess each query's "greenness"!

1 reply

navveenb Jun 15, 2024
Maintainer

The rating should be an extension of SCI Specification, with its defined boundary. As @jawache pointed out, we should extend the specification to include variants for specific types - web, LLM, so we standardize this. Through the Impact framework, we have the manifest file, which defines the boundary and observation that can be audited and shared, which would be required for a rating consensus.

chrisxie-fw · 2024-06-07T21:13:29Z

chrisxie-fw
Jun 7, 2024

SBOM is what I was referring to during our last Standard WG meeting. SBOM is typically represented in the SPDX file format. It's a standard file format that is used to describe the components, licenses, and dependencies in software. It is widely used for compliance and transparency in open-source software. The SCER Spec Ratings section is an attempt to visually represent the details of how a Rating is generated. It can be said that the SCER Rating is a visual representation of the SBOM SPDX for green software.

0 replies

ribalba · 2024-06-13T09:42:47Z

ribalba
Jun 13, 2024

Thx @chrisxie-fw for the idea of placing the SCER value next to the values. I'll add it to the todo.

Just a heads up, we also wrote a paper for the HotCarbon [0] in which we measured a lot of LLMs in respect to their SCI and the task at hand. The green-coding.ai project came out of this paper. It just got accepted so I will be able to share soon. Otherwise feel free to drop me an email.

The abstract is:

The resource consumption for software and communication infrastructure is an increasing concern, particularly with the emergence of Large Language Models (LLMs). While energy consumption and carbon emission of LLMs during training have been a focus in research, the impact of LLM inference which scales with the number of requests, remains underexplored. In this paper we show that energy efficiency and carbon emission for LLM inference vary depending on the model and on the task category, e.g. math, pro- gramming, general knowledge, with which the model is prompted and that smaller specialized models can achieve comparable accuracy while using less resources. We analyze the differences across 8 open-source LLMs when processing prompts of different task categories. Our findings lead to a novel approach: Classifying prompts by using embeddings to route them to the most energy-efficient and least carbon-intensive LLM in a federation of LLMs while keeping accuracy high. We validate the effectiveness of our method through empirical measurements.

[0] https://hotcarbon.org/

7 replies

chrisxie-fw Jun 14, 2024

Definitely, love to have you join our SCER bi-weekly meeting. @seanmcilroy29 can send you a meeting invite.

Again, if you can add SCER label in a demo link, before June 18th next week, that would be good enough, so I can demo your website.

ribalba Jun 14, 2024

I have now implemented an SCER value for only this key. I can't know if it is good or not but for demonstration purposes this should be ok? Will this help?

chrisxie-fw Jun 14, 2024

This is great, thank you so much! I will only demo this link.

jawache Jun 15, 2024
Maintainer Author

Hey @chrisxie-fw and @ribalba were moving far too fast here. The SCER is a draft project which hasn't passed an initial review from the Standards Working Group or ratification from the Steering Committee and our members have not been given the chance to object or surface any patent claims they might have on it.

As such we have to be very careful regarding how it's communicated externally and ensure our members are comfortable w According to our rules we can talk about the SCER only in the context of getting feedback since we don't want to give the impression this is something our members have agreed to (there is consensus) or give the impression we've worked through any IP/patent implications.

So please let's hold off on publishing it in any websites at the moment.

Note the SCER is not SCI which is a specification that has gone through the whole process of IP protection, review, consensus, ratification and publication.

navveenb Jun 16, 2024
Maintainer

Agree with @jawache.

navveenb · 2024-06-15T06:12:16Z

navveenb
Jun 15, 2024
Maintainer

Thanks, @jawache, for restarting this thread. We should also look at how we can extend SCI to Managed Services, particularly where LLM models are exposed as APIs, and the applications only consume the APIs. As the LLM models can be a black box, the details of underlying infrastructure and telemetrics would not be available and need to rely on proxy values. @ribalba , can you please share the link to the paper, this would helpful for our analysis (i.e, energy efficiency and carbon emission for LLM inference vary depending on the model and on the task category,)

1 reply

ribalba Jul 24, 2024

https://hotcarbon.org/assets/2024/pdf/hotcarbon24-final109.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCI for AI #385

{{title}}

Replies: 7 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

SCI for AI #385

jawache Jun 5, 2024 Maintainer

Software Boundaries

Functional Unit

Replies: 7 comments · 9 replies

atg-abhishek Jun 7, 2024 Maintainer

jawache Jun 7, 2024 Maintainer Author

Carbon Hack AI/LLM Submissions

Research Related to Measuring AI from the Awesome Green Software List

General Research

navveenb Jun 15, 2024 Maintainer

jawache Jun 15, 2024 Maintainer Author

navveenb Jun 16, 2024 Maintainer

navveenb Jun 15, 2024 Maintainer

jawache
Jun 5, 2024
Maintainer

Replies: 7 comments 9 replies

atg-abhishek
Jun 7, 2024
Maintainer

jawache
Jun 7, 2024
Maintainer Author

navveenb Jun 15, 2024
Maintainer

jawache Jun 15, 2024
Maintainer Author

navveenb Jun 16, 2024
Maintainer

navveenb
Jun 15, 2024
Maintainer