SCI for AI #385
Replies: 7 comments 9 replies
-
Thanks Asim for bringing this up!
!00% aligned on the need for restarting conversations on this and ensuring that the SCI remains applicable and relevant for people who will use it to make decisions.
On the point of the software boundaries, I think a neat thing to try out would be thinking of a software bill-of-materials (SBOM) which we could help templatize / standardize - or at the very least provide a few exemplars that can help make comparisons more useful across a range of folks who are creating different kinds of services and products.
On the point of the functional unit, I like the idea of a user journey and think that it also sort of ties into the shape of the UX/UI and what kind of interactions are enabled via the product/service design in the first place and how users commonly expect to use it. Helping the users make in-context decisions on that would be a great way to bring transparency around the functional unit and again providing some exemplars here might be the way to go.
As for next steps, how do we want to proceed? Do we want to kickstart a discussion thread on GH or bring it up for broader discussion already during the live SWG meetings?
Abhishek Gupta ( @atg_abhishek ( https://twitter.com/atg_abhishek ) )
Founder and Principal Researcher, Montreal AI Ethics Institute ( https://montrealethics.ai/ )
Director, Responsible AI, Boston Consulting Group (BCG) ( https://www.bcg.com/about/people/experts/abhishek-gupta )
Fellow, Augmented Collective Intelligence, BCG Henderson Institute ( https://bcghendersoninstitute.com/contributors/abhishek-gupta/ )
Author, State of AI Ethics Report ( https://montrealethics.ai/state ) and AI Ethics Brief ( https://brief.montrealethics.ai/ )
Chair, Standards Working Group, Green Software Foundation ( https://greensoftware.foundation/ )
Learn more about my work here ( https://atg-abhishek.github.io/ ). You can now support my work by buying me a coffee ( https://buymeacoff.ee/abhishekgupta ) !
…On Wed, Jun 05, 2024 at 9:18 PM, Asim Hussain < ***@***.*** > wrote:
@ atg-abhishek ( https://github.com/atg-abhishek ) , @ Henry-WattTime (
https://github.com/Henry-WattTime ) , @ navveenb (
https://github.com/navveenb ) , @ seanmcilroy29 (
https://github.com/seanmcilroy29 )
Now we have SCI in ISO perhaps it's time to revisit some early
conversations we had about SCI and work towards more concrete versions for
specific domains, I'm proposing we explore how we update the specification
to create a clear definition for an SCI score as applied to the domain of
AI (or LLMs specifically)
I believe there are two areas where we need to agree on a clear definition
as we apply SCI score to AI, Software Boundary and Functional Unit.
Software Boundaries
-------------------
The reason we spoke about a software boundary in the SCI was because we
recognized that that would be a way that the metric would be "gamed" in a
way that is not aligned with the goal of the spec and also to provide a
method of comparability between two similar software systems.
Example:
* Product A which is a SaaS LLM has an SCI score of 5g per prompt, and
includes their database in their software boundary.
* Product B is a SaaS LLM with an SCI score of 4g per prompt, but it *doesn't*
include their database in their software boundary.
It's not a comparable metric, because the boundary condition is different.
* Product A has an SCI score of 5g per prompt in 2024, and includes their
database in their software boundary.
* Product A reduced their an SCI score to 4g per prompt in 2025, but it *doesn't*
include their database in their software boundary. They reduced their
score not through efficiency measures, but just through a redefinition of
the software boundary.
This isn't a problem with physical products since two fridges have to have
all the components, you can't leave the door off one. But with software
the boundary is a lot more subjective and open to interpretation.
*Suggestion:*
The spec provides a clear set of software components that MUST be includes
in an SCI score as it's applied to AI, for example:
* All the software components involved with training the original models.
* All the software components involved with inference.
* Any and all databases
* Data storage used in the training of models.
* Transmissions and networking, both on the server/DC side but also in the
DC to User interactions.
* The user interfaces (mobile, web, api)
* All software components used in the operations of the AI, including
CI/CD, QA machines, dev machines, logging, metrics.
* All SaaS products used in the user journey of delivering the result of a
prompt to the end user.
Functional Unit
---------------
The second is deciding the functional unit, to be comparable the
Functional Units must match. But also to be an honest metric over time the
functional unit can't be redefined over time to game the score.
For example:
If we decide on a functional unit of "prompt" does prompt mean all forms
of prompts the end user is entering, e.g. image/video prompts are far more
costly than text prompts. Does it include prompts that the user enters
from the official front end as well as perhaps prompts that come from an
API. We want to provide enough clarity so there is no room for
interpretation regarding what we mean by a prompt.
*Suggestion*
Per Prompt where Prompt is defined by all forms of requests to the LLM
whether they come from the official user interface or from an API call.
Note
The Functional Unit seems to be linked to a user journey. We should
perhaps clarify a functional unit by describing a long list of user
stories which describe without ambiguity the user journey we are
expressing with the Functional Unit. E.g. As a user I click on the prompt
box and enter "How old is the universe" and then click enter. As a user I
enter the prompt "Generate an image of a monkey riding a bicycle"
—
Reply to this email directly, view it on GitHub (
#385 ) , or unsubscribe
(
https://github.com/notifications/unsubscribe-auth/AA6CZPCCQUHDW44JMTFFINTZF4XNZAVCNFSM6AAAAABI3AACZ6VHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZWG44DEMJYGA
).
You are receiving this because you were mentioned. Message ID: <Green-Software-Foundation/sci/repo-discussions/385
@ github. com>
|
Beta Was this translation helpful? Give feedback.
-
It might be worth looking at the recent Green Coding AI project from Green Coding Solutions linked below: It's the first project I know working to expose SCI figures on a per-prompt basis, and lets you see the results from using different models for the same query and how that impacts the SCI. I've linked to an example of the results for query: |
Beta Was this translation helpful? Give feedback.
-
Very interesting @mrchrisadams! @chrisxie-fw pointing the above green-coding.ai to your attention, I believe it's similar to the vein of the hugging face/scer work you were showing the other day. In addition, in your research into SCER is there anything regarding the "boundary of AI software" that might be useful to this conversation? @atg-abhishek thanks, SBOM is also an interesting approach. I actually spoke to one of the maintainers of https://github.com/CycloneDX/specification this week at an Apache Software Foundation conference and he mentioned how there might be some overlap (I was talking about Impact Framework and Manifest Files). I might be able to get him in to discuss on a SWG call? I suggest first perhaps some research is required. The decision to include/exclude something in the software boundary should have some discussion and citations to back up the why. How about we structure this as a research topic trying to answer the question: "What software components/infrastructure/user-stories MUST be included to compute the environmental impact of a deployed AI software service?" In terms of scope, I believe the goal must be to create guidance for organizations that are providing AI services, which is why I chose the word "deployed" above. It's useful to know the SCI score for OSS LLM models as a baseline, but the goal is a standard which organizations can use to compute the environmental impact of their proprietary AI solutions/services in such a way that it's useful for a consumer of the AI service. I've put some sources of information that might be useful are below: Carbon Hack AI/LLM SubmissionsWe had quite a few people submit hackathon solutions (content and tooling) regarding measuring AI emissions for this years hackathon. The winner of the content prize was a great analysis into the measurement of LLMs using Impact Framework.
Research Related to Measuring AI from the Awesome Green Software ListGathered from our Awesome Green Software List
General Research
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
SBOM is what I was referring to during our last Standard WG meeting. SBOM is typically represented in the SPDX file format. It's a standard file format that is used to describe the components, licenses, and dependencies in software. It is widely used for compliance and transparency in open-source software. The SCER Spec Ratings section is an attempt to visually represent the details of how a Rating is generated. It can be said that the SCER Rating is a visual representation of the SBOM SPDX for green software. |
Beta Was this translation helpful? Give feedback.
-
Thx @chrisxie-fw for the idea of placing the SCER value next to the values. I'll add it to the todo. Just a heads up, we also wrote a paper for the HotCarbon [0] in which we measured a lot of LLMs in respect to their SCI and the task at hand. The green-coding.ai project came out of this paper. It just got accepted so I will be able to share soon. Otherwise feel free to drop me an email. The abstract is: The resource consumption for software and communication infrastructure is an increasing concern, particularly with the emergence of Large Language Models (LLMs). While energy consumption and carbon emission of LLMs during training have been a focus in research, the impact of LLM inference which scales with the number of requests, remains underexplored. In this paper we show that energy efficiency and carbon emission for LLM inference vary depending on the model and on the task category, e.g. math, pro- gramming, general knowledge, with which the model is prompted and that smaller specialized models can achieve comparable accuracy while using less resources. We analyze the differences across 8 open-source LLMs when processing prompts of different task categories. Our findings lead to a novel approach: Classifying prompts by using embeddings to route them to the most energy-efficient and least carbon-intensive LLM in a federation of LLMs while keeping accuracy high. We validate the effectiveness of our method through empirical measurements. |
Beta Was this translation helpful? Give feedback.
-
Thanks, @jawache, for restarting this thread. We should also look at how we can extend SCI to Managed Services, particularly where LLM models are exposed as APIs, and the applications only consume the APIs. As the LLM models can be a black box, the details of underlying infrastructure and telemetrics would not be available and need to rely on proxy values. @ribalba , can you please share the link to the paper, this would helpful for our analysis (i.e, energy efficiency and carbon emission for LLM inference vary depending on the model and on the task category,) |
Beta Was this translation helpful? Give feedback.
-
@atg-abhishek, @Henry-WattTime, @navveenb, @seanmcilroy29
Now we have SCI in ISO perhaps it's time to revisit some early conversations we had about SCI and work towards more concrete versions for specific domains, I'm proposing we explore how we update the specification to create a clear definition for an SCI score as applied to the domain of AI (or LLMs specifically)
I believe there are two areas where we need to agree on a clear definition as we apply SCI score to AI, Software Boundary and Functional Unit.
Software Boundaries
The reason we spoke about a software boundary in the SCI was because we recognized that that would be a way that the metric would be "gamed" in a way that is not aligned with the goal of the spec and also to provide a method of comparability between two similar software systems.
Example:
It's not a comparable metric, because the boundary condition is different.
This isn't a problem with physical products since two fridges have to have all the components, you can't leave the door off one. But with software the boundary is a lot more subjective and open to interpretation.
Suggestion:
The spec provides a clear set of software components that MUST be includes in an SCI score as it's applied to AI, for example:
Functional Unit
The second is deciding the functional unit, to be comparable the Functional Units must match. But also to be an honest metric over time the functional unit can't be redefined over time to game the score.
For example:
If we decide on a functional unit of "prompt" does prompt mean all forms of prompts the end user is entering, e.g. image/video prompts are far more costly than text prompts. Does it include prompts that the user enters from the official front end as well as perhaps prompts that come from an API. We want to provide enough clarity so there is no room for interpretation regarding what we mean by a prompt.
Suggestion
Per Prompt where Prompt is defined by all forms of requests to the LLM whether they come from the official user interface or from an API call.
Note
The Functional Unit seems to be linked to a user journey. We should perhaps clarify a functional unit by describing a long list of user stories which describe without ambiguity the user journey we are expressing with the Functional Unit. E.g. As a user I click on the prompt box and enter "How old is the universe" and then click enter. As a user I enter the prompt "Generate an image of a monkey riding a bicycle"
Beta Was this translation helpful? Give feedback.
All reactions