Verifiable Compute for Akash Network #614
Replies: 9 comments 11 replies
-
Sriram talked about discussion 614 during the Akash Steering Committee meeting on June 27th, 2024. The video can be found here, starting at the 24 minute mark: June 27th, 2024 Akash Network Steering Committee Monthly Meeting Video Recording You can catch up on all past Akash Steering Committee meetings here. |
Beta Was this translation helpful? Give feedback.
-
Thank you for writing this incredibly important proposal that adds a great amount of value to Akash Network by solving one of the harder challenges in growing the only fully permissionless network as well as enabling onchain incentives. As @brewsterdrinkwater mentioned above - this was presented and discussed at the last Steering Comittee meeting with support from everyone there. As such, the core team (at OCL) will be working with @sriramvish to get this proposal Onchain for a formal vote, in the next day or two |
Beta Was this translation helpful? Give feedback.
-
I'm working on something similar with Naoris protocol. I'd be very interested in talking with you on discord. My discord is 31trainman. |
Beta Was this translation helpful? Give feedback.
-
Thanks for stepping in and sharing, I still need to catch up with the last steering committee call recording, however, this text proposal looks solid and the benefits this would bring to Akash network are clear: reducing abuse within the Akash network through verifiable provisioning of hardware |
Beta Was this translation helpful? Give feedback.
-
I do not think this is the proper way to approach verifiable compute for the Akash network. There is a much simpler and economical solution which also has already been tested and produced very good results. There is an existing benchmarking API which simply needs funding for 24/7/365 operation that can easily check all providers on the network. Additionally I do not think there is any value in a 3rd party USB key that plugs into anything. If you run a provider and learn how Kubernetes works, this is not a reasonable solution at scale. Your plan also requires paying your tuition and hiring graduate students to "research" solutions, I do not think community funds should be used for this and instead should be used for builders who have working solutions (functional POC and beyond) and deep familiarity with the Akash network. Your budget does not make sense and is unreasonable. $96,216 in Tuition and "Research Assistants"
$52,907 "indirect costs" with no explanation.
$75k for Akash hardware - no explanation or rational behind this amount in your proposal.
Finally, please feel free to reach out to me to work on the Benchmarking API solutions that have already been developed. I voted No and I would recommend others vote No as well. |
Beta Was this translation helpful? Give feedback.
-
I will be voting no. With no real answer to some questions on the discussion, it looks like it needs more discussion in my opinion. It’s not that I’m against someone developing a verifiable compute add on, I just don’t think the USB stick makes sense and without reply’s to discussion questions makes me want to know more before I can vote yes. |
Beta Was this translation helpful? Give feedback.
-
Thank you all for contributing your thoughts @Zblocker64 @88plug @KamuelBob. Your feedback and participation is what helps make our community so robust. I would like to point out some of the practical points that make this effort extremely beneficial to Akash Network:
Improvements for future props and discussions
Improvements to community feedback
Good feedback should present tangible and reasonable solutions. |
Beta Was this translation helpful? Give feedback.
-
Yo @sriramvish do you have a discord or anything? I'm working on integrating Naoris security protocol into akash network and your idea fits into my security integration. |
Beta Was this translation helpful? Give feedback.
-
This proposal was put on chain and passed: https://www.mintscan.io/akash/proposals/261 Work will be tracked here: https://github.com/orgs/akash-network/projects/5?pane=issue&itemId=71129502 |
Beta Was this translation helpful? Give feedback.
-
Introduction
Verifiable computing is an entire class of algorithms or systems, where a particular portion of the compute stack is verifiable/provable in a trustless manner to participants within a decentralized network. Verifiable computing can take many forms, including:
Verifiable provisioning of hardware: This corresponds to the case where we desire to verify the nature and extent to which a piece of hardware is provisioned for the Akash network.
Specifically, if a 4090 GPU were to be incorporated in the Akash network, verifiable provisioning ensures that it indeed matches its hardware specifications, and it is genuinely allocated for functions on the Akash network.
Verifiable execution of program/software: This corresponds to the case where a program (any AI program, ranging from inference to training) is correctly executed on a node/set of nodes in the Akash network. For example, that a particular piece of code was executed correctly in a cluster of 4090s on the Akash network. Verifiable execution of programs/software also comes in multiple flavors, including:
Non-real-time: An offline verification mechanism that presents a proof in non-real-time, where the proof has no time or size constraints.
Optimistic, real-time proofs: An optimistic proof mechanism that can be verified or contested in (near) real time.
Zero knowledge, real-time proofs: A zero knowledge proof mechanism (that does not reveal anything about the inputs but can still be verified, in (near) real time.
In this proposal, for the first year of this project, we focus on only the first type of verifiability: That of provisioning of hardware. After the completion of this first portion of the project, a further proposal will be submitted on non-real-time and subsequently, real-time verifiable computing within the Akash network.
Benefits to Akash Network
The need for verifiable provisioning of hardware is significant for a variety of reasons, including the elimination/reduction of Sybil attacks, and of other forms of misrepresentation and abuse in the network.
Verifiable Hardware Provisioning
Verifiable hardware provisioning can be achieved in a variety of ways: by using schemes uniquely associated with particular types of hardware, by using access patterns and footprints associated with a particular make and model, and other ways. However, these schemes are dependent on hardware configurations and do not necessarily generalize well. In order to develop a scalable, universal solution, we take a trusted enclave (trusted execution environment) approach as follows:
Akash providers that intend to be “hardware verifiable” are equipped with a TEE, configured by Akash (such as Trusty [1], for more information on TEE, see tutorial [2]). Such a TEE contains a physically unclonable function (a PUF, see [3]) that can securely sign transactions. To ensure uniformity, this TEE will be designed to be a USB A/C dongle that can be attached to any hardware configuration.
We will verify that the USB A/C dongle can be attached to any hardware configuration and provide a detailed set of instructions to install and use this dongle to enable each provider to become “hardware verifiable” on Akash.
This TEE will periodically perform the following two tasks, based on an internal pseudo-random timer:
Identification task:
Following a pseudo-random clock, the TEE will query every GPU in the specific Akash provider on its status and device-level details.
Provisioning task:
Periodically and randomly, a random machine learning task will be assigned to the GPUs within this provider. These provisioning tasks are based on existing, well known benchmarks on the performance of GPUs to certain deep learning tasks, including particular types of models [4], more general deep learning models [5] and other tasks that are well known benchmarks on existing GPUs [6].
After the conclusion of each type of pseudorandomly repeated task, the TEE will securely sign the message, and will share the secure message with the Akash network.
The tasks are used to ensure the following properties:
Identification task:
The identification task sets up the base configuration for each GPU cluster, and assigns a unique signature associated with the TEE with that cluster. As the identification is performed at the operating system level, it can potentially be spoofed, and therefore, the provisioning/benchmarking tasks are required.
Provisioning task:
The provisioning/benchmarking tasks verifies the identification while simultaneously ensuring that the associated GPUs are dedicated for the Akash network and are not prioritizing other tasks. In case they are not provisioned for Akash network, they will fail the provisioning task.
A key point is that both the entire system (user, operating system) cannot differentiate between a provisioning/benchmarking tasks and a regular AI workload provided by the Akash network, and therefore cannot selectively serve a particular type of workload/task. This ensures that the GPUs are both correctly identified and are made available to Akash network-centric tasks at all times.
Team
The team for this project is led by Prof. Sriram Vishwanath from The University of Texas, Austin. Sriram Vishwanath is a professor at The University of Texas, Austin and Shruti Raghavan is a PhD candidate in Computer Science at UT Austin. They are working together with the Harvard Medical School and MITRE on the design of new foundation/base models in healthcare, with causal learning incorporated into such a platform.
Sriram Vishwanath received the B. Tech. degree in Electrical Engineering from the Indian Institute of Technology (IIT), Madras, India in 1998, the M.S. degree in Electrical Engineering from California Institute of Technology (Caltech, Pasadena USA in 1999, and the Ph.D. degree in Electrical Engineering from Stanford University, Stanford, CA USA in 2003. Currently, he is Professor in the Chandra Department of Electrical and Computer Engineering at The University of Texas at Austin, and recently, a Technical Fellow for Distributed Systems and Machine Learning at MITRE Labs.
Timeline
The timeline for this project is as follows:
Open Discussions: Starting end of June 2024
Governance Proposal: Through first half of July, 2024
Design Phase: Through Q3 and Q4 2024
Hacknet TEE Phase: Q1 2025
Devent TEE Phase: Q2 2025
Conclusion of Hardware Provisioning testing and handover to Akash Team: End of Q2 2025
Note: This is subject to change based on feedback
Deliverables
Q3 2024 - High Level Design
Q4 2024 - Design Specification
Q1 2025 - Initial Hacknet Prototype
Q2 2025 - Devnet and Conclusion of Testing
Budget
The tentative budget for this project is presented in the spreadsheet attached here: https://docs.google.com/spreadsheets/d/1asmvyi5r7QgKRjsImZInAENXptr_cwoW/edit?usp=sharing&ouid=103645797398143147236&rtpof=true&sd=true).
The high-level breakdown for the budget is:
R&D Costs (Student salaries + tuition + University Overhead): 146,547
Akash Computing/Hardware Costs: 75,000
Disbursement:
Disbursement will happen in two increments, coinciding with the few weeks before the beginning of each semester - Fall 2024 (on July 22nd 2024) and Spring 2025 (December 15 2024).
FAQ
Response: As H100s come with integrated TEEs, we will focus this project on older generations of GPUs, with a particular emphasis on RTX series of GPUs.
This FAQ section will be populated as soon as we have questions arising from discussions.
References
[1] Trusty TEE: Android Open Source Project https://source.android.com/docs/security/features/trusty
[2] TEE 101 White Paper https://www.securetechalliance.org/wp-content/uploads/TEE-101-White-Paper-FINAL2-April-2018.pdf
[3] Shamsoshoara, Alireza, et al. "A survey on physical unclonable function (PUF)-based security solutions for Internet of Things." Computer Networks 183 (2020): 107593.
[4] Wang, Yu Emma, Gu-Yeon Wei, and David Brooks. "Benchmarking TPU, GPU, and CPU platforms for deep learning." arXiv preprint arXiv:1907.10701 (2019).
[5] Shi, Shaohuai, et al. "Benchmarking state-of-the-art deep learning software tools." 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE, 2016.
[6] Araujo, Gabriell, et al. "NAS Parallel Benchmarks with CUDA and beyond." Software: Practice and Experience 53.1 (2023): 53-80.
Beta Was this translation helpful? Give feedback.
All reactions