Skip to content

Commit

Permalink
Merge pull request #2160 from FedML-AI/dev/v0.7.0
Browse files Browse the repository at this point in the history
Dev/v0.7.0
  • Loading branch information
fedml-alex authored Jun 11, 2024
2 parents 31d8e7c + af026fb commit 9c227bb
Show file tree
Hide file tree
Showing 53 changed files with 861 additions and 9,055 deletions.
42 changes: 19 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,40 @@

# FEDML Open Source: A Unified and Scalable Machine Learning Library for Running Training and Deployment Anywhere at Any Scale

Backed by FEDML Nexus AI: Next-Gen Cloud Services for LLMs & Generative AI (https://fedml.ai)
Backed by TensorOpera AI: Your Generative AI Platform at Scale (https://TensorOpera.ai)

<div align="center">
<img src="docs/images/fedml_logo_light_mode.png" width="400px">
<img src="docs/images/TensorOpera_arch.png" width="600px">
</div>

FedML Documentation: https://doc.fedml.ai
TensorOpera Documentation: https://docs.TensorOpera.ai

FedML Homepage: https://fedml.ai/ \
FedML Blog: https://blog.fedml.ai/ \
FedML Medium: https://medium.com/@FedML \
FedML Research: https://fedml.ai/research-papers/
TensorOpera Homepage: https://TensorOpera.ai/ \
TensorOpera Blog: https://blog.TensorOpera.ai/

Join the Community: \
Join the Community:
Slack: https://join.slack.com/t/fedml/shared_invite/zt-havwx1ee-a1xfOUrATNfc9DFqU~r34w \
Discord: https://discord.gg/9xkW8ae6RV


FEDML® stands for Foundational Ecosystem Design for Machine Learning. [FEDML Nexus AI](https://fedml.ai) is the next-gen cloud service for LLMs & Generative AI. It helps developers to *launch* complex model *training*, *deployment*, and *federated learning* anywhere on decentralized GPUs, multi-clouds, edge servers, and smartphones, *easily, economically, and securely*.
TensorOpera® AI (https://TensorOpera.ai) is the next-gen cloud service for LLMs & Generative AI. It helps developers to launch complex model training, deployment, and federated learning anywhere on decentralized GPUs, multi-clouds, edge servers, and smartphones, easily, economically, and securely.

Highly integrated with [FEDML open source library](https://github.com/fedml-ai/fedml), FEDML Nexus AI provides holistic support of three interconnected AI infrastructure layers: user-friendly MLOps, a well-managed scheduler, and high-performance ML libraries for running any AI jobs across GPU Clouds.
Highly integrated with TensorOpera open source library, TensorOpera AI provides holistic support of three interconnected AI infrastructure layers: user-friendly MLOps, a well-managed scheduler, and high-performance ML libraries for running any AI jobs across GPU Clouds.

![fedml-nexus-ai-overview.png](./docs/images/fedml-nexus-ai-overview.png)
A typical workflow is showing in figure above. When developer wants to run a pre-built job in Studio or Job Store, TensorOpera®Launch swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, eliminating complex environment setup and management. When running the job, TensorOpera®Launch orchestrates the compute plane in different cluster topologies and configuration so that any complex AI jobs are enabled, regardless model training, deployment, or even federated learning. TensorOpera®Open Source is unified and scalable machine learning library for running these AI jobs anywhere at any scale.

A typical workflow is showing in figure above. When developer wants to run a pre-built job in Studio or Job Store, FEDML®Launch swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, eliminating complex environment setup and management. When running the job, FEDML®Launch orchestrates the compute plane in different cluster topologies and configuration so that any complex AI jobs are enabled, regardless model training, deployment, or even federated learning. FEDML®Open Source is unified and scalable machine learning library for running these AI jobs anywhere at any scale.
In the MLOps layer of TensorOpera AI
- **TensorOpera® Studio** embraces the power of Generative AI! Access popular open-source foundational models (e.g., LLMs), fine-tune them seamlessly with your specific data, and deploy them scalably and cost-effectively using the TensorOpera Launch on GPU marketplace.
- **TensorOpera® Job Store** maintains a list of pre-built jobs for training, deployment, and federated learning. Developers are encouraged to run directly with customize datasets or models on cheaper GPUs.

In the MLOps layer of FEDML Nexus AI
- **FEDML® Studio** embraces the power of Generative AI! Access popular open-source foundational models (e.g., LLMs), fine-tune them seamlessly with your specific data, and deploy them scalably and cost-effectively using the FEDML Launch on GPU marketplace.
- **FEDML® Job Store** maintains a list of pre-built jobs for training, deployment, and federated learning. Developers are encouraged to run directly with customize datasets or models on cheaper GPUs.
In the scheduler layer of TensorOpera AI
- **TensorOpera® Launch** swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, eliminating complex environment setup and management. It supports a range of compute-intensive jobs for generative AI and LLMs, such as large-scale training, serverless deployments, and vector DB searches. TensorOpera Launch also facilitates on-prem cluster management and deployment on private or hybrid clouds.

In the scheduler layer of FEDML Nexus AI
- **FEDML® Launch** swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, eliminating complex environment setup and management. It supports a range of compute-intensive jobs for generative AI and LLMs, such as large-scale training, serverless deployments, and vector DB searches. FEDML Launch also facilitates on-prem cluster management and deployment on private or hybrid clouds.

In the Compute layer of FEDML Nexus AI
- **FEDML® Deploy** is a model serving platform for high scalability and low latency.
- **FEDML® Train** focuses on distributed training of large and foundational models.
- **FEDML® Federate** is a federated learning platform backed by the most popular federated learning open-source library and the world’s first FLOps (federated learning Ops), offering on-device training on smartphones and cross-cloud GPU servers.
- **FEDML® Open Source** is unified and scalable machine learning library for running these AI jobs anywhere at any scale.
In the Compute layer of TensorOpera AI
- **TensorOpera® Deploy** is a model serving platform for high scalability and low latency.
- **TensorOpera® Train** focuses on distributed training of large and foundational models.
- **TensorOpera® Federate** is a federated learning platform backed by the most popular federated learning open-source library and the world’s first FLOps (federated learning Ops), offering on-device training on smartphones and cross-cloud GPU servers.
- **TensorOpera® Open Source** is unified and scalable machine learning library for running these AI jobs anywhere at any scale.

# Contributing
FedML embraces and thrive through open-source. We welcome all kinds of contributions from the community. Kudos to all of <a href="https://github.com/fedml-ai/fedml/graphs/contributors" target="_blank">our amazing contributors</a>!
Expand Down
Binary file added docs/images/TensorOpera_arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 10 additions & 0 deletions python/examples/deploy/debug/inference_timeout/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
workspace: "./src"
entry_point: "serve_main.py"
bootstrap: |
echo "Bootstrap start..."
sleep 5
echo "Bootstrap finished"
auto_detect_public_ip: true
use_gpu: true

request_timeout_sec: 10
32 changes: 32 additions & 0 deletions python/examples/deploy/debug/inference_timeout/src/serve_main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from fedml.serving import FedMLPredictor
from fedml.serving import FedMLInferenceRunner
import uuid
import torch

# Calculate the number of elements
num_elements = 1_073_741_824 // 4 # using integer division for whole elements


class DummyPredictor(FedMLPredictor):
def __init__(self):
super().__init__()
# Create a tensor with these many elements
tensor = torch.empty(num_elements, dtype=torch.float32)

# Move the tensor to GPU
tensor_gpu = tensor.cuda()

# for debug
with open("/tmp/dummy_gpu_occupier.txt", "w") as f:
f.write("GPU is occupied")

self.worker_id = uuid.uuid4()

def predict(self, request):
return {f"AlohaV0From{self.worker_id}": request}


if __name__ == "__main__":
predictor = DummyPredictor()
fedml_inference_runner = FedMLInferenceRunner(predictor)
fedml_inference_runner.run()
Empty file.
21 changes: 4 additions & 17 deletions python/examples/deploy/quick_start/config.yaml
Original file line number Diff line number Diff line change
@@ -1,21 +1,8 @@
workspace: "./src"
workspace: "."
entry_point: "main_entry.py"

# If you want to install some packages
# Please write the command in the bootstrap.sh
bootstrap: |
echo "Bootstrap start..."
sh ./config/bootstrap.sh
echo "Bootstrap finished"
# If you do not have any GPU resource but want to serve the model
# Try FedML® Nexus AI Platform, and Uncomment the following lines.
# ------------------------------------------------------------
computing:
minimum_num_gpus: 1 # minimum # of GPUs to provision
maximum_cost_per_hour: $3000 # max cost per hour for your job per gpu card
#allow_cross_cloud_resources: true # true, false
#device_type: CPU # options: GPU, CPU, hybrid
resource_type: A100-80G # e.g., A100-80G,
# please check the resource type list by "fedml show-resource-type"
# or visiting URL: https://open.fedml.ai/accelerator_resource_type
# ------------------------------------------------------------
echo "Install some packages..."
echo "Install finished!"
27 changes: 27 additions & 0 deletions python/examples/deploy/quick_start/main_entry.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
from fedml.serving import FedMLPredictor
from fedml.serving import FedMLInferenceRunner


class Bot(FedMLPredictor): # Inherit FedMLClientPredictor
def __init__(self):
super().__init__()

# --- Your model initialization code here ---

# -------------------------------------------

def predict(self, request: dict):
input_dict = request
question: str = input_dict.get("text", "").strip()

# --- Your model inference code here ---
response = "I do not know the answer to your question."
# ---------------------------------------

return {"generated_text": f"The answer to your question {question} is: {response}"}


if __name__ == "__main__":
chatbot = Bot()
fedml_inference_runner = FedMLInferenceRunner(chatbot)
fedml_inference_runner.run()
Empty file.
Empty file.
Empty file.
68 changes: 0 additions & 68 deletions python/examples/deploy/quick_start/src/app/pipe/constants.py

This file was deleted.

Loading

0 comments on commit 9c227bb

Please sign in to comment.