Ali's Newsletter
Posts
FastAPI and Ray Clusters for Deploying ML Services⚡⚡

FastAPI and Ray Clusters for Deploying ML Services⚡⚡

Deploying ML was never easy as in this article🧩🧩 welcome to fast deployment journey for ML applications

Ali Ali
July 31, 2025

In the rapidly evolving landscape of Machine Learning (ML) operations, deploying models from development to production efficiently and at scale remains a significant challenge. While building accurate models is crucial, their true value is unlocked only when they can serve real-world applications with high performance and reliability. This is where the powerful combination of FastAPI and Ray Clusters emerges as a game-changer in 2025.

Why FastAPI + Ray is a Powerful Duo for ML Deployment in 2025

⚡ Performance and Scalability: The demand for real-time inference and handling massive data streams has never been higher. Traditional ML deployment strategies often struggle to keep up. FastAPI, with its asynchronous capabilities and high performance (on par with NodeJS and Go [1]), provides an ideal foundation for building responsive ML APIs. When coupled with Ray, an open-source unified framework for scaling AI and Python applications [2], you unlock unparalleled horizontal scalability. Ray allows you to distribute your ML workloads across a cluster of machines, managing resources and orchestrating tasks seamlessly.

🏗️ Simplified Development and Deployment: FastAPI's automatic interactive API documentation (Swagger UI and ReDoc) simplifies API development and consumption. Its modern Python type hints ensure robust code and excellent editor support. Ray further streamlines distributed computing, abstracting away the complexities of parallel processing and inter-node communication. This means ML engineers can focus more on model development and less on infrastructure.

🧩 Robust Ecosystem Integration: Both FastAPI and Ray boast thriving ecosystems. FastAPI integrates effortlessly with various data science libraries, while Ray provides specialized libraries for ML (Ray ML), reinforcement learning (Ray RLlib), and model serving (Ray Serve). This rich integration allows for end-to-end ML pipelines, from data preprocessing and model training to serving and monitoring, all within a unified framework.

In essence, FastAPI provides the high-performance, developer-friendly API layer, while Ray offers the distributed computing backbone necessary for scaling ML services to meet the demands of 2025 and beyond. This synergy empowers organizations to deploy and manage complex ML applications with unprecedented agility and efficiency.

🧩 Components of a FastAPI + Ray ML Deployment

To build a robust ML service with FastAPI and Ray, you'll typically involve several key components:

FastAPI: The web framework for creating the RESTful API endpoints that expose your ML models.
Ray: The distributed computing framework that handles parallel processing, resource management, and model serving (via Ray Serve).
Docker/Kubernetes: For containerization and orchestration of your application, ensuring portability and scalability across different environments.
Model Serialization: Techniques to save and load your trained ML models (e.g., pickle, joblib, ONNX, HDF5 for TensorFlow/Keras, PyTorch save/load).
ML Model: Your trained machine learning model (e.g., scikit-learn, TensorFlow, PyTorch).

🐍 Code Examples: Bringing the Duo to Life

Example 1: A Simple FastAPI ML Inference Endpoint

Let's create a basic FastAPI application that loads a pre-trained scikit-learn model and provides an inference endpoint. For simplicity, we'll assume a dummy model is saved as model.pkl.

First, create a dummy model and save it (e.g., create_model.py):

# create_model.py
import pickle
from sklearn.linear_model import LogisticRegression

# Create a dummy model
model = LogisticRegression()
# In a real scenario, you would train your model here
# For demonstration, let's fit it with some dummy data
X = [[1, 2], [3, 4], [5, 6]]
y = [0, 1, 0]
model.fit(X, y)

# Save the model
with open("model.pkl", "wb") as f:
    pickle.dump(model, f)
print("Dummy model saved as model.pkl")

Now, the FastAPI application (main.py):

# main.py
from fastapi import FastAPI
from pydantic import BaseModel
import pickle
import numpy as np

app = FastAPI()

# Load the pre-trained model
try:
    with open("model.pkl", "rb") as f:
        model = pickle.load(f)
    print("Model loaded successfully!")
except FileNotFoundError:
    print("Error: model.pkl not found. Please run create_model.py first.")
    model = None # Handle case where model is not found

class PredictionRequest(BaseModel):
    features: list[float]

@app.post("/predict/")
async def predict(request: PredictionRequest):
    if model is None:
        return {"error": "Model not loaded."}
    
    features_array = np.array(request.features).reshape(1, -1)
    prediction = model.predict(features_array).tolist()
    return {"prediction": prediction}

# To run this FastAPI app, save it as main.py and run:
# uvicorn main:app --host 0.0.0.0 --port 8000

Suggested YouTube Video Placement: A video demonstrating how to set up a FastAPI project, create the model.pkl, and run the main.py with uvicorn, showing the Swagger UI.

Example 2: A Ray Cluster Example for Scaling (Conceptual with Ray Serve)

Ray Serve is Ray's scalable model serving library. Here's a conceptual example of how you might deploy the same model using Ray Serve, allowing for easy scaling.

# serve_model.py
import ray
from ray import serve
from fastapi import FastAPI
from pydantic import BaseModel
import pickle
import numpy as np

# Initialize Ray if not already initialized
if not ray.is_initialized():
    ray.init(address="auto") # Connect to an existing Ray cluster or start a local one

app = FastAPI()

class PredictionRequest(BaseModel):
    features: list[float]

@serve.deployment(num_replicas=1, route_prefix="/predict")
@serve.ingress(app)
class MLModelDeployment:
    def __init__(self):
        # Load the model once per replica
        try:
            with open("model.pkl", "rb") as f:
                self.model = pickle.load(f)
            print("Model loaded successfully within Ray Serve deployment!")
        except FileNotFoundError:
            print("Error: model.pkl not found. Please run create_model.py first.")
            self.model = None

    @app.post("/")
    async def predict(self, request: PredictionRequest):
        if self.model is None:
            return {"error": "Model not loaded."}
        
        features_array = np.array(request.features).reshape(1, -1)
        prediction = self.model.predict(features_array).tolist()
        return {"prediction": prediction}

# Deploy the model
ml_app = MLModelDeployment.bind()
serve.run(ml_app)

# To deploy and scale this, you would run:
# ray start --head # on head node
# ray start --address=<head_node_ip>:6379 # on worker nodes
# python serve_model.py # on head node to deploy
# You can then scale by changing num_replicas in @serve.deployment

Suggested YouTube Video Placement: A video explaining Ray Serve concepts, showing how to start a local Ray cluster, deploy the serve_model.py, and then demonstrate scaling by changing num_replicas.

🏗️ Architecture of an ML Service with FastAPI & Ray

Diagram Suggestion: A clear, concise architecture diagram illustrating the flow:
- Client/User -> FastAPI Gateway (handling API requests, input validation)
- FastAPI Gateway -> Ray Serve Deployment (orchestrating model inference)
- Ray Serve Deployment -> ML Model Replicas (actual model inference, potentially distributed across multiple nodes in a Ray Cluster)
- Include components like Docker/Kubernetes surrounding the FastAPI and Ray Serve parts, indicating containerization and orchestration.
- Show data flow and interaction points.

📚 Resources for Deeper Learning

To master ML service deployment with FastAPI and Ray, consider these resources:

FastAPI Official Documentation: The best place to learn about FastAPI's features and best practices. [1]
Ray Documentation (especially Ray Serve): Comprehensive guides on building and deploying scalable applications with Ray. [2]
Online Courses/Tutorials: Search for "FastAPI ML deployment," "Ray Serve tutorial," or "MLOps with Kubernetes" on platforms like Udemy, Coursera, or YouTube. Look for courses that cover practical deployment scenarios.
GitHub Repositories: Explore open-source projects that demonstrate FastAPI and Ray integration for real-world ML applications.

💡 Practical Deployment Tips and Next Steps

Containerize Everything: Use Docker to package your FastAPI application and its dependencies. This ensures consistent environments from development to production.
Orchestration with Kubernetes: For production-grade deployments, Kubernetes (or similar container orchestration platforms) is essential for managing, scaling, and healing your FastAPI and Ray Serve deployments.
Monitoring and Logging: Implement robust monitoring (e.g., Prometheus, Grafana) and logging (e.g., ELK stack) to track the performance and health of your ML services.
CI/CD Pipelines: Automate your deployment process with Continuous Integration/Continuous Delivery (CI/CD) pipelines to ensure rapid and reliable updates.
Model Versioning and Management: Use tools like MLflow or DVC to manage different versions of your models and track experiments.
Security: Secure your FastAPI endpoints with authentication and authorization mechanisms. Ensure your Ray cluster is properly secured.

By combining the high performance of FastAPI with the distributed computing power of Ray, you are well-equipped to build and deploy scalable, robust, and efficient machine learning services in 2025. Start experimenting with these powerful tools, and you'll be at the forefront of ML deployment!

🔗 References

[1] FastAPI Documentation: https://fastapi.tiangolo.com/

[2] Ray Documentation : https://docs.ray.io/en/latest/