Model Deployment

Model deployment is the process of Packaging a trained model with its code and dependencies, exposing it in a production environment so applications, dashboards, or APIs can send data and receive predictions, and run it reliably with monitoring and security.

Data Scientists build and train the model, while data engineers ensure it runs reliably, securely, and efficiently in production.

Key Goals in Deployment Task

Correctness:
- The system must be deterministic.
- For any given input, the output must be identical, regardless of how many times it’s called or when.
- This is the foundation of reliable systems.
- For example, a function calculateTax(income=50000) must always return the same amount (e.g., 12500). It should never return a different value due to internal state or randomness.
Performance:
- The system must meet its speed (latency) and volume (throughput) requirements.
- Low latency means quick responses, while high throughput means handling many requests per second.
- For example, an autocomplete API must return suggestions in under 100ms (low latency).
- A data ingestion service must process 10,000 events/second (high throughput).
Scalability:
- The system can handle a sudden, significant increase in load by adding resources (scaling out) without failing or degrading performance critically.
- For example, an e-commerce website scales its front-end servers from 10 to 100 instances automatically to handle a flash sale, preventing a crash.
Safety:
- The system is protected from threats (security), can be updated without downtime (versioning), and can quickly revert to a previous stable state if a new version fails (rollbacks).
- For example, deploying a new API version (v2) alongside v1. If v2 has a critical bug, traffic is instantly routed back to the stable v1 (rollback), all while enforcing authentication (security).
Observability:
- The system provides deep internal visibility through structured logs, performance metrics, request traces, and notifications for when its behavior deviates from the expected norm (drift)
- A user reports an error. An engineer uses a trace ID to find the specific log error, see the slow database query in the metrics, and identify the faulty microservice, all triggered by an alert on rising error rates.

Drift: A gradual and unintended deviation of a system from its expected performance, behavior, or resource usage baseline.

Key Approaches for ML Model Deployment

A) Batch (Offline) Scoring

This method processes pre-defined datasets on a scheduled trigger (e.g., Airflow DAG, cron job) using compute engines like Spark or Pandas on a cluster.
It is designed for generating bulk predictions, such as populating a nightly customer churn table in a data warehouse.
The architecture is simple and cost-optimized for large volumes but introduces inherent latency, making predictions unavailable until the next job completion.
Example: An e-commerce company runs a daily job to predict which customers are most likely to churn in the next 30 days, saving the list to a database for the marketing team.

B) Online (Synchronous) Inference

Model is hosted as a containerized microservice (e.g., FastAPI/Flask in Docker) behind a load balancer and API Gateway.
It serves individual predictions over HTTP/REST or gRPC with strict latency SLAs (e.g., <100ms).
This is mandatory for real-time applications like credit card fraud scoring.
Production readiness requires autoscaling, health checks, and robust service discovery to handle volatile traffic.
Example: A bank’s website calls a fraud detection API in real-time to approve or decline a credit card transaction during checkout.

C) Streaming/Event-Driven (Asynchronous)

Inference is embedded within a stream processing framework (e.g., Apache Flink, Kafka Streams).
The application consumes events from a message broker (Kafka, Kinesis), scores each record, and emits the result to a new topic.
This enables high-throughput, near-real-time processing for use cases like real-time alerting.
The complexity lies in managing state and ensuring fault-tolerant delivery semantics.
Example: A ride-sharing app calculates ETA and surge pricing in real-time by continuously processing streaming location data from drivers and passengers.

D) Edge / On-Device Inference

The model is converted to an optimized format (TFLite, ONNX, CoreML) and compiled into a mobile or IoT application.
Inference executes locally on the device’s hardware, often producing dedicated NPUs/GPUs for performance.
This is used for offline-capable applications (e.g., photo style transfer) or where latency is critical (e.g., autonomous robot navigation).
The constraint is the model’s size and complexity, which must fit the device’s limited computing and memory resources.
Example: The iPhone’s Face ID system runs a neural network on its dedicated Neural Engine to authenticate users without sending data to the cloud.

E) In-Database Inference

Uses the compute power of modern MPP data warehouses (Snowflake, BigQuery, Redshift) to run inference inside the database engine via SQL UDFs or built-in ML functions.
This eliminates data movement by scoring data directly at rest, ideal for creating massive batch prediction sets for BI dashboards.
Performance and cost are directly tied to the data platform’s SQL execution engine.
Example: A retailer uses Snowflake to score all customer records in its data warehouse for lifetime value prediction without moving any data to an external system.

F) Serverless Functions

The model is packaged into a serverless function (AWS Lambda, Google Cloud Functions) with a lightweight runtime.
It is triggered by HTTP events or from a message queue.
The platform manages scaling from zero to handle traffic spikes, making it cost-effective for intermittent or unpredictable workloads.
The key technical challenge is mitigating cold-start latency, often by using provisioned concurrency or optimizing the package size.
Example: A mobile app that uses image recognition for plant identification. A user uploads a photo, triggering a Lambda function to score the image and return the result. Traffic is spiky and unpredictable.

Model Deployment Lifecycle

1. Package the model + code + environment.

Bundle the trained model file, inference code, and all software dependencies into a single, reproducible unit. This ensures it runs the same everywhere.
For Example, using a Dockerfile to create an image that includes Python, TensorFlow, your predict.py script, and the saved model.h5 file.

2. Serve the model via API, batch job, or stream.

Expose the model’s functionality through an API for real-time responses, run it on a schedule for bulk processing, or integrate it into a data stream.
For example, creating a Flask/FastAPI endpoint that returns a loan approval prediction. A separate batch job runs nightly to score all new user sign-ups.

3. Ship with CI/CD so changes are tested and automated.

Automate testing and deployment using pipelines.
Code and model changes are automatically validated and deployed to production upon passing tests.
For example, a GitHub Action pipeline that runs unit tests on the inference code, builds a new Docker image, and deploys it to a staging environment when a pull request is merged.

4. Run in containers/orchestrators with load balancing.

Deploy the packaged model inside containers managed by an orchestrator. This provides scalability, resilience, and efficient resource usage.
For example, deploying multiple container replicas of your model API on Kubernetes, which automatically distributes incoming traffic across them and restarts any that fail.

An orchestrator is a system that automates the deployment, management, scaling, and networking of containers.

5. Safely roll out new versions (A/B, canary, shadow).

Deploy new model versions to a small subset of users/traffic first to validate performance and minimize risk before a full rollout.
For example, using a canary release to send 5% of live API traffic to a new model version. If error rates stay low, gradually increase the traffic to 100%.

A/B Testing: Directing different user segments to two distinct versions (A and B) to statistically compare a specific business metric.
Canary Release: Gradually rolling out a new version to a small, increasing percentage of users to minimize the impact of potential failures.
Shadow Mode: Sending a copy of live traffic to the new version without affecting the user’s response, to validate performance against the current version in production.

6. Monitor model + system health; alert and retrain as needed.

Track system metrics (latency, errors) and model metrics (accuracy, drift). Trigger alerts for degradation and initiate retraining pipelines.
For example, A dashboard monitoring prediction drift. An alert fires when drift exceeds a threshold, triggering a pipeline to retrain the model on fresh data and redeploy it.

Deploying a Machine Learning Model with Flask(web app deployment using a local web server)

What is Flask Deployment?

Flask is a Python web framework.
We use it to:
- Take input from a user through a web form (HTML page).
- Send that input to our machine learning model.
- Show the prediction result back on the webpage.

This is called local deployment because it runs on our own computer.

Project Structure

Our target is to create a new project folder as follows:

project/
│── app.py
│── model.pkl
│── train_model.py
│── templates/
     └── index.html

train_model.py → script to train and save the model.
model.pkl → saved ML model.
app.py → Flask app that connects model with web page.
templates/index.html → the webpage form.

Train and Save a Model

Write this code in an IDE like Spyder, PyCharm, etc
Save file as model_name.py
Run this code to create a pickle file like model.pkl
For example, we will create a model to predict Profit from R&D, Marketing, and Admin Spend.

Pickle is a Python module used to save (serialize) Python objects into a file so that you can load (deserialize) them later. The saved file usually has the extension .pkl (or sometimes .pickle).

# Import Necessary Libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
import pickle

# Example dataset
data = {
    'R&D Spend': [20000, 30000, 40000, 50000, 60000],
    'Marketing Spend': [10000, 15000, 20000, 25000, 30000],
    'Admin Spend': [12000, 13000, 14000, 15000, 16000],
    'Profit': [22000, 33000, 45000, 58000, 70000]
}
df = pd.DataFrame(data)

# Features and target
X = df[['R&D Spend', 'Marketing Spend', 'Admin Spend']]
y = df['Profit']

# Train model
model = LinearRegression()
model.fit(X, y)

# Save model
with open("model.pkl", "wb") as f:
    pickle.dump(model, f)

print("Model saved as model.pkl")

Build the Flask App

Write the code below in an IDE like Spyder, PyCharm
Load the model we just created model.pkl

'''Import necessary libraries like 
Flask → a Python framework to make websites.
render_template → used to load HTML pages (from templates folder).
request → allows Flask to get values typed by the user in the form.
pickle → loads the saved ML model (model.pkl).
'''

from flask import Flask, render_template, request
import pickle
import numpy as np

'''Creating a Flask application object called app. Think of this as starting your web server. '''

app = Flask(__name__)

# Load saved model
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

'''@app.route('/') defines a route. '/' = the homepage (http://127.0.0.1:5000/). When someone opens the homepage, Flask will show index.html.'''

@app.route('/') 
def home():
    return render_template('index.html')

''' Creating another route  /predict. This is triggered when a user submits the form (method = POST).'''

@app.route('/predict', methods=['POST'])
def predict():
    try:
'''Geting user input values from the form (rd, marketing, admin). Converting them to numbers (float).'''

        rd = float(request.form['rd'])
        marketing = float(request.form['marketing'])
        admin = float(request.form['admin'])

'''Creating a 2D NumPy array (since the model expects data in matrix form). Calls model.predict() runs the ML model to predict profit. [0] because prediction comes in a list (we just take the first value).'''

        features = np.array([[rd, marketing, admin]])
        prediction = model.predict(features)[0]

'''Sending the prediction back to the index.html page. prediction_text is a variable that will show inside the HTML.'''

        return render_template('index.html', prediction_text=f"Predicted Profit: ${prediction:.2f}")
    except:
        return render_template('index.html', prediction_text=" Please enter valid numbers.")

'''Runs the Flask app. debug=True means Flask will automatically reload when you change code and also show detailed error messages.'''

if __name__ == "__main__":
    app.run(debug=True)

Note: I have put comments on each step inside the above code to give you a clear understanding, you can skip these during copying.

Create the Webpage

Write the html code below in Notepad (Windows), TextEdit (Mac), or any plain text editor.
Save the file as webpage_name.html, like index.html
Open it in a web browser (Chrome, Edge, Firefox) to check
We should keep this HTML file inside the templates folder

<!DOCTYPE html>
<html>
<head>
    <title>Fashion Retail Profit Predictor</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            text-align: center;
            margin-top: 50px;
            background: linear-gradient(135deg, #95f8f8, #00c6ff, #0072ff);
            color: #fff;
        }
        form {
            background: rgba(0, 0, 0, 0.3);
            display: inline-block;
            padding: 25px 40px;
            border-radius: 20px;
            box-shadow: 0px 8px 20px rgba(0,0,0,0.2);
        }
        input[type="text"] {
            width: 250px;
            padding: 10px;
            margin: 10px 0;
            border-radius: 10px;
            border: none;
            outline: none;
            font-size: 15px;
            text-align: center;
        }
        input[type="submit"] {
            background-color: #ff9800;
            border: none;
            padding: 12px 25px;
            border-radius: 25px;
            font-size: 16px;
            font-weight: bold;
            cursor: pointer;
            transition: 0.3s;
            color: #fff;
        }
        input[type="submit"]:hover {
            background-color: #e68900;
            transform: scale(1.05);
        }
    </style>
</head>
<body>
    <h2>💹 Fashion Retail Profit Predictor</h2>
    <form action="/predict" method="post">
        <label>R&D Spend:</label><br>
        <input type="text" name="rd" placeholder="e.g. 20000 USD"><br>

        <label>Marketing Spend:</label><br>
        <input type="text" name="marketing" placeholder="e.g. 15000 USD"><br>

        <label>Admin Spend:</label><br>
        <input type="text" name="admin" placeholder="e.g. 12000 USD"><br>

        <input type="submit" value="🔮 Predict Profit">
    </form>
    <h3>{{ prediction_text }}</h3>
</body>
</html>

Run the App

Let’s go to the project folder we just created
Right click > click open in terminal
write python app.py and enter
It will give a link like http://127.0.0.1:5000

Use Web App

Open that link in your browser
We need to input values
Now we need to click on predict profit button
We will get the prediction value now

To understand the full process, watch the video below. I have created

This is referred to as local Flask model deployment. Later, we can deploy it to Heroku, AWS, or Streamlit Cloud to share with the world.

What Next?

Cloud Deployment to share with the world

Hosting our Flask app online so anyone can access it via a URL.

There are different options as follows:

Platform	Pros	Cons	Use Case
Heroku	Easy, beginner-friendly, and integrates with Git	Free tier sleeps after inactivity	Small to medium ML apps, prototypes, learning deployments
Render	Similar to Heroku, simple	Slightly less documentation than Heroku	Web apps and APIs for side projects or demos
Streamlit Cloud	Minimal coding to turn ML scripts into apps	Limited customization for complex apps	Quick ML dashboards or data visualization apps
AWS / Azure / GCP	Full control, scalable, professional	Steeper learning curve, costs money	Production-level apps, large-scale APIs, enterprise solutions

Containerization (for professional projects)

Using Docker to package your app + ML model + dependencies into a single “container”.
Ensures it runs exactly the same on any machine or cloud server.
Makes deployment portable and reproducible.
Often used in professional ML pipelines.

Steps:

Create a Dockerfile specifying Python version, dependencies, and commands to run your app.
Build the Docker image:

docker build -t fashion-app .

3. Run the container locally:

docker run -p 5000:5000 fashion-app

4. Push it to cloud platforms (AWS ECS, GCP Cloud Run, or Azure Container Instances).

API Deployment (backend style)

Instead of a full webpage, our app exposes an API endpoint (like /predict).
Apps can send data (JSON) and receive predictions.
Makes our ML model usable in mobile apps, dashboards, or other software.
Easier to integrate into real-world systems.

Example (Flask API route):

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    prediction = model.predict([data['features']])
    return jsonify({"profit": float(prediction[0])})

CI/CD and Production Practices

CI/CD (Continuous Integration / Continuous Deployment) automates testing and deployment whenever you update your app or model.
Production practices include logging, monitoring, and error handling.
Ensures your app stays up-to-date without manual deployment.
Helps track bugs and usage in real time.
Makes your project professional and maintainable.

Steps:

Set up GitHub Actions or GitLab CI/CD to test code automatically.
Deploy automatically to cloud after successful tests.
Use logging libraries (logging in Python) to track errors.
Monitor app usage and performance (like response time, error rate).

Model Deployment

Model Deployment

Key Goals in Deployment Task

Key Approaches for ML Model Deployment

A) Batch (Offline) Scoring

B) Online (Synchronous) Inference

C) Streaming/Event-Driven (Asynchronous)

D) Edge / On-Device Inference

E) In-Database Inference

F) Serverless Functions

Model Deployment Lifecycle

1. Package the model + code + environment.

2. Serve the model via API, batch job, or stream.

3. Ship with CI/CD so changes are tested and automated.

4. Run in containers/orchestrators with load balancing.

5. Safely roll out new versions (A/B, canary, shadow).

6. Monitor model + system health; alert and retrain as needed.

Deploying a Machine Learning Model with Flask(web app deployment using a local web server)

What is Flask Deployment?

Project Structure

Train and Save a Model

Build the Flask App

Create the Webpage

Run the App

Use Web App

What Next?

Cloud Deployment to share with the world

Containerization (for professional projects)

API Deployment (backend style)

CI/CD and Production Practices

Social Profile

Data Driven Fashion

Model Deployment

Key Goals in Deployment Task

Key Approaches for ML Model Deployment

A) Batch (Offline) Scoring

B) Online (Synchronous) Inference

C) Streaming/Event-Driven (Asynchronous)

D) Edge / On-Device Inference

E) In-Database Inference

F) Serverless Functions

Model Deployment Lifecycle

1. Package the model + code + environment.

2. Serve the model via API, batch job, or stream.

3. Ship with CI/CD so changes are tested and automated.

4. Run in containers/orchestrators with load balancing.

5. Safely roll out new versions (A/B, canary, shadow).

6. Monitor model + system health; alert and retrain as needed.

Deploying a Machine Learning Model with Flask(web app deployment using a local web server)

What is Flask Deployment?

Project Structure

Train and Save a Model

Build the Flask App

Create the Webpage

Run the App

Use Web App

What Next?

Cloud Deployment to share with the world

Containerization (for professional projects)

API Deployment (backend style)

CI/CD and Production Practices

Register

Login here

Forgot your password?

Subscribe to our email list

Social Profile

Data Driven Fashion