Model Deployment

  • Model deployment is the process of Packaging a trained model with its code and dependencies, exposing it in a production environment so applications, dashboards, or APIs can send data and receive predictions, and run it reliably with monitoring and security.

Data Scientists build and train the model, while data engineers ensure it runs reliably, securely, and efficiently in production.

Key Goals in Deployment Task

  • Correctness:
    • The system must be deterministic.
    • For any given input, the output must be identical, regardless of how many times it’s called or when.
    • This is the foundation of reliable systems.
    • For example, a function calculateTax(income=50000) must always return the same amount (e.g., 12500). It should never return a different value due to internal state or randomness.
  • Performance:
    • The system must meet its speed (latency) and volume (throughput) requirements.
    • Low latency means quick responses, while high throughput means handling many requests per second.
    • For example, an autocomplete API must return suggestions in under 100ms (low latency).
    • A data ingestion service must process 10,000 events/second (high throughput).
  • Scalability:
    • The system can handle a sudden, significant increase in load by adding resources (scaling out) without failing or degrading performance critically.
    • For example, an e-commerce website scales its front-end servers from 10 to 100 instances automatically to handle a flash sale, preventing a crash.
  • Safety:
    • The system is protected from threats (security), can be updated without downtime (versioning), and can quickly revert to a previous stable state if a new version fails (rollbacks).
    • For example, deploying a new API version (v2) alongside v1. If v2 has a critical bug, traffic is instantly routed back to the stable v1 (rollback), all while enforcing authentication (security).
  • Observability:
    • The system provides deep internal visibility through structured logs, performance metrics, request traces, and notifications for when its behavior deviates from the expected norm (drift)
    • A user reports an error. An engineer uses a trace ID to find the specific log error, see the slow database query in the metrics, and identify the faulty microservice, all triggered by an alert on rising error rates.

Drift: A gradual and unintended deviation of a system from its expected performance, behavior, or resource usage baseline.

Key Approaches for ML Model Deployment

A) Batch (Offline) Scoring

  • This method processes pre-defined datasets on a scheduled trigger (e.g., Airflow DAG, cron job) using compute engines like Spark or Pandas on a cluster.
  • It is designed for generating bulk predictions, such as populating a nightly customer churn table in a data warehouse.
  • The architecture is simple and cost-optimized for large volumes but introduces inherent latency, making predictions unavailable until the next job completion.
  • Example: An e-commerce company runs a daily job to predict which customers are most likely to churn in the next 30 days, saving the list to a database for the marketing team.

B) Online (Synchronous) Inference

  • Model is hosted as a containerized microservice (e.g., FastAPI/Flask in Docker) behind a load balancer and API Gateway.
  • It serves individual predictions over HTTP/REST or gRPC with strict latency SLAs (e.g., <100ms).
  • This is mandatory for real-time applications like credit card fraud scoring.
  • Production readiness requires autoscaling, health checks, and robust service discovery to handle volatile traffic.
  •  Example: A bank’s website calls a fraud detection API in real-time to approve or decline a credit card transaction during checkout.

C) Streaming/Event-Driven (Asynchronous)

  • Inference is embedded within a stream processing framework (e.g., Apache Flink, Kafka Streams).
  • The application consumes events from a message broker (Kafka, Kinesis), scores each record, and emits the result to a new topic.
  • This enables high-throughput, near-real-time processing for use cases like real-time alerting.
  • The complexity lies in managing state and ensuring fault-tolerant delivery semantics.
  • Example: A ride-sharing app calculates ETA and surge pricing in real-time by continuously processing streaming location data from drivers and passengers.

D) Edge / On-Device Inference

  • The model is converted to an optimized format (TFLite, ONNX, CoreML) and compiled into a mobile or IoT application.
  • Inference executes locally on the device’s hardware, often producing dedicated NPUs/GPUs for performance.
  • This is used for offline-capable applications (e.g., photo style transfer) or where latency is critical (e.g., autonomous robot navigation).
  • The constraint is the model’s size and complexity, which must fit the device’s limited computing and memory resources.
  • Example: The iPhone’s Face ID system runs a neural network on its dedicated Neural Engine to authenticate users without sending data to the cloud.

E) In-Database Inference

  • Uses the compute power of modern MPP data warehouses (Snowflake, BigQuery, Redshift) to run inference inside the database engine via SQL UDFs or built-in ML functions.
  • This eliminates data movement by scoring data directly at rest, ideal for creating massive batch prediction sets for BI dashboards.
  • Performance and cost are directly tied to the data platform’s SQL execution engine.
  • Example: A retailer uses Snowflake to score all customer records in its data warehouse for lifetime value prediction without moving any data to an external system.

F) Serverless Functions

  • The model is packaged into a serverless function (AWS Lambda, Google Cloud Functions) with a lightweight runtime.
  • It is triggered by HTTP events or from a message queue.
  • The platform manages scaling from zero to handle traffic spikes, making it cost-effective for intermittent or unpredictable workloads.
  • The key technical challenge is mitigating cold-start latency, often by using provisioned concurrency or optimizing the package size.
  • Example: A mobile app that uses image recognition for plant identification. A user uploads a photo, triggering a Lambda function to score the image and return the result. Traffic is spiky and unpredictable.

Model Deployment Lifecycle

1. Package the model + code + environment.

  • Bundle the trained model file, inference code, and all software dependencies into a single, reproducible unit. This ensures it runs the same everywhere.
  • For Example, using a Dockerfile to create an image that includes Python, TensorFlow, your predict.py script, and the saved model.h5 file.

2. Serve the model via API, batch job, or stream.

  • Expose the model’s functionality through an API for real-time responses, run it on a schedule for bulk processing, or integrate it into a data stream.
  • For example, creating a Flask/FastAPI endpoint that returns a loan approval prediction. A separate batch job runs nightly to score all new user sign-ups.

3. Ship with CI/CD so changes are tested and automated.

  • Automate testing and deployment using pipelines.
  • Code and model changes are automatically validated and deployed to production upon passing tests.
  • For example, a GitHub Action pipeline that runs unit tests on the inference code, builds a new Docker image, and deploys it to a staging environment when a pull request is merged.

4. Run in containers/orchestrators with load balancing.

  • Deploy the packaged model inside containers managed by an orchestrator. This provides scalability, resilience, and efficient resource usage.
  • For example, deploying multiple container replicas of your model API on Kubernetes, which automatically distributes incoming traffic across them and restarts any that fail.

An orchestrator is a system that automates the deployment, management, scaling, and networking of containers.

5. Safely roll out new versions (A/B, canary, shadow).

  • Deploy new model versions to a small subset of users/traffic first to validate performance and minimize risk before a full rollout.
  • For example, using a canary release to send 5% of live API traffic to a new model version. If error rates stay low, gradually increase the traffic to 100%.

A/B Testing: Directing different user segments to two distinct versions (A and B) to statistically compare a specific business metric.

Canary Release: Gradually rolling out a new version to a small, increasing percentage of users to minimize the impact of potential failures.

Shadow Mode: Sending a copy of live traffic to the new version without affecting the user’s response, to validate performance against the current version in production.

6. Monitor model + system health; alert and retrain as needed.

  • Track system metrics (latency, errors) and model metrics (accuracy, drift). Trigger alerts for degradation and initiate retraining pipelines.
  • For example, A dashboard monitoring prediction drift. An alert fires when drift exceeds a threshold, triggering a pipeline to retrain the model on fresh data and redeploy it.

Deploying a Machine Learning Model with Flask(web app deployment using a local web server)

What is Flask Deployment?

  • Flask is a Python web framework.
  • We use it to:
    • Take input from a user through a web form (HTML page).
    • Send that input to our machine learning model.
    • Show the prediction result back on the webpage.

This is called local deployment because it runs on our own computer.

Project Structure

Our target is  to create a new project folder as follows:

project/
│── app.py
│── model.pkl
│── train_model.py
│── templates/
     └── index.html

  • train_model.py → script to train and save the model.
  • model.pkl → saved ML model.
  • app.py → Flask app that connects model with web page.
  • templates/index.html → the webpage form.

Train and Save a Model

  • Write this code in an IDE like Spyder, PyCharm, etc
  • Save file as model_name.py
  • Run this code to create a pickle file like model.pkl
  • For example, we will create a model to predict Profit from R&D, Marketing, and Admin Spend.

Pickle is a Python module used to save (serialize) Python objects into a file so that you can load (deserialize) them later. The saved file usually has the extension .pkl (or sometimes .pickle).

# Import Necessary Libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
import pickle

# Example dataset
data = {
    'R&D Spend': [20000, 30000, 40000, 50000, 60000],
    'Marketing Spend': [10000, 15000, 20000, 25000, 30000],
    'Admin Spend': [12000, 13000, 14000, 15000, 16000],
    'Profit': [22000, 33000, 45000, 58000, 70000]
}
df = pd.DataFrame(data)

# Features and target
X = df[['R&D Spend', 'Marketing Spend', 'Admin Spend']]
y = df['Profit']

# Train model
model = LinearRegression()
model.fit(X, y)

# Save model
with open("model.pkl", "wb") as f:
    pickle.dump(model, f)

print("Model saved as model.pkl")

Build the Flask App

  • Write the code below in an IDE like Spyder, PyCharm
  • Load the model we just created model.pkl
'''Import necessary libraries like 
Flask → a Python framework to make websites.
render_template → used to load HTML pages (from templates folder).
request → allows Flask to get values typed by the user in the form.
pickle → loads the saved ML model (model.pkl).
'''

from flask import Flask, render_template, request
import pickle
import numpy as np

'''Creating a Flask application object called app. Think of this as starting your web server. '''

app = Flask(__name__)

# Load saved model
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

'''@app.route('/') defines a route. '/' = the homepage (http://127.0.0.1:5000/). When someone opens the homepage, Flask will show index.html.'''

@app.route('/') 
def home():
    return render_template('index.html')

''' Creating another route  /predict. This is triggered when a user submits the form (method = POST).'''

@app.route('/predict', methods=['POST'])
def predict():
    try:
'''Geting user input values from the form (rd, marketing, admin). Converting them to numbers (float).'''

        rd = float(request.form['rd'])
        marketing = float(request.form['marketing'])
        admin = float(request.form['admin'])

'''Creating a 2D NumPy array (since the model expects data in matrix form). Calls model.predict() runs the ML model to predict profit. [0] because prediction comes in a list (we just take the first value).'''

        features = np.array([[rd, marketing, admin]])
        prediction = model.predict(features)[0]

'''Sending the prediction back to the index.html page. prediction_text is a variable that will show inside the HTML.'''

        return render_template('index.html', prediction_text=f"Predicted Profit: ${prediction:.2f}")
    except:
        return render_template('index.html', prediction_text=" Please enter valid numbers.")

'''Runs the Flask app. debug=True means Flask will automatically reload when you change code and also show detailed error messages.'''

if __name__ == "__main__":
    app.run(debug=True)

Note: I have put comments on each step inside the above code to give you a clear understanding, you can skip  these during copying.

Create the Webpage

  • Write the html code below in Notepad (Windows), TextEdit (Mac), or any plain text editor.
  • Save the file as webpage_name.html, like index.html
  • Open it in a web browser (Chrome, Edge, Firefox) to check 
  • We should keep this HTML file inside the templates folder
<!DOCTYPE html>
<html>
<head>
    <title>Fashion Retail Profit Predictor</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            text-align: center;
            margin-top: 50px;
            background: linear-gradient(135deg, #95f8f8, #00c6ff, #0072ff);
            color: #fff;
        }
        form {
            background: rgba(0, 0, 0, 0.3);
            display: inline-block;
            padding: 25px 40px;
            border-radius: 20px;
            box-shadow: 0px 8px 20px rgba(0,0,0,0.2);
        }
        input[type="text"] {
            width: 250px;
            padding: 10px;
            margin: 10px 0;
            border-radius: 10px;
            border: none;
            outline: none;
            font-size: 15px;
            text-align: center;
        }
        input[type="submit"] {
            background-color: #ff9800;
            border: none;
            padding: 12px 25px;
            border-radius: 25px;
            font-size: 16px;
            font-weight: bold;
            cursor: pointer;
            transition: 0.3s;
            color: #fff;
        }
        input[type="submit"]:hover {
            background-color: #e68900;
            transform: scale(1.05);
        }
    </style>
</head>
<body>
    <h2>💹 Fashion Retail Profit Predictor</h2>
    <form action="/predict" method="post">
        <label>R&D Spend:</label><br>
        <input type="text" name="rd" placeholder="e.g. 20000 USD"><br>

        <label>Marketing Spend:</label><br>
        <input type="text" name="marketing" placeholder="e.g. 15000 USD"><br>

        <label>Admin Spend:</label><br>
        <input type="text" name="admin" placeholder="e.g. 12000 USD"><br>

        <input type="submit" value="🔮 Predict Profit">
    </form>
    <h3>{{ prediction_text }}</h3>
</body>
</html>

Run the App

  • Let’s go to the project folder we just created
  • Right click > click open in terminal
  • write python app.py and enter
  • It will give a link like http://127.0.0.1:5000

Use Web App

  • Open that link in your browser
  • We need to input values
  • Now we need to click on predict profit button 
  • We will get the prediction value now

To understand the full process, watch the video below. I have created 

This is referred to as local Flask model deployment. Later, we can deploy it to Heroku, AWS, or Streamlit Cloud to share with the world.

What Next?

Cloud Deployment to share with the world 

  • Hosting our Flask app online so anyone can access it via a URL.

There are different options as follows:

PlatformProsConsUse Case
HerokuEasy, beginner-friendly, and integrates with GitFree tier sleeps after inactivitySmall to medium ML apps, prototypes, learning deployments
RenderSimilar to Heroku, simpleSlightly less documentation than HerokuWeb apps and APIs for side projects or demos
Streamlit CloudMinimal coding to turn ML scripts into appsLimited customization for complex appsQuick ML dashboards or data visualization apps
AWS / Azure / GCPFull control, scalable, professionalSteeper learning curve, costs moneyProduction-level apps, large-scale APIs, enterprise solutions

Containerization (for professional projects)

  • Using Docker to package your app + ML model + dependencies into a single “container”.
  • Ensures it runs exactly the same on any machine or cloud server.
  • Makes deployment portable and reproducible.
  • Often used in professional ML pipelines.

Steps:

  1. Create a Dockerfile specifying Python version, dependencies, and commands to run your app.
  2. Build the Docker image:
docker build -t fashion-app .

3. Run the container locally:

docker run -p 5000:5000 fashion-app

4. Push it to cloud platforms (AWS ECS, GCP Cloud Run, or Azure Container Instances).

API Deployment (backend style)

  • Instead of a full webpage, our app exposes an API endpoint (like /predict).
  • Apps can send data (JSON) and receive predictions.
  • Makes our ML model usable in mobile apps, dashboards, or other software.
  • Easier to integrate into real-world systems.

Example (Flask API route):

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    prediction = model.predict([data['features']])
    return jsonify({"profit": float(prediction[0])})

CI/CD and Production Practices

  • CI/CD (Continuous Integration / Continuous Deployment) automates testing and deployment whenever you update your app or model.
  • Production practices include logging, monitoring, and error handling.
  • Ensures your app stays up-to-date without manual deployment.
  • Helps track bugs and usage in real time.
  • Makes your project professional and maintainable.

Steps:

  1. Set up GitHub Actions or GitLab CI/CD to test code automatically.
  2. Deploy automatically to cloud after successful tests.
  3. Use logging libraries (logging in Python) to track errors.
  4. Monitor app usage and performance (like response time, error rate).

Register

Login here

Forgot your password?

ads

ads

I am an enthusiastic advocate for the transformative power of data in the fashion realm. Armed with a strong background in data science, I am committed to revolutionizing the industry by unlocking valuable insights, optimizing processes, and fostering a data-centric culture that propels fashion businesses into a successful and forward-thinking future. - Masud Rana, Certified Data Scientist, IABAC

Social Profile

© Data4Fashion 2023-2025

Developed by: Behostweb.com

Please accept cookies
Accept All Cookies