What is Machine Learning?

  • Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed.
  • It is a branch of artificial intelligence (AI) and computer science that focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.
  • Some innovative products are based on machine learning, such as Netflix’s recommendation engine and self-driving cars.
  • Machine learning is an important component of the growing field of data science.
  • Through the use of statistical methods, algorithms are trained to make classifications or predictions to uncover key insights in data mining projects.
  • Diff between traditional programming & Machine learning
    • Programming >> data+rule=output
    • Machine learning >> data+output=rule

Types of Approaches in Machine Learning:

Supervised ML :

  • Supervised machine learning, is defined by its use of labeled datasets to train algorithms to classify data or predict outcomes accurately.
  • Some methods used in supervised learning include neural networks, naïve Bayes, linear regression, logistic regression, random forest, and support vector machine (SVM).
  • Types of Supervised Learning
    • Regression:- predict continuous data
    • Classification:- predict target labels

Unsupervised ML :

  • Unsupervised machine learning uses machine learning algorithms to analyze and cluster unlabeled datasets.
  • This method’s ability to discover similarities and differences in information makes it ideal for exploratory data analysis, cross-selling strategies, customer segmentation, and image and pattern recognition.
  • Principal component analysis (PCA) and singular value decomposition (SVD) are two common approaches for this.

Reinforcement ML :

  • Reinforcement machine learning is a machine learning model that is similar to supervised learning, but the algorithm isn’t trained using sample data. This model learns as it goes by using trial and error.

AI, ML, DL, DS difference :

  • Artificial Intelligence (AI) enables machines to think by understanding, learning from the data, and taking decisions based on patterns hidden in the data or make inferences that would otherwise be very difficult for humans to make manually. The end goal of using ML or DL is to create an AI application or machine as smart as humans.
    • Areas of Artificial Intelligence:
      1. Computer Vision
      2. Natural Language Processing
      3. Machine Learning & Deep Learning
      4. Decision Making
      5. Robotics
  • Machine Learning (ML) is a subset of AI; it provides us with statistical tools/techniques like Supervised, Unsupervised, and Reinforcement learning to explore and analyze the data.
  • Deep Learning (DL) is further a subset of ML, and the main idea behind it is to make machines learn by mimicking the human brain. Here, we create a multi-neural network architecture with the help of different techniques like ANN, CNN, and RNN.
  • Data Science (DS) is basically drawing insights from structured and unstructured data either by using ML or DL or without these techniques. We can even use different visualization tools, statistics, and probability to gain these insights.

Several machine learning algorithms are commonly used. These include:

  • Neural networks: Neural networks simulate the way the human brain works, with a huge number of linked processing nodes. Neural networks are good at recognizing patterns and play an important role in applications including natural language translationimage recognitionspeech recognition, and image creation.
  • Linear regression: This algorithm is used to predict numerical values, based on a linear relationship between different values. For example, the technique could be used to predict house prices based on historical data for the area.
  • Logistic regression: This supervised learning algorithm makes predictions for categorical response variables, such as“yes/no” answers to questions. It can be used for applications such as classifying spam and quality control on a production line.
  • Clustering: Using unsupervised learning, clustering algorithms can identify patterns in data so that it can be grouped. Computers can help data scientists by identifying differences between data items that humans have overlooked.
  • Decision trees: Decision trees can be used for both predicting numerical values (regression) and classifying data into categories. Decision trees use a branching sequence of linked decisions that can be represented with a tree diagram. One of the advantages of decision trees is that they are easy to validate and audit, unlike the black box of the neural network.
  • Random forests: In a random forest, the machine learning algorithm predicts a value or category by combining the results from a number of decision trees.

General Life Cycle/Pipe Line For Machine Learning Projects:

Step 1:- Identify the business case and categorize the type of problem to solve. i.e. Regression, Classification, and Time Series Analysis.

Step 2:- Data Collection

Data Collection is the first and foundational step in the machine learning process. It involves gathering relevant data that will be used to train and evaluate machine learning models. The quality and quantity of data collected directly impact the model’s performance and accuracy.

Step 3:- Identify the independent and dependent variables.

Step 4:- Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a critical step in the data science process, where you analyze and visualize datasets to uncover patterns, relationships, and insights. EDA helps you understand the structure, distribution, and quality of the data, which informs the next steps in the modeling process.

Step 5:- Data Preprocessing

  • Remove unwanted columns like ids, and columns with most of the values missing.
  • Impute missing values
  • Check for outliers
  • Convert categorical to numerical
  • Feature scaling
  • Handle imbalance dataset
  • Format the data
  • Clean the data

Step 6:- Feature Engineering

Feature Engineering is the process of selecting, modifying, or creating new features (variables) from raw data to improve the performance of a machine learning model.

    • Feature Selection:
      • Filter Methods: Use statistical techniques (e.g., correlation, chi-square,HeatMap etc) to select features that have the strongest relationships with the target variable.
      • Wrapper Methods: Use iterative processes (e.g., forward selection, backward elimination) to find the best subset of features based on model performance.
      • Embedded Methods: Feature selection is integrated into the model training process (e.g., Lasso regression).
    • Feature Transformation:
      • Normalization/Scaling: Adjust the range of numerical features to a common scale (e.g., min-max scaling, standardization) to ensure that all features contribute equally to the model.
      • Log Transformation: Apply a logarithmic function to skewed data to make it more normally distributed.
      • Binning: Convert continuous features into discrete bins or categories (e.g., age groups).
    • Encoding Categorical Variables:
      • One-Hot Encoding: Convert categorical variables into a series of binary columns (e.g., “red,” “blue,” “green” becomes separate binary columns).
      • Label Encoding: Assign numerical values to categorical variables (e.g., “red” = 1, “blue” = 2).
    • Feature Creation:
      • Polynomial Features: Generate new features by combining existing ones (e.g., creating interaction terms or higher-order polynomials).
      • Date/Time Features: Extract features from date and time (e.g., day of the week, month, hour) to capture seasonal patterns.
      • Domain-Specific Features: Use domain knowledge to create new features that capture important aspects of the data (e.g., creating a “body mass index” feature from height and weight).
    • Handling Missing Data:
      • Imputation: Fill in missing values with a specific value (e.g., mean, median) or use more advanced techniques like K-nearest neighbors (KNN) imputation.
      • Dummy Variables: Create a binary feature indicating whether a value was missing

Step 7:- Model Selection and Building

Selecting a model for the problem you are solving is a crucial step. There are 2 ways to handle the model selection,

  • Test all possible algorithms on your data to see which works best for you. There are both pros and cons to this approach. The pros would be that you would definitely know that one algorithm or a set of algorithms are better choices for your problem statement. The approach is computationally costly when you have huge datasets.
  • Another approach is to try and understand what the algorithm does before deciding if it is a good fit for your problem or not. Do not be afraid to go into the basics of the algorithm itself.The more you understand how the algorithm works and its limitations, the better your chances are of identifying whether it is a good choice for your problem or not.

Once you have narrowed down your algorithms, the next step is to build the model and train it on your data.

Step 8:- Hyperparameter tuning

Hyperparameter tuning is a process to set values to parameters of models, that model cannot set by itself. So manually we need to tune them. Hyperparameter optimizes the performance of the model.

There are 2 approaches that are widely used to tune hyperparameters, based on the type of problem you can go for the below methods

  • GridSearchCV
  • RandomizedSerchCV

Step 9:-Model Evaluation

Choose a good evaluation metric pertinent to your problem. Many people go with accuracy in tasks such as classification, regression, etc merely because it is the easiest metric to understand but it might not be the case always. Here are some possible scores for binary class classification problems.

  • Confusion matrix
  • Accuracy
  • Precision/Sensitivity
  • Recall
  • Specificity
  • F1 score
  • Precision-Recall or PR curve
  • ROC (Receiver Operating Characteristics) curve
  • PR vs ROC curve.

For regression problems, we can look at

  • Log Loss
  • RMSE
  • R Squared/Adjusted R squared

Step 10:-Model Deployment

Model deployment is the process of taking a trained machine learning model and making it available in a production environment so it can be used to make predictions on new, unseen data. It’s the step where the model is integrated into applications, systems, or processes to provide real-world value.

  • Save the trained model in a format (like Pickle or ONNX) that can be loaded later.
  • Develop an API (e.g., RESTful API) that allows external systems to interact with the model.
  • Embed the model into the existing application or system where it will be used.

Tools to use:

  • Frameworks: TensorFlow Serving, TorchServe, ONNX Runtime.
  • APIs: Flask, FastAPI, Django (for creating web APIs).
  • Cloud Services: AWS SageMaker, Google AI Platform, Azure Machine Learning.
  • Containerization: Docker, Kubernetes (for scalable deployment).

Step 11:-Model Monitoring & Maintenance

Monitoring involves continuously tracking the performance of a deployed machine learning model. Key metrics include accuracy, latency, and prediction error rates.

Maintenance refers to the ongoing process of updating and improving the model after deployment. This can involve retraining the model with new data, fine-tuning hyperparameters, or even replacing the model if it becomes outdated.


