Packages:
- A package is a directory containing one or more modules and possibly sub-packages.
- They group related modules, making code easier to manage, maintain & share with others.
- Packages serve as a toolbox, allowing easy access and reuse of code across different projects.
Sub-packages:
- A subpackage is a package that is nested inside another package.
- It helps further organize and structure large codebases by grouping related modules into smaller, more manageable packages within a larger package.
Modules:
- A module is a single file (with a .py extension) that contains Python code, including functions, classes, and variables.
- Modules allow you to organize and reuse code by breaking it into separate, manageable files.
Example:
from sklearn.datasets import load_iris
- Package (sklearn): The main package for scikit-learn, which provides tools for machine learning.
- Subpackage (datasets): Part of sklearn, it includes utilities to load datasets like load_iris.
- Module (load_iris): A function within sklearn.datasets that loads the Iris dataset, which is a well-known dataset for classification tasks.
Common Packages for Data Science:
Pandas
- Pandas is the go-to library for data manipulation and analysis.
- It provides data structures like DataFrames that are great for handling and analyzing structured data (e.g., CSV files, Excel files).
- Use for data cleaning, transformation, and exploration.
NumPy
- NumPy provides support for large, multi-dimensional arrays and matrices, along with a large collection of mathematical functions to operate on these arrays.
- Use for numerical computations, performing mathematical operations on arrays, and working with multi-dimensional data.
Matplotlib
- Matplotlib is a plotting library that produces high-quality graphs and visualizations.
- Use for creating static, animated, and interactive visualizations in Python.
Seaborn
- Seaborn is built on top of Matplotlib and provides a high-level interface for creating attractive and informative statistical graphics.
- Use for Visualizing complex datasets, especially for statistical analysis.
Scikit-learn
- Scikit-learn is a comprehensive library for machine learning in Python. It provides simple and efficient tools for data mining and data analysis.
- Use for classification, regression, clustering, and model evaluation.
TensorFlow and PyTorch
- TensorFlow and PyTorch are the leading deep learning frameworks. They provide extensive tools for building and training neural networks.
- Use for deep learning, neural networks, and complex machine learning models.
SciPy
- SciPy builds on NumPy and provides additional functionality for scientific computing, including modules for optimization, integration, interpolation, eigenvalue problems, and more.
- Use for advanced mathematical functions, optimization, and scientific computing.
Statsmodels
- Statsmodels is a library for estimating and testing statistical models.
- It provides classes and functions for statistical analysis.
- Use for statistical modeling, hypothesis testing, and data exploration
NLTK (Natural Language Toolkit)
- NLTK is a popular Python library used for working with human language data (text).
- It provides tools and resources for natural language processing (NLP), making it easier to handle text data, perform text analysis, and build applications
- Use for processing or understanding human language, like Text Tokenization, Text Classification, Sentiment Analysis, Language Translation.
TextBlob
- TextBlob is a Python library for processing textual data.
- It provides a simple API for common natural language processing (NLP) tasks and is designed to be easy to use for both beginners and experienced developers.
- Use for processing or understanding human language, like Text Tokenization, Sentiment Analysis.