MACHINE LEARNING
machine learning
Machine Learning (ML) is a subfield of artificial intelligence (AI) that focuses on developing algorithms and models that enable computers to learn from and make predictions or decisions based on data. The key idea behind machine learning is to allow computers to learn patterns and relationships in data without being explicitly programmed for every specific task.
Types of Machine Learning:
1. Supervised Learning: In supervised learning, the algorithm learns from a labeled dataset where the input data is associated with the correct output. The goal is to learn a mapping from inputs to outputs, enabling accurate predictions on new, unseen data.
Image Classification:
- Given a dataset of images with labeled categories (e.g., cats, dogs, cars), the algorithm learns to classify new images into the appropriate categories.
Email Spam Detection:
- The algorithm is trained on a dataset of emails labeled as spam or not spam. It learns to
2. Unsupervised Learning: Unsupervised learning involves finding patterns and structures in unlabeled data without explicit output labels. Clustering and dimensionality reduction are common tasks in this category.
Example: Clustering - Customer Segmentation
- Input (Features): Purchase history, demographics
- Output: Grouping similar customers
Dimensionality Reduction - Principal Component Analysis (PCA)
- Input (Features): High-dimensional data
- Output: Reduced-dimensional representation while retaining important information
3. Semi-Supervised Learning: Semi-supervised learning combines elements of supervised and unsupervised learning. It uses a small amount of labeled data along with a larger amount of unlabeled data for training.
Example: Anomaly Detection - Network Intrusion Detection
- Labeled Data: Known instances of network intrusions
- Unlabeled Data: Network traffic data
- Goal: Detect anomalies (intrusions) in the unlabeled data
Example: Game Playing - Chess AI
- Agent: Chess-playing AI
- Environment: Chessboard and opponent
- Actions: Moves on the chessboard
- Reward: Winning the game
Robotics - Autonomous Navigation
- Agent: Robot
- Environment: Physical surroundings
- Actions: Movements and decisions made by the robot
- Reward: Reaching a destination safely
Numerical Data:
- Integer: Whole numbers without decimal points (e.g., 1, 10, -5).
- Float: Numbers with decimal points (e.g., 3.14, -0.5, 2.71828).
- Date/Time: Timestamps or durations (e.g., 2023-08-09 15:30:00).
Categorical Data:
- Nominal: Categories without any inherent order (e.g., colors, types of animals).
- Ordinal: Categories with a meaningful order or ranking (e.g., low, medium, high).
Text Data:
- String: Sequence of characters (e.g., "Hello, world!", "Customer name").
Boolean Data:
- Boolean: Represents binary values (True or False) indicating the presence or absence of a characteristic.
Mixed Data Types:
- Mixed: Columns containing a mix of different data types (e.g., addresses with both strings and numbers).
Missing Data:
- Null/NaN: Represent missing or undefined values in the dataset.
Categorical with Hierarchical Structure:
- Tree or Hierarchy: Data that can be organized into a hierarchical structure, such as a category/subcategory relationship.
- In a rule-based approach, decisions are made based on a set of predefined rules or conditions.
- Rules are often created manually by domain experts or through a knowledge acquisition process.
- If-then rules dictate how the model should behave for different input scenarios.
- Rule-based systems are interpretable and can be useful for tasks that require transparency.
- Example: Expert systems in medical diagnosis, where a set of rules guides the diagnosis based on patient symptoms.
- Rules: If income > $50,000 and credit score > 700, then approve credit. If income <= $50,000 and employment years > 5, then approve credit.
- This rule-based system evaluates applicants' income and credit score to make decisions about credit approval.
- In a distance-based approach, similarity or dissimilarity between data points is used to make predictions or decisions.
- The idea is that similar data points should have similar outcomes.
- Commonly used in clustering and classification tasks.
- Example: K-Nearest Neighbors (KNN) algorithm, which predicts the label of a data point based on the labels of its k nearest neighbors.
- A boundary-based approach focuses on finding decision boundaries that separate different classes or categories in the data.
- The model learns to distinguish between classes by identifying regions in the feature space.
- Decision boundaries can be linear or nonlinear, depending on the complexity of the problem.
- Example: Support Vector Machines (SVM), which aim to find the hyperplane that best separates classes.
- A probability-based approach involves estimating the probabilities of different outcomes or classes given the input data.
- The model calculates the likelihood of each class and makes predictions based on the highest probability.
- Often used in classification problems, particularly when dealing with uncertainty.
- Example: Naive Bayes classifier, which applies Bayes' theorem and assumes feature independence to make predictions.
data preprocessing
- Data Cleaning:
- Handling missing values: Impute or remove missing data points.
- Removing duplicates: Eliminate duplicate records from the dataset.
- Handling outlier: Decide whether to keep, transform, or remove outliers based on the problem context.
2.Data Transformation:
- Feature scaling: Scale numerical features to similar ranges (e.g., standardization or normalization) to improve algorithm convergence.
- Encoding categorical variables: Convert categorical data into numerical format (e.g., one-hot encoding, label encoding) to make them suitable for algorithms.
- Feature extraction: Create new features or reduce dimensionality using techniques like PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis).
- Text preprocessing: Tokenization, stemming, and removing stop words when working with textual data.
3.Data Reduction:
- Dimensionality reduction: Reduce the number of features to improve efficiency and reduce noise (e.g., PCA).
- Sampling: Use techniques like random sampling, stratified sampling, or oversampling/undersampling to balance class distribution.
4.Data Integration:
- Combining data from multiple sources: Merge data from different databases or datasets to create a comprehensive dataset.
- Handling data from different formats: Convert data from various formats (CSV, Excel, JSON) into a consistent format.
5.Data Formatting:
- Ensure consistent units, formats, and scales across features.
- Convert dates and times into a standardized format.
6.Data Splitting:
- Splitting data into training, validation, and test sets for model evaluation.
- Ensuring data is representative of the overall distribution across sets.
7.Handling Imbalanced Data:
- Techniques to address class imbalance, such as oversampling the minority class or using specialized algorithms.
8.Data Normalization:
- Ensuring that data adheres to specific assumptions or requirements of the algorithm (e.g., normalizing data for neural networks).
- Normalization types
- Normalization is a data preprocessing technique used to transform features to a common scale, helping machine learning algorithms work better and converge faster. There are several types of normalization methods, each with its own advantages and use cases.
- MinMax Scaling
- Mean Scaling
- Absolute Maximum Scaling
- Min-Max Scaling (Min-Max Normalization):
- Scales features to a specified range, often between 0 and 1.
- Formula:
X_normalized = (X - X_min) / (X_max - X_min) - Useful when features have a clear minimum and maximum value.
1.1Mean Scaling
- mean Scaling, also known as Centering, is a data preprocessing technique used to shift the values of a feature so that they have a mean of zero. It involves subtracting the mean of the feature from each data point, effectively centering the data around zero. This technique is particularly useful when the mean of the feature is significant for the analysis, or when you want to remove any potential bias caused by the mean.
Formula: x(old) - x(MEAN)
X(scaled) = ----------------
x(max) - x(min)Where:
X_centeredis the centered data.Xrepresents the original dataset.mean(X)is the mean of the original dataset.
1.2Absolute Maximum Scaling
- We should first select the maximum absolute value out of all the entries of a particular measure.
- Then after this, we divide each entry of the column by this maximum value.
- we will observe that each entry of the column lies in the range of -1 to 1.
- But this method is not used that often the reason behind this is that it is too sensitive to the outliers.
Formula: x(old) - Max(|x|) X(scaled) = ----------------- Max(|x|)
x(old): This represents the original value of a data point.Max(|x|): This is the absolute maximum value of the dataset, calculated by taking the maximum value of the absolute values of all data points.X(scaled): This is the scaled value of the data point after applying the formula.
2.STANDARDIZATION
- Standardization, also known as Z-Score Normalization, is a common data preprocessing technique used to transform features so that they have a mean of 0 and a standard deviation of 1. This process helps bring features to a common scale, making them comparable and suitable for algorithms that assume Gaussian-distributed data or require features to be centered around zero.
Formula: x(old) - MEAN X(scaled) = --------------------- standard deviation
Where:
X_standardizedis the standardized feature.Xrepresents the original feature.mean(X)is the mean (average) of the original feature.standard_deviation(X)is the standard deviation of the original feature.
3.ROBUST SCALING
Robust Scaling, also known as Robust Standardization, is a data preprocessing technique used to scale features in a way that is robust to the presence of outliers. It is similar to Z-Score Normalization (Standardization), but instead of using the mean and standard deviation, Robust Scaling uses the median and the interquartile range (IQR) to scale the data.
FORMULA : x(old) - Q2 x(old) - x(MEDIAN) X(scaled) = ------------- = ------------------ Q3 - Q1 IQR
Where:
X_robust_scaledis the scaled feature using Robust Scaling.Xrepresents the original feature.median(X)is the median of the original feature.IQR(X)is the interquartile range of the original feature.







Comments
Post a Comment