ML Made Simple in 30 hours: Hour 5 Gradient Boosting

####Concept

Gradient Boosting is an ensemble learning technique that builds a strong predictive model by combining the predictions of multiple weaker models, typically decision trees. Unlike Random Forest, which builds trees independently, Gradient Boosting builds trees sequentially, each one correcting the errors of its predecessor.

The key idea is to optimize a loss function over the iterations:

1. Initialize the model with a constant value.

2. Fit a weak learner (e.g., a decision tree) to the residuals (errors) of the previous model.

3. Update the model by adding the fitted weak learner to minimize the loss.

4. Repeat the process for a specified number of iterations or until convergence.

## Implementation Example

Suppose we have a dataset that records features like age, income, and years of experience to predict whether a person gets a loan approval.

# Import necessary libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Example data
data = {
    'Age': [25, 45, 35, 50, 23, 37, 32, 28, 40, 27],
    'Income': [50000, 60000, 70000, 80000, 20000, 30000, 40000, 55000, 65000, 75000],
    'Years_Experience': [1, 20, 10, 25, 2, 5, 7, 3, 15, 12],
    'Loan_Approved': [0, 1, 1, 1, 0, 0, 1, 0, 1, 1]
}

df = pd.DataFrame(data)
# Independent variables (features) and dependent variable (target)
X = df[['Age', 'Income', 'Years_Experience']]
y = df['Loan_Approved']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                   random_state=0)


# Creating and training the gradient boosting model

model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, 
                                   max_depth=3, random_state=0)
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")

# Feature importance
feature_importances = pd.DataFrame(model.feature_importances_, 
                      index=X.columns, columns=['Importance']).
                      sort_values('Importance', ascending=False)
print(f"Feature Importances:\n{feature_importances}")

# Plotting the feature importances

sns.barplot(x=feature_importances.index,
            y=feature_importances['Importance'])
plt.title('Feature Importances')
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.show()

Results:

Accuracy: 1.0
Confusion Matrix:
[[2]]
Classification Report:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         2

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

Feature Importances:
                  Importance
Years_Experience         1.0
Age                      0.0
Income                   0.0

Plot:

## Explanation of the Code

1. Libraries: We import necessary libraries like numpy, pandas, sklearn, matplotlib, and seaborn.

2. Data Preparation: We create a DataFrame containing features (Age, Income, Years_Experience) and the target variable (Loan_Approved).

3. Feature and Target: We separate the features and the target variable.

4. Train-Test Split: We split the data into training and testing sets.

5. Model Training: We create a GradientBoostingClassifier model with 100 estimators (n_estimators=100), a learning rate of 0.1, and a maximum depth of 3, and train it using the training data.

6. Predictions: We use the trained model to predict loan approval for the test set.

7. Evaluation: We evaluate the model using accuracy, confusion matrix, and classification report.

8. Feature Importance: We compute and display the importance of each feature.

9. Visualization: We plot the feature importances to visualize which features contribute most to the model's predictions.

## Evaluation Metrics

- Accuracy: The proportion of correctly classified instances among the total instances.

- Confusion Matrix: Counts of TP, TN, FP, and FN.

- Classification Report: Provides precision, recall, F1-score, and support for each class.

ENJOY LEARNING 👍👍

ML Made Simple in 30 hours

Friday, 3 January 2025

Hour 5 Gradient Boosting

No comments:

Post a Comment

Hour 30 Hyperparameter Optimization

Search This Blog