ML Made Simple in 30 hours: Hour 4 Random Forest Model

#### Concept

Random Forest is an ensemble learning method that combines multiple decision trees to improve classification or regression performance. Each tree in the forest is built on a random subset of the data and a random subset of features. The final prediction is made by aggregating the predictions from all individual trees (majority vote for classification, average for regression).

Key advantages of Random Forest include:

- Reduced Overfitting: By averaging multiple trees, Random Forest reduces the risk of overfitting compared to individual decision trees.

- Robustness: Less sensitive to the variability in the data.

 Import necessary libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,
                            classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Example data
data = {
    'Age': [29, 45, 50, 39, 48, 50, 55, 60, 62, 43],
    'Cholesterol': [220, 250, 230, 180, 240, 290, 310, 275, 300, 280],
    'Max_Heart_Rate': [180, 165, 170, 190, 155, 160, 150, 140, 130, 148],
    'Heart_Disease': [0, 1, 1, 0, 1, 1, 1, 1, 1, 0]
}
df = pd.DataFrame(data)

# Independent variables (features) and dependent variable (target)
X = df[['Age', 'Cholesterol', 'Max_Heart_Rate']]
y = df['Heart_Disease']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)

# Creating and training the random forest model
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")

# Feature importance
feature_importances = pd.DataFrame(model.feature_importances_,
                      index=X.columns, 
                      columns=['Importance']).sort_values('Importance', 
                               ascending=False)
print(f"Feature Importances:\n{feature_importances}")

# Plotting the feature importances
sns.barplot(x=feature_importances.index,
            y=feature_importances['Importance'])
plt.title('Feature Importances')
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.show()

Results

Accuracy: 1.0
Confusion Matrix:
[[2]]
Classification Report:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         2

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

Feature Importances:
                Importance
Age               0.404142
Cholesterol       0.360129
Max_Heart_Rate    0.235729

Plot:

## Explanation of the Code

1. Libraries: We import necessary libraries like numpy, pandas, sklearn, matplotlib, and seaborn.

2. Data Preparation: We create a DataFrame containing features (Age, Cholesterol, Max_Heart_Rate) and the target variable (Heart_Disease).

3. Feature and Target: We separate the features and the target variable.

4. Train-Test Split: We split the data into training and testing sets.

5. Model Training: We create a RandomForestClassifier model with 100 trees and train it using the training data.

6. Predictions: We use the trained model to predict heart disease for the test set.

7. Evaluation: We evaluate the model using accuracy, confusion matrix, and classification report.

8. Feature Importance: We compute and display the importance of each feature.

9. Visualization: We plot the feature importances to visualize which features contribute most to the model's predictions.

## Evaluation Metrics

- Accuracy: The proportion of correctly classified instances among the total instances.

- Confusion Matrix: Shows the counts of true positives, true negatives, false positives, and false negatives.

- Classification Report: Provides precision, recall, F1-score, and support for each class.

ML Made Simple in 30 hours

Friday, 3 January 2025

Hour 4 Random Forest Model

No comments:

Post a Comment

Hour 30 Hyperparameter Optimization

Search This Blog