ML Made Simple in 30 hours: Hour 3 DecisionTree Model

#### Concept

Decision trees are a non-parametric supervised learning method used for both classification and regression tasks. They model decisions and their possible consequences in a tree-like structure, where internal nodes represent tests on features, branches represent the outcome of the test, and leaf nodes represent the final prediction (class label or value).

For classification, decision trees use measures like Gini impurity or entropy to split the data:

- Gini Impurity: Measures the likelihood of an incorrect classification of a randomly chosen element.

- Entropy (Information Gain): Measures the amount of uncertainty or impurity in the data.

For regression, decision trees minimize the variance (mean squared error) in the splits.

## Implementation Example

Suppose we have a dataset with features like age, income, and student status to predict whether a person buys a computer.

# Import necessary libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, 
     classification_report
import matplotlib.pyplot as plt

# Example data
data = {
    'Age': [25, 45, 35, 50, 23, 37, 32, 28, 40, 27],
    'Income': ['High', 'High', 'High', 'Medium', 'Low', 'Low', 'Low',
               'Medium', 'Low', 'Medium'],
    'Student': ['No', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 
                'Yes', 'No'],
    'Buys_Computer': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 
                      'Yes', 'Yes']
}

df = pd.DataFrame(data)
# Convert categorical features to numeric
df['Income'] = df['Income'].map({'Low': 1, 'Medium': 2, 'High': 3})
df['Student'] = df['Student'].map({'No': 0, 'Yes': 1})
df['Buys_Computer'] = df['Buys_Computer'].map({'No': 0, 'Yes': 1})

# Independent variables (features) and dependent variable (target)
X = df[['Age', 'Income', 'Student']]
y = df['Buys_Computer']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                         test_size=0.2, random_state=0)

# Creating and training the decision tree model
model = DecisionTreeClassifier(criterion='gini', max_depth=3, 
                               random_state=0)
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")

# Plotting the decision tree
plt.figure(figsize=(20,8))
plt.figure(dpi=200)
plot_tree(model, feature_names=['Age', 'Income', 'Student'], 
          class_names=['No', 'Yes'], filled=True)
plt.title('Decision Tree')
plt.show()

Results:

Accuracy: 0.01
Confusion Matrix:
[[0 0]
 [2 0]]
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       0.0
           1       0.00      0.00      0.00       2.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0

The accuracy is 0.01. Correct this (LEFT as EXERCISE).

Plot:

#### Explanation of the Code

1. Libraries: We import necessary libraries like numpy, pandas, sklearn, and matplotlib.

2. Data Preparation: We create a DataFrame containing features and the target variable. Categorical features are converted to numeric values.

3. Feature and Target: We separate the features (Age, Income, Student) and the target (Buys_Computer).

4. Train-Test Split: We split the data into training and testing sets.

5. Model Training: We create a DecisionTreeClassifier model, specifying the criterion (Gini impurity) and maximum depth of the tree, and train it using the training data.

6. Predictions: We use the trained model to predict whether a person buys a computer for the test set.

7. Evaluation: Evaluate the model using accuracy, confusion matrix, and classification report.

8. Visualization: Plot decision tree to visualize the decision-making process.

## Evaluation Metrics

- Accuracy

- Confusion Matrix: Shows the counts of true positives, true negatives, false positives, and false negatives.

- Classification Report: Provides precision, recall, F1-score, and support for each class.

Like if you need similar content 😄👍

Hope this helps you 😊

ML Made Simple in 30 hours

Thursday, 2 January 2025

Hour 3 DecisionTree Model

No comments:

Post a Comment

Hour 30 Hyperparameter Optimization

Search This Blog