Model: ML confusion matrix for Isolation

Published: 22 January 2024| Version 1 | DOI: 10.17632/mpgzxmfhrc.1
Contributor:
Sunil Maria Benedict

Description

Purpose: Simulated dataset for exploring the performance of a machine learning model in classifying individuals as "at risk" or "not at risk" based on a set of features. Used for testing model accuracy, evaluating feature importance, and understanding model behavior under varying conditions. Key Components: Features: Synthetic, not based on real-world data. 5 features: Question1, Question2, Question3 (random integers between 1 and 6) Trait1, Trait2 (randomly generated numbers) Represent potential questionnaire responses or other relevant attributes. Target Variable: Binary (0 for "not at risk", 1 for "at risk") Simulated with a 70/30 class distribution (30% "not at risk", 70% "at risk") Sample Size: 2000 samples Machine Learning Model: Random Forest Classifier with 100 estimators and a maximum depth of 10 Trained and evaluated using standard metrics (accuracy, classification report, confusion matrix) Considerations: Simulated Data: Does not reflect the complexity and nuances of real-world data. Feature Meaning: Actual meanings of features are not specified, limiting interpretation of results. Class Balance: Adjusted to be more balanced, but still not representative of all real-world scenarios. Next Steps: Validate with Real-World Data: Assess model performance on actual data to ensure generalisability. Incorporate Additional Features: Explore incorporating more complex and realistic features. Explore Different Models: Experiment with other algorithms to compare performance. Address Class Imbalance: Consider techniques like oversampling or under-sampling to handle imbalanced datasets effectively.

Files

Steps to reproduce

# Import necessary libraries import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix, accuracy_score, classification_report import seaborn as sns import matplotlib.pyplot as plt # Generate synthetic data for simulation np.random.seed(42) num_samples = 2000 # Increase the sample size # Features: More complex features features = pd.DataFrame({ 'Question1': np.random.randint(1, 6, num_samples), 'Question2': np.random.randint(1, 6, num_samples), 'Question3': np.random.randint(1, 6, num_samples), 'Trait1': np.random.randn(num_samples) * 2 + 5, # Additional feature for complexity 'Trait2': np.random.randn(num_samples) * 3 + 8 # Additional feature for complexity # Add more features as needed }) # Target: Simulated target variable (1 for at risk, 0 for not at risk) # Generate a more balanced dataset with a higher prevalence of the "at risk" class target = np.random.choice([0, 1], size=num_samples, p=[0.3, 0.7]) # Adjust class distribution # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42) # Create and train the RandomForestClassifier with adjusted hyperparameters model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42) model.fit(X_train, y_train) # Make predictions on the test set predictions = model.predict(X_test) # Evaluate the model accuracy = accuracy_score(y_test, predictions) print(f'Accuracy: {accuracy:.2f}') # Display classification report print('\nClassification Report:\n', classification_report(y_test, predictions)) # Plot confusion matrix cm = confusion_matrix(y_test, predictions) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, xticklabels=['Not at Risk', 'At Risk'], yticklabels=['Not at Risk', 'At Risk']) plt.title('Confusion Matrix') plt.xlabel('Predicted') plt.ylabel('True') plt.show()

Institutions

CMR Group of institutions

Categories

Psychology, Cognitive Psychology, Machine Learning

Licence