insurance_claims
Description
The dataset is accessible via a GitHub repository, highlighting features like 'months_as_customer', 'age', and 'policy_number'. The main focus is the 'fraud_reported' variable, which indicates claim legitimacy.
Files
Steps to reproduce
Steps to Reproduce the Research on Fraud Detection in Insurance Claims through Machine Learning Techniques Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following GitHub repository: https://github.com/mwitiderrick/insurancedata/blob/master/insurance_claims.csv - Download and store the dataset locally for easy access during subsequent steps. Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used: #Load the Dataset File insurance_df = pd.read_csv('insurance_claims.csv') - Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure. Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary. Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims. Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features. Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE). Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search. Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model. Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle). Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.