Published: 22 August 2023| Version 2 | DOI: 10.17632/992mh7dk9y.2


The dataset, named "insurance_claims.csv", is a comprehensive collection of insurance claim records. Each row represents an individual claim, and the columns represent various features associated with that claim. The dataset is, highlighting features like 'months_as_customer', 'age', policy_number, ...etc. The main focus is the 'fraud_reported' variable, which indicates claim legitimacy. Claims data were sourced from various insurance providers, encompassing a diverse array of insurance types including vehicular, property, and personal injury. Each claim's record provides an in-depth look into the individual's background, claim specifics, associated documentation, and feedback from insurance professionals. The dataset further includes specific indicators and parameters that were considered during the claim's assessment, offering a granular look into the complexities of each claim. For privacy reasons, and in agreement with the participating insurance providers, certain personal details and specific identifiers have been anonymized. Instead of names or direct identifiers, each entry is associated with a unique ID, ensuring data privacy while retaining data integrity. The insurance claims were subjected to rigorous examination, encompassing both manual assessments and automated checks. The end result of this examination, specifically whether a claim was deemed fraudulent or not, is clearly indicated for each record.


Steps to reproduce

Steps to Reproduce the Research on Fraud Detection in Insurance Claims through Machine Learning Techniques Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https:// - Download and store the dataset locally for easy access during subsequent steps. Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used: #Load the Dataset File insurance_df = pd.read_csv('insurance_claims.csv') - Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure. Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary. Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims. Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features. Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE). Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search. Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model. Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle). Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.


Insurance Fraud