+1 (315) 557-6473 

A Linear Regression Approach for Insurance Claims and Fraud Detection Using Excel

Unlock the power of Excel for precise fraud detection in insurance claims with our comprehensive linear regression model. This skilful approach not only identifies key variables such as claim amount and annual income but also optimizes investigation thresholds, showcasing the efficiency and cost-effectiveness of an Excel-driven solution in safeguarding against fraudulent activities. Embrace the future of fraud detection with precision and ease using Excel.

Problem Description:

Fraudulent insurance claims pose a significant challenge, consuming valuable resources and costing companies substantial financial losses. This paper addresses this issue through a statistical fraud detection framework applied to insurance claims using linear regression using Excel assignment. The model, employing multiple linear regression, analyzes customer data, including variables such as annual income, marital status, age, and more, to identify potential fraud instances.

Data Selection: A random sample of 8,000 data points was selected from a pool of approximately 9,000 for model development.

Model Development: Variable selection is crucial in building an effective linear regression model. After exploring various variables against fraudulent frequency, the following were chosen:

  • Claim Amount
An analysis of the average claim amount used in the regression model

Chart 1: An analysis of the average claim amount used in the regression model

  • Employment status
Employment status vs. fraud detection

Chart 2: Employment status vs. fraud detection

  • Age
Age vs. fraud detection

Chart 3: Age vs. fraud detection

  • Annual income
Fraud
Average of Annual Income ( $ )
0 41,017.28
1 23,802.24
Grand Total 37507.57

Table 1: The average annual income vs. fraud detection

  • Monthly Premium
  • Marital Status (Single)
Marital status vs. fraud detection

Chart 4: Marital status vs. fraud detection

Model Results:

The multiple regression model explained approximately 19.4% of the variation in fraud using the selected regressors. The model was statistically significant (F = 275.34, p < 0.001), indicating its efficacy in predicting fraud likelihood. While marital status (single) and employment status were statistically insignificant, they were retained in the model due to similar adjusted R-squared values.

Prediction of Fraud:

The likelihood of fraud was predicted using the estimated equation, provided below. A higher value signifies a greater likelihood of a fraudulent claim.

y ̂=0.697-0.012×Age+0.0003×Montly Premium-0.0000098×Claim Amount-0.000024×Annual Income+0.002×I(Single)+0.063×I(Unemployed)+0.039×I(Retired)

Investigation Decision Rule:

The threshold for fraud investigation was optimized through a cost function. The model suggests a threshold of 0.25, resulting in 3,350 investigations and a total cost of $4.94 million to the firm.

Threshold Investigated? Non-Fraud Investigated Fraud Not Investigated Cost
0.1 5785 4195 41 $ 5,990,000
0.15 5106 3555 80 $ 5,506,000
0.2 4231 2783 183 $ 5,146,000
0.25 3350 2037 318 $ 4,940,000
0.3 2463 1360 528 $ 5,103,000
0.35 1731 846 746 $ 5,461,000
0.4 1076 432 987 $ 6,011,000
0.45 619 218 1230 $ 6,769,000
0.5 344 100 1387 $ 7,279,000
0.55 112 15 1534 $ 7,782,000
0.6 0 0 1631 $ 8,155,000

Table 2: The cost of investigation of fraud on the firm

Model Evaluation:

Comparing the model-based recommendation to a random decision model:

  • Total Investigated: 3,350
  • Investigation Cost: $3.35 million
  • Fraud Cost: $4.775 million
  • Total Cost: $8.125 million
  • Cost from Regression: $4.94 million
  • Money Saved: $3.185 million

The regression model's cost was $3.185 million lower, demonstrating a 64.5% cost reduction compared to the random decision model.

Areas of Improvement:

  1. Consider using logistic regression for binary response variables.
  2. Explore penalized methods or best subset selection for more optimal variable selection.
  3. Incorporate higher-order terms to capture additional variation in the data.