How to Do Linear Regression Analysis in Excel

Regression Analysis is a technique to forecast the relationship between data variables of entities such as customers, orders, transactions, etc. These are the parameters based on which business decisions can be taken for sales, finance, and marketing operations. To understand this in a simpler way, let us take some examples.

Example 1 – Risk Analytics in Retail Banking

Credit risk scores of customers are built by the banking industry to predict the customer’s delinquency behavior and to evaluate the creditworthiness of each customer while processing loan applications.

Example 2 – Marketing Analytics in eCommerce

To predict customers’ buying patterns based on the customer’s past transactions from an online portal. The e-commerce industry is highly competitive so they use analytics to predict customers’ purchasing patterns.

Example 3 – Pricing Analytics in Stock Market

Stockbrokers and investors use analytics to forecast stock pricing based on the past 52-week pricing trend and many other parameters. The analytics helps them to make the decisions on buying or selling the stocks based on the future pricing predictions.

Example 4 – Behavioral Analytics in Human Resource Industry

Human resource departments/companies use analytics to predict employees’ behavioral patterns that will enable them to decide the increment or promotion to be given on a yearly basis.

Regression Analysis Basics

Before we go for how to do regression in Excel by taking a case study, let us understand the dataset and its components.

Population vs. Sample

The population is referred to as all members of a defined group that are considered for studying information on data-driven decisions. For example – The current inflation rates of EU countries.

A sample is a part of the “population”. It can be biased or unbiased. For example – The current inflation rates of EU countries having per capita income < 50000 EURO per annum.

Types of Data Variables

The below diagram shows the types of data variables under each category.

Data Snapshot

The table below gives an example of a data snapshot with different data variables.

#	Cust.Name	Cust. ID	No.Of credit cards	Gender	Marital status	Age	Annua lSalary	Monthly Credit Card Usage
1	Josh	111669	5	F	Never Married	43	88,001	Low
2	John	223112	6	M	Married	24	592,330	Low
3	Alen	189123	4	M	Divorced	50	272,304	Low
4	Chaya	171690	3	F	Married	37	140,400	Low
5	Dandre	16625	6	M	Never Married	23	105, 234	Low
6	Justin	149171	7	M	Divorced	34	358,534	Low
7	Neil	169254	5	M	Married	36	510,321	Low
8	Emily	149771	2	F	Never Married	26	164,732	Low
9	Janice	175646	2	F	Married	35	103,345	Low
10	Farhan	170639	4	M	Never Married	63	724,788	Low
11	Tony	113136	3	M	Married	70	105,450	Low

Let us understand the data variables available in the above table.

#	Cust.Name	Cust. ID	No.of credit cards	Gender	Marital Status	Age	Annual Salary	Monthly Credit Card Usage
–	–	–	Numerical (Discrete)	Categorical (Binary)	Categorical (Nominal)	Numerical (Discrete)	Numerical (Continuous)	Categorical (Ordinal)

Summarizing Data

The next step is to summarize data in a presentable format that will give the report on which analysis can be done. The ways to summarize the data are as follows:

Frequency Distribution – this is a simple way of counting distinct discrete values. For example – The number of credit cards owned by 3000 customers as sample data. Can be shown in tabular format or as a graph using Excel.

Steps to do this in Excel:

Go to the Data tab in your Excel Worksheet.

Click Advanced. The dialog is shown

Go to the Number of Cards column. Sort them into ascending order of values.

Insert the formula in # of customers to set the frequency.

You will get the values below.

Go to the Insert tab. Click the Column icon. Select the 2-D Chart.

The chart will be generated and shown below.

Grouped Frequency Distribution – this is a method to summarize discrete variables having a large number of observations and a range of values. For example – The number of customers falling under different salary ranges or groups.

Follow the steps in Excel as per the instructions given on the screen below.

Cumulative Frequency Distribution – this is obtained by accumulating the frequencies to give the total number of observations up to and including the value or group in question. Illustration as per the screen below.

Steps to calculate the cumulative frequency distribution are given on the screen below.

Stem Leaf Diagram – this technique is used in a limited set of data. For a large set of data, this method is not used. Example: Given the age of 20 individuals in years.

Line Plots – this technique is also not suitable for a large set of data. It is not extensively used in the industry. Example: Given test scores of 20 students.

The measure of Central Tendency – Median

The central tendency measure is robust to the effects of extreme observations.

The median is a value, which splits the data set into two equal halves. Example – Calculate the median for a sample of 3000 individuals having credit cards along with a demonstration of extreme observations.

The measure of Central Tendency – Mode

It is the value that occurs with the greatest frequency or the most typical value. For Example, finding the mode for a sample of 3000 individuals having credit cards. Excel has an in-built function “Mode” for granular data. For summarized data, it can be found easily by visual inspection. Illustration as below.

Measure of Spread

The different measures of spread are –

Variance and Standard Deviation – below screen shows how to calculate the values.

The Range – It is a very simple measure of spread defined, as its name suggests, by the difference between the largest and smallest observations in the data set.

The Interquartile range – the quartiles divide a set of data into four quarters. They are denoted by Q1, Q2, and Q3. Q1 is called the lower quartile and Q3 is the upper quartile. The following image shows how to calculate the Interquartile Range for a sample of 3000 individuals having credit cards.

Symmetry and Skewness of data

It represents the shape of the distribution of a dataset, that is, whether it is symmetric or skewed to one side or the other. The approximate shape of a distribution can be determined by looking at a histogram. The below diagram shows the mean, median, mode, and variance for symmetric and skewed data.

A tabular representation of data is shown below.

After evaluating the dataset by the above measures and techniques, a regression analysis can be conducted. Now, the question is which type of regression model is required for predictions.

The next section describes the measures that will state the relationship between the data variables and determine how strong the relationship is between identified variables.

Regression Analysis Models

The following are different techniques or formulas that are used to do regression analysis. These techniques are nothing but different formulas or equations that are used to determine the relationship between X and Y variables.

Covariance and correlation coefficient are the measures used in regression analysis.

Covariance is a measure that helps to find out the direction of the relationship between the x variable and y variables. In simple terms, what happens to y when x increases or decreases.

The following equation is used to determine the covariance measure.

A correlation coefficient is a measure that tells how strong the linear relationship is between X and Y variables. It is termed an r-value that determines the quality of a model. The parameter is represented by p or r values.
- Population correlation is denoted by ρ (rho)
- Sample correlation is denoted by r.

The following equation is used to determine the correlation coefficient measure for population data.

In real-life data analytics, the correlation measure is calculated on sampling distribution. There are a couple of reasons why sampling is used for analytics.

Features of ρ and r

The physical impossibility of checking all items in the population.
The cost of studying all the items in a population.
The sample results are usually adequate.
Contacting the whole population would often be time-consuming.

Unit free and ranges between -1 and 1
The closer to -1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker the linear relationship

Negative Linear Relationship

Positive Linear Relationship

Relationship Not Linear

No Relationship

Tools used to do Regression Analysis

Following software tools are the most used in performing Regression Analysis. The oldest tool is Microsoft Excel.

SAS (www.sas.com/)
SPSS (http://www-01.ibm.com/software/analytics/spss/ )
R (http://www.r-project.org/)
Microsoft Excel (http://office.microsoft.com/en-in/)

As the spectrum of analytics is widened, dataset volume has increased, the data scientists have started using Python.

We will see a case study to run regression analysis in Excel.

Case Study

An Auto Insurance company has the data of “Loss” amount and policy-related information. The company wants to know the factors responsible for losses in a multivariate fashion using a regression model.

Prior to running the regression model, the following things need to be in place.

Variable identification –
- Identifying the dependent (response) variable.
- Identifying the independent (explanatory) variables.
- Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous, etc.)
- Creation of Data Dictionary
Response variable exploration
- Distribution analysis
  - Percentiles
  - Variance
  - Frequency distribution
- Outlier treatment
  - Identify the outliers/threshold limit
  - Cap/floor the values at the thresholds
Independent variable analyses
- Identify the prospective independent variables (that can explain the response variable)
- Bivariate analysis of response variable against independent variables
- Variable treatment /transformation
  - Grouping of distinct values/levels
  - Mathematical transformation e.g. logs, splines, etc.

After performing the above data-related activities, the last step is iterative in terms of running the regression.

Fitting the regression (first time)

Check for correlation between independent variables
- This is to take care of Multicollinearity
Variable selection
- Check for the most suitable transformed variable
- Select the transformation giving the best fit
- Reject the statistically insignificant variables

Fitting the regression (iteration)

Analysis of results
- Model comparison
- Model performance check
Actual vs Predicted comparison

Steps to do regression analysis in Excel

We will perform the following steps in Excel with reference to the case study explained in the section above.

Data Analysis

Step 1. Snapshot of Data and Data Description

The above data contains policy holders’ and loss amount information (variables)

Policy Number
Age
Years of Driving Experience
Number of Vehicles
Gender
Married
Vehicle Age
Fuel Type
Losses (Dependent/Response Variable)

A Data Dictionary is formed as below.

Sl #	Variable Name	Variable Description	Values Stored	Variable Type
1	Policy Number	Unique Policy Number	Unique value identifying the policy	Identifier
2	Age	Age of Policyholder	16, 17,…,70	Numerical (Discrete)
3	Years of Driving Experience	Years of Driving Experience of the Policyholder	0,1,….,53	Numerical (Discrete)
4	Number of Vehicles	Number of Vehicles insured under the policy	1,2,3,4	Numerical (Discrete)
5	Gender	Gender of the Policyholder	F, M	Categorical (Binary)
6	Married	Marital status of the Policyholder	Married, Single	Categorical (binary)
7	Vehicle Age	Age of vehicle insured under the policy	0,1,…,15	Numerical (Discrete)
8	Fuel Type	Fuel type of the vehicle insured	D, P	Categorical (Binary)
9	Losses	Loss amount claimed under the policy	Range: 13- 3500	Numerical (Continuous)

Data Dictionary

Step 2. Frequency Distribution – Loss amount

Step 3. Calculate Capped Losses using grouped frequency distribution

Step 3. Perform variable analysis in Excel. The first bivariate profiling we will do for Age vs. Capped Loss for policies.

Step 4. The second bivariate profiling we will do for Years of Driving Experience vs. Capped Loss for policies.

Step 5. The next bivariate profiling we will do for Gender vs. Capped Loss for policies.

Step 6. The next bivariate profiling we will do for Marital Status vs. Capped Loss for policies.

Step 7: The next bivariate profiling we will do for Vehicle Age vs. Capped Loss for policies.

MblOl sS3tYL7Z R 9bIn WgmNc5PT9OL8Uy3IRSgLryNqxpiWa8jeAPuZSWaLFJlQm1Aeb0tNgjlRT7pvNJKKbVZimKLDSgo08sywYzeV49j BZ5UqGNh bR1qNPvyBegloJs

Step 8: The next bivariate profiling we will do for Vehicle Age Band vs. Capped Loss for policies.

Step 9: The next bivariate profiling we will do for Vehicles vs. Capped Loss for policies.

Step 10: The next bivariate profiling we will do for Fuel Type vs. Capped Loss for policies.

Preparing Excel for Regression Analysis

Step 1: Open the Office icon from the toolbar.

Step 2: Go to Excel Options.

Step 3: Click Add-Ins.

Step 4: From the Manage drop-down, select Excel Add-ins.

Step 5: Select the Analysis Toolpak check box and the Solver Add-in check box.

Step 6: Select the Data tab. Click Data Analysis.

Step 7: In the Data Analysis dialog box, click Regression.

Step 8: Select the data cell reference range. Click OK.

Step 9: See the summary results as below.

Step 10: Picking up the variables based on the following activities.

We will perform the following tests to select by running :

Multicollinearity
Banding of variables
Statistical significance of variables tested

List of independent variables:

Age
Age Band
Years of Driving Experience
Number of Vehicles
Gender
Married
Vehicle Age
Vehicle Age Band
Fuel Type

“Age” and “Years of Driving Experience” are highly correlated (Correlation Coefficient = 0.9972). We can use either of the variables in the regression.

Fit two separate models using either of the variables one at a time. Check for the goodness of fit (R² in this case). The variable producing higher R² gets accepted.

Age

*Regression Statistics (Age)*
Multiple R	0.475766
R Square	0.226354
Adjusted R Square	0.226303
Standard Error	201.2306
Observations	15290

Years of Driving Experience

*Regression Statistics(Yrs Driving Experience)*
Multiple R	0.475273
R Square	0.225885
Adjusted R Square	0.225834
Standard Error	201.2916
Observations	15290

R² for Age > R² for Years of Driving Experience
Reject Years of Driving Experience

Banding of Variables

Investigate whether to use “Age” or “Age band”
Fit regression independently using “Age” and “Age Band”
Before fitting regression, “Age Band” needs to be converted to numerical form from categorical. Replace “Age Band” values with “Average Age” for the particular band.

Age Band	Sum of Age	# Policies	Average Age
16-25	93,770.0	4,563.0	20.6
26-59	270,793.0	6,384.0	42.4
60+	282,636.0	4,343.0	65.1

Regressions results using “Age” and “Average Age”

*Regression Statistics (Age)*
Multiple R	0.475766
R Square	0.226354
Adjusted R Square	0.226303
Standard Error	201.2306
Observations	15290

*Regression Statistics (Average Age*)
Multiple R	0.509969
R Square	0.260068
Adjusted R Square	0.26002
Standard Error	196.7971
Observations	15290

R2 for Average Age > R2 for Age
Select “Average Age”

Investigate whether to use “Vehicle Age” or “Vehicle Age band”

Fit regression independently using “Vehicle Age” and “Vehicle Age Band”
Before fitting regression, “Vehicle Age Band” needs to be converted to numerical form from categorical. Replace “Vehicle Age Band” values with “Vehicle Average Age” for the particular band.

Vehicle Age Band	Sum of Vehicle Age	# Policies	Average Vehicle Age
0-5	9,229	3,688	2.50
6-10	44,298	5,523	8.02
11+	78,819	6,079	12.97

Regressions results using “Vehicle Age” and “Average Vehicle Age”

*Regression Statistics (Vehicle Age)*
Multiple R	0.289431325
R Square	0.083770492
Adjusted R Square	0.083710561
Standard Error	218.9903277
Observations	15290

*Regression Statistics (Average Vehicle Age*)
Multiple R	0.303099405
R Square	0.09186925
Adjusted R Square	0.091809848
Standard Error	218.0203272
Observations	15290

R² for Average Vehicle Age > R² for Vehicle Age
Select “Average Vehicle Age”

Now, we have identified the variables that can be used in fitting the regression model.

Age Band in the form of “Average Age” of the band (selected out of “Age” and “Age Band”). Also got selected over “Years of Driving Experience”.
Number of Vehicles
Gender
Married
Vehicle Age Band in the form of “Average Vehicle Age” of the band (selected out of “Vehicle Age” and “Vehicle Age Band”).
Fuel Type

Step 11: Categorical variables to be converted to their numerical equivalent (0,1)

Gender (F = 0 and M = 1)
Married (Married = 0 and Single = 1)
Fuel Type (P = 0, D = 1)

Snapshot of the final data on which we will run the multivariate regression.

Step 11: Run Regression.

Summary of Output

SUMMARY OUTPUT

Regression Statistics
Multiple R	0.865972274
R Square	0.749907979
Adjusted R Square	0.749809794
Standard Error	114.4310136
Observations	15290

ANOVA
	df	SS	MS	F	Significance F
Regression	6	600073213.5	100012202.3	7637.751088	0
Residual	15283	200122584.4	13094.45688
Total	15289	800195798

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%	Lower 95.0%	Upper 95.0%
Intercept	624.56529	5.29192	118.02233	0.00000	614.19249	634.93809	614.19249	634.93809
Avg Age	-5.55974	0.06546	-84.93889	0.00000	-5.68804	-5.43144	-5.68804	-5.43144
Number of Vehicles	0.17875	0.97039	0.18420	0.85386	-1.72333	2.08082	-1.72333	2.08082
Gender Dummy	50.88326	1.89081	26.91084	0.00000	47.17705	54.58947	47.17705	54.58947
Married Dummy	78.39837	1.92148	40.80106	0.00000	74.63204	82.16469	74.63204	82.16469
Avg Vehicle Age	-15.14220	0.26734	-56.63987	0.00000	-15.66623	-14.61818	-15.66623	-14.61818
Fuel Type Dummy	267.93559	2.74845	97.48614	0.00000	262.54830	273.32287	262.54830	273.32287

Results Interpretation

Significant Test

Significance test of coefficients based on Normal distribution

H0: b is no different than 0 (i.e. 0 is the coefficient when the variable is not included in regression)
H1: b is different than 0

Test statistic, Z = (b-0)/σ (at 95% two tailed confidence interval, Z = 1.96)

Confidence interval = (b – 1.96 * σ, b + 1.96 * σ)

For the variable to be significant, the interval must not contain “0”.

Example1: Avg Age.

Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431)

No zero in the interval. Hence, significant.

Example2: Number of Vehicles

Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080)

Zero is present in the interval. Hence, insignificant.

Summary output after removing insignificant variables.

Predicted Losses = 625.004932715948 – 5.5596551344537 * Avg Age + 50.8828923910091 * Gender Dummy + 78.4016899779131 * Married Dummy -15.1420259903571 * Avg Vehicle Age + 267.935139741526 * Fuel Type Dummy

Interpretation

Takeaway

Running regression in Excel gives you more confidence in your model as you are running each step manually. It is a time-consuming and lengthy process, but you get to learn more about datasets and output summaries. Other tools such as R, SAS, and Python give you quick results but you will not be able to know the algorithm behind the code that is executed and has given you the results.