How to Do Linear Regression Analysis in Excel

Regression Analysis is a technique to forecast the relationship between data variables of entities such as customers, orders, transactions, etc. These are the parameters based on which business decisions can be taken for sales, finance, and marketing operations. To understand this in a simpler way, let us take some examples.

Regression Analysis

Example 1 – Risk Analytics in Retail Banking

Credit risk scores of customers are built by the banking industry to predict the customer’s delinquency behavior and to evaluate the creditworthiness of each customer while processing loan applications.

Example 2 – Marketing Analytics in eCommerce

To predict customers’ buying patterns based on the customer’s past transactions from an online portal. The e-commerce industry is highly competitive so they use analytics to predict customers’ purchasing patterns.

Example 3 – Pricing Analytics in Stock Market

Stockbrokers and investors use analytics to forecast stock pricing based on the past 52-week pricing trend and many other parameters. The analytics helps them to make the decisions on buying or selling the stocks based on the future pricing predictions.

Example 4 – Behavioral Analytics in Human Resource Industry

Human resource departments/companies use analytics to predict employees’ behavioral patterns that will enable them to decide the increment or promotion to be given on a yearly basis.

Regression Analysis Basics

Before we go for how to do regression in Excel by taking a case study, let us understand the dataset and its components.

Population vs. Sample

The population is referred to as all members of a defined group that are considered for studying information on data-driven decisions. For example – The current inflation rates of EU countries.

A sample is a part of the “population”. It can be biased or unbiased. For example – The current inflation rates of EU countries having per capita income < 50000 EURO per annum.

Types of Data Variables

The below diagram shows the types of data variables under each category.

Data variables

Data Snapshot

The table below gives an example of a data snapshot with different data variables.

#Cust.NameCust. IDNo.Of credit cardsGenderMarital statusAgeAnnua lSalaryMonthly Credit Card Usage
1Josh1116695FNever Married4388,001Low
2John2231126MMarried24592,330Low
3Alen1891234MDivorced50272,304Low
4Chaya1716903FMarried37140,400Low
5Dandre166256MNever Married23105, 234Low
6Justin1491717MDivorced34358,534Low
7Neil1692545MMarried36510,321Low
8Emily1497712FNever Married26164,732Low
9Janice1756462FMarried35103,345Low
10Farhan1706394MNever Married63724,788Low
11Tony1131363MMarried70105,450Low

Let us understand the data variables available in the above table.

#Cust.NameCust. IDNo.of credit cardsGenderMarital StatusAgeAnnual SalaryMonthly Credit Card Usage
Numerical (Discrete)Categorical (Binary)Categorical (Nominal)Numerical (Discrete)Numerical (Continuous)Categorical (Ordinal)

Summarizing Data

The next step is to summarize data in a presentable format that will give the report on which analysis can be done. The ways to summarize the data are as follows:

  1. Frequency Distribution – this is a simple way of counting distinct discrete values. For example – The number of credit cards owned by 3000 customers as sample data. Can be shown in tabular format or as a graph using Excel.
Tabular representation

Steps to do this in Excel:

  • Go to the Data tab in your Excel Worksheet. 
Sort
  • Click Advanced. The dialog is shown
Advance filter
  • Go to the Number of Cards column. Sort them into ascending order of values.
Number of cards
customers
  • You will get the values below.
values
  • Go to the Insert tab. Click the Column icon. Select the 2-D Chart.
Colums
  • The chart will be generated and shown below.
Customers data
  1. Grouped Frequency Distribution – this is a method to summarize discrete variables having a large number of observations and a range of values. For example – The number of customers falling under different salary ranges or groups.
Graphical representation

Follow the steps in Excel as per the instructions given on the screen below.

Data
  1. Cumulative Frequency Distribution – this is obtained by accumulating the frequencies to give the total number of observations up to and including the value or group in question. Illustration as per the screen below.
Data representation

Steps to calculate the cumulative frequency distribution are given on the screen below.

cumulative frequency
  1. Stem Leaf Diagram – this technique is used in a limited set of data. For a large set of data, this method is not used. Example: Given the age of 20 individuals in years. 
Stem Leaf Diagram
  1. Line Plots – this technique is also not suitable for a large set of data. It is not extensively used in the industry. Example: Given test scores of 20 students.
Line Plots

The measure of Central Tendency – Median

The central tendency measure is robust to the effects of extreme observations. 

The median is a value, which splits the data set into two equal halves. Example – Calculate the median for a sample of 3000 individuals having credit cards along with a demonstration of extreme observations.

Measure of Central Tendency - Median

The measure of Central Tendency – Mode

It is the value that occurs with the greatest frequency or the most typical value. For Example, finding the mode for a sample of 3000 individuals having credit cards. Excel has an in-built function “Mode” for granular data. For summarized data, it can be found easily by visual inspection. Illustration as below.

Measure of Central Tendency - Mode

Measure of Spread

The different measures of spread are – 

  • Variance and Standard Deviation – below screen shows how to calculate the values.
Measure of Spread
  • The Range – It is a very simple measure of spread defined, as its name suggests, by the difference between the largest and smallest observations in the data set.
data range
  • The Interquartile range – the quartiles divide a set of data into four quarters. They are denoted by Q1, Q2, and Q3. Q1 is called the lower quartile and Q3 is the upper quartile. The following image shows how to calculate the Interquartile Range for a sample of 3000 individuals having credit cards.
Inter quartile range

Symmetry and Skewness of data

It represents the shape of the distribution of a dataset, that is, whether it is symmetric or skewed to one side or the other. The approximate shape of a distribution can be determined by looking at a  histogram. The below diagram shows the mean, median, mode, and variance for symmetric and skewed data.

Symmetry and Skewness of data

A tabular representation of data is shown below.

Data

After evaluating the dataset by the above measures and techniques, a regression analysis can be conducted. Now, the question is which type of regression model is required for predictions.

The next section describes the measures that will state the relationship between the data variables and determine how strong the relationship is between identified variables.

Regression Analysis Models

The following are different techniques or formulas that are used to do regression analysis. These techniques are nothing but different formulas or equations that are used to determine the relationship between X and Y variables.

Covariance and correlation coefficient are the measures used in regression analysis. 

  • Covariance is a measure that helps to find out the direction of the relationship between the x variable and y variables. In simple terms, what happens to y when x increases or decreases. 

The following equation is used to determine the covariance measure.

formula
  • A correlation coefficient is a measure that tells how strong the linear relationship is between X and Y variables. It is termed an r-value that determines the quality of a model. The parameter is represented by p or r values. 
    • Population correlation is denoted by ρ (rho)
    • Sample correlation is denoted by r.

The following equation is used to determine the correlation coefficient measure for population data.

formula

In real-life data analytics, the correlation measure is calculated on sampling distribution. There are a couple of reasons why sampling is used for analytics.

Features of ρ and r

  • The physical impossibility of checking all items in the population.
  • The cost of studying all the items in a population.
  • The sample results are usually adequate.
  • Contacting the whole population would often be time-consuming.
  • Unit free and ranges between -1 and 1
  • The closer to -1, the stronger the negative linear relationship
  • The closer to 1, the stronger the positive linear relationship
  • The closer to 0, the weaker the linear relationship
Features of ρ and r

Negative Linear Relationship 

Negative Linear Relationship

Positive Linear Relationship 

Positive Linear Relationship

Relationship Not Linear

Relationship Not Linear

No Relationship

No Relationship

Tools used to do Regression Analysis

Following software tools are the most used in performing Regression Analysis. The oldest tool is Microsoft Excel. 

As the spectrum of analytics is widened, dataset volume has increased, the data scientists have started using Python.

We will see a case study to run regression analysis in Excel.

Case Study

An Auto Insurance company has the data of “Loss” amount and policy-related information. The company wants to know the factors responsible for losses in a multivariate fashion using a regression model.

Prior to running the regression model, the following things need to be in place.

  • Variable identification –
    • Identifying the dependent (response) variable.
    • Identifying the independent (explanatory) variables.
    • Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous, etc.)
    • Creation of Data Dictionary
  • Response variable exploration
    • Distribution analysis
      • Percentiles
      • Variance 
      • Frequency distribution
    • Outlier treatment
      • Identify the outliers/threshold limit
      • Cap/floor the values at the thresholds
  • Independent variable analyses
    • Identify the prospective independent variables (that can explain the response variable)
    • Bivariate analysis of response variable against independent variables
    • Variable treatment /transformation
      • Grouping of distinct values/levels
      • Mathematical transformation e.g. logs, splines, etc.

After performing the above data-related activities, the last step is iterative in terms of running the regression.

Fitting the regression (first time)

  • Check for correlation between independent variables
    • This is to take care of Multicollinearity
  • Variable selection
    • Check for the most suitable transformed variable
    • Select the transformation giving the best fit
    • Reject the statistically insignificant variables

Fitting the regression (iteration)

  • Analysis of results
    • Model comparison
    • Model performance check
  • Actual vs Predicted comparison

Steps to do regression analysis in Excel

We will perform the following steps in Excel with reference to the case study explained in the section above.

Data Analysis

Step 1. Snapshot of Data and Data Description

Data and Data Description

The above data contains policy holders’ and loss amount information (variables)

  • Policy Number
  • Age
  • Years of Driving Experience
  • Number of Vehicles
  • Gender
  • Married
  • Vehicle Age
  • Fuel Type
  • Losses (Dependent/Response Variable)

A Data Dictionary is formed as below.

Sl #Variable NameVariable DescriptionValues StoredVariable Type
1Policy NumberUnique Policy NumberUnique value identifying the policyIdentifier
2AgeAge of Policyholder16, 17,…,70Numerical (Discrete)
3Years of Driving ExperienceYears of Driving Experience of the Policyholder0,1,….,53Numerical (Discrete)
4Number of VehiclesNumber of Vehicles insured under the policy1,2,3,4Numerical (Discrete)
5GenderGender of the PolicyholderF, MCategorical (Binary)
6MarriedMarital status of the PolicyholderMarried, SingleCategorical (binary)
7Vehicle AgeAge of vehicle insured under the policy0,1,…,15Numerical (Discrete)
8Fuel TypeFuel type of the vehicle insuredD, PCategorical (Binary)
9LossesLoss amount claimed under the policyRange: 13- 3500Numerical (Continuous)
Data Dictionary

Step 2. Frequency Distribution – Loss amount

Frequency Distribution - Loss amount

Step 3. Calculate Capped Losses using grouped frequency distribution

Capped Losses

Step 3. Perform variable analysis in Excel. The first bivariate profiling we will do for Age vs. Capped Loss for policies.

Age vs. Capped Loss

Step 4. The second bivariate profiling we will do for Years of Driving Experience vs. Capped Loss for policies.

Years of Driving Experience vs. Capped Loss

Step 5. The next bivariate profiling we will do for Gender vs. Capped Loss for policies.

gender

Step 6. The next bivariate profiling we will do for Marital Status vs. Capped Loss for policies.

marital status

Step 7: The next bivariate profiling we will do for Vehicle Age vs. Capped Loss for policies.

MblOl sS3tYL7Z R 9bIn WgmNc5PT9OL8Uy3IRSgLryNqxpiWa8jeAPuZSWaLFJlQm1Aeb0tNgjlRT7pvNJKKbVZimKLDSgo08sywYzeV49j BZ5UqGNh bR1qNPvyBegloJs

Step 8: The next bivariate profiling we will do for Vehicle Age Band vs. Capped Loss for policies.

vehicle age based

Step 9: The next bivariate profiling we will do for Vehicles vs. Capped Loss for policies.

vehicles

Step 10: The next bivariate profiling we will do for Fuel Type vs. Capped Loss for policies.

fuel type

Preparing Excel for Regression Analysis

Step 1: Open the Office icon from the toolbar.

 toolbar

Step 2: Go to Excel Options.

Excel Options

Step 3: Click Add-Ins.

Add-Ins

Step 4: From the Manage drop-down, select Excel Add-ins.

 Excel Add-ins

Step 5: Select the Analysis Toolpak check box and the Solver Add-in check box.

Analysis Toolpak

Step 6: Select the Data tab. Click Data Analysis

Data
data analysis

Step 7: In the Data Analysis dialog box, click Regression.

data analysis

Step 8: Select the data cell reference range. Click OK.

cell reference range

Step 9: See the summary results as below.

data summary

Step 10: Picking up the variables based on the following activities.

We will perform the following tests to select by running :

  • Multicollinearity
  • Banding of variables
  • Statistical significance of variables tested

List of independent variables:

  1. Age
  2. Age Band
  3. Years of Driving Experience
  4. Number of Vehicles
  5. Gender
  6. Married
  7. Vehicle Age
  8. Vehicle Age Band
  9. Fuel Type

“Age” and “Years of Driving Experience” are highly correlated (Correlation Coefficient = 0.9972). We can use either of the variables in the regression.

Fit two separate models using either of the variables one at a time. Check for the goodness of fit (R2 in this case).  The variable producing higher R2 gets accepted.

Age

Regression Statistics (Age)
Multiple R0.475766
R Square0.226354
Adjusted R Square0.226303
Standard Error201.2306
Observations15290

Years of Driving Experience

Regression Statistics(Yrs Driving Experience)
Multiple R0.475273
R Square0.225885
Adjusted R Square0.225834
Standard Error201.2916
Observations15290
  • R2 for Age > R2 for Years of Driving Experience
  • Reject Years of Driving Experience

Banding of Variables

  • Investigate whether to use “Age” or “Age band”
  • Fit regression independently using “Age” and “Age Band”
  • Before fitting regression, “Age Band” needs to be converted to numerical form from categorical. Replace “Age Band” values with “Average Age” for the particular band.
Age BandSum of Age# PoliciesAverage Age
16-25      93,770.0     4,563.0                 20.6 
26-59    270,793.0     6,384.0                 42.4 
60+    282,636.0     4,343.0                 65.1 
  • Regressions results using “Age” and “Average Age”
Regression Statistics (Age)
Multiple R0.475766
R Square0.226354
Adjusted R Square0.226303
Standard Error201.2306
Observations15290
Regression Statistics (Average Age)
Multiple R0.509969
R Square0.260068
Adjusted R Square0.26002
Standard Error196.7971
Observations15290
  • R2 for Average Age > R2 for Age
  • Select “Average Age”

Investigate whether to use “Vehicle Age” or “Vehicle Age band”

  • Fit regression independently using “Vehicle Age” and “Vehicle Age Band”
  • Before fitting regression, “Vehicle Age Band” needs to be converted to numerical form from categorical. Replace “Vehicle Age Band” values with “Vehicle Average Age” for the particular band.
Vehicle Age BandSum of Vehicle Age# PoliciesAverage Vehicle Age
0-5                            9,229                 3,688 2.50
6-10                          44,298                 5,523 8.02
11+                          78,819                 6,079 12.97
  • Regressions results using “Vehicle Age” and “Average Vehicle Age”
Regression Statistics (Vehicle Age)
Multiple R0.289431325
R Square0.083770492
Adjusted R Square0.083710561
Standard Error218.9903277
Observations15290
Regression Statistics (Average  Vehicle Age)
Multiple R0.303099405
R Square0.09186925
Adjusted R Square0.091809848
Standard Error218.0203272
Observations15290
  • R2 for Average Vehicle Age > R2 for Vehicle Age
  • Select “Average Vehicle Age”

Now, we have identified the variables that can be used in fitting the regression model.

  1. Age Band in the form of “Average Age” of the band (selected out of “Age” and  “Age Band”). Also got selected over “Years of Driving Experience”.
  2. Number of Vehicles
  3. Gender
  4. Married
  5. Vehicle Age Band in the form of “Average Vehicle Age” of the band (selected out of “Vehicle Age” and  “Vehicle Age Band”). 
  6. Fuel Type

Step 11: Categorical variables to be converted to their numerical equivalent (0,1)

  1. Gender (F = 0 and M = 1) 
  2. Married (Married = 0 and Single = 1)
  3. Fuel Type (P = 0, D = 1)

Snapshot of the final data on which we will run the multivariate regression.

Age

Step 11: Run Regression. 

Summary of Output

SUMMARY  OUTPUT  
Regression Statistics
Multiple R0.865972274
R Square0.749907979
Adjusted R Square0.749809794
Standard Error114.4310136
Observations15290
ANOVA
 dfSSMSFSignificance F
Regression6600073213.5100012202.37637.7510880
Residual15283200122584.413094.45688
Total15289800195798   
 CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Lower 95.0%Upper 95.0%
Intercept624.565295.29192118.022330.00000614.19249634.93809614.19249634.93809
Avg Age-5.559740.06546-84.938890.00000-5.68804-5.43144-5.68804-5.43144
Number of Vehicles0.178750.970390.184200.85386-1.723332.08082-1.723332.08082
Gender Dummy50.883261.8908126.910840.0000047.1770554.5894747.1770554.58947
Married Dummy78.398371.9214840.801060.0000074.6320482.1646974.6320482.16469
Avg Vehicle Age-15.142200.26734-56.639870.00000-15.66623-14.61818-15.66623-14.61818
Fuel Type Dummy267.935592.7484597.486140.00000262.54830273.32287262.54830273.32287

Results Interpretation

Results Interpretation

Significant Test

Significant Test

Significance test of coefficients based on Normal distribution

  • H0: b is no different than 0 (i.e. 0 is the coefficient when the variable is not included in regression)
  • H1: b is different than 0 

Test statistic, Z = (b-0)/σ  (at 95% two tailed confidence interval, Z = 1.96)

Confidence interval = (b – 1.96 * σ, b + 1.96 * σ)

For the variable to be significant, the interval must not contain “0”.

Example1: Avg Age.

Confidence interval = (-5.560-1.96*0.065, -5.560+1.96*0.065) = (-5.688, -5.431)

No zero in the interval. Hence, significant.

Example2: Number of Vehicles

Confidence interval = (0.179-1.96*0.970, 0.179+1.96*0.970) = (-1.723, 2.080)

Zero is present in the interval. Hence, insignificant.

Summary output after removing insignificant variables.

Predicted Losses = 625.004932715948 – 5.5596551344537 * Avg Age + 50.8828923910091 * Gender Dummy + 78.4016899779131 * Married Dummy -15.1420259903571 * Avg Vehicle Age + 267.935139741526 * Fuel Type Dummy

Interpretation

Interpretation

Takeaway

Running regression in Excel gives you more confidence in your model as you are running each step manually. It is a time-consuming and lengthy process, but you get to learn more about datasets and output summaries. Other tools such as R, SAS, and Python give you quick results but you will not be able to know the algorithm behind the code that is executed and has given you the results.

You may also like:

How to do Indent Text in Excel Sheet

How To Delete Multiple Rows in Excel Sheet at Once

How to Create a Waterfall Chart in Microsoft Excel?