Multiple Linear Regression
Question :
Create a scenario in your discipline in which multiple linear regression would be the appropriate analysis method. Assume that you need to obtain the best model with three predictors selected from a pool of five.
a— Briefly describe the underlying problem. Give the response variable, the variables you will choose from for predictors, and the reasons you believe they would be relevant.
b— Give, in detail the steps you would go through to check that your predictor meets the requirements for regression. What would you do for your potential predictors?
c— Give, in detail, the process you would use to select the best model.
d— Suppose that you find two different models that meet your standard for best.
Answer :
Underlying problem
In the existing study, we would like to construct a multiple linear regression model analyzing how strongly following independent characteristics affect the life of city dweller:
1 1. %obesity
2 2. %Excessive drinking
3 3. %Smokers
4 4. %Physically active
5 5. %Insured
Multiple-linear regression was developed to analyse the effect of the above features on the loss of city dwellers' life.
Methodology
Data Source
We gathered the data from the Health Report published by the University of Wisconsin Population Health Institute’s New York in 2014. The data collected was sponsored through the Robert Wood Johnson Foundation program. This foundation is the most significantUS charity organization focussing on public health3.
Software
According to Princeton’s online archive, regression testing is: “Any software testing that seeks to uncover software errors after changes to the program…have been made by retesting the program.4” We tried to create the relationship between our regressors and beta values by using the Microsoft Excel add-in program Data Analysis.
Model Refinement Procedure
As referenced by author James R. Evans, regression analysis is defined as: “A tool for building statistical models that characterize relationships among a dependent variable and one or more independent variables, all of which are numerical.5” To determine as toif our regression variables have a plausible power of value, we evaluated the individual characteristics having the potential to influence our dependent variable. Isolating each singular aspect reinforced our understanding that they are independently, identically distributed6 (IID).
We worked under the strict assumption that any findings regarding one characteristic as IID would not, in any way, affect another. This study was also conducted under the assumption that each x-variable in itself was to be independent of the others.
Results
ŷ=76.12*X1+347.07*X2+75.73*X3-69.9
X1= % Smokers
X2= % Had Low Birth Weight
X3= % Physically Inactive
The final Model reveals that “% Smokers,” “% Had Low Birth Weight,” and “%Physically Inactive” were the most significant factors affecting the life of a city dweller while the factors like “% Uninsured,” “% Excessive Drinking,” “% Obese,” and “Physically-inactive” have the highest p-values proving four x-variables were the most insignificant predictors of the regression model. The adjusted R-Square in this model was 0.54, suggesting approximately 54% of the variation in y attributed to x,as explained by the regression analysis we conducted. The t-stats are high, emphasizing each individual x-variable has significance (assuming the 95% confidence interval is being used).
Model 1
The purpose of Model 1 is to show a regression analysis with all six autonomous x-variables. Our interest in this model is to identify which of those variables is most likely to affect the dependent variable and to remove the insignificant variables. The higher the value of T-stat, the greater the influence each variable will have on the dependent variable. We will remove variables in a step-wise manner that contain excessively highp-values. Our initial hypothesis is that the “% Smokers” category will have the most considerable influence on our loss rate2.
Model 2
The purpose of Model 2 is to show a regression analysis with five of the x-variables, excluding the one that had the highest P-value (% Obese). Our interest in this model is to try and find a better “fit” for the model. Our initial hypothesis is that repeating the regression while excluding “% Obese” will help us to find a better fit for our model.
References
1
11. “About RWJF” http://www.rwjf.org/en/about-rwjf.html
22. Evans, James R. “Statistics, Data Analysis, and Decision Modeling.” Pearson Education, Inc. 2013
33. “Income, Poverty, and Health Insurance Coverage in the United States: 2012.” http://www.census.gov/prod/2013pubs/p60-245.pdf
44. “Regression Testing.” Princeton University. https://www.princeton.edu/~achaney/tmve/wiki100k/docs/Regression_testing.html