kruiser.RData


Level of measurement


Classroom notes


Classroom task

First, we learned the material found in the classroom notes above.

Then we estimated the Price variable using variables Mileage, Cylinder, Sound, and Leather. We saw that the summary() function prints the most important statistics related to the performed linear regression. We recognized the R-squared statistics. We saw that at the end of each row there is a p-value corresponding to the null hypothesis that the theoretical coefficient of the variable corresponding to the given row is 0. We discussed that the magnitudes of the coefficients depend on the scales of the variables. For this reason, the absolute values of the coefficients are not appropriate to judge the importance of the variables. First, I advised using the p-values for this purpose (the smaller the p-value, the stronger the rejection, hence, the stronger the effect of the variable). Then we added the so-called beta coefficients to the regression. These are the coefficients coming from the regression performed on the standardized variables (the standardization of a variable means that we subtract the mean and we divide by the sample standard deviation). I noted that the dependency among the variables could influence the experienced significance and importance of the variables (multicollinearity analysis deals with this issue). Related to this we ran a model containing Liter instead of Cylinder and a model containing both.

After that, we discussed the different measurement levels (see the blog above). Then we converted the Cylinder variable into factor variable (this is the right way to work with nominal and ordinal variables in R) using the as.factor() command. Then we reran the first regression. We saw that the lm() command treated Cylinder differently. It created the 0-1 valued dummy random variables Cylinder6 and Cylinder8 that are one iff the original Cylinder is 6 or 8, respectively, and used them in the regression instead of the original Cylinder variable. Note that it didn't create a dummy variable for category 4. The omitted category is called the reference category. The program outputted the following linear Price predictor: 21364 - 0.16 Mileage + 2224 Cylinder6 + 20682 Cylinder8 - 1030 Sound1 + 604 Leather. We can interpret the result so that the "constant" is allowed to be different across the categories of Cylinder (it is 21364 if the Cylinder is 4, it is 23588 if the Cylinder is 6 and it is 42046 if the Cylinder is 8).