Kói Tamás honlapja

choleverybody.RData

Used car dataset (you have to find kuiper.xls)

Paper corresponding to the above dataset

kruiser.RData

Classroom notes (we skipped the last page)

Classroom task

First, we learned the Kolmogorov-Smirnov test (see the first page of classroom notes II of the previous class). Then we tried it on simulated data. We simulated 200 copies from N(2,3) distribution, and we tested with Kolmogorov-Smirnov test whether the simulation was successful and we saved the p-value. We repeated this procedure 1000 times. We learned that the distribution of the p-value under the null hypothesis is E(0,1). We checked this theoretical fact by running the Kolmogorov-Smirnov test on the vector consisting of the saved p-values.

After that, we checked the normality assumptions of the t-tests that we performed on the cholesterol data two weeks ago. First, we used the Kolmogorov-Smirnov test with estimated parameters. I mentioned that using estimated parameters causes a bias toward accepting the null hypothesis. I remarked that there exist methods that correct this bias, however, instead of these modifications, we learned other alternatives: we performed Shapiro-Wilk test (its null hypothesis is normality), and we created qqplots.

Then we repeated what the independence of discrete variables means and we learned the chi-square independence test (see the first two pages of the classroom notes above). We concluded that according to this test the variables Sound and Leather in the above car dataset (kruiser.RData) are not independent. After that, we focused on the dependency between the continuous variables Price and Mileage. First, we ran correlation test. Its null hypothesis is that the theoretical correlation is 0. Based on the p-value we rejected this null hypothesis. Hence, as independence implies 0 correlation, we concluded that Price and Mileage are not independent. If we had accepted the null hypothesis we would have only known that there is no linear relationship between the two variables. Then we focused on the empirical correlation. It significantly differs from 0; however, its absolute value is relatively small. Heuristically we would expect a correlation much closer to -1. We concluded that the cause of the relatively small absolute value is the fact that all cars in the dataset are less than one year old. Finally, we also checked the independence of Price and Mileage in the following way: we discretized the variables and then checked the independence of these discretized variables. We rejected the independence of the discretized variables, hence, the independence of the original variables can be also rejected. I noted that if we had accepted the independence of the discretized variables we could have only suspected the independence of the original variables. The suspicion is the stronger the larger the number of categories used in the discretization, however, we have to be careful, increasing the number of categories can increase the number of empty cells (or cells containing only a few observations), and it is an empirical fact that in the presence of empty cells (or cells containing only a few observations) the chi-square test doesn't work properly).