The purpose of this appendix is twofold: to present a rudimentary review of the derivation of OLS regression estimates and a practical application that can be easily applied in any sensitivity analysis or other simple regression analysis. The mathematical derivation is offered here merely as a rigorous foundation for the practical application below and the general discussion in Chapter 5 Section E-1. For a complete treatment of the subject, see the readings for sources or visit Econometrics Journal online at http://www.econ.vu.nl/econometriclinks/ mentioned in the chapter.
1)- Derivation of regression coefficients
a)- Case of two variables
Let us take the most simple case of two variables where we inquire whether a linear relationship of the form
y = a + bx
exists between the two variables. To test the hypothesis that the coefficient b is other than zero and establish that a relationship indeed exists between variables x and y, we dispose of n observations for x and y. The equation used is the one presented in the chapter with the error term et indicating that the data used will naturally not fit the expected linear function perfectly
yt = a + bxt + et
for all observations of y and x from t = 1 to t = n
The goal is to obtain estimates a* and b* for the unknown parameters a and b in terms of the available n observations. The OLS method is to minimize the sum square of the error term et. The error terms are deviations or distances from actual values of yt to the fitted line (y*=a*+b*xt)
et = yt - a* - b*xt
which can be seen as vertical segment in Graph G-5.1. As mentioned in the chapter, the sum of error terms is zero
E(e) = Sum( et)/n = 0
because positive deviations cancel out negative ones. (This is one of essential assumptions of OLS. If there is a possibility that this assumption is violated, the estimates may be biased as in the case of autocorrelation.)
The sum of squared error terms is
Sum( et2) = Sum( (yt - a* - b*xt)2)
This sum of squared error terms is minimum if both the partial derivative with respect to a* is zero
dSum( et2) / da* = - 2 Sum(yt - a* - b*xt) = 0
and the partial derivative with respect to b* is zero
dSum( et2) / db* = - 2 Sum( xt(yt - a* - b*xt)2) = 0
The partial derivative with respect to a* simplifies to
Sum(yt) = na* + b*Sum(xt)
or since the mean (i.e. expected value) of y is E(y)=Sum(yt)/n and the mean of x is E(x)=Sum(xt)/n, we obtain
E(y) = a* + b*E(x)
The error term et can also be written in terms of the deviations dxt of xt from its mean E(x) (i.e. dxt = xt - E(x)), and deviations dyt of yt from its mean E(y) (i.e. dyt = yt - E(y)), as
et = dyt - b*dxt
The sum square of error terms is then
Sum( et2) = Sum( (dyt - b*dxt)2)
and the partial derivative with respect to b* is
dSum( et2) / db* = - 2 Sum(dxt(dyt - b*dxt)) = 0
which gives
b* = Sum((dxt)(dyt)) / Sum((dxt)2)
The second derivative of the sum of error terms with respect to b* is 2 Sum(dxt2) which is necessarily positive, and which assures that sum of error terms is indeed a minimum. The estimate for a* is given by
a* = E(y) - b*E(x)
b) - General formulation
In this section, we go beyond the case of two variables and present the formulation of regression analysis for any number of variables. It is shown that the case of two variables is just a special case of the general formulation. To keep this material accessible to all readers, derivation of coefficient estimates is not presented, and interested readers are once again urged to consult specialized textbooks.
Let us specify the following linear equation
yt = b0 + b1x1t + b2x2t + . . . + bkxkt
Observe that the notation is different from that of the case of two variables (i.e. b0 in this equation corresponds to a above, b1 corresponds to b above, and x1tcorresponds to xt above).
As before, the goal it to test whether (or which of) the k+1 coefficients b are different from zero. If so, that would establish the presence of a correlation between the endogenous variable y and variables x. We dispose of n observations from t=1 to t=n for y and x. The number of observations must exceed the number of exogenous variables k plus one.
The equation can be stated more economically by substituting a matrix X for the k variables x1 to xk and B for the k coefficients b1 to bk
Y = XB
The estimated coefficients b* are calculated for the sum of squared error terms et to be minimum where the error terms et are given by
et = yt - xt B
If E is the vector of the error terms et
E = Y - XB
The sum of squared error terms is
sum(et2) = E'E
or = (Y-XB)'(Y-XB)
= Y'Y - 2Y'YB + B'X'XB
For the sum of squared error terms to be minimum, the first derivatives with respect to B must be null
dE'E / dB = - 2Y'Y + 2X'XB = 0
and the second derivative must be positive. The matrix of estimated coefficients B* is given by
B* = (X'X)-1 (X'Y)
Applying this general formulation of coefficient estimates to the case of two variables (i.e. where there is only one x), the components of matrix B* are
|
|
|
|||||
|
|
|
|
|
|
Matrix B* reduces to the following two elements
Sum(yt) = n b0 + b1 Sum(xt)
Sum(xtyt) = b0 Sum(xt) + b1Sum(xt2)
Solving for b1 and b0 gives
b*1 = (n Sum(xtyt) - Sum(xt) Sum(yt)) / (n Sum(xt2) - (Sum(xt)) 2 )
Since E(x) = Sum(xt) / n and E(x) = Sum(yt) / n , this expression can be written with dxt = xt - E(x) and dyt = yt - E(y) as
b*1 = Sum((dxt)(dyt)) / Sum((dxt)2)
and b*0 = Sum(yt)/n - b1 Sum(xt)/n
or b*0 = E(Y)/n - b1 E(X)
One can verify that the estimates for b*0 and b*1 are naturally identical to those for a* and b* derived in the previous section.
c)- Testing quality of regression results
The test of the hypothesis that each coefficient estimate b* is different from zero is conducted with the t statistic
tb = b* / sb
where sb= the standard deviation of b* (in the general formulation, it is given as a square root of element in the variance diagonal matrix Sbb = Se(X'X)-1where Se = variance of the error term)
The t statistic has a Z normal distribution N(0,1) which values are given in tables. For instance, if we obtain an estimate for t of 1.35. The table shows for that value and 10 degrees of freedom (i.e. the number of observations, minus the number of variables, plus one), there is 90% confidence interval that the hypothesis that b is different from zero can be supported. Statisticians also say the level of significance in this case if 5% (i.e. half of 1-.90) that accepting the hypothesis that b=0 would be an error.
To test the validity of the entire equation, the statistic used is R2
R2 = 1 - SSR/TD
or R2 = (TD - SSR)/TD
One may verify that the maximum value of R2 is one. A R2 value in excess of .50 is usually taken to prove that a correlation is present.
When there are more than one exogenous variable it is necessary to calculate an adjusted R2 to account for the loss of degrees of freedom due to the presence of additional variables
Adjusted R2 = 1 - (SSR/ (N-K)) / (TD/(N-1))
A more rigorous test of the "goodness of fit" than R2 is the F statistic which has a Chi Square distribution which values can be found in tables for different number of variables v and observations o. The F(v,o) statistic is calculated by
F(v,o) = ((TD-SSR)/(K-1)) / (SSR/(N-K))
Finally, the Durbin-Watson statistic D-W which is used to determine the presence of autocorrelation, is calculated by
D-W = Sum((et - et-1 )2 ) / Sum (et2)
where e = error term or residual
Tables of values of Durbin-Watson statistic appear in statistical manuals, and allow to reject the hypothesis of autocorrelation if D-W is higher than the value found in the column of the table for the number of observations and the number of variables used in the regression.
For this application we use the data of the example for Delta revenues and GDP in Chapter 5 Section E.The following schedules and tables will illustrate how the above equations and formulas are used to calculate the estimates. First, Table T-5.25 below reproduces the data in Table T-5.23 and adds the values of deviations d(yt) and d(xt).
|
|
|||||
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| . | |||||
Table T-5.26 below is constructed to give the values of the sums of squared deviations and residuals, which are needed to calculate the coefficients.
|
|
||||||
| . |
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
3 |
. | . | . | . | . | . |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In Table T-5.27 below, each of the statistic is presented with its formula developed in the previous section of this appendix, the estimated value and the calculation to derive the value with the formula using the elements presented in Tables T-5.25 and T-5.26.
|
|
|||
| Name | Formula | Estimated value | Calculation |
| Coefficient b* estimated value | b*=Sum(d(yt)d(xt))/Sum(d(xt)2) | 1.973292366 | 47946344.1 / 24297638.2 |
| Coefficient a* estimated value | a*=E(y) - b*E(x) | -2707.145988 | 10753 - (1.973292366 * 6821.161538) |
| Sum of squared residuals | SSR=Sum(et2) |
|
6556537.2 |
| Standard error of regression | se= (Sum(d(yt)2) - (b*Sum(d(yt)d(xt)))) / (n-2) | 596048.83 | (101168692 - ( 1.973292366 * 47946344.1)) / (13 -2) |
| Variance of b* | var(b)=se/Sum(d(xt)2) |
|
596048.83 / 24297638.2 |
| Standard deviation of b* | sb=SquareRoot(var(b)) |
|
SquareRoot(0.024531) |
| Variance of a* | var(a)=se((1/n)+(E(x)2/Sum(d(xt)2))) |
|
596048.83 * ((1/13) + (6821.161538 * 6821.161538/ 24297638.2)) |
| Standard deviation of a* | sa=SquareRoot(var(a)) |
|
SquareRoot(1187240.9 ) |
| Coefficient of determination or R2 | R2=b2Sum(d(xt)2)/Sum(d(yt2) or = 1 - SSR/TD |
|
1 - (6556537.2 / 101168692) |
| Adjusted R2 | Adjusted R2=1-((se/(n-2))/(Sum(d(yt)2)/(n-1))) | . | . |
| t statistic of b* | tb=b*/sb |
|
1.973292366 /0.156624 |
| t statistic of a* | ta=a*/sa |
|
-2707.145988 / 1089.6 |
| F statistic | F=((TD-SSR)/(k-1))/(SSR/(n-k)) |
|
((101168692 - 6556537.2)/(2-1))/(6556537.2/(13-2)) |
Finally, Table T-5.28 shows the content of each spreadsheet cell used to obtain the numbers shown in the previous three tables. Such a spreadsheet can be used to abtain estimates of OLS regression on any two variables data series, by simply repeating the formulas in cells D4 through M4 beyond line 16 as many times as needed, entering the number of observations in A2 and changing the summations in B2 to M2 accordingly (i.e. changing 16 in the summations to number of observations plus 3).
| . | A | B | C | D | E | F | G | H | K | L | M | N | O |
| 1 | Years | yt | xt | dyt | dxt | dyt2 =TD | dxt2 | dytdxt | yt* | et | et2 = SSR | Name | Estimates |
| 2 | 13 | = SUM(B4:B16) | =SUM(C4:C16) | =SUM(D4:D16) | =SUM(E4:E16) | =SUM(F4:F16) | =SUM(G4:G16) | =SUM(H4:H16) | =SUM(K4:K16) | =SUM(L4:L16) | =SUM(L4:L16) | b* | =H2/G2 |
| 3 | Mean | =B2/A2 | =C2/A2 | . | . | . | . | . | . | . | . | a* | =B3-O2*C3 |
| 4 | 1987 | 5318 | 4742.5 | =B4-B3 | =C4-C3 | =D4*D4 | =E4*E4 | =D4*E4 | =N3+N2*C4 | =B4-K4 | =L4*L4 | SSR | =M2 |
| 5 | 1988 | 6915 | 5108.3 | =B5-B3 | =C5-C3 | =D5*D4 | =E5*E5 | =D5*E5 | =N3+N2*C5 | =B5-K5 | =L5*L5 | se | =(F3-(O2*H4))/(A2-2) |
| 6 | 1989 | 8039 | 5489.1 | =B6-B3 | =C6-C3 | =D6*D6 | =E6*E6 | =D6*E6 | =N3+N2*C6 | =B6-K6 | =L6*L6 | VAR(b) | =O5/G2 |
| 7 | 1990 | 8683 | 5803.2 | =B7-B3 | =C7-C3 | =D7*D7 | =E7*E7 | =D7*E7 | =N3+N2*C7 | =B7-K7 | =L7*L7 | sb | =SR(O6) |
| 8 | 1991 | 9171 | 5986.2 | =B8-B3 | =C8-C3 | =D8*D8 | =E8*E8 | =D8*E8 | =N3+N2*C8 | =B8-K8 | =L8*L8 | VAR(a) | =O5*((1/A2)+(C2*C2/G2)) |
| 9 | 1992 | 10837 | 6318.9 | =B9-B3 | =C9-C3 | =D9*D9 | =E9*E9 | =D9*E9 | =N3+N2*C9 | =B9-K9 | =L9*L9 | sa | =SR(O8) |
| 10 | 1993 | 11657 | 6642.3 | =B10-B3 | =C10-C3 | =D10*D10 | =E10*E10 | =D10*E10 | =N3+N2*C10 | =B10-K10 | =L10*L10 | R2 | =1-(M2/F2) |
| 11 | 1994 | 12077 | 7054.3 | =B10-B3 | =C10-C3 | =D10*D10 | =E10*E10 | =D10*E10 | =N3+N2*C10 | =B10-K10 | =L11*L11 | AdjR2 | . |
| 12 | 1995 | 12194 | 7400.3 | =B12-B3 | =C12-C3 | =D12*D12 | =E12*E12 | =D12*E12 | =N3+N2*C12 | =B12-K12 | =L12*L12 | tb | =O2/O7 |
| 13 | 1996 | 12455 | 7813.2 | =B13-B3 | =C13-C3 | =D13*D13 | =E13*E13 | =D13*E13 | =N3+N2*C13 | =B13-K13 | =L13*L13 | ta | =O3/O9 |
| 14 | 1997 | 13594 | 8300.8 | =B14-B3 | =C14-C3 | =D14*D14 | =E14*E14 | =D14*E14 | =N3+N2*C14 | =B14-K14 | =L14*L14 | F | = (F2 - M2)/(2-1)/(M2/(A2-2)) |
| 15 | 1998 | 14138 | 8759.9 | =B15-B3 | =C15-C3 | =D15*D15 | =E15*E15 | =D15*E15 | =N3+N2*C15 | =B15-K15 | =L15*L15 | . | . |
| 16 | 1999 | 14711 | 9256.1 | =B16-B3 | =C16-C3 | =D16*D16 | =E16*E16 | =D16*E16 | =N3+N2*C16 | =B16-K16 | =L16*L16 | . | . |
| Previous: Assignments |
|
Next: Appendix 5B |