© 2000 John Petroff  

Appendix 5A to Chapter 5

 

 

The purpose of this appendix is twofold: to present a rudimentary review of the derivation of OLS regression estimates and a practical application that can be easily applied in any sensitivity analysis or other simple regression analysis. The mathematical derivation is offered here merely as a rigorous foundation for the practical application below and the general discussion in Chapter 5 Section E-1. For a complete treatment of the subject, see the readings for sources or visit Econometrics Journal online at http://www.econ.vu.nl/econometriclinks/ mentioned in the chapter.

1)- Derivation of regression coefficients

a)- Case of two variables

Let us take the most simple case of two variables where we inquire whether a linear relationship of the form

y = a + bx

exists between the two variables. To test the hypothesis that the coefficient b is other than zero and establish that a relationship indeed exists between variables x and y, we dispose of n observations for x and y. The equation used is the one presented in the chapter with the error term et indicating that the data used will naturally not fit the expected linear function perfectly

yt = a + bxt + et

where a = constant term
b = coefficient of correlation
e = error term
t = time

for all observations of y and x from t = 1 to t = n

The goal is to obtain estimates a* and b* for the unknown parameters a and b in terms of the available n observations. The OLS method is to minimize the sum square of the error term et. The error terms are deviations or distances from actual values of yt to the fitted line (y*=a*+b*xt)

et = yt - a* - b*xt

which can be seen as vertical segment in Graph G-5.1. As mentioned in the chapter, the sum of error terms is zero

E(e) = Sum( et)/n = 0

because positive deviations cancel out negative ones. (This is one of essential assumptions of OLS. If there is a possibility that this assumption is violated, the estimates may be biased as in the case of autocorrelation.)

The sum of squared error terms is

Sum( et2) = Sum( (yt - a* - b*xt)2)

This sum of squared error terms is minimum if both the partial derivative with respect to a* is zero

dSum( et2) / da* = - 2 Sum(yt - a* - b*xt) = 0

and the partial derivative with respect to b* is zero

dSum( et2) / db* = - 2 Sum( xt(yt - a* - b*xt)2) = 0

The partial derivative with respect to a* simplifies to

Sum(yt) = na* + b*Sum(xt)

or since the mean (i.e. expected value) of y is E(y)=Sum(yt)/n and the mean of x is E(x)=Sum(xt)/n, we obtain

E(y) = a* + b*E(x)

The error term et can also be written in terms of the deviations dxt of xt from its mean E(x) (i.e. dxt = xt - E(x)), and deviations dyt of yt from its mean E(y) (i.e. dyt = yt - E(y)), as

et = dyt - b*dxt

The sum square of error terms is then

Sum( et2) = Sum( (dyt - b*dxt)2)

and the partial derivative with respect to b* is

dSum( et2) / db* = - 2 Sum(dxt(dyt - b*dxt)) = 0

which gives

b* = Sum((dxt)(dyt)) / Sum((dxt)2)

The second derivative of the sum of error terms with respect to b* is 2 Sum(dxt2) which is necessarily positive, and which assures that sum of error terms is indeed a minimum. The estimate for a* is given by

a* = E(y) - b*E(x)

b) - General formulation

In this section, we go beyond the case of two variables and present the formulation of regression analysis for any number of variables. It is shown that the case of two variables is just a special case of the general formulation. To keep this material accessible to all readers, derivation of coefficient estimates is not presented, and interested readers are once again urged to consult specialized textbooks.

Let us specify the following linear equation

yt = b0 + b1x1t + b2x2t + . . . + bkxkt

Observe that the notation is different from that of the case of two variables (i.e. b0 in this equation corresponds to a above, b1 corresponds to b above, and x1tcorresponds to xt above).

As before, the goal it to test whether (or which of) the k+1 coefficients b are different from zero. If so, that would establish the presence of a correlation between the endogenous variable y and variables x. We dispose of n observations from t=1 to t=n for y and x. The number of observations must exceed the number of exogenous variables k plus one.

The equation can be stated more economically by substituting a matrix X for the k variables x1 to xk and B for the k coefficients b1 to bk

Y = XB

The estimated coefficients b* are calculated for the sum of squared error terms et to be minimum where the error terms et are given by

et = yt - xt B

If E is the vector of the error terms et

E = Y - XB

The sum of squared error terms is

sum(et2) = E'E

or = (Y-XB)'(Y-XB)

= Y'Y - 2Y'YB + B'X'XB

For the sum of squared error terms to be minimum, the first derivatives with respect to B must be null

dE'E / dB = - 2Y'Y + 2X'XB = 0

and the second derivative must be positive. The matrix of estimated coefficients B* is given by

B* = (X'X)-1 (X'Y)

Applying this general formulation of coefficient estimates to the case of two variables (i.e. where there is only one x), the components of matrix B* are

 X'Y =

Sum(yt)

Sum(xtyt)
   

 X'X =

 n

 Sum(xt)

 Sum(xt2)

 Sum(xt2)

(X'X)-1=

Sum(xt2)

Sum(xt)

1/(n Sum(xt2) - (Sum(xt))2 )

- Sum(xt) n

 

Matrix B* reduces to the following two elements

Sum(yt) = n b0 + b1 Sum(xt)

Sum(xtyt) = b0  Sum(xt) + b1Sum(xt2)

Solving for b1 and b0 gives

b*1 = (n Sum(xtyt) - Sum(xt) Sum(yt)) / (n Sum(xt2) - (Sum(xt)) 2 )

Since E(x) = Sum(xt) / n and E(x) = Sum(yt) / n , this expression can be written with dxt = xt - E(x) and dyt = yt - E(y) as

b*1 = Sum((dxt)(dyt)) / Sum((dxt)2)

and b*0 = Sum(yt)/n - b1 Sum(xt)/n

or b*0 = E(Y)/n - b1 E(X)

One can verify that the estimates for b*0 and b*1 are naturally identical to those for a* and b* derived in the previous section.

 

c)- Testing quality of regression results

The test of the hypothesis that each coefficient estimate b* is different from zero is conducted with the t statistic

tb = b* / sb

where sb= the standard deviation of b* (in the general formulation, it is given as a square root of element in the variance diagonal matrix Sbb = Se(X'X)-1where Se = variance of the error term)

The t statistic has a Z normal distribution N(0,1) which values are given in tables. For instance, if we obtain an estimate for t of 1.35. The table shows for that value and 10 degrees of freedom (i.e. the number of observations, minus the number of variables, plus one), there is 90% confidence interval that the hypothesis that b is different from zero can be supported. Statisticians also say the level of significance in this case if 5% (i.e. half of 1-.90) that accepting the hypothesis that b=0 would be an error.

To test the validity of the entire equation, the statistic used is R2

R2 = 1 - SSR/TD

or R2 = (TD - SSR)/TD

where TD = total deviation
SSR = sum of squared error terms or residuals

One may verify that the maximum value of R2 is one. A R2 value in excess of .50 is usually taken to prove that a correlation is present.

When there are more than one exogenous variable it is necessary to calculate an adjusted R2 to account for the loss of degrees of freedom due to the presence of additional variables

Adjusted R2 = 1 - (SSR/ (N-K)) / (TD/(N-1))

where TD = total deviation
SSR = sum of squared error terms or residuals
N = the number of observations
K = the number of exogenous variables

A more rigorous test of the "goodness of fit" than R2 is the F statistic which has a Chi Square distribution which values can be found in tables for different number of variables v and observations o. The F(v,o) statistic is calculated by

F(v,o) = ((TD-SSR)/(K-1)) / (SSR/(N-K))

where TD = total deviation
SSR = sum of squared error terms or residuals
N = the number of observations
K = the number of exogenous variables

Finally, the Durbin-Watson statistic D-W which is used to determine the presence of autocorrelation, is calculated by

D-W = Sum((et - et-1 )2 ) / Sum (et2)

where e = error term or residual

Tables of values of Durbin-Watson statistic appear in statistical manuals, and allow to reject the hypothesis of autocorrelation if D-W is higher than the value found in the column of the table for the number of observations and the number of variables used in the regression.

2)- Application

For this application we use the data of the example for Delta revenues and GDP in Chapter 5 Section E.The following schedules and tables will illustrate how the above equations and formulas are used to calculate the estimates. First, Table T-5.25 below reproduces the data in Table T-5.23 and adds the values of deviations d(yt) and d(xt).

Table T-5.25

Regression application data

.

A

B

C

D

E

Definition
1

years

yt=sales

xt=gdp

d(yt)=yt-E(y)

d(xt)=xt-E(x)

Sum
2

13

139789

88675.1

0

-5.45697E-12

Mean
3

-

10753

6821.161538

.

.

4

1987

5318

4742.5

-5435

-2078.661538

5

1988

6915

5108.3

-3838

-1712.861538

6

1989

8039

5489.1

-2714

-1332.061538

7

1990

8683

5803.2

-2070

-1017.961538

8

1991

9171

5986.2

-1582

-834.9615385

9

1992

10837

6318.9

84

-502.2615385

10

1993

11657

6642.3

904

-178.8615385

11

1994

12077

7054.3

1324

233.1384615

12

1995

12194

7400.3

1441

579.1384615

13

1996

12455

7813.2

1702

992.0384615

14

1997

13594

8300.8

2841

1479.638462

15

1998

14138

8759.9

3385

1938.738462

16

1999

14711

9256.1

3958

2434.938462
.

Table T-5.26 below is constructed to give the values of the sums of squared deviations and residuals, which are needed to calculate the coefficients.

Table T-5.26

Regression elements calculation
.

F

G

H

I

J

K

Definition
1

TD=sum(d(yt)2)

d(xt)2

d(yt)d(xt)

y*=a*+b*xt

et=yt-b*xt-a*

SSR=sum(et2)

Sum
2

101168692

24297638.2

47946344.1

139789

-3.63798E-12

6556537.2

Mean
3
. . . . . .

4

29539225

4320833.7

11297525.4

6651.1

1333.1

1777403.7

5

14730244

2933894.6

6573962.5

7373.0

458.0

209785.4

6

7365796

1774387.9

3615215.0

8124.4

85.4

7302.2

7

4284900

1036245.6

2107180.3

8744.2

61.2

3753.3

8

2502724

697160.7

1320909.1

9105.3

-65.6

4306.4

9

7056

252266.6

-42189.9

9761.8

-1075.1

1155859.0

10

817216

31991.4

-161690.8

10400.0

-1256.9

1579913.5

11

1752976

54353.5

308675.3

11213.0

-863.9

746409.0

12

2076481

335401.3

834538.5

11895.8

-298.1

88917.5

13

2896804

984140.3

1688449.4

12710.5

255.5

65322.1

14

8071281

2189329.9

4203652.8

13672.7

78.7

6203.0

15

11458225

3758706.8

6562629.6

14578.6

440.6

194214.5

16

15665764

5928925.3

9637486.4

15557.8

846.8

717147.2

In Table T-5.27 below, each of the statistic is presented with its formula developed in the previous section of this appendix, the estimated value and the calculation to derive the value with the formula using the elements presented in Tables T-5.25 and T-5.26.

Table T-5.27

Regression estimates calculation
Name Formula Estimated value  Calculation
Coefficient b* estimated value b*=Sum(d(yt)d(xt))/Sum(d(xt)2) 1.973292366  47946344.1 / 24297638.2
Coefficient a* estimated value a*=E(y) - b*E(x) -2707.145988  10753 - (1.973292366 * 6821.161538)
Sum of squared residuals SSR=Sum(et2)

6556537.2
 6556537.2
Standard error of regression se= (Sum(d(yt)2) - (b*Sum(d(yt)d(xt)))) / (n-2) 596048.83  (101168692 - ( 1.973292366 *  47946344.1)) / (13 -2)
Variance of b* var(b)=se/Sum(d(xt)2)

0.024531
 596048.83 / 24297638.2
Standard deviation of b* sb=SquareRoot(var(b))

0.156624
 SquareRoot(0.024531)
Variance of a* var(a)=se((1/n)+(E(x)2/Sum(d(xt)2)))

1187240.9
 596048.83 * ((1/13) + (6821.161538 * 6821.161538/ 24297638.2))
Standard deviation of a* sa=SquareRoot(var(a))

1089.6
 SquareRoot(1187240.9 )
Coefficient of determination or R2 R2=b2Sum(d(xt)2)/Sum(d(yt2) or = 1 - SSR/TD

0.935192033
 1 - (6556537.2 / 101168692)
Adjusted R2 Adjusted R2=1-((se/(n-2))/(Sum(d(yt)2)/(n-1))) .  .
t statistic of b* tb=b*/sb

12.59
 1.973292366 /0.156624
t statistic of a* ta=a*/sa

-2.48
-2707.145988 / 1089.6
F statistic F=((TD-SSR)/(k-1))/(SSR/(n-k))

187.59
 ((101168692 - 6556537.2)/(2-1))/(6556537.2/(13-2))
 

Finally, Table T-5.28 shows the content of each spreadsheet cell used to obtain the numbers shown in the previous three tables. Such a spreadsheet can be used to abtain estimates of OLS regression on any two variables data series, by simply repeating the formulas in cells D4 through M4 beyond line 16 as many times as needed, entering the number of observations in A2 and changing the summations in B2 to M2 accordingly (i.e. changing 16 in the summations to number of observations plus 3).

Table T-5.28

. A B C D E F G H K L M N O
1 Years yt xt dyt dxt dyt2 =TD dxt2 dytdxt yt* et et2 = SSR Name Estimates
2 13 = SUM(B4:B16) =SUM(C4:C16) =SUM(D4:D16) =SUM(E4:E16) =SUM(F4:F16) =SUM(G4:G16) =SUM(H4:H16) =SUM(K4:K16) =SUM(L4:L16) =SUM(L4:L16) b* =H2/G2
3 Mean =B2/A2 =C2/A2 . . . . . . . . a* =B3-O2*C3
4 1987 5318 4742.5 =B4-B3 =C4-C3 =D4*D4 =E4*E4 =D4*E4 =N3+N2*C4 =B4-K4 =L4*L4 SSR =M2
5 1988 6915 5108.3 =B5-B3 =C5-C3 =D5*D4 =E5*E5 =D5*E5 =N3+N2*C5 =B5-K5 =L5*L5 se =(F3-(O2*H4))/(A2-2)
6 1989 8039 5489.1 =B6-B3 =C6-C3 =D6*D6 =E6*E6 =D6*E6 =N3+N2*C6 =B6-K6 =L6*L6 VAR(b) =O5/G2
7 1990 8683 5803.2 =B7-B3 =C7-C3 =D7*D7 =E7*E7 =D7*E7 =N3+N2*C7 =B7-K7 =L7*L7 sb =SR(O6)
8 1991 9171 5986.2 =B8-B3 =C8-C3 =D8*D8 =E8*E8 =D8*E8 =N3+N2*C8 =B8-K8 =L8*L8 VAR(a) =O5*((1/A2)+(C2*C2/G2))
9 1992 10837 6318.9 =B9-B3 =C9-C3 =D9*D9 =E9*E9 =D9*E9 =N3+N2*C9 =B9-K9 =L9*L9 sa =SR(O8)
10 1993 11657 6642.3 =B10-B3 =C10-C3 =D10*D10 =E10*E10 =D10*E10 =N3+N2*C10 =B10-K10 =L10*L10 R2 =1-(M2/F2)
11 1994 12077 7054.3 =B10-B3 =C10-C3 =D10*D10 =E10*E10 =D10*E10 =N3+N2*C10 =B10-K10 =L11*L11 AdjR2 .
12 1995 12194 7400.3 =B12-B3 =C12-C3 =D12*D12 =E12*E12 =D12*E12 =N3+N2*C12 =B12-K12 =L12*L12 tb =O2/O7
13 1996 12455 7813.2 =B13-B3 =C13-C3 =D13*D13 =E13*E13 =D13*E13 =N3+N2*C13 =B13-K13 =L13*L13 ta =O3/O9
14 1997 13594 8300.8 =B14-B3 =C14-C3 =D14*D14 =E14*E14 =D14*E14 =N3+N2*C14 =B14-K14 =L14*L14 F = (F2 - M2)/(2-1)/(M2/(A2-2))
15 1998 14138 8759.9 =B15-B3 =C15-C3 =D15*D15 =E15*E15 =D15*E15 =N3+N2*C15 =B15-K15 =L15*L15 . .
16 1999 14711 9256.1 =B16-B3 =C16-C3 =D16*D16 =E16*E16 =D16*E16 =N3+N2*C16 =B16-K16 =L16*L16 . .

 

 Previous: Assignments

Last modified: Jun/01/01
 Next: Appendix 5B