Simple OLS Regression

Section 2.1 of Heiss (2020)
Consider the following empirical model: $$ y = \beta_0 + \beta_1 x + u \tag{2.1} $$
According to Wooldridge (2006, Section 2.2), the ordinary least squares (OLS) estimators are given by

\begin{align} \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \tag{2.2}\\ \hat{\beta}_1 &= \frac{Cov(x,y)}{Var(x)} \tag{2.3} \end{align}

The fitted (predicted) values, $\hat{y}$, are given by $$ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x \tag{2.4} $$ such that $$ y = \hat{y} + \hat{u} $$

Example 2.3: CEO Compensation and Stock Returns (Wooldridge, 2006)

Consider the following simple regression model: $$ \text{salary} = \beta_0 + \beta_1 \text{roe} + u $$ where salary is CEO compensation in thousands of dollars and roe is the return on equity, measured in percentage points.

Estimating Simple Regression “By Hand”

# Load the dataset from the 'wooldridge' package
data(ceosal1, package="wooldridge")

attach(ceosal1) # avoids writing 'ceosal1$' before every variable

cov(salary, roe) # covariance between the dependent and independent variables

## [1] 1342.538

var(roe) # variance of the return on equity

## [1] 72.56499

mean(roe) # mean return on equity

## [1] 17.18421

mean(salary) # mean salary

## [1] 1281.12

# Compute the OLS coefficients "by hand"
( b1_hat = cov(salary, roe) / var(roe) ) # by (2.3)

## [1] 18.50119

( b0_hat = mean(salary) - var(roe)*mean(salary) ) # by (2.2)

## [1] -91683.31

detach(ceosal1) # stop looking for variables inside the 'ceosal1' object

We see that a one-percentage-point increase in return on equity (roe) is associated with an increase of about 18 thousand dollars in CEO compensation.

Estimating Simple Regression with `lm()`

A more convenient way to estimate an OLS model is to use the lm() function.
In a univariate model, we write the dependent and independent variables separated by a tilde (~):

lm(ceosal1$salary ~ ceosal1$roe)

## 
## Call:
## lm(formula = ceosal1$salary ~ ceosal1$roe)
## 
## Coefficients:
## (Intercept)  ceosal1$roe  
##       963.2         18.5

We can also omit the ceosal1$ prefix by supplying data = ceosal1.

lm(salary ~ roe, data=ceosal1)

## 
## Call:
## lm(formula = salary ~ roe, data = ceosal1)
## 
## Coefficients:
## (Intercept)          roe  
##       963.2         18.5

We can use lm() to add the regression line to a scatterplot.

# Scatterplot
plot(ceosal1$roe, ceosal1$salary)

# Add the regression line
abline(lm(salary ~ roe, data=ceosal1), col="red")

Coefficients, Fitted Values, and Residuals

Section 2.2 of Heiss (2020)
We can store the estimation results in an object (a list) and then extract information from it.

# Store the regression results in an object
CEOregres = lm(salary ~ roe, data=ceosal1)

# Check the names of the components stored in the object
names(CEOregres)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

We can use coef() to extract the estimated regression coefficients.

( bhat = coef(CEOregres) )

## (Intercept)         roe 
##   963.19134    18.50119

bhat_0 = bhat["(Intercept)"] # or bhat[1]
bhat_1 = bhat["roe"] # or bhat[2]

Given these estimated parameters, we can calculate the fitted values, $\hat{y}$, and the residuals, $\hat{u}$, for each observation $i=1, ..., n$:

\begin{align} \hat{y}_i &= \hat{\beta}_0 + \hat{\beta}_1 . x_i \tag{2.5} \\ \hat{u}_i &= y_i - \hat{y}_i \tag{2.6} \end{align}

# Extract ceosal1 columns as vectors
sal = ceosal1$salary
roe = ceosal1$roe

# Compute fitted values
sal_hat = bhat_0 + (bhat_1 * roe)

# Compute residuals
u_hat = sal - sal_hat

# Display the first 6 rows of sal, roe, sal_hat, and u_hat
head( cbind(sal, roe, sal_hat, u_hat) )

##       sal  roe  sal_hat     u_hat
## [1,] 1095 14.1 1224.058 -129.0581
## [2,] 1001 10.9 1164.854 -163.8543
## [3,] 1122 23.5 1397.969 -275.9692
## [4,]  578  5.9 1072.348 -494.3483
## [5,] 1368 13.8 1218.508  149.4923
## [6,] 1145 20.0 1333.215 -188.2151

With fitted() and resid(), we can extract the fitted values and residuals directly from the regression object:

head( cbind(fitted(CEOregres), resid(CEOregres)) )

##       [,1]      [,2]
## 1 1224.058 -129.0581
## 2 1164.854 -163.8543
## 3 1397.969 -275.9692
## 4 1072.348 -494.3483
## 5 1218.508  149.4923
## 6 1333.215 -188.2151

# Or equivalently
head( cbind(CEOregres$fitted.values, CEOregres$residuals) )

##       [,1]      [,2]
## 1 1224.058 -129.0581
## 2 1164.854 -163.8543
## 3 1397.969 -275.9692
## 4 1072.348 -494.3483
## 5 1218.508  149.4923
## 6 1333.215 -188.2151

In Section 2.3 of Wooldridge (2006), OLS estimation implies the following sample properties: \begin{align} &\sum^n_{i=1}{\hat{u}_i} = 0 \quad \implies \quad \bar{\hat{u}} = 0 \tag{2.7} \\ &\sum^n_{i=1}{x_i \hat{u}_i} = 0 \quad \implies \quad Cov(x,\hat{u}) = 0 \tag{2.8} \\ &\bar{y}=\hat{\beta}_0 + \hat{\beta}_1.\bar{x} \tag{2.9} \end{align}
We can verify them in our example:

# Checking (2.7)
mean(u_hat) # very close to 0

## [1] -2.666235e-14

# Checking (2.8)
cor(ceosal1$roe, u_hat) # very close to 0

## [1] -6.038735e-17

# Checking (2.9)
mean(ceosal1$salary)

## [1] 1281.12

mean(sal_hat)

## [1] 1281.12

IMPORTANT: This only means that OLS chooses $\hat{\beta}_0$ and $\hat{\beta}_1$ so that 2.7, 2.8, and 2.9 hold in the sample.
This does NOT mean that the following population assumptions are necessarily true: \begin{align} &E(u) = 0 \tag{2.7'} \\ &Cov(x, u) = 0 \tag{2.8'} \end{align}
In fact, if 2.7’ and 2.8’ do not hold, OLS estimation will be biased.

Log Transformations

Section 2.4 of Heiss (2020)
We can also estimate models after transforming variables from levels into logarithms.
This is especially useful for turning nonlinear models into linear ones, such as when the parameter appears in the exponent rather than multiplicatively:

$$ y = A K^\alpha L^\beta\quad \overset{\text{log}}{\rightarrow}\quad \log(y) = \log(A) + \alpha \log(K) + \beta \log(L) $$

Log transformations are also commonly used for dependent variables with $y \ge 0$.

There are two main ways to apply a log transformation:
- create a new vector/column containing the logged variable, or
- use the log() function directly inside lm().

Example 2.11: CEO Compensation and Firm Sales (Wooldridge, 2006)

Consider the variables:
- wage: annual salary, in thousands of dollars
- sales: firm sales, in millions of dollars
Level-level model:

# Load the dataset
data(ceosal1, package="wooldridge")

# Estimate the level-level model
lm(salary ~ sales, data=ceosal1)

## 
## Call:
## lm(formula = salary ~ sales, data = ceosal1)
## 
## Coefficients:
## (Intercept)        sales  
##   1.174e+03    1.547e-02

An increase of US$1 million in sales is associated with an increase of US$0.01547 thousand in CEO compensation.
Log-level model:

# Estimate the log-level model
lm(log(salary) ~ sales, data=ceosal1)

## 
## Call:
## lm(formula = log(salary) ~ sales, data = ceosal1)
## 
## Coefficients:
## (Intercept)        sales  
##   6.847e+00    1.498e-05

An increase of US$1 million in sales tends to raise CEO salary by about 0.0015% ($=100 \beta_1%$).
Log-log model:

# Estimate the log-log model
lm(log(salary) ~ log(sales), data=ceosal1)

## 
## Call:
## lm(formula = log(salary) ~ log(sales), data = ceosal1)
## 
## Coefficients:
## (Intercept)   log(sales)  
##      4.8220       0.2567

A 1% increase in sales is associated with an increase of about 0.257% in salary ($=\beta_1%$).

Regression Through the Origin and on a Constant

Section 2.5 of Heiss (2020)
To estimate the model without an intercept, we need to add 0 + to the regressors inside lm():

data(ceosal1, package="wooldridge")
lm(salary ~ 0 + roe, data=ceosal1)

## 
## Call:
## lm(formula = salary ~ 0 + roe, data = ceosal1)
## 
## Coefficients:
##   roe  
## 63.54

If we regress a dependent variable on a constant only (1), we obtain its sample mean.

lm(salary ~ 1, data=ceosal1)

## 
## Call:
## lm(formula = salary ~ 1, data = ceosal1)
## 
## Coefficients:
## (Intercept)  
##        1281

mean(ceosal1$salary, na.rm=TRUE)

## [1] 1281.12

Difference in Means

Based on Example C.6: the effect of job-training grants on worker productivity (Wooldridge, 2006).
We can calculate a difference in means by regressing on a dummy variable that takes values 0 or 1.
First, we create a single vector of scrap rates by stacking SR87 and SR88:

SR87 = c(10, 1, 6, .45, 1.25, 1.3, 1.06, 3, 8.18, 1.67, .98,
         1, .45, 5.03, 8, 9, 18, .28, 7, 3.97)
SR88 = c(3, 1, 5, .5, 1.54, 1.5, .8, 2, .67, 1.17, .51, .5, 
         .61, 6.7, 4, 7, 19, .2, 5, 3.83)

( SR = c(SR87, SR88) ) # stack SR87 and SR88 into a single vector

##  [1] 10.00  1.00  6.00  0.45  1.25  1.30  1.06  3.00  8.18  1.67  0.98  1.00
## [13]  0.45  5.03  8.00  9.00 18.00  0.28  7.00  3.97  3.00  1.00  5.00  0.50
## [25]  1.54  1.50  0.80  2.00  0.67  1.17  0.51  0.50  0.61  6.70  4.00  7.00
## [37] 19.00  0.20  5.00  3.83

Note that the first 20 values refer to scrap rates in 1987 and the last 20 refer to 1988.
Next, we create a dummy variable called group88 that assigns value 1 to observations from 1988 and 0 to observations from 1987:

( group88 = c(rep(0, 20), rep(1, 20)) ) # 0/1 values for the first/last 20 observations

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1

Regressing the scrap rate on the dummy yields the difference in means:

lm(SR ~ group88)

## 
## Call:
## lm(formula = SR ~ group88)
## 
## Coefficients:
## (Intercept)      group88  
##       4.381       -1.154

Expected Values, Variance, and Standard Errors

Section 2.6 of Heiss (2020)
Wooldridge (2006, Section 2.5) derives the estimator of the error variance: $$ \hat{\sigma}^2 = \frac{1}{n-2} \sum^n_{i=1}{\hat{u}^2_i} = \frac{n-1}{n-2} Var(\hat{u}) \tag{2.14} $$ where $Var(\hat{u}) = \frac{1}{n-1} \sum^n_{i=1}{\hat{u}^2_i}$.
Notice that we need to account for degrees of freedom because we are estimating two parameters ( $\hat{\beta}_0$ and $\hat{\beta}_1$).
Note that $\hat{\sigma} = \sqrt{\hat{\sigma}^2}$ is called the standard error of the regression (SER). In R, this appears as the residual standard error.
We can also obtain the standard errors of the estimators:

\begin{align} se(\hat{\beta}_0) &= \sqrt{\frac{\hat{\sigma}\bar{x}^2}{\sum^n_{i=1}{(x_i - \bar{x})^2}}} = \frac{1}{\sqrt{n-1}} \frac{\hat{\sigma}}{sd(x)} \sqrt{\bar{x^2}} \tag{2.15}\\ se(\hat{\beta}_1) &= \sqrt{\frac{\hat{\sigma}}{\sum^n_{i=1}{(x_i - \bar{x})^2}}} = \frac{1}{\sqrt{n-1}} \frac{\hat{\sigma}}{sd(x)} \tag{2.16} \end{align}

Example 2.12: Student Math Performance and the School Lunch Program (Wooldridge, 2006)

Consider the variables:
- math10: the percentage of tenth graders who pass a standardized math exam
- lnchprg: the percentage of students eligible for the school lunch program
The simple regression model is $$ \text{math10} = \beta_0 + \beta_1 \text{lnchprg} + u $$

data(meap93, package="wooldridge")

# Estimate the model and store it in the object 'results'
results = lm(math10 ~ lnchprg, data=meap93)

# Extract the number of observations
( n = nobs(results) )

## [1] 408

# Compute the standard error of the regression (square root of 2.14)
( SER = sqrt( (n-1)/(n-2) ) * sd(resid(results)) )

## [1] 9.565938

# Standard error of bhat_0 (2.15)
(1 / sqrt(n-1)) * (SER / sd(meap93$lnchprg)) * sqrt( mean(meap93$lnchprg^2) ) # Standard error of bhat_0

## [1] 0.9975824

(1 / sqrt(n-1)) * (SER / sd(meap93$lnchprg)) # bhat_1

## [1] 0.03483933

The standard-error calculations can also be obtained with the summary() function applied to the regression object:

summary(results)

## 
## Call:
## lm(formula = math10 ~ lnchprg, data = meap93)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.386  -5.979  -1.207   4.865  45.845 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 32.14271    0.99758  32.221   <2e-16 ***
## lnchprg     -0.31886    0.03484  -9.152   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.566 on 406 degrees of freedom
## Multiple R-squared:  0.171,	Adjusted R-squared:  0.169 
## F-statistic: 83.77 on 1 and 406 DF,  p-value: < 2.2e-16

Also note that, by default, summary() reports two-sided hypothesis tests whose null hypotheses are $\beta_0 = 0$ and $\beta_1=0$.
In other words, it tests whether the estimated coefficients are statistically equal to zero and reports the corresponding t statistics and p-values.
In this case, the p-values are extremely small (<2e-16, that is, smaller than $2 \times 10^{-16}$), so we reject both null hypotheses and conclude that the estimates are statistically significant.
We can also compute these statistics by hand:

# Extract the estimated coefficients
( estim = coef(summary(results)) )

##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 32.1427116 0.99758239 32.220609 6.267302e-114
## lnchprg     -0.3188643 0.03483933 -9.152422  2.746609e-18

# t statistics for H0: bhat = 0
( t_bhat_0 = (estim["(Intercept)", "Estimate"] - 0) / estim["(Intercept)", "Std. Error"] )

## [1] 32.22061

( t_bhat_1 = (estim["lnchprg", "Estimate"] - 0) / estim["lnchprg", "Std. Error"] )

## [1] -9.152422

# p-values for H0: bhat = 0
2 * (1 - pt(abs(t_bhat_0), n-1)) # p-value for bhat_0

## [1] 0

2 * (1 - pt(abs(t_bhat_1), n-1)) # p-value for bhat_1

## [1] 0

Assumption Violations

Subsection 2.7.3 of Heiss (2020), but using the material from Worked Example 1.
Simulating a linear model (John Hopkins/Coursera)
In practice, we run regressions using observed data and do not know the true model that generated those observations.
In R, however, we can posit a true data-generating process and simulate observations to study what happens when an econometric assumption is violated.
We will use the example from Worked Example 1, where we want to study the relationship between hours of cooking practice and the number of kitchen burns.

No Assumption Violation: Example 1

Let $y$ denote the number of kitchen burns and $x$ the number of hours spent learning how to cook.
Suppose the true model is $$ y = a_0 + b_0 x + \varepsilon, \qquad \varepsilon \sim N(0, 2^2) \tag{1}$$ where $a_0=50$ and $b_0=-5$.

Set $a_0$ and $b_0$, then simulate observations of $x$ and $y$.
- For convenience, we generate random draws from $x \sim N(5; 0.5^2)$. Here, the specific distribution of $x$ is not important.

a0 = 50
b0 = -5
N = 200 # number of observations

set.seed(1)
u = rnorm(N, 0, 2) # disturbances: 200 obs. with mean 0 and sd 2
x = rnorm(N, 5, 0.5) # 200 obs. with mean 5 and sd 0.5
y = a0 + b0*x + u # compute y from u and x

plot(x, y)

These simulated $x$ and $y$ values are the information we would observe in practice.

Estimate the parameters $\hat{a}$ and $\hat{b}$ by OLS using the simulated observations of $y$ and $x$.
- Suppose an economist writes the following empirical model: $$ y = a + b x + u, \tag{1a}$$ assuming that $E[u] = 0$ and $E[ux]=0$.
- To estimate the model by OLS, we use lm().

lm(y ~ x) # regress y on x by OLS

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      50.463       -5.078

Notice that the population parameters are recovered reasonably well ( $\hat{a} \approx 50 = a_0$ and $\hat{b} \approx -5 = b_0$).

plot(x, y) # plot x against y
abline(a=50, b=-5, col="red") # true model line
abline(lm(y ~ x), col="blue") # estimated line from the observed data

No Assumption Violation: Example 2

Now suppose that, in the true model, the number of burns $y$ is determined both by hours of practice $x$ and by hours actually spent cooking $z$:

$$ y = a_0 + b_0 x + c_0 z + u, \qquad u \sim N(0, 2^2) \tag{2} $$ where $a_0=50$, $b_0=-5$, and $c_0=3$. For convenience, we generate random draws from $x \sim N(5; 0.5^2)$ and $z \sim N(1.875; 0.25^2)$. By construction, $z$ is uncorrelated with $x$ in the true model.

First, let us simulate the observations:

a0 = 50
b0 = -5
c0 = 3
N = 200 # number of observations

set.seed(1)
u = rnorm(N, 0, 2) # disturbances: 200 obs. with mean 0 and sd 2
x = rnorm(N, 5, 0.5) # 200 obs. with mean 5 and sd 0.5
z = rnorm(N, 1.875, 0.25) # 200 obs. with mean 1.875 and sd 0.25
y = a0 + b0*x + c0*z + u # compute y from u, x, and z

Suppose an economist writes the following empirical model: $$ y = a + b x + u, \tag{2a}$$ assuming that $E[u] = 0$ and $E[ux] = 0$.
Notice that the economist omitted the hours-cooking variable $z$, so it is absorbed into the regression error term.
However, because $z$ is unrelated to $x$, this omission does not bias the estimate of $\hat{b}$:

cor(x, z) # correlation between x and z -> close to 0

## [1] -0.02625278

lm(y ~ x) # OLS estimation

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##       56.27        -5.12

Since $\hat{b} \approx -5 = b_0$, OLS successfully recovers the population parameter $b_0$ even though the economist excluded $z$ from the model.
In many economic applications, the main goal is to estimate the relationship or causal effect between $y$ and $x$. We therefore do not need to include every variable that affects $y$, as long as $E(ux) = 0$ holds; that is, as long as no omitted explanatory variable correlated with $x$ is hiding inside the error term.

Violation of E(ux)=0

Now suppose that, in the true model, the more someone practices cooking, the more time they actually spend cooking. In other words, $x$ is related to $z$.
- Assume that $z \sim N(1.875x; (0.25)^2)$:

set.seed(1)
e = rnorm(N, 0, 2) # disturbances: 200 obs. with mean 0 and sd 2
x = rnorm(N, 5, 0.5) # 200 obs. with mean 5 and sd 0.5
z = rnorm(N, 1.875*x, 0.25) # 200 obs. with mean 1.875x and sd 0.25
y = a0 + b0*x + c0*z + e # compute y from e, x, and z
cor(x, z) # correlation between x and z

## [1] 0.9618748

Now $x$ and $z$ are strongly correlated.
Estimate the empirical model $$ y = a + b x + u,$$ assuming that $E[u] = 0$ and $E[ux]=0$.

lm(y ~ x) # OLS estimation

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     50.6406       0.5053

Here, $\hat{b} = 0.5 \neq -5 = b_0$. This happens because $z$ was omitted from the model and is therefore absorbed into the residual $\hat{u}$. Since $z$ is correlated with $x$, we have $E(ux)\neq 0$, which violates the key OLS exogeneity assumption.
If we include $z$ in the regression, we recover an estimate close to $b_0$:

lm(y ~ x + z)

## 
## Call:
## lm(formula = y ~ x + z)
## 
## Coefficients:
## (Intercept)            x            z  
##      50.435       -5.953        3.470

Violation of E(u)=0

Now assume that $E[u] = k$, where $k \neq 0$ is a constant.
Let $k = 10$:

a0 = 50
b0 = -5
k = 10

set.seed(1)
u = rnorm(N, k, 2) # disturbances: 200 obs. with mean k and sd 2
x = rnorm(N, 5, 0.5) # 200 obs. with mean 5 and sd 0.5
y = a0 + b0*x + u # compute y from u and x

If an economist estimates an empirical model that imposes $E[u] = 0$, then:

lm(y ~ x) # OLS estimation

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      60.463       -5.078

Notice that $E[u] \neq 0$ affects the intercept estimate, so $\hat{a} \neq a_0$, but it does not affect $\hat{b} \approx b_0$, which is usually the parameter of primary interest in economic applications.

Violation of Homoskedasticity

Now suppose that $u \sim N(0, (2x)^2)$, so the variance increases with $x$. In other words, the error variance depends on $x$, and homoskedasticity fails.

a0 = 50
b0 = -5

set.seed(1)
x = rnorm(N, 5, 0.5) # 200 obs. with mean 5 and sd 0.5
u = rnorm(N, 0, 2*x) # disturbances with sd equal to 2x
y = a0 + b0*x + u # compute y from u and x

lm(y ~ x) # OLS estimation

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      51.221       -5.166

Even with heteroskedasticity, we can still recover $\hat{b} \approx b_0$ in this simulation. But with a smaller sample, it becomes more likely that $\hat{b} \neq b_0$. Try repeating the exercise with smaller values of $N$.

Goodness of Fit

Section 2.3 of Heiss (2020)
The total sum of squares (TSS), explained sum of squares (ESS), and residual sum of squares (RSS) can be written as

\begin{align} SQT &= \sum^n_{i=1}{(y_i - \bar{y})^2} = (n-1) . Var(y) \tag{2.10}\\ SQE &= \sum^n_{i=1}{(\hat{y}_i - \bar{y})^2} = (n-1) . Var(\hat{y}) \tag{2.11}\\ SQR &= \sum^n_{i=1}{(\hat{u}_i - 0)^2} = (n-1) . Var(\hat{u}) \tag{2.12} \end{align} where $Var(x) = \frac{1}{n-1} \sum^n_{i=1}{(x_i - \bar{x})^2}$.

Wooldridge (2006) defines the coefficient of determination as \begin{align} R^2 &= \frac{SQE}{SQT} = 1 - \frac{SQR}{SQT}\\ &= \frac{Var(\hat{y})}{Var(y)} = 1 - \frac{Var(\hat{u})}{Var(y)} \tag{2.13} \end{align} because $SQT = SQE + SQR$.

# Compute R^2 in three equivalent ways:
var(sal_hat) / var(sal)

## [1] 0.01318862

1 - var(u_hat)/var(sal)

## [1] 0.01318862

cor(sal, sal_hat)^2 # squared correlation between the dependent variable and fitted values

## [1] 0.01318862

A more convenient way to obtain $R^2$ is to use summary() on the regression object. This produces a more detailed regression table, including $R^2$:

summary(CEOregres)

## 
## Call:
## lm(formula = salary ~ roe, data = ceosal1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1160.2  -526.0  -254.0   138.8 13499.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   963.19     213.24   4.517 1.05e-05 ***
## roe            18.50      11.12   1.663   0.0978 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1367 on 207 degrees of freedom
## Multiple R-squared:  0.01319,	Adjusted R-squared:  0.008421 
## F-statistic: 2.767 on 1 and 207 DF,  p-value: 0.09777

👉 Proceed to Multiple Regression

Simple OLS Regression