Please copy and paste this embed script to where you want to embed

RECALL: In last class, we learned statistical inference for population mean.

Problem. Notation

Populati on Notation

Meaning The population mean

𝑋� 𝜎

The sample mean The population standard deviation

s The sample standard deviation

n The sample size

RECALL:

Point estimation. (sample mean X ) Distribution of X Confidence Interval One-sample z-interval (population SD is known)

One-sample t-interval

X ±t

* n −1

s n

(only sample SD is known) Remark: 1. T-interval needs normal assumption. * 2. t n −1 , which is related to n-1 and C%, can be obtained from t-table.

RECALL:

Hypothesis Testing about 𝜇

Null Hypothesis H0

H0 :

vs.

vs.

Alternative Hypothesis HA

HA : HA :

(two-sided) (one-sided)

HA :

(one-sided)

Z-Test (population SD is known) Test statistic:

P-value: Alternative Hypothesis HA

P-value formula

HA : HA :

(two-sided)

P-value=2P(Z>|z|)

(one-sided)

P-value=P(Z>z)

HA :

(one-sided)

P-value=P(Zalpha level, fail to reject H0, and we say the test is not statistically significant at this alpha level); Errors: Type I error: decide to reject H0, but actually H0 is true; Type II error: decide to retain H0, but actually H0 is false; P(Type I error)=alpha level.

Exploring Relationship Between Variables

Chapter 7: Scatterplots, Association, and Correlation Chapter 8: Linear Regression

WHERE ARE WE GOING?

People might ask the following questions in the real life:

1. Is the price of sneakers related to how long they last? 2. Is smoking related to lung cancer? 3. Do baseball teams that score more runs sell more tickets to their games?

Chapter 7 will look at relationships between two quantitative variables X and Y. Scatterplot Correlation

TERM 1: SCATTERPLOTS

Is the price of sneakers related to how long they last?

Following table shows some data collected for sneakers: Price

Years

Price($)

1

20.00

70

2

21.99

60

3

23.29

4

25.99

5

29.99

20

6

34.99

10

7

39.99

0

8

44.99

9

49.99

10

59.99

50 40 30

0

2

4

6

8

10

12

This is an example of scatterplot. x-axis represents variable years and y-axis represents prices.

TERM 1: SCATTERPLOT

Scatterplots may be the most common and most effective display for paired data. Scatterplots are the best way to start observing the relationship and the ideal way to picture associations between two quantitative variables X-axis: Years, Explanatory variable which explains or influences changes in the other variable.

Price 70 60 50 40 30 20 10 0

0

2

4

6

8

10

12

Y-axis: Price, Response variable which measures an outcome of a study.

TERM 1: SCATTERPLOTS How do we describe the scatterplot? Or, What information about the relationship of the two variables can we get by looking at the scatterplot? Please look at the scatterplot of the sneakers example, and think about what can you tell about the relationship of years and price.

Price 70 60 50 40 30 20 10 0

0

2

4

6

8

10

12

We are going to describe the relationship from four different aspects. 1) Direction 2) Form 3) Strength 4) Unusual features

TERM 1: SCATTERPLOT

Negative A pattern like this that runs from the upper left to the lower right is said to be negative. Y variable decreases as the X variable increases. Positive A pattern running the other way is called positive. Y variable increases as X variable increases.

0 -5 Y

Look for direction: What’s my design—positive, negative or neither?

-10

0

10

20

30

40

50

X

10

15

Scatterplot

0

5

Y

Scatterplot

0

10

20

30 X

40

50

TERM 1: SCATTERPLOT

The example in the text shows a negative association between central pressure and maximum wind speed As the central pressure increases, the maximum wind speed decreases

TERM 1: SCATTERPLOTS Look for Form: straight, curved or something exotic, or no pattern? Scatterplot

Scatterplot 3000

Y

1500 1000

Y

1

2000

2

2500

30 25 20 15

500

-1

10

0

-2

5 0

Y

Scatterplot

0

0

2

4

6

8

10

0

2

4

6

8

10

0 X

2

4

6

X

X

Straight line, linear

Curved

No pattern

In this part, we are more interested in the linear pattern.

8

10

TERM 1: SCATTERPLOTS Look for strength: how much scatter? Or, how strong the relationship is? Strong: the points appear tightly clustered in a single stream.

Scatterplot

Scatterplot

3 Y -1

0

5

0

2

1

10

4

2

Y

15

Y

6

20

4

8

25

5

10

6

30

Scatterplot

0

0

4

6

8

2

4

6

8

10

10

X

X

0

1

2

Scatterplot

Y

2

-1

2

4

6

8

10

Weak: the swarm of points seem to form a vague cloud through which we can barely discern any trend or pattern 0

0

-2

0

2

4

6 X

8

10

X

TERM 1: SCATTERPLOTS

Look for the Unusual Features: Are there outliers or subgroups? Scatterplot

15

-2

0

0

5

2

10

4

Y

Y

6

20

8

25

30

10

Scatterplot

0

2

4

6 X

8

10

The point circled is a potential outlier

0

5

10 X

There are two clusters.

15

TERM 1: SCATTERPLOT-ROLES FOR VARIABLES

It is important to determine which of the two quantitative variables goes on the x-axis and which on the y-axis. This determination is made based on the roles played by the variables. When the roles are clear, the explanatory or predictor variable goes on the x-axis, and the response variable goes on the y-axis.

Slide 1- 16

TERM 1: SCATTERPLOTS

Summary A Scatterplot shows the relationship between two quantitative variables measured on the same individual. The variable that is designated the X variable is called the explanatory variable The variable that is designated the Y variable is called the response variable Always plot the explanatory variable on the horizontal (x) axis Always plot the response variable on the vertical (y) axis In examining scatterplots, look for an overall pattern showing the form, direction and strength of the relationship Look also for outliers or other deviations from this pattern

TERM 1: SCATTERPLOT

Example: Fast food is often considered unhealthy because much of it is high in fat. Are fat and calories related? Here are the fat and calories contents of several brands of burgers. Analyze the association between fat content and calories. Fat(g)

20

30

35

36

40

40

44

Calories

410

580

590

570

640

680

660

Calorie

700 600 500 400

18

28

38 Fat

48

Comment on the scatterplot: 1) Direction Positive 2) Form Roughly linear 3) Strength Moderately strong 4) Unusual features No.

TERM 2: CORRELATION

From scatterplots, we can look for the relationship between two quantitative variables and whether the relationship is strong or weak. But how strong is it? Correlation coefficient (or simply correlation) is a quantitative measure of linear relationship (association) between two quantitative variables. Finding the correlation coefficient, denoted by r, by hand: ( x − x )( y − y ) ∑ r= (n − 1) s x s y

Where s x and s y are standard deviations for X and Y respectively. Remarks:

Before you use correlation, you must check several conditions: Quantitative Variables Condition Straight Enough Condition Outlier Condition

TERM 2: CORRELATION

(Revisit the calories example) Here are the fat and calories contents of several brands of burgers. X: Fat(g)

20

30

35

36

40

40

44

Y: Calories

410

580

590

570

640

680

660

What is the correlation coefficient of x (fat) and y (calories)? Solution:

Deviations in x

Deviations in y

Product

20-35=-15

410-590=-180

(-15)*(-180)=2700

30-35=-5

580-590=-10

(-5)*(-10)=50

35-35= 0

590-590= 0

0*0=0

36-35= 1

570-590=-20

1*(-20)=-20

40-35= 5

640-590= 50

5*50=250

40-35= 5

680-590= 90

5*90=450

44-35= 9

660-590= 70

9*70=630

Add up the products: 2700+50+0+(-20)+250+450+630=4060 Correlation r=4060/{(7-1)*7.98*89.81}=0.9442

TERM 2: CORRELATION

CORRELATION PROPERTIES

The sign of a correlation coefficient gives the direction of the linear association. Positive sign Positive linear association Negative sign Negative linear association Correlation is always between -1 and +1.

Example: The correlation between fat and calories as 0.9442 indicates a strong positive linear association between them.

Slide 1- 22

Correlation can be exactly equal to -1 or +1, but these values are unusual in real data because they mean that all the data points fall exactly on a single straight line. A correlation near zero corresponds to a weak linear association.

TERM 2: CORRELATION Cautions about correlation:

Quantitative Variables Condition: Correlation applies only to quantitative variables. Straight Enough Condition: Correlation measures the strength only of the linear association.

0

-2

2

y

0

y

4

2

6

4

8

-2

-4

-2

-2

-1

0

1

0

1

2

x

r=0.92 x

10

Outlier Condition: Outliers can distort the correlation dramatically. 5

With the outlier: r=0.795 y

Without the outlier: r=0.938

0

-1

r=0.098

2

-5

-2

-1

0 x

1

2

TERM 2: CORRELATION

A. B.

Correlation≠Causation

Fast food is often considered unhealthy because much of it is high in fat. Are fat and calories related? Based on the fat and calories contents of several brands of burgers, the correlation between them is r=0.9442. Which conclusion is most accurate?

More fat in the burgers causes higher calories The burgers containing more fat tend to have higher calories

Comment: Even though A sounds all right, it is not the conclusion can be derived/explained by the correlation. Correlation is an objective story teller of the linear association between two variables. It can’t tell the causation.

CORRELATION PROPERTIES (CONT.) Correlation treats x and y symmetrically:

The correlation of x with y is the same as the correlation of y with x.

Correlation has no units.

Correlation is not affected by shifting and rescaling of either variable.

Correlation depends only on the z-scores, and they are unaffected by changes in center or scale. i.e. corr(aX+b,cY+d)=corr(X,Y) where a,b,c,d are constants.

Slide 1- 25

TERM 2: CORRELATION

0 -10

Y

-40 -80 -120

-5

0

5

10

0.777 -10

-5

0

X

X

(c)

(d)

5

10

30

20

-10

-20

0.006

-20 -10 0

-10

0

Y

10

20

-0.487

-20

-0.923 -10

-5

10

Y

10

0

20

20

Example: Here are several scatterplots. The calculated correlations are -0.923, -0.487, 0.006 and 0.777. Which is which? (a) (b)

Y

0 X

5

10

-10

-5

0 X

5

10

QUESTION: CAN WE DO MORE?

Scatterplot and correlation are useful tolls helping us to learn the (linear) association between two quantitative variables. Can we answer the following question:

Fast food is often considered unhealthy because much of it is high in fat. What is the calorie content of a kind of fast food with 28g fat? 700

If we want to estimate a unknown value based on the known values, this is called a prediction.

Calorie

650 600 550

One way to do the prediction is by constructing a linear model.

500 450 400

18

28

38 Fat

48

TERM 3: LINEAR MODEL

Let’s look at the burger example again. Fat(g)

20

30

35

36

40

40

44

Calories

410

580

590

570

640

680

660

BURGERS

550 500 450 400

CALORIES

600

650

The red line does not go through all the points, but it can summarize the general pattern with only a couple of parameters: Calories = a+b*fat.

20

25

30

35 FAT

40

This model can be used to predict the Calories based on the fat contain. Explanatory Var: Fat Response Var: Calories

TERM 3: LINEAR MODEL BURGERS

600 550

residual

450

500

Prediction

400

The line of best fit is the line for which the sum of the squared residuals is smallest. And it’s called the least squares line.

CALORIES

Residual: The difference between the observed value and its associated predicted value is called the residual.

650

Predicted value: we call the estimate made from a model the predicted value, denoted as yˆ .

20

25

30

35 FAT

40

TERM 3: LINEAR MODEL

TERM 3: LINEAR MODEL X: Fat(g)

20

30

35

36

40

40

44

Y: Calories 410 580 590 570 640 680 660 Q1: Please construct a linear regression model to predict the calories based on fat. Fat: Calories: Correlation: r=0.9442 Slope: 550 500

=210.8+11.06x

400

Intercept: Linear model:

450

CALORIES

600

650

BURGERS

Q2: What is the predicted calorie when the fat is 30g? When x=30,

20

Q3: What is the residual for the burger with 30g fat? When x=30, the residual is

25

30

35 FAT

40

TERM 3: LINEAR MODEL Remarks: Since regression and correlation are closely related, we need to check the same conditions for regressions as we did for correlations:

Quantitative Variables Condition Straight Enough Condition Outlier Condition

TERM 3: LINEAR MODEL (PARAMETERS)

We write a and b for the slope and intercept of the line. They are called the coefficients of the linear model. The coefficient b is the slope, which tells us how ˆ ) changes with rapidly the predicted value ( y respect to x. As the value of x increases by 1 unit, the predicted value of y will be increased by b units. The coefficient a is the intercept, which tells where the line hits (intercepts) the y-axis. In other words, the intercept a is the predicted value of y when x=0

Intercept and Slope (examples)

Fast food is often considered unhealthy because much of it is high in fat. Are fat and calories related? Here are the fat and calories contents of several brands of burgers. To analyze the association between fat content and calories, the equation of the regression model is: Predicted calories=217.95+10.63*fat For this linear equation, slope=10.63, intercept=217.95 Q1: What does the slope 10.63 mean? A1: An increase in fat of 1 gram is associated with an increase in calories of 10.63. Q2: If the fat increases by 2 grams, how many more calories are expected to be contained in the burger? A2: 2*10.63=21.26 Q3: What does the intercept 217.95 mean here? A3: Theoretically, it means: when the burger contains no fat at all, the amount of calories is 217.95.

TERM 4: RESIDUAL PLOT

2 1 0 -1 -2

After you construct the linear model, you have to check whether the linear model makes sense or not. Residual plot can be used to check the appropriateness of the linear model. Residual plot is the scatterplot of the residuals versus the xvalues. If a linear model is appropriate, then the residual plot shouldn’t have any interesting features, like a direction or shape. It should stretch horizontally, with about the same amount of scatter throughout. It should show no bends, and it should have no outliers. Residuals

-10

-5

0 X

5

10

TERM 4: RESIDUAL SCATTERPLOT

Now, let’s try to diagnose the model for the calorie and fat example.

Fat(g): x

20

30

35

36

40

40

44

Calories: y

410

580

590

570

640

680

660

Predicted calories:

430.6 536.9 590

600.6 643.2 643.2 685.7

Residual:

-20.6

-30.6

0

10 0 -10 -20 -30

Residual plot

residuals

20

30

40

43.1

20

25

30

35 fat

x

40

-3.2

36.8

-25.7

TERM 4: RESIDUAL PLOT Example: Tell what each of the residual plots below indicates about the appropriateness of the linear model that was fit to the data. (a)

(b)

(a)

(c) (c)

y3 -4

-5

-4

-2

-3

0

y2

0 -1

-6

y1

-2

2

-1

1

4

0

6

2

1

(b)

-2

-2

-1

0

1

2

-2

-1

0

1

2

-2

-1

0

x1

x2

x3

1

2

TI for correlation and regression equation

The first time you do this: Press 2nd, CATALOG (above 0) Scroll down to DiagnosticOn Press ENTER, ENTER Read “Done” Your calculator will remember this setting even when turned off

Enter predictor (x) values in L1 Enter response (y) values in L2 Pairs must line up There must be the same number of predictor and response values

Press STAT, > (to CALC)

Scroll down to 8:LinReg(a+bx), press ENTER, ENTER

Read intercept a, slope b and correlation r at the screen

IMPORTANT NOTES: Take-home quiz is due on Monday. No late submission will be accepted. Keep the ID assignment and bring it to class on Monday. Sample exam will be handed out on Monday. We will discuss the questions on Wednesday. Suggested Problem Set 4 will be collected on next Thursday. Final exam will be on next Thursday. 2 hours in class. Please prepare one page A4 size cheat sheet (one-sided) on your own. Formula sheet will not be provided in final exam. Cheat sheet will be collected together with the final exam.

View more...
Problem. Notation

Populati on Notation

Meaning The population mean

𝑋� 𝜎

The sample mean The population standard deviation

s The sample standard deviation

n The sample size

RECALL:

Point estimation. (sample mean X ) Distribution of X Confidence Interval One-sample z-interval (population SD is known)

One-sample t-interval

X ±t

* n −1

s n

(only sample SD is known) Remark: 1. T-interval needs normal assumption. * 2. t n −1 , which is related to n-1 and C%, can be obtained from t-table.

RECALL:

Hypothesis Testing about 𝜇

Null Hypothesis H0

H0 :

vs.

vs.

Alternative Hypothesis HA

HA : HA :

(two-sided) (one-sided)

HA :

(one-sided)

Z-Test (population SD is known) Test statistic:

P-value: Alternative Hypothesis HA

P-value formula

HA : HA :

(two-sided)

P-value=2P(Z>|z|)

(one-sided)

P-value=P(Z>z)

HA :

(one-sided)

P-value=P(Zalpha level, fail to reject H0, and we say the test is not statistically significant at this alpha level); Errors: Type I error: decide to reject H0, but actually H0 is true; Type II error: decide to retain H0, but actually H0 is false; P(Type I error)=alpha level.

Exploring Relationship Between Variables

Chapter 7: Scatterplots, Association, and Correlation Chapter 8: Linear Regression

WHERE ARE WE GOING?

People might ask the following questions in the real life:

1. Is the price of sneakers related to how long they last? 2. Is smoking related to lung cancer? 3. Do baseball teams that score more runs sell more tickets to their games?

Chapter 7 will look at relationships between two quantitative variables X and Y. Scatterplot Correlation

TERM 1: SCATTERPLOTS

Is the price of sneakers related to how long they last?

Following table shows some data collected for sneakers: Price

Years

Price($)

1

20.00

70

2

21.99

60

3

23.29

4

25.99

5

29.99

20

6

34.99

10

7

39.99

0

8

44.99

9

49.99

10

59.99

50 40 30

0

2

4

6

8

10

12

This is an example of scatterplot. x-axis represents variable years and y-axis represents prices.

TERM 1: SCATTERPLOT

Scatterplots may be the most common and most effective display for paired data. Scatterplots are the best way to start observing the relationship and the ideal way to picture associations between two quantitative variables X-axis: Years, Explanatory variable which explains or influences changes in the other variable.

Price 70 60 50 40 30 20 10 0

0

2

4

6

8

10

12

Y-axis: Price, Response variable which measures an outcome of a study.

TERM 1: SCATTERPLOTS How do we describe the scatterplot? Or, What information about the relationship of the two variables can we get by looking at the scatterplot? Please look at the scatterplot of the sneakers example, and think about what can you tell about the relationship of years and price.

Price 70 60 50 40 30 20 10 0

0

2

4

6

8

10

12

We are going to describe the relationship from four different aspects. 1) Direction 2) Form 3) Strength 4) Unusual features

TERM 1: SCATTERPLOT

Negative A pattern like this that runs from the upper left to the lower right is said to be negative. Y variable decreases as the X variable increases. Positive A pattern running the other way is called positive. Y variable increases as X variable increases.

0 -5 Y

Look for direction: What’s my design—positive, negative or neither?

-10

0

10

20

30

40

50

X

10

15

Scatterplot

0

5

Y

Scatterplot

0

10

20

30 X

40

50

TERM 1: SCATTERPLOT

The example in the text shows a negative association between central pressure and maximum wind speed As the central pressure increases, the maximum wind speed decreases

TERM 1: SCATTERPLOTS Look for Form: straight, curved or something exotic, or no pattern? Scatterplot

Scatterplot 3000

Y

1500 1000

Y

1

2000

2

2500

30 25 20 15

500

-1

10

0

-2

5 0

Y

Scatterplot

0

0

2

4

6

8

10

0

2

4

6

8

10

0 X

2

4

6

X

X

Straight line, linear

Curved

No pattern

In this part, we are more interested in the linear pattern.

8

10

TERM 1: SCATTERPLOTS Look for strength: how much scatter? Or, how strong the relationship is? Strong: the points appear tightly clustered in a single stream.

Scatterplot

Scatterplot

3 Y -1

0

5

0

2

1

10

4

2

Y

15

Y

6

20

4

8

25

5

10

6

30

Scatterplot

0

0

4

6

8

2

4

6

8

10

10

X

X

0

1

2

Scatterplot

Y

2

-1

2

4

6

8

10

Weak: the swarm of points seem to form a vague cloud through which we can barely discern any trend or pattern 0

0

-2

0

2

4

6 X

8

10

X

TERM 1: SCATTERPLOTS

Look for the Unusual Features: Are there outliers or subgroups? Scatterplot

15

-2

0

0

5

2

10

4

Y

Y

6

20

8

25

30

10

Scatterplot

0

2

4

6 X

8

10

The point circled is a potential outlier

0

5

10 X

There are two clusters.

15

TERM 1: SCATTERPLOT-ROLES FOR VARIABLES

It is important to determine which of the two quantitative variables goes on the x-axis and which on the y-axis. This determination is made based on the roles played by the variables. When the roles are clear, the explanatory or predictor variable goes on the x-axis, and the response variable goes on the y-axis.

Slide 1- 16

TERM 1: SCATTERPLOTS

Summary A Scatterplot shows the relationship between two quantitative variables measured on the same individual. The variable that is designated the X variable is called the explanatory variable The variable that is designated the Y variable is called the response variable Always plot the explanatory variable on the horizontal (x) axis Always plot the response variable on the vertical (y) axis In examining scatterplots, look for an overall pattern showing the form, direction and strength of the relationship Look also for outliers or other deviations from this pattern

TERM 1: SCATTERPLOT

Example: Fast food is often considered unhealthy because much of it is high in fat. Are fat and calories related? Here are the fat and calories contents of several brands of burgers. Analyze the association between fat content and calories. Fat(g)

20

30

35

36

40

40

44

Calories

410

580

590

570

640

680

660

Calorie

700 600 500 400

18

28

38 Fat

48

Comment on the scatterplot: 1) Direction Positive 2) Form Roughly linear 3) Strength Moderately strong 4) Unusual features No.

TERM 2: CORRELATION

From scatterplots, we can look for the relationship between two quantitative variables and whether the relationship is strong or weak. But how strong is it? Correlation coefficient (or simply correlation) is a quantitative measure of linear relationship (association) between two quantitative variables. Finding the correlation coefficient, denoted by r, by hand: ( x − x )( y − y ) ∑ r= (n − 1) s x s y

Where s x and s y are standard deviations for X and Y respectively. Remarks:

Before you use correlation, you must check several conditions: Quantitative Variables Condition Straight Enough Condition Outlier Condition

TERM 2: CORRELATION

(Revisit the calories example) Here are the fat and calories contents of several brands of burgers. X: Fat(g)

20

30

35

36

40

40

44

Y: Calories

410

580

590

570

640

680

660

What is the correlation coefficient of x (fat) and y (calories)? Solution:

Deviations in x

Deviations in y

Product

20-35=-15

410-590=-180

(-15)*(-180)=2700

30-35=-5

580-590=-10

(-5)*(-10)=50

35-35= 0

590-590= 0

0*0=0

36-35= 1

570-590=-20

1*(-20)=-20

40-35= 5

640-590= 50

5*50=250

40-35= 5

680-590= 90

5*90=450

44-35= 9

660-590= 70

9*70=630

Add up the products: 2700+50+0+(-20)+250+450+630=4060 Correlation r=4060/{(7-1)*7.98*89.81}=0.9442

TERM 2: CORRELATION

CORRELATION PROPERTIES

The sign of a correlation coefficient gives the direction of the linear association. Positive sign Positive linear association Negative sign Negative linear association Correlation is always between -1 and +1.

Example: The correlation between fat and calories as 0.9442 indicates a strong positive linear association between them.

Slide 1- 22

Correlation can be exactly equal to -1 or +1, but these values are unusual in real data because they mean that all the data points fall exactly on a single straight line. A correlation near zero corresponds to a weak linear association.

TERM 2: CORRELATION Cautions about correlation:

Quantitative Variables Condition: Correlation applies only to quantitative variables. Straight Enough Condition: Correlation measures the strength only of the linear association.

0

-2

2

y

0

y

4

2

6

4

8

-2

-4

-2

-2

-1

0

1

0

1

2

x

r=0.92 x

10

Outlier Condition: Outliers can distort the correlation dramatically. 5

With the outlier: r=0.795 y

Without the outlier: r=0.938

0

-1

r=0.098

2

-5

-2

-1

0 x

1

2

TERM 2: CORRELATION

A. B.

Correlation≠Causation

Fast food is often considered unhealthy because much of it is high in fat. Are fat and calories related? Based on the fat and calories contents of several brands of burgers, the correlation between them is r=0.9442. Which conclusion is most accurate?

More fat in the burgers causes higher calories The burgers containing more fat tend to have higher calories

Comment: Even though A sounds all right, it is not the conclusion can be derived/explained by the correlation. Correlation is an objective story teller of the linear association between two variables. It can’t tell the causation.

CORRELATION PROPERTIES (CONT.) Correlation treats x and y symmetrically:

The correlation of x with y is the same as the correlation of y with x.

Correlation has no units.

Correlation is not affected by shifting and rescaling of either variable.

Correlation depends only on the z-scores, and they are unaffected by changes in center or scale. i.e. corr(aX+b,cY+d)=corr(X,Y) where a,b,c,d are constants.

Slide 1- 25

TERM 2: CORRELATION

0 -10

Y

-40 -80 -120

-5

0

5

10

0.777 -10

-5

0

X

X

(c)

(d)

5

10

30

20

-10

-20

0.006

-20 -10 0

-10

0

Y

10

20

-0.487

-20

-0.923 -10

-5

10

Y

10

0

20

20

Example: Here are several scatterplots. The calculated correlations are -0.923, -0.487, 0.006 and 0.777. Which is which? (a) (b)

Y

0 X

5

10

-10

-5

0 X

5

10

QUESTION: CAN WE DO MORE?

Scatterplot and correlation are useful tolls helping us to learn the (linear) association between two quantitative variables. Can we answer the following question:

Fast food is often considered unhealthy because much of it is high in fat. What is the calorie content of a kind of fast food with 28g fat? 700

If we want to estimate a unknown value based on the known values, this is called a prediction.

Calorie

650 600 550

One way to do the prediction is by constructing a linear model.

500 450 400

18

28

38 Fat

48

TERM 3: LINEAR MODEL

Let’s look at the burger example again. Fat(g)

20

30

35

36

40

40

44

Calories

410

580

590

570

640

680

660

BURGERS

550 500 450 400

CALORIES

600

650

The red line does not go through all the points, but it can summarize the general pattern with only a couple of parameters: Calories = a+b*fat.

20

25

30

35 FAT

40

This model can be used to predict the Calories based on the fat contain. Explanatory Var: Fat Response Var: Calories

TERM 3: LINEAR MODEL BURGERS

600 550

residual

450

500

Prediction

400

The line of best fit is the line for which the sum of the squared residuals is smallest. And it’s called the least squares line.

CALORIES

Residual: The difference between the observed value and its associated predicted value is called the residual.

650

Predicted value: we call the estimate made from a model the predicted value, denoted as yˆ .

20

25

30

35 FAT

40

TERM 3: LINEAR MODEL

TERM 3: LINEAR MODEL X: Fat(g)

20

30

35

36

40

40

44

Y: Calories 410 580 590 570 640 680 660 Q1: Please construct a linear regression model to predict the calories based on fat. Fat: Calories: Correlation: r=0.9442 Slope: 550 500

=210.8+11.06x

400

Intercept: Linear model:

450

CALORIES

600

650

BURGERS

Q2: What is the predicted calorie when the fat is 30g? When x=30,

20

Q3: What is the residual for the burger with 30g fat? When x=30, the residual is

25

30

35 FAT

40

TERM 3: LINEAR MODEL Remarks: Since regression and correlation are closely related, we need to check the same conditions for regressions as we did for correlations:

Quantitative Variables Condition Straight Enough Condition Outlier Condition

TERM 3: LINEAR MODEL (PARAMETERS)

We write a and b for the slope and intercept of the line. They are called the coefficients of the linear model. The coefficient b is the slope, which tells us how ˆ ) changes with rapidly the predicted value ( y respect to x. As the value of x increases by 1 unit, the predicted value of y will be increased by b units. The coefficient a is the intercept, which tells where the line hits (intercepts) the y-axis. In other words, the intercept a is the predicted value of y when x=0

Intercept and Slope (examples)

Fast food is often considered unhealthy because much of it is high in fat. Are fat and calories related? Here are the fat and calories contents of several brands of burgers. To analyze the association between fat content and calories, the equation of the regression model is: Predicted calories=217.95+10.63*fat For this linear equation, slope=10.63, intercept=217.95 Q1: What does the slope 10.63 mean? A1: An increase in fat of 1 gram is associated with an increase in calories of 10.63. Q2: If the fat increases by 2 grams, how many more calories are expected to be contained in the burger? A2: 2*10.63=21.26 Q3: What does the intercept 217.95 mean here? A3: Theoretically, it means: when the burger contains no fat at all, the amount of calories is 217.95.

TERM 4: RESIDUAL PLOT

2 1 0 -1 -2

After you construct the linear model, you have to check whether the linear model makes sense or not. Residual plot can be used to check the appropriateness of the linear model. Residual plot is the scatterplot of the residuals versus the xvalues. If a linear model is appropriate, then the residual plot shouldn’t have any interesting features, like a direction or shape. It should stretch horizontally, with about the same amount of scatter throughout. It should show no bends, and it should have no outliers. Residuals

-10

-5

0 X

5

10

TERM 4: RESIDUAL SCATTERPLOT

Now, let’s try to diagnose the model for the calorie and fat example.

Fat(g): x

20

30

35

36

40

40

44

Calories: y

410

580

590

570

640

680

660

Predicted calories:

430.6 536.9 590

600.6 643.2 643.2 685.7

Residual:

-20.6

-30.6

0

10 0 -10 -20 -30

Residual plot

residuals

20

30

40

43.1

20

25

30

35 fat

x

40

-3.2

36.8

-25.7

TERM 4: RESIDUAL PLOT Example: Tell what each of the residual plots below indicates about the appropriateness of the linear model that was fit to the data. (a)

(b)

(a)

(c) (c)

y3 -4

-5

-4

-2

-3

0

y2

0 -1

-6

y1

-2

2

-1

1

4

0

6

2

1

(b)

-2

-2

-1

0

1

2

-2

-1

0

1

2

-2

-1

0

x1

x2

x3

1

2

TI for correlation and regression equation

The first time you do this: Press 2nd, CATALOG (above 0) Scroll down to DiagnosticOn Press ENTER, ENTER Read “Done” Your calculator will remember this setting even when turned off

Enter predictor (x) values in L1 Enter response (y) values in L2 Pairs must line up There must be the same number of predictor and response values

Press STAT, > (to CALC)

Scroll down to 8:LinReg(a+bx), press ENTER, ENTER

Read intercept a, slope b and correlation r at the screen

IMPORTANT NOTES: Take-home quiz is due on Monday. No late submission will be accepted. Keep the ID assignment and bring it to class on Monday. Sample exam will be handed out on Monday. We will discuss the questions on Wednesday. Suggested Problem Set 4 will be collected on next Thursday. Final exam will be on next Thursday. 2 hours in class. Please prepare one page A4 size cheat sheet (one-sided) on your own. Formula sheet will not be provided in final exam. Cheat sheet will be collected together with the final exam.