In last class, we learned statistical inference for population mean. Meaning. The population mean. The sample mean
February 7, 2017 | Author: Posy Franklin | Category: N/A
Short Description
1 RECALL: In last class, we learned statistical inference for population mean. Problem. Notation Populati on Notation X ...
Description
RECALL: In last class, we learned statistical inference for population mean.
Problem. Notation
Populati on Notation
Meaning The population mean
𝑋� 𝜎
The sample mean The population standard deviation
s The sample standard deviation
n The sample size
RECALL:
Point estimation. (sample mean X ) Distribution of X Confidence Interval One-sample z-interval (population SD is known)
One-sample t-interval
X ±t
* n −1
s n
(only sample SD is known) Remark: 1. T-interval needs normal assumption. * 2. t n −1 , which is related to n-1 and C%, can be obtained from t-table.
RECALL:
Hypothesis Testing about 𝜇
Null Hypothesis H0
H0 :
vs.
vs.
Alternative Hypothesis HA
HA : HA :
(two-sided) (one-sided)
HA :
(one-sided)
Z-Test (population SD is known) Test statistic:
P-value: Alternative Hypothesis HA
P-value formula
HA : HA :
(two-sided)
P-value=2P(Z>|z|)
(one-sided)
P-value=P(Z>z)
HA :
(one-sided)
P-value=P(Zalpha level, fail to reject H0, and we say the test is not statistically significant at this alpha level); Errors: Type I error: decide to reject H0, but actually H0 is true; Type II error: decide to retain H0, but actually H0 is false; P(Type I error)=alpha level.
Exploring Relationship Between Variables
Chapter 7: Scatterplots, Association, and Correlation Chapter 8: Linear Regression
WHERE ARE WE GOING?
People might ask the following questions in the real life:
1. Is the price of sneakers related to how long they last? 2. Is smoking related to lung cancer? 3. Do baseball teams that score more runs sell more tickets to their games?
Chapter 7 will look at relationships between two quantitative variables X and Y. Scatterplot Correlation
TERM 1: SCATTERPLOTS
Is the price of sneakers related to how long they last?
Following table shows some data collected for sneakers: Price
Years
Price($)
1
20.00
70
2
21.99
60
3
23.29
4
25.99
5
29.99
20
6
34.99
10
7
39.99
0
8
44.99
9
49.99
10
59.99
50 40 30
0
2
4
6
8
10
12
This is an example of scatterplot. x-axis represents variable years and y-axis represents prices.
TERM 1: SCATTERPLOT
Scatterplots may be the most common and most effective display for paired data. Scatterplots are the best way to start observing the relationship and the ideal way to picture associations between two quantitative variables X-axis: Years, Explanatory variable which explains or influences changes in the other variable.
Price 70 60 50 40 30 20 10 0
0
2
4
6
8
10
12
Y-axis: Price, Response variable which measures an outcome of a study.
TERM 1: SCATTERPLOTS How do we describe the scatterplot? Or, What information about the relationship of the two variables can we get by looking at the scatterplot? Please look at the scatterplot of the sneakers example, and think about what can you tell about the relationship of years and price.
Price 70 60 50 40 30 20 10 0
0
2
4
6
8
10
12
We are going to describe the relationship from four different aspects. 1) Direction 2) Form 3) Strength 4) Unusual features
TERM 1: SCATTERPLOT
Negative A pattern like this that runs from the upper left to the lower right is said to be negative. Y variable decreases as the X variable increases. Positive A pattern running the other way is called positive. Y variable increases as X variable increases.
0 -5 Y
Look for direction: What’s my design—positive, negative or neither?
-10
0
10
20
30
40
50
X
10
15
Scatterplot
0
5
Y
Scatterplot
0
10
20
30 X
40
50
TERM 1: SCATTERPLOT
The example in the text shows a negative association between central pressure and maximum wind speed As the central pressure increases, the maximum wind speed decreases
TERM 1: SCATTERPLOTS Look for Form: straight, curved or something exotic, or no pattern? Scatterplot
Scatterplot 3000
Y
1500 1000
Y
1
2000
2
2500
30 25 20 15
500
-1
10
0
-2
5 0
Y
Scatterplot
0
0
2
4
6
8
10
0
2
4
6
8
10
0 X
2
4
6
X
X
Straight line, linear
Curved
No pattern
In this part, we are more interested in the linear pattern.
8
10
TERM 1: SCATTERPLOTS Look for strength: how much scatter? Or, how strong the relationship is? Strong: the points appear tightly clustered in a single stream.
Scatterplot
Scatterplot
3 Y -1
0
5
0
2
1
10
4
2
Y
15
Y
6
20
4
8
25
5
10
6
30
Scatterplot
0
0
4
6
8
2
4
6
8
10
10
X
X
0
1
2
Scatterplot
Y
2
-1
2
4
6
8
10
Weak: the swarm of points seem to form a vague cloud through which we can barely discern any trend or pattern 0
0
-2
0
2
4
6 X
8
10
X
TERM 1: SCATTERPLOTS
Look for the Unusual Features: Are there outliers or subgroups? Scatterplot
15
-2
0
0
5
2
10
4
Y
Y
6
20
8
25
30
10
Scatterplot
0
2
4
6 X
8
10
The point circled is a potential outlier
0
5
10 X
There are two clusters.
15
TERM 1: SCATTERPLOT-ROLES FOR VARIABLES
It is important to determine which of the two quantitative variables goes on the x-axis and which on the y-axis. This determination is made based on the roles played by the variables. When the roles are clear, the explanatory or predictor variable goes on the x-axis, and the response variable goes on the y-axis.
Slide 1- 16
TERM 1: SCATTERPLOTS
Summary A Scatterplot shows the relationship between two quantitative variables measured on the same individual. The variable that is designated the X variable is called the explanatory variable The variable that is designated the Y variable is called the response variable Always plot the explanatory variable on the horizontal (x) axis Always plot the response variable on the vertical (y) axis In examining scatterplots, look for an overall pattern showing the form, direction and strength of the relationship Look also for outliers or other deviations from this pattern
TERM 1: SCATTERPLOT
Example: Fast food is often considered unhealthy because much of it is high in fat. Are fat and calories related? Here are the fat and calories contents of several brands of burgers. Analyze the association between fat content and calories. Fat(g)
20
30
35
36
40
40
44
Calories
410
580
590
570
640
680
660
Calorie
700 600 500 400
18
28
38 Fat
48
Comment on the scatterplot: 1) Direction Positive 2) Form Roughly linear 3) Strength Moderately strong 4) Unusual features No.
TERM 2: CORRELATION
From scatterplots, we can look for the relationship between two quantitative variables and whether the relationship is strong or weak. But how strong is it? Correlation coefficient (or simply correlation) is a quantitative measure of linear relationship (association) between two quantitative variables. Finding the correlation coefficient, denoted by r, by hand: ( x − x )( y − y ) ∑ r= (n − 1) s x s y
Where s x and s y are standard deviations for X and Y respectively. Remarks:
Before you use correlation, you must check several conditions: Quantitative Variables Condition Straight Enough Condition Outlier Condition
TERM 2: CORRELATION
(Revisit the calories example) Here are the fat and calories contents of several brands of burgers. X: Fat(g)
20
30
35
36
40
40
44
Y: Calories
410
580
590
570
640
680
660
What is the correlation coefficient of x (fat) and y (calories)? Solution:
Deviations in x
Deviations in y
Product
20-35=-15
410-590=-180
(-15)*(-180)=2700
30-35=-5
580-590=-10
(-5)*(-10)=50
35-35= 0
590-590= 0
0*0=0
36-35= 1
570-590=-20
1*(-20)=-20
40-35= 5
640-590= 50
5*50=250
40-35= 5
680-590= 90
5*90=450
44-35= 9
660-590= 70
9*70=630
Add up the products: 2700+50+0+(-20)+250+450+630=4060 Correlation r=4060/{(7-1)*7.98*89.81}=0.9442
TERM 2: CORRELATION
CORRELATION PROPERTIES
The sign of a correlation coefficient gives the direction of the linear association. Positive sign Positive linear association Negative sign Negative linear association Correlation is always between -1 and +1.
Example: The correlation between fat and calories as 0.9442 indicates a strong positive linear association between them.
Slide 1- 22
Correlation can be exactly equal to -1 or +1, but these values are unusual in real data because they mean that all the data points fall exactly on a single straight line. A correlation near zero corresponds to a weak linear association.
TERM 2: CORRELATION Cautions about correlation:
Quantitative Variables Condition: Correlation applies only to quantitative variables. Straight Enough Condition: Correlation measures the strength only of the linear association.
0
-2
2
y
0
y
4
2
6
4
8
-2
-4
-2
-2
-1
0
1
0
1
2
x
r=0.92 x
10
Outlier Condition: Outliers can distort the correlation dramatically. 5
With the outlier: r=0.795 y
Without the outlier: r=0.938
0
-1
r=0.098
2
-5
-2
-1
0 x
1
2
TERM 2: CORRELATION
A. B.
Correlation≠Causation
Fast food is often considered unhealthy because much of it is high in fat. Are fat and calories related? Based on the fat and calories contents of several brands of burgers, the correlation between them is r=0.9442. Which conclusion is most accurate?
More fat in the burgers causes higher calories The burgers containing more fat tend to have higher calories
Comment: Even though A sounds all right, it is not the conclusion can be derived/explained by the correlation. Correlation is an objective story teller of the linear association between two variables. It can’t tell the causation.
CORRELATION PROPERTIES (CONT.) Correlation treats x and y symmetrically:
The correlation of x with y is the same as the correlation of y with x.
Correlation has no units.
Correlation is not affected by shifting and rescaling of either variable.
Correlation depends only on the z-scores, and they are unaffected by changes in center or scale. i.e. corr(aX+b,cY+d)=corr(X,Y) where a,b,c,d are constants.
Slide 1- 25
TERM 2: CORRELATION
0 -10
Y
-40 -80 -120
-5
0
5
10
0.777 -10
-5
0
X
X
(c)
(d)
5
10
30
20
-10
-20
0.006
-20 -10 0
-10
0
Y
10
20
-0.487
-20
-0.923 -10
-5
10
Y
10
0
20
20
Example: Here are several scatterplots. The calculated correlations are -0.923, -0.487, 0.006 and 0.777. Which is which? (a) (b)
Y
0 X
5
10
-10
-5
0 X
5
10
QUESTION: CAN WE DO MORE?
Scatterplot and correlation are useful tolls helping us to learn the (linear) association between two quantitative variables. Can we answer the following question:
Fast food is often considered unhealthy because much of it is high in fat. What is the calorie content of a kind of fast food with 28g fat? 700
If we want to estimate a unknown value based on the known values, this is called a prediction.
Calorie
650 600 550
One way to do the prediction is by constructing a linear model.
500 450 400
18
28
38 Fat
48
TERM 3: LINEAR MODEL
Let’s look at the burger example again. Fat(g)
20
30
35
36
40
40
44
Calories
410
580
590
570
640
680
660
BURGERS
550 500 450 400
CALORIES
600
650
The red line does not go through all the points, but it can summarize the general pattern with only a couple of parameters: Calories = a+b*fat.
20
25
30
35 FAT
40
This model can be used to predict the Calories based on the fat contain. Explanatory Var: Fat Response Var: Calories
TERM 3: LINEAR MODEL BURGERS
600 550
residual
450
500
Prediction
400
The line of best fit is the line for which the sum of the squared residuals is smallest. And it’s called the least squares line.
CALORIES
Residual: The difference between the observed value and its associated predicted value is called the residual.
650
Predicted value: we call the estimate made from a model the predicted value, denoted as yˆ .
20
25
30
35 FAT
40
TERM 3: LINEAR MODEL
TERM 3: LINEAR MODEL X: Fat(g)
20
30
35
36
40
40
44
Y: Calories 410 580 590 570 640 680 660 Q1: Please construct a linear regression model to predict the calories based on fat. Fat: Calories: Correlation: r=0.9442 Slope: 550 500
=210.8+11.06x
400
Intercept: Linear model:
450
CALORIES
600
650
BURGERS
Q2: What is the predicted calorie when the fat is 30g? When x=30,
20
Q3: What is the residual for the burger with 30g fat? When x=30, the residual is
25
30
35 FAT
40
TERM 3: LINEAR MODEL Remarks: Since regression and correlation are closely related, we need to check the same conditions for regressions as we did for correlations:
Quantitative Variables Condition Straight Enough Condition Outlier Condition
TERM 3: LINEAR MODEL (PARAMETERS)
We write a and b for the slope and intercept of the line. They are called the coefficients of the linear model. The coefficient b is the slope, which tells us how ˆ ) changes with rapidly the predicted value ( y respect to x. As the value of x increases by 1 unit, the predicted value of y will be increased by b units. The coefficient a is the intercept, which tells where the line hits (intercepts) the y-axis. In other words, the intercept a is the predicted value of y when x=0
Intercept and Slope (examples)
Fast food is often considered unhealthy because much of it is high in fat. Are fat and calories related? Here are the fat and calories contents of several brands of burgers. To analyze the association between fat content and calories, the equation of the regression model is: Predicted calories=217.95+10.63*fat For this linear equation, slope=10.63, intercept=217.95 Q1: What does the slope 10.63 mean? A1: An increase in fat of 1 gram is associated with an increase in calories of 10.63. Q2: If the fat increases by 2 grams, how many more calories are expected to be contained in the burger? A2: 2*10.63=21.26 Q3: What does the intercept 217.95 mean here? A3: Theoretically, it means: when the burger contains no fat at all, the amount of calories is 217.95.
TERM 4: RESIDUAL PLOT
2 1 0 -1 -2
After you construct the linear model, you have to check whether the linear model makes sense or not. Residual plot can be used to check the appropriateness of the linear model. Residual plot is the scatterplot of the residuals versus the xvalues. If a linear model is appropriate, then the residual plot shouldn’t have any interesting features, like a direction or shape. It should stretch horizontally, with about the same amount of scatter throughout. It should show no bends, and it should have no outliers. Residuals
-10
-5
0 X
5
10
TERM 4: RESIDUAL SCATTERPLOT
Now, let’s try to diagnose the model for the calorie and fat example.
Fat(g): x
20
30
35
36
40
40
44
Calories: y
410
580
590
570
640
680
660
Predicted calories:
430.6 536.9 590
600.6 643.2 643.2 685.7
Residual:
-20.6
-30.6
0
10 0 -10 -20 -30
Residual plot
residuals
20
30
40
43.1
20
25
30
35 fat
x
40
-3.2
36.8
-25.7
TERM 4: RESIDUAL PLOT Example: Tell what each of the residual plots below indicates about the appropriateness of the linear model that was fit to the data. (a)
(b)
(a)
(c) (c)
y3 -4
-5
-4
-2
-3
0
y2
0 -1
-6
y1
-2
2
-1
1
4
0
6
2
1
(b)
-2
-2
-1
0
1
2
-2
-1
0
1
2
-2
-1
0
x1
x2
x3
1
2
TI for correlation and regression equation
The first time you do this: Press 2nd, CATALOG (above 0) Scroll down to DiagnosticOn Press ENTER, ENTER Read “Done” Your calculator will remember this setting even when turned off
Enter predictor (x) values in L1 Enter response (y) values in L2 Pairs must line up There must be the same number of predictor and response values
Press STAT, > (to CALC)
Scroll down to 8:LinReg(a+bx), press ENTER, ENTER
Read intercept a, slope b and correlation r at the screen
IMPORTANT NOTES: Take-home quiz is due on Monday. No late submission will be accepted. Keep the ID assignment and bring it to class on Monday. Sample exam will be handed out on Monday. We will discuss the questions on Wednesday. Suggested Problem Set 4 will be collected on next Thursday. Final exam will be on next Thursday. 2 hours in class. Please prepare one page A4 size cheat sheet (one-sided) on your own. Formula sheet will not be provided in final exam. Cheat sheet will be collected together with the final exam.
View more...
Comments