Education 1–Statistics Primer

Statistics Primer

This Primer is designed for those in the education or health related fields. It covers basic statistical terminology and presents various (hopefully helpful) examples from the “real world”.

RATIO  A RATIO expresses the magnitude of one quantity in relation to the magnitude of another quantity, without making further assumptions about the numbers and without requiring that the two numbers share a common unit. Often, ratios are expressed as A:B or A per B. (p. 340, Statistics in a Nutshell)

RISK RATIO A RISK RATIO, also known as relative risk, is defined as

RR=

a/(a+b)c/(c+d)

(p. 349, Statistics in a Nutshell) When dealing with the relative rate (RR), you can calculate what the relative risk of contracting a disease (like Type II diabetes) is if another condition is affected (like diet). This is done by constructing what is called the “classic 2 X 2 table”.

Disease    
D+ D- Total
Exposure E+ a b a+b
  E- c d c+d
  Total a+c b+d a+b+c+d

For example, what is the relative risk of contracting Type II diabetes if you have either a low fat/normal diet versus a high fat diet?

Disease
D+ (has Type II diabetes) D- (doesn’t have Type II diabetes) Total
Exposure E+ (high fat diet) 350 1200 1550
E- (low fat or normal diet) 200 1900 2100
Total 550 3100 3650

Risk of Type II diabetes given a high fat diet:

a/(a+b)= 350/1550 = 0.226

Risk of Type II diabetes given a low fat to normal diet:

c/(c+d)= 200/2100 = 0.095

Now you can calculate RR.

RR=

a/(a+b)c/(c+d)

Plug in the numbers and you get:

RR=

0.2260.095

= 2.38

A relative risk greater than 1 indicates that exposure increases the risk of contracting the disease. We would say that people consuming a high fat diet have 2.38 times the risk of Type II diabetes, compared to people consuming a low fat to normal diet.

ODDS RATIO An ODDS RATIO is defined as

OR=

a/c  = ad

b/d     bc

An odds ratio was developed for what is called “case-control studies”. Case-controls were invented in epidemiology to facilitate research into diseases that are rare or slow to develop. In the case-control study, individuals are selected on the basis of their disease status and then their exposure status is determined. Risk ratios, being sensitive to the number of people without the disease, cannot be calculated in case-control studies.

The odds ratio is simply the ratio of the odds of disease for the exposed group to the odds of disease in the unexposed group. If a disease or condition is rare (less than 10%), the odds ratio provides a reasonable estimate of the risk ratio. (p. 352, Statistics in a Nutshell ).

PROPORTION     A PROPORTION is a particular type of ratio in which all cases included in the numerator are also included in the denominator. For example, if you wanted to know what proportion of people living with AIDS in the United States were male, we would divide the number of males by the total number of cases (which is the number of males in the U.S. who are living with AIDS + the total number of females in the U.S. who are living with AIDS).

               Proportion=

769,635

(769,635 + 186,383)

=0.805

(p. 340, Statistics in a Nutshell) Proportions are usually expressed in percentages (cent in Latin means 100), so we would say 0.805 * 100 = 80.5%. Therefore, the number of males living with AIDS in the U. S. is 80.5%.

RATE     A RATE is a proportion in which the denominator includes a measure of time. Morbidity and mortality (disease and death) statistics are often reported in terms of the rate per 1,000 or 100,000 per time unit. This is done because it is easier to interpret numbers like 3.57 versus 12.9 annually per 100,000 population than 0.00000357 versus 0.0000129 annually per person.

Converting rates to standard quantities facilitates comparison across populations of different sizes.

One example is death rates and populations.

Year

Deaths

Population

Deaths Per 100,000

1940 75 50,000 150.0
1950 95 60,000 158.3
1960 110 75,000 146.7
1970 125 90,000 138.9

The death rate per 100,000 formula is deaths/population *100,000. In this example, with made up numbers, you can see that even though deaths increased over the years, the population did too, so the overall deaths per 100,000 went down. (p. 341, Statistics in a Nutshell).

CRUDE and AGE-ADJUSTED RATES           If not otherwise qualified, the term RATE usually means CRUDE RATE. The crude rate is the rate for the entire population under study, with no particular weighting or adjustment. For instance, according to the CDC, the overall death rate for cancer in the U.S. in 2003 was 195.5 per 100,000. But these rates varied widely across gender, race, and age group. They also varied a great deal in terms of different types of cancer. That is why we should use age-adjusted rates when we wish to include those factors as “weights” on the measurement, to improve its overall accuracy.

For example, look at the differences between crude and age-adjusted mortality rates for the U. S. population in 2003 (per 100,000):

Population Ethnicity

Crude

Age-Adjusted

Overall 191.5 190.1
White 203.8 188.3
African American 164.3 234.5
Asian/Pacific Islander 79.4 114.3
American Indian/Alaska Native 69.3 121.0
Hispanic 60.3 127.4

The crude mortality rate from cancer is highest among white Americans but this is due to their longer life expectancy. The older we get, the more likely we are to have some sort of cancer. When the age-adjusted rate is considered, African Americans have the highest mortality rate for cancer, according to the CDC. (p. 345, Statistics in a Nutshell).

PREVALENCE     PREVALENCE describes the number of cases that exist in a population at a particular point in time. It describes the disease burden on a population without differentiating between new versus existing cases. The calculation is P = number of cases/total population at a given moment in time.

It is sometimes referred to as prevalence rate. If a survey of a city with a population of 150,000 people found that 671 were diabetics, the prevalence of diabetes at the time of the survey in that city would be 671 per 150,000 or 447.3 per 100,000. (p. 342, Statistics in a Nutshell).

INCIDENCE is somewhat more complicated to calculate, because it requires three elements to be defined. Incidence describes the number of new cases of a disease or condition that develop in a population at risk during a particular time interval. There are two types of incidence: cumulative incidence and incidence density (also known as incidence rate).

Cumulative incidence is the proportion of people who contract a disease during a specific time interval and is defined as: CI = (number of new cases/population at risk) for a specified time period. The formula to calculate CI assumes the entire population is at risk and can be studied for the entire time period specified.

If the population at risk changes at all over the period included in the incidence calculations, then the incidence density or the incidence rate should be calculated instead. Calculation of the IR requires expressing the denominator in “person-time units”, which represent the amount of time each person was observed.

For example, let’s look at hypothetical data on the annual rate of post-surgical infections at two hospitals. Because the hospitals serve different numbers of patients and patients are in the hospital for different lengths of time, we need to calculate the IR using person-time units in the denominator. Our statistic of comparison will be the number of complications per 100 patient-days. Each patient-day can be considered an opportunity for an infection to occur, so using patient-days in the denominator corrects for the different exposure lengths in the two hospitals.

Here’s our data:

Hospital

Patient ID

Days Followed

Infection?

1 1 30 N
1 2 25 Y
1 3 15 N
Total for Hospital 1   Sum of above: 70 Sum of above: 1
2 1 45 Y
2 2 30 N
2 3 50 N
2 4 75 Y
Total for Hospital 2   Sum of above: 200 Sum of above: 2

The rate of infections per 100 patient-days is calculated as: (number of infections/person-days studied) * 100

So for this example, the IRs are:

Hospital 1            1/70 *100 = 1.43 per 100

Hospital 2            2/200 *100 = 1.00 per 100

Even though Hospital B had more post-surgical infections in the period studied, these occurred during proportionally more patient-days, so Hospital B has a lower rate of post-surgical infections than A. (p. 343-344, Statistics in a Nutshell).

MEAN (Average)   The mean is commonly known as the average of a set of values. This will give you an idea of what a typical or common value is for a set of data. It is obtained usually by summing the values and then dividing by the number of values. For means that concern populations (very large groups of data), the mean is indicated by the Greek letter mu (μ). For samples (smaller groups of data), the mean is indicated by X with a bar over it. (p. 55, Statistics in a Nutshell)

An example of calculating the mean for a set of values in a sample:

Your sample: 100, 115, 93, 102, 97

The equation:  (100+115+93+102+97)/5 = 507/5 = 101.4

You may also want to calculate the median, which is the middle value, literally, when the values are ranked in ascending or descending order. If there are N values, the median is (N+1)/2. If N=7, the median is (7+1)/2 or the 4th value. If there is an even number of values, the median is the average of the two middle values. (p. 57, Statistics in a Nutshell)

Another measure of central tendency (how common or typical a value is in the data set you are looking at) is called the mode. The mode is the most frequently occurring value. The mode is most useful when describing categories or what is termed ordinal data. For example, if you asked college students what their most favorite news source is, and got the following data, you could show what the mode is.

1 = newspapers                 2 = television                     3 = Internet

And your survey reported the following answers, put in order:

1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3

The mode would be 3 (the Internet), since it is the most frequently occurring value. (p. 58, Statistics in a Nutshell)

VARIANCE 
The variance helps you determine how spread out around the mean your data is. This can be important, because you want to know if the data is gathered closely around the mean or is spread out, with what are called outliers. These outliers are important to look at, because they can influence your results. Sometimes that is called skewing the data.
The variance is the average of the squared deviations from the mean. For measuring a population variance, you use the Greek symbol σ2   and if you are dealing with a sample, you use the symbol s2.

The formula is:

S2 =1/(n-1) ∑ (xi –x)(you should place a bar on top of the last X, since it should be the mean.)

For example, if you had this set of data, you could calculate the variance.

Data Mean Difference from Mean Squared Difference from Mean

1

3

-2

4

2

3

-1

1

3

3

0

0

4

3

+1

1

5

3

+2

4

 The variance is computed now:

S2            = 1 / (5-1) * ((-2)2 + (-1)2 + (0)2 + (1)2 + (2)2)

= 4 + 1 + 0 + 1 + 4/4

= 10/4

= 2.5

STANDARD DEVIATION   The standard deviation is the square root of the variance. This helps you see things measured in the same terms the original mean was. For example, if you had a mean measuring the average weight of Tyrannosaurus Rex in pounds, your variance would show squared pounds, not pounds. To get back to pounds, you use the standard deviation.

The formula for standard deviation is quite simple (here we will show the formula for a sample standard deviation): s = √s2

So using the example data from the variance problem above, we can calculate the standard deviation:  s = √2.5 = 1.58

CONFIDENCE INTERVALS   When you calculate a mean for a sample, that is considered a point estimate—that is, the mean represents a single number. Sometimes you want to know, or have confidence, that your mean is accurate. The sample mean is only an estimate. So you need to calculate how confident you are that if you sampled again, and calculated a mean, the mean would be similar to your first mean. The way you do that is by using confidence intervals.

Beware any statistics that show you means and standard deviations but do not show you the confidence intervals.

The confidence interval is the interval between a lower and upper confidence limit or bound. Basically, you are saying if you sampled over and over again, and calculated means over and over again on your samples, X% of the time, the means you calculate would be between those two values—the upper and lower confidence limits.

Some people use different confidence limits, and they should tell you which one they are using. For example, some people will calculate confidence intervals of 90%, 95% or 98%. If they use 90% as their confidence interval, you can calculate the interval limits using this formula:

Confidence Interval (90%) = Mean + or – (1.65 * SE) where SE means standard error.

Standard error is calculated by (standard deviation/square root of N). (AllPsychOnLine, slide 2.42).

Confidence Interval (95%) = Mean + or – (1.96 * SE).

Confidence Interval (98%) = Mean + or – (2.58 * SE).

Working example:

Your sample data: 100, 115, 93, 102, 97

N = 5

The mean is:  (100+115+93+102+97)/5 = 507/5 = 101.4

Variance is: 69.3

Standard deviation is: 8.324662

Standard error is: (8.324662/2.236068) = 3.722902

So if we had a mean of 101.4, our confidence intervals would be:

Confidence Interval (90%) = 101.4 + or – (1.65 * 3.722902) = (95.25721, 107.5428).

Confidence Interval (95%) = 101.4 + or – (1.96 * 3.722902) = (94.10311, 108.6969).

Confidence Interval (98%) = 101.4 + or – (2.58 * 3.722902) = (91.79491, 111.0051).

The higher the percentage is, the more tightly you are saying that you are confident that your mean will fall within that range of values. For example, if we have a mean and we use the 95% confidence interval, we are saying that over an infinite number of repetitions of the study where we got the mean, 95% of the time, the confidence interval would contain the mean from your study.

You can use this to determine whether a study is more or less precise. For example, suppose you have two samples of students and in both cases the mean IQ score is 100. In one case, the 95% confidence interval is (95, 105) and in the other case, the 95% confidence interval is (80, 120). Because the former confidence interval is much narrower than the latter, the estimate of the mean is more precise for the first sample. (p. 144, Statistics in a Nutshell)

References

Statistics in a Nutshell. Sarah Boslaugh and Paul Andrew Watters. O’Reilly Publishing. (2008) ISBN: 978-0-596-51049-7 

How To Lie With Statistics by Darryl Huff. New York: W. W. Norton. 1954 (reprint 1993). ISBN-10: 0393310728 ISBN-13: 978-0393310726

National Institute of Standards and Technology, Engineering Statistics Handbook: Gallery of Distributions. http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm

Handbook of Epidemiology. Wolfgang Ahrens and Iris Pigeot, eds., New York: Springer (2004) ISBN-10: 3540005668 ISBN-13: 978-3540005667

Epidemiology in Medicine. Charles H. Hennekens and Julie E. Buring. Boston: Little, Brown. (1987). ISBN-10: 0316356360 ISBN-13: 978-0316356367

Principles of Biostatistics, 2nd Ed. Marcello Pagano and Kimberlee Gauvreau. Pacific Grove, CA: Duxbury Press. (2000). ISBN-10: 0534229026 ISBN-13: 978-0534229023

“Ask Dr. Math”. http://mathforum.org/dr.math/

All PsychOnLine psychology and psychiatry web site. http://allpsych.com/stats/unit1/02.html

Leave a comment