Lesson 2: Summarizing Data
Section 6: Measures of Central Location
A measure of central location provides a single value that summarizes an entire distribution of data. Suppose you had data from an outbreak of gastroenteritis affecting 41 persons who had recently attended a wedding. If your supervisor asked you to describe the ages of the affected persons, you could simply list the ages of each person. Alternatively, your supervisor might prefer one summary number — a measure of central location. Saying that the mean (or average) age was 48 years rather than reciting 41 ages is certainly more efficient, and most likely more meaningful.
Measures of central location include the mode, median, arithmetic mean, midrange, and geometric mean. Selecting the best measure to use for a given distribution depends largely on two factors:
 The shape or skewness of the distribution, and
 The intended use of the measure.
Each measure — what it is, how to calculate it, and when best to use it — is described in this section.
Mode
Definition of mode
The mode is the value that occurs most often in a set of data. It can be determined simply by tallying the number of times each value occurs. Consider, for example, the number of doses of diphtheriapertussistetanus (DPT) vaccine each of seventeen 2yearold children in a particular village received:
0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4
Two children received no doses; two children received 1 dose; three received 2 doses; six received 3 doses; and four received all 4 doses. Therefore, the mode is 3 doses, because more children received 3 doses than any other number of doses.
Method for identifying the mode
 Step 1. Arrange the observations into a frequency distribution, indicating the values of the variable and the frequency with which each value occurs. (Alternatively, for a data set with only a few values, arrange the actual values in ascending order, as was done with the DPT vaccine doses above.)
 Step 2. Identify the value that occurs most often.
EXAMPLES: Identifying the Mode
Example A: Table 2.8 (below) provides data from 30 patients who were hospitalized and received antibiotics. For the variable “length of stay” (LOS) in the hospital, identify the mode.

Step 1. Arrange the data in a frequency distribution.
Part 1 of 3 LOSFrequency01 10 21 31 41 52 61 71 81 93 Part 2 LOSFrequency105 111 123 131 141 150 161 170 182 191 Part 3 LOSFrequency200 210 221 .0 .0 271 .0 .0 491 Alternatively, arrange the values in ascending order.0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49 
Step 2. Identify the value that occurs most often.
Most values appear once, but the distribution includes two 5s, three 9s, five 10s, three 12s, and two 18s.
Because 10 appears most frequently, the mode is 10.
Example B: Find the mode of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22 days.

Step 1. Arrange the values in ascending order.
15, 22, 27, 30, and 31 days

Step 2. Identify the value that occurs most often.
None
Note: When no value occurs more than once, the distribution is said to have no mode.
Example : Find the mode of the following incubation periods for Bacillus cereus food poisoning:

Step 1. Arrange the values in ascending order.
Done

Step 2. Identify the values that occur most often.
Five 3s and five 12s
Example C illustrates the fact that a frequency distribution can have more than one mode. When this occurs, the distribution is said to be bimodal. Indeed, Bacillus cereus is known to cause two syndromes with different incubation periods: a shortincubation period (1–6 hours) syndrome characterized by vomiting; and a longincubationperiod (6–24 hours) syndrome characterized by diarrhea.
Table 2.8 Sample Data from the Northeast Consortium Vancomycin Quality Improvement Project
ID  Admission Date  Discharge Date  LOS  DOB (mm/dd)  DOB (year)  Age  Sex  ESRD  

1  1/01  1/10  9  11/18  1928  66  M  Y  3  N 
2  1/08  1/30  22  01/21  1916  78  F  N  10  Y 
3  1/16  3/06  49  04/22  1920  74  F  N  32  Y 
4  1/23  2/04  12  05/14  1919  75  M  N  5  Y 
5  1/24  2/01  8  08/17  1929  65  M  N  4  N 
6  1/27  2/14  18  01/11  1918  77  M  N  6  Y 
7  2/06  2/16  10  01/09  1920  75  F  N  2  Y 
8  2/12  2/22  10  06/12  1927  67  M  N  1  N 
9  2/22  3/04  10  05/09  1915  79  M  N  8  N 
10  2/22  3/08  14  04/09  1920  74  F  N  10  N 
11  2/25  3/04  7  07/28  1915  79  F  N  4  N 
12  3/02  3/14  12  04/24  1928  66  F  N  8  N 
13  3/11  3/17  6  11/09  1925  69  M  N  3  N 
14  3/18  3/23  5  04/08  1924  70  F  N  2  N 
15  3/19  3/28  9  09/13  1915  79  F  N  1  Y 
16  3/27  4/01  5  01/28  1912  83  F  N  4  Y 
17  3/31  4/02  2  03/14  1921  74  M  N  2  Y 
18  4/12  4/24  12  02/07  1927  68  F  N  3  N 
19  4/17  5/06  19  03/04  1921  74  F  N  11  Y 
20  4/29  5/26  27  02/23  1921  74  F  N  14  N 
21  5/11  5/15  4  05/05  1923  72  M  N  4  Y 
22  5/14  5/14  0  01/03  1911  84  F  N  1  N 
23  5/20  5/30  10  11/11  1922  72  F  N  9  Y 
24  5/21  6/08  18  08/08  1912  82  M  N  14  Y 
25  5/26  6/05  10  09/28  1924  70  M  Y  5  N 
26  5/27  5/30  3  05/14  1899  96  F  N  2  N 
27  5/28  6/06  9  07/22  1921  73  M  N  1  Y 
28  6/07  6/20  13  12/30  1896  98  F  N  3  N 
29  6/07  6/23  16  08/31  1906  88  M  N  1  N 
30  6/16  6/27  11  07/07  1917  77  F  N  7  Y 
To identify the mode from a data set in Analysis Module:
Epi Info does not have a Mode command. Thus, the best way to identify the mode is to create a histogram and look for the tallest column(s).
Select graphs, then choose histogram under Graph Type.
The tallest column(s) is(are) the mode(s).
NOTE: The Means command provides a mode, but only the lowest value if a distribution has more than one mode.
Properties and uses of the mode
The mode is the easiest measure of central location to understand and explain. It is also the easiest to identify, and requires no calculations.
 The mode is the preferred measure of central location for addressing which value is the most popular or the most common. For example, the mode is used to describe which day of the week people most prefer to come to the influenza vaccination clinic, or the “typical” number of doses of DPT the children in a particular community have received by their second birthday.
 As demonstrated, a distribution can have a single mode. However, a distribution has more than one mode if two or more values tie as the most frequent values. It has no mode if no value appears more than once.
 The mode is used almost exclusively as a “descriptive” measure. It is almost never used in statistical manipulations or analyses.
 The mode is not typically affected by one or two extreme values (outliers).
Exercise 2.3
Using the same vaccination data as in Exercise 2.2, find the mode. (If you answered Exercise 2.2, find the mode from your frequency distribution.)
2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1
Median
Definition of median
The median is the middle value of a set of data that has been put into rank order. Similar to the median on a highway that divides the road in two, the statistical median is the value that divides the data into two halves, with one half of the observations being smaller than the median value and the other half being larger. The median is also the 50th percentile of the distribution. Suppose you had the following ages in years for patients with a particular illness:
4, 23, 28, 31, 32
The median age is 28 years, because it is the middle value, with two values smaller than 28 and two values larger than 28.
Method for identifying the median
Step 1. Arrange the observations into increasing or decreasing order.
Step 2. Find the middle position of the distribution by using the following formula:
Middle position = (n + 1) / 2
 If the number of observations (n) is odd, the middle position falls on a single observation.
 If the number of observations is even, the middle position falls between two observations.
Step 3. Identify the value at the middle position.
 If the number of observations (n) is odd and the middle position falls on a single observation, the median equals the value of that observation.
 If the number of observations is even and the middle position falls between two observations, the median equals the average of the two values.
Properties and uses of the median
 The median is a good descriptive measure, particularly for data that are skewed, because it is the central point of the distribution.
 The median is relatively easy to identify. It is equal to either a single observed value (if odd number of observations) or the average of two observed values (if even number of observations).
 The median, like the mode, is not generally affected by one or two extreme values (outliers). For example, if the values on the previous page had been 4, 23, 28, 31, and 131 (instead of 31), the median would still be 28.
 The median has lessthanideal statistical properties. Therefore, it is not often used in statistical manipulations and analyses.
Exercise 2.4
Determine the median for the same vaccination data used in Exercises 2.2. and 2.3.
2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1
Arithmetic mean
Definition of mean
The arithmetic mean is a more technical name for what is more commonly called the mean or average. The arithmetic mean is the value that is closest to all the other values in a distribution.
Method for calculating the mean
Step 1. Add all of the observed values in the distribution.
Step 2. Divide the sum by the number of observations.
EXAMPLE: Finding the Mean
Find the mean of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22 days.
Step 1. Add all of the observed values in the distribution.
27 + 31 + 15 + 30 + 22 = 125
Step 2. Divide the sum by the number of observations.
125 / 5 = 25.0
Therefore, the mean incubation period is 25.0 days.
Properties and uses of the arithmetic mean
 The mean has excellent statistical properties and is commonly used in additional statistical manipulations and analyses. One such property is called the centering property of the mean. When the mean is subtracted from each observation in the data set, the sum of these differences is zero (i.e., the negative sum is equal to the positive sum). For the data in the previous hepatitis A example:
Value minus Mean  Difference 
15 – 25.0  10.0 
22 – 25.0  3.0 
27 – 25.0  + 2.0 
30 – 25.0  + 5.0 
31 – 25.0  + 6.0 
125 – 125.0 = 0  + 13.0 – 13.0 = 0 
This demonstrates that the mean is the arithmetic center of the distribution.
 Because of this centering property, the mean is sometimes called the center of gravity of a frequency distribution. If the frequency distribution is plotted on a graph, and the graph is balanced on a fulcrum, the point at which the distribution would balance would be the mean.
 The arithmetic mean is the best descriptive measure for data that are normally distributed.
 On the other hand, the mean is not the measure of choice for data that are severely skewed or have extreme values in one direction or another. Because the arithmetic mean uses all of the observations in the distribution, it is affected by any extreme value. Suppose that the last value in the previous distribution was 131 instead of 31. The mean would be 225 / 5 = 45.0 rather than 25.0. As a result of one extremely large value, the mean is much larger than all values in the distribution except the extreme value (the “outlier”).
Epi Info Demonstration: Finding the Median
Question: In the data set named SMOKE, what is the mean weight of the participants?
Answer: In Epi Info:
Select Analyze Data.
Select Read (Import). The default data set should be Sample.mdb. Under Views, scroll down to view SMOKE, and double click, or click once and then click OK. Note that 9 persons have a weight of 777, and 10 persons have a weight of 999. These are code for “refused” and “missing.” To delete these records, enter the following commands:
Click on Select. Then type in the weight < 770, or select weight from available values, then type < 750, and click on OK.
Select Means. Then click on the down arrow beneath Means of, scroll down and select WEIGHT, then click OK.
The resulting output should indicate a mean weight of 158.116 pounds.
Your Turn: What is the mean number of cigarettes smoked per day? [Answer: 17]
Exercise 2.5
Determine the mean for the same set of vaccination data.
2, 0, 3, 1, 0, 1, 2, 2, 4, 8, 1, 3, 3, 12, 1, 6, 2, 5, 1
The midrange (midpoint of an interval)
Definition of midrange
The midrange is the half‑way point or the midpoint of a set of observations. The midrange is usually calculated as an intermediate step in determining other measures.
Method for identifying the midrange
 Identify the smallest (minimum) observation and the largest (maximum) observation.
 Add the minimum plus the maximum, then divide by two.
Exception: Age differs from most other variables because age does not follow the usual rules for rounding to the nearest integer. Someone who is 17 years and 360 days old cannot claim to be 18 year old for at least 5 more days. Thus, to identify the midrange for age (in years) data, you must add the smallest (minimum) observation plus the largest (maximum) observation plus 1, then divide by two.
Midrange (most types of data) = (minimum + maximum) / 2
Midrange (age data) = (minimum + maximum + 1) / 2
Consider the following example:
In a particular preschool, children are assigned to rooms on the basis of age on September 1. Room 2 holds all of the children who were at least 2 years old but not yet 3 years old as of September 1. In other words, every child in room 2 was 2 years old on September 1. What is the midrange of ages of the children in room 2 on September 1?
For descriptive purposes, a reasonable answer is 2. However, recall that the midrange is usually calculated as an intermediate step in other calculations. Therefore, more precision is necessary.
Consider that children born in August have just turned 2 years old. Others, born in September the previous year, are almost but not quite 3 years old. Ignoring seasonal trends in births and assuming a very large room of children, birthdays are expected to be uniformly distributed throughout the year. The youngest child, born on September 1, is exactly 2.000 years old. The oldest child, whose birthday is September 2 of the previous year, is 2.997 years old. For statistical purposes, the mean and midrange of this theoretical group of 2yearolds are both 2.5 years.
Properties and uses of the midrange
 The midrange is not commonly reported as a measure of central location.
 The midrange is more commonly used as an intermediate step in other calculations, or for plotting graphs of data collected in intervals.
EXAMPLES: Identifying the Midrange
Example A: Find the midrange of the following incubation periods for hepatitis A: 27, 31, 15, 30, and 22 days.
 Identify the minimum and maximum values.
Minimum = 15, maximum = 31  Add the minimum plus the maximum, then divide by two.
Midrange = 15 + 31 / 2 = 46 / 2 = 23 days
Example B: Find the midrange of the grouping 15–24 (e.g., number of alcoholic beverages consumed in one week).
 Identify the minimum and maximum values.
Minimum = 15, maximum = 24  Add the minimum plus the maximum, then divide by two.
Midrange = 15 + 24 / 2 = 39 / 2 = 19.5
This calculation assumes that the grouping 15–24 really covers 14.50–24.49…. Since the midrange of 14.50–24.49… = 19.49…, the midrange can be reported as 19.5.
Example C: Find the midrange of the age group 15–24 years.
 Identify the minimum and maximum values.
Minimum = 15, maximum = 24  Add the minimum plus the maximum plus 1, then divide by two.
Midrange = (15 + 24 + 1) / 2 = 40 / 2 = 20 years
Age differs from the majority of other variables because age does not follow the usual rules for rounding to the nearest integer. For most variables, 15.99 can be rounded to 16. However, an adolescent who is 15 years and 360 days old cannot claim to be 16 years old (and hence get his driver’s license or learner’s permit) for at least 5 more days. Thus, the interval of 15–24 years really spans 15.0–24.99… years. The midrange of 15.0 and 24.99… = 19.99… = 20.0 years.
Geometric mean
To calculate the geometric mean, you need a scientific calculator with log and yx keys.
Definition of geometric mean
The geometric mean is the mean or average of a set of data measured on a logarithmic scale. The geometric mean is used when the logarithms of the observations are distributed normally (symmetrically) rather than the observations themselves. The geometric mean is particularly useful in the laboratory for data from serial dilution assays (1/2, 1/4, 1/8, 1/16, etc.) and in environmental sampling data.
More About Logarithms
A logarithm is the power to which a base is raised.
To what power would you need to raise a base of 10 to get a value of 100?
Because 10 times 10 or 102 equals 100, the log of 100 at base 10 equals 2. Similarly, the log of 16 at base 2 equals 4, because 24 = 2 x 2 x 2 x 2 = 16.
20 = 1 (anything raised to the 0 power is 1)
21 = 2 = 2
22 = 2 x 2 = 4
23 = 2 x 2 x 2 = 8
24 = 2 x 2 x 2 x 2 = 16
25 = 2 x 2 x 2 x 2 x 2 = 32
26 = 2 x 2 x 2 x 2 x 2 x 2 = 64
27 = 2 x 2 x 2 x 2 x 2 x 2 x 2 = 128
and so on.
100 = 1 (Anything raised to the 0 power equals 1)
101 = 10
102 = 100
103 = 1,000
104 = 10,000
105 = 100,000
106 = 1,000,000
107 = 10,000,000
and so on.
An antilog raises the base to the power (logarithm). For example, the antilog of 2 at base 10 is 102, or 100. The antilog of 4 at base 2 is 24, or 16. The majority of titers are reported as multiples of 2 (e.g., 2, 4, 8, etc.); therefore, base 2 is typically used when dealing with titers.
Method for calculating the geometric mean
There are two methods for calculating the geometric mean.
Method A
 Take the logarithm of each value.
 Calculate the mean of the log values by summing the log values, then dividing by the number of observations.
 Take the antilog of the mean of the log values to get the geometric mean.
Method B
 Calculate the product of the values by multiplying all of the values together.
 Take the nth root of the product (where n is the number of observations) to get the geometric mean.
EXAMPLES: Calculating the Geometric Mean
Example A: Using Method A
Calculate the geometric mean from the following set of data.
10, 10, 100, 100, 100, 100, 10,000, 100,000, 100,000, 1,000,000
Because these values are all multiples of 10, it makes sense to use logs of base 10.
Take the log (in this case, to base 10) of each value.
log10(xi) = 1, 1, 2, 2, 2, 2, 4, 5, 5, 6
Calculate the mean of the log values by summing and dividing by the number of observations (in this case, 10).
Mean of log10(xi) = (1+1+2+2+2+2+4+5+5+6) / 10 = 30 / 10 = 3
 Take the antilog of the mean of the log values to get the geometric mean.
 Antilog10(3) = 103 = 1,000.
 The geometric mean of the set of data is 1,000.
Example B: Using Method B
Calculate the geometric mean from the following 95% confidence intervals of an odds ratio: 1.0, 9.0

 Calculate the product of the values by multiplying all values together.
1.0 x 9.0 = 9.0
 Take the square root of the product.
The geometric mean = square root of 9.0 = 3.0.
Properties and uses of the geometric mean
The geometric mean is the average of logarithmic values, converted back to the base. The geometric mean tends to dampen the effect of extreme values and is always smaller than the corresponding arithmetic mean. In that sense, the geometric mean is less sensitive than the arithmetic mean to one or a few extreme values.
 The geometric mean is the measure of choice for variables measured on an exponential or logarithmic scale, such as dilutional titers or assays.
 The geometric mean is often used for environmental samples, when levels can range over several orders of magnitude. For example, levels of coliforms in samples taken from a body of water can range from less than 100 to more than 100,000.
Exercise 2.6
Using the dilution titers shown below, calculate the geometric mean titer of convalescent antibodies against tularemia among 10 residents of Martha’s Vineyard. [Hint: Use only the second number in the ratio, i.e., for 1:640, use 640.]
ID #  Acute  Convalescent 
1  1:16  1:512 
2  1:16  1:512 
3  1:32  1:128 
4  not done  1:512 
5  1:32  1:1024 
6  “negative”  1:1024 
7  1:256  1:2048 
8  1:32  1:128 
9  “negative”  1:4096 
10  1:16  1:1024 
Selecting the appropriate measure
Measures of central location are single values that summarize the observed values of a distribution. The mode provides the most common value, the median provides the central value, the arithmetic mean provides the average value, the midrange provides the midpoint value, and the geometric mean provides the logarithmic average.
The mode and median are useful as descriptive measures. However, they are not often used for further statistical manipulations. In contrast, the mean is not only a good descriptive measure, but it also has good statistical properties. The mean is used most often in additional statistical manipulations.
While the arithmetic mean is the measure of choice when data are normally distributed, the median is the measure of choice for data that are not normally distributed. Because epidemiologic data tend not to be normally distributed (incubation periods, doses, ages of patients), the median is often preferred. The geometric mean is used most commonly with laboratory data, particularly dilution titers or assays and environmental sampling data.
The arithmetic mean uses all the data, which makes it sensitive to outliers. Although the geometric mean also uses all the data, it is not as sensitive to outliers as the arithmetic mean. The midrange, which is based on the minimum and maximum values, is more sensitive to outliers than any other measures. The mode and median tend not to be affected by outliers.
In summary, each measure of central location — mode, median, mean, midrange, and geometric mean — is a single value that is used to represent all of the observed values of a distribution. Each measure has its advantages and limitations. The selection of the most appropriate measure requires judgment based on the characteristics of the data (e.g., normally distributed or skewed, with or without outliers, arithmetic or log scale) and the reason for calculating the measure (e.g., for descriptive or analytic purposes).
Exercise 2.7
For each of the variables listed below from the line listing in Table 2.9, identify which measure of central location is best for representing the data.
 Mode
 Median
 Mean
 Geometric mean
 No measure of central location is appropriate
________ 6. Year of diagnosis
________ 7. Age (years)
________ 8. Sex
________ 9. Highest IFA titer
________ 10. Platelets x 106/L
________ 11. White blood cell count x 109/L
Table 2.9 Line Listing for 12 Patients with Human Monocytotropic Ehrlichiosis — Missouri, 1998–1999
Patient ID  Year of Diagnosis  A ge (years)  Sex  Highest IFA* Titer  Platelets x 106/L  White Blood Cell Count x 109/L 
01  1999  44  M  1:1024  90  1.9 
02  1999  42  M  1:512  114  3.5 
03  1999  63  M  1:2048  83  6.4 
04  1999  53  F  1:512  180  4.5 
05  1999  77  M  1:1024  44  3.5 
06  1999  43  F  1:512  89  1.9 
10  1998  22  F  1:128  142  2.1 
11  1998  59  M  1:256  229  8.8 
12  1998  67  M  1:512  36  4.2 
14  1998  49  F  1:4096  271  2.6 
15  1998  65  M  1:1024  207  4.3 
18  1998  27  M  1:64  246  8.5 
Mean:  1998.5  50.92  na  1:976.00  144.25  4.35 
Median:  1998.5  51  na  1:512  128  3.85 
Geometric Mean:  1998.5  48.08  na  1:574.70  120.84  3.81 
Mode:  none  none  M  1:512  none  1.9, 3.5 
*Immunofluorescence assay
Data Source: Olano JP, Masters E, Hogrefe W, Walker DH. Human monocytotropic ehrlichiosis, Missouri. Emerg Infect Dis 2003;9:157986.