Module 9: Statistics Review
This module focuses on learning how to apply statistical concepts to real world data in order to answer research questions in nutritional anthropology.
1. Provide an intuitive understanding of the application of several statistical concepts including: mean, standard deviation, correlations, zscores, and ttests to concepts in nutritional anthropology.
2. Use the above statistics to evaluate information from several data sets that measure different aspects of nutritional anthropology.
3. Interpret the statistical results from the SPSS output in your own words.
 Let’s practice calculating and interpreting means, standard deviations and ZScores.
 Start SPSS for Windows.
 Open the data set anthpomet.sav, by choosing File, Open, Choose the G: or Data drive from the Look in: box. Highlight the file, anthpomet.sav and click the Open button.
 Calculate means and standard deviation: Choose Analyze, Descriptives, and select years in schoolfemale, years in schoolmale, number of births, number of living children. Click the box next to Save standardized values as variables. Print out the results.
 Based on these results answer the following questions:
 What is the average number of years a male spend in school?
 What is the average number of years a female spend in school?
 Which gender spends more time in school on average?
 What is the average number of births in the sample?
 What is the average number of living children in the sample?
 Multiply the standard deviation of the number of living children by 2. Add this to the mean number of living children. What is this value?
 Based on the value you just calculated above, how likely is it to find a woman with 4 living children? (You can answer high medium or low.)
 Based on the value you just calculated above, how likely is it to find a woman with 10 living children? (You can answer high medium or low.)
 Make sure you are in the spreadsheet view of the data set by clicking on the SPSS Data Editor Button at the bottom of your screen. Click on the Data View tab on the bottom left hand side of your screen. Scroll to the right as far you can go and still see numbers. Look at the column Zv84 – these are the z scores for the variable v84 – the number of living children. Look up and down in this column and find a z score larger than 3. Click on the row number to highlight the row. Scroll to the left to find the column v84. What is the value in that column for the row you highlighted? How likely is it to find a woman with this many children? How is this related to the ZScore?
2. Let’s apply the concept of ZScores to nutritional anthropology. There are three anthropometric indices measured in the data set anthpomet.sav. These are var67:Height for age zscore, v68:wteight for height zscore, v 69:weight for age zscore. Scroll to these columns in the data editor. Using the information in these columns, answer the following questions.
 Write down the observation number (row number) of those persons that have low anthropometric levels by the height for age score. How many of these are contained in the sample. What proportion of the sample is this? Does this indicate that this is a healthy population?
 Write down the observation number (row number) of those persons that have low anthropometric levels by the weight for height score. How many of these are contained in the sample. What proportion of the sample is this? Does this indicate that this is a healthy population?
 Write down the observation number (row number) of those persons that have low anthropometric levels by the weight for age score. How many of these are contained in the sample. What proportion of the sample is this? Does this indicate that this is a healthy population?
 Write down the observation number (row number) of those persons that have low anthropometric levels by all three measures. How many of these are contained in the sample. What proportion of the sample is this? Does this indicate that this is a healthy population?
3. Now let’s practice applying a difference in means tests. Open the data set calorie.sav, by choosing File, Open, Choose the G: or Data drive from the Look in: box. Highlight the file, calorie.sav and click the Open button.
 Let test to see if there is a difference in daily calorie intake, between males and females. First of all what is the null hypothesis? (Hint, look back at the notes on doing a Ttest.)
 Let’s conduct the test. Choose Analyze, Compare Means, Independent Samples Ttest. Use the variable calories as your test variable, and gender as your grouping variable. Make sure you define the groups for gender: Press the define groups button, put in a 0 for Group 1 and 1 for Group 2, press continue, Ok.
 Answer the following questions based on the SPSS Output:
 Focus on the Group Statistics table. What is the mean amount of calorie intake for males? For females? Do they look different for the two groups? Do you think that you will reject your null hypothesis?
 Look at the Sig value in the Independent Samples Test box. Is it smaller or larger than .05? Do you accept or reject the null hypothesis?
 Can you state in your own words as to whether males and females differ in their calorie intake based on your test results?
4. The final exercise involves using a cross tabs to see if either high cholesterol or high blood pressure leads to a greater likelihood of a coronary incident. Open the data set Open the data set hgbloodp.sav, by choosing File, Open, Choose the G: or Data drive from the Look in: box. Highlight the file, hgbloodp.sav and click the Open button.
 Let’s test to see if those with high cholesterol have a greater likelihood of having a coronary incident. First choose Analyze, Descriptive Statistics, Crosstabs. Choose high cholesterol (highchol) as the row variable, and Coronary incident as the column variable. Click the Statistics… button; check the box next to Chisquare, and click the Continue button. Click the Format… button; check the box next to Row under Percentages, and click the Continue button. Click OK. Answer the following questions based on the output.
 Look at the crosstabulation table. What percentage of those persons with low cholesterol (Less than 240) has had a coronary incident? What percentage of those persons with high cholesterol (greater than 240) has had a coronary incident? Which is higher?
 What is the null hypothesis? Based on what you found above, do you think you will reject the null hypothesis?
 Based on the chisquare test do you accept or reject the null hypothesis?
 Can you state in your own words your conclusion as to whether persons with high cholesterol have a higher likelihood of coronary incidents?
 Let’s test to see if those with high systolic blood pressure have a greater likelihood of having a coronary incident. First choose Analyze, Descriptive Statistics, Crosstabs. Choose high systolic blood pressure (hghsystl) as the row variable, and Coronary incident as the column variable. Click the Statistics… button; check the box next to Chisquare, and click the Continue button. Click the Format… button; check the box next to Row under Percentages, and click the Continue button. Click OK. Answer the following questions based on the output.
 Look at the crosstabulation table. What percentage of those persons with low systolic blood pressure has had a coronary incident? What percentage of those persons with high systolic blood pressure has had a coronary incident? Which is higher?
 What is the null hypothesis? Based on what you found above, do you think you will reject the null hypothesis?
 Based on the chisquare test do you accept or reject the null hypothesis?
 Can you state in your own words your conclusion as to whether persons with high systolic blood pressure have a higher likelihood of coronary incidents?
 Repeat the same exercise for diastolic blood pressure.
tag –>
Part I: SPSS For Windows: ZScores
A statistical measure called a ZScore is used to gauge the health of a population. The measure helps us sort the unhealthy members from the healthy members of a population. We’ll start with anthropometric measures such as height for age, weight for height, and weight for age. Then we will use the notion of a ZScore to help us find which individuals are extremely different from the typical values of these measures in a healthy population. For example, the ZScore will help us find individuals that have extremely low weight for height as compared to what would be found in a healthy population. The low weight for height value may indicate poor nutrition in that individual.
We need to review some statistical terminology to be able to understand the notion of a ZScore. In particular we need to review the ideas of a mean and standard deviation.
Mean
The mean of a random variable is a measure of central tendency of that random variable. An estimate of a mean is the arithmetic average. This is the sum of all the values in your sample, divided by the number of cases. The mean can be very loosely interpreted as the “typical” value of a population.
For example: If the mean age of a population is 45, we can interpret this as the average age in this population. It is loosely the typical age in the population. (See Figure A) Suppose we have another population that has a mean age of 15. We could conclude that it would be more likely to find younger people in the second population. The typical person is younger in the second population. (See Figure B)
Standard Deviation (SD)
The standard deviation is a measure of dispersion of values of a random variable around the mean. It tells us the range of likely values we should find in a population. In a normal distribution, 68% of cases fall within one SD of the mean, 95% of cases fall within 2 SD of the mean and 99% of cases fall within 3 SD of the mean. For example, if the mean age of a population were 45, with a standard deviation of 10, 95% of the cases would be between 25 and 65 in a normal distribution. In this case, two standard deviations are 2 times 10 or 20. If we add two standard deviations to the mean we get 65, if we subtract two standard deviations from the mean we get 25. Intuitively, this means that most of the people in this population (95%) are between the ages of 25 and 65. If someone is older than 65, or younger than 25, this is very rare. It happens only 5% of the time.
ZScores: A ZScore is a measure that tells you how many standard deviation units a value is above or below the mean. From the above example, if the mean age of a population is 45, with a standard deviation of 10, an age of 35 is 1 ZScore below the mean. Likewise, an age of 25 is 2 ZScores below the mean. The way we interpret this is that ages below 2 ZScores are quite rare in this population. Observations that have a ZScore of negative two or smaller and observations that have a ZScore of two or larger are rare values. These values only appear approximately 5% of the time.
The distribution of the anthropometric indices can be expressed in terms of ZScores. The three indices are height for age, weight for height, and weight for age. In calculating these ZScores for a population, we use the mean of the index from a healthy (reference) population. The ZScore cutoff point recommended by WHO, CDC, and others to classify low anthropometric levels is 2 SD units below the reference mean for the three indices. The proportion of the population that falls below a Zscore of 2 is generally compared with the reference (healthy) population proportion. In the healthy population, only 2.3% of people fall below this cutoff. In a less healthy population, this proportion would be greater than 2.3%. The reason is that the mean value of the index tends to be higher in the healthy population so there would be many more values that fall below the healthy mean in the unhealthy population. There is also another more stringent cutoff measure. The cutoff for very low anthropometric levels is usually more than 3 SD units below the reference mean.
Part II: SPSS For Windows Tests For Difference In Means: TTests And Crosstabs
Differences in Means Ttests
The differences in means ttest is a statistical test one would use to prove if the mean of a variable (measurement) is different by group. For example, suppose a researcher wants to test to see if the level of female education is different by household type. More specifically, the research question is whether the average level of female education different in dual/maleheaded households as compared to femaleheaded households. The measurement in this example is the level of female education, the grouping variable is household type, group A is dual/maleheaded and group B is femaleheaded household. This research question can be tested by using an differences in means ttest using SPSS.
IndependentSamples ttests
There are two different kinds of difference in means tests: one for independent samples, one for dependent samples. The particular differences in means ttest we will use in this case is the Independent?Samples ttest. The word independent has a very specific meaning. If the independent samples version of the test is used, one must know that measurements you are comparing across groups are mathematically independent. What this means is that very different processes determine the measurements. One helpful characteristic for independence is to have the subjects in the two groups be different individuals, and each individual should only be present in one group. Gender fulfills this criterion for independence, as one cannot belong to the “male and female” groups at once.
If the measurements cannot be assumed to be determined independently, then we say the samples are dependent. The samples would be dependent if there were two measurements made using the same individual (e.g. blood pressure for the same person before and after an exam). Group A would be blood pressure measurements for a group of people before a treatment; Group B would be blood pressure measurements after a treatment. Both groups contain the same people; the difference between the groups is the treatment. In this case one should use a dependent samples ttest instead. Another example of when to use a dependent ttest is when the group members are different individuals, but are somehow related – for example the groups of male and female were composed of boyfriend and girlfriend pairs. In our example, it is okay to do an independent ttest, because dual/maleheaded and femaleheaded households are two separate groups, with unrelated individuals in them.
The basic underlying assumption (called a null hypothesis) behind the test is that there is no difference between the means of the groups (e.g. the average female education level for femaleheaded households will be the same as that for dual/maleheaded households). After conducting the test we will either accept this hypothesis (conclude that there is no difference in the means) or we will reject the hypothesis (conclude that there are differences in the two groups). The test essentially calculates the means, compares them statistically and based on this comparison, will either cause us to reject the null
hypothesis, or not give us enough evidence to reject the null hypothesis. Let’s first look at how to conduct the test in SPSS and then look at how we make the decision to either to reject or not reject the null hypothesis.
To conduct an Independent?Samples T Test in SPSS:
· Open the Data set you want to use in SPSS: File, Open, Data, H:\ECDIET.SAV
· From the menus choose: Analyze, Compare Means, Independent?Samples T Test.
· Select one or more quantitative test variables. (e.g. var 82a: femeduc). A separate t test is computed for each variable.
· Select a single grouping variable (e.g. var 1: Household type), and click the Define Groups… button to specify two codes for the groups you want to compare. (e.g. Put a 1 in the Group 1 box for femaleheaded household type (matricentric) , and put a 2 in the Group 2 box for dual/maleheaded household type (patricentric).)
· If you would like, you can click Options to control the treatment of missing data and the level of the confidence interval.
Interpreting The Results Of The TTest
The test produces two tables. The first table summarizes the means by group. This is titled Group Statistics. The first row gives the number N, mean, standard deviation, and standard error of the mean for the first group (1), which is femaleheaded households. The second row gives the number N, mean, standard deviation, and standard error of the mean for the second group (2), which is dual/maleheaded households. The level female education for femaleheaded households is higher, at 3.64 years, whereas the level female education for dual/maleheaded households is slightly lower, at 3.37 years.
VAR1 household type

N

Mean

Std. Deviation

Std. Error Mean


FEMEDUC level of education of adult female in household

1

14

3.64

3.079

.823

2

30

3.37

2.297

.419

To see if this difference is statistically significant, we look at the second table (Independent Samples ttest). We will use the first row results in this test (we are assuming the variances are equal). We want to focus on the column titled Sig. 2Tailed. This reports the statistical significance of the tstatistic. The decision to reject the null hypothesis is made if this number is less than 0.05. The sig. value tells us the probability that the null hypothesis is true, given our data. If this value falls below 5%, we would reject our null hypothesis that the means across household types are the same in this case, and conclude that there is a significant difference. The intuitive idea is that a sig. value lower than 5% says that the probability that the null hypothesis is true is very small, so we should reject this null hypothesis that the means are the same.
In our current case the sig. value is (0.714). We do not reject the null hypothesis in this case because the sig value is much larger than .05, the probability the null is true is quite large. We would continue to accept the null hypothesis, and conclude that we have no evidence for a significant difference in average female education level for dual/maleheaded versus femaleheaded households.
Crosstabulations (Crosstabs) and Chi Square Tests
Suppose a researcher wanted to find out if femaleheaded households and dual/maleheaded households differed in whether or not they gathered wild food. That is, are dual/maleheaded households as likely to gather wild food as femaleheaded households? We would use the crosstabs procedure because both of these variables (household type and gath) are categorical.
The null hypothesis or basic assumption behind the statistical test is that there is no difference between the household types in the probability that they gather wild food, in other words that these probabilities are equal – femaleheaded households are just as likely to gather wild food as dual/maleheaded households. The test we will use to prove this is a Chi Square test.
How To Compute Cross tabulations in SPSS:
· Open the Data set you want to use in SPSS: File, Open, Data, H:\ECDIET.SAV
· From the menus choose: Analyze, Descriptive Statistics, Crosstabs.
· Select one or more row variables (This is the category or grouping variable. In our example it is household type or var1) and one or more column variables (In this example we chose ‘hh gathers wild food’.)
· Also click Statistics and check the Chi Square box, Click the Continue Button.
· Click Cells for observed and expected values, percentages, and residuals. (Under this you should check the Observed box under the Counts heading, and check the Row box under the Percentages heading)
How to interpret the results of the Cross tabulation
Three tables are produced: a case summary table, the table containing the cross tabulations and a table reporting the results. The case summary table simply lists the number of valid cases, number of missing cases, the total number of cases, and the percentages of each.
The second table contains the cross tabulations of food gathering and household type. The rows represent household type, the columns the represent food gathering habit. The first row counts how many persons are in household type 1, femaleheaded households. The first column represents the number of persons who do not gather wild food. There are 12 persons in the sample that are in femaleheaded households and do not gather wild food, whereas there are relatively few persons (3) persons in the femaleheaded households that do gather wild food. Dual/maleheaded households follow the same pattern: few (3) gather wild food, and most (30) do not gather wild food. The percentages of each household type in each foodgathering category are also calculated. For example, 12/15 people, or 80% of the persons in femaleheaded households are NOT wild food gatherers.
The final row represents the counts and percents of the total number of persons in the sample in both foodgathering categories. 42 of 48 persons or 87.5% of all persons in the sample are NOT wild food gatherers.
Are the differences in food gathering by household type that much different from those in the population? To answer this question we use the Chi Square statistic.
a. Computed only for a 2×2 table
b. 2 cells (50.0%) have expected count less than 5. The minimum expected count is 1.88.
We want to focus on the column titled Asymp. Sig. 2Tailed and the row Pearson ChiSquare. This reports the statistical significance of the Chi Sqaure statistic. The decision to reject the null hypothesis is made if this number is less than 0.05. The sig. value tells us the probability that the null hypothesis is true, given our data. If this value falls below 5%, we would reject our null hypothesis that wild food gathering is just as likely for both household types, and conclude that there is a significant difference. The intuitive idea is that a sig. value lower than 5% says that the probability that the null hypothesis is true is very small, so we should reject this null hypothesis that the means are the same.
In our case, the Sig. value is 0.289, which is much higher than 0.05. So we would continue to accept the null hypothesis, and say that we do not have evidence to support a different likelihood of wild food gathering for femaleheaded versus dual/maleheaded households.
There are no assignments for Module 9.