Module 9: Statistics Review

Overview

This module focuses on learning how to apply statistical concepts to real world data in order to answer research questions in nutritional anthropology.

Objectives

1. Provide an intuitive understanding of the application of several statistical concepts including: mean, standard deviation, correlations, z-scores, and t-tests to concepts in nutritional anthropology.

2. Use the above statistics to evaluate information from several data sets that measure different aspects of nutritional anthropology.

3. Interpret the statistical results from the SPSS output in your own words.

Activities
  1. Let’s practice calculating and interpreting means, standard deviations and Z-Scores.
    1. Start SPSS for Windows.
    2. Open the data set anthpomet.sav, by choosing File, Open, Choose the G: or Data drive from the Look in: box. Highlight the file, anthpomet.sav and click the Open button.
    3. Calculate means and standard deviation: Choose Analyze, Descriptives, and select years in school-female, years in school-male, number of births, number of living children. Click the box next to Save standardized values as variables. Print out the results.
    4. Based on these results answer the following questions:
      1. What is the average number of years a male spend in school?
      2. What is the average number of years a female spend in school?
      3. Which gender spends more time in school on average?
      4. What is the average number of births in the sample?
      5. What is the average number of living children in the sample?
      6. Multiply the standard deviation of the number of living children by 2. Add this to the mean number of living children. What is this value?
      7. Based on the value you just calculated above, how likely is it to find a woman with 4 living children? (You can answer high medium or low.)
      8. Based on the value you just calculated above, how likely is it to find a woman with 10 living children? (You can answer high medium or low.)
      9. Make sure you are in the spreadsheet view of the data set by clicking on the SPSS Data Editor Button at the bottom of your screen. Click on the Data View tab on the bottom left hand side of your screen. Scroll to the right as far you can go and still see numbers. Look at the column Zv84 – these are the z scores for the variable v84 – the number of living children. Look up and down in this column and find a z score larger than 3. Click on the row number to highlight the row. Scroll to the left to find the column v84. What is the value in that column for the row you highlighted? How likely is it to find a woman with this many children? How is this related to the Z-Score?

2. Let’s apply the concept of Z-Scores to nutritional anthropology. There are three anthropometric indices measured in the data set anthpomet.sav. These are var67:Height for age z-score, v68:wteight for height z-score, v 69:weight for age z-score. Scroll to these columns in the data editor. Using the information in these columns, answer the following questions.

  1. Write down the observation number (row number) of those persons that have low anthropometric levels by the height for age score. How many of these are contained in the sample. What proportion of the sample is this? Does this indicate that this is a healthy population?
  2. Write down the observation number (row number) of those persons that have low anthropometric levels by the weight for height score. How many of these are contained in the sample. What proportion of the sample is this? Does this indicate that this is a healthy population?
  3. Write down the observation number (row number) of those persons that have low anthropometric levels by the weight for age score. How many of these are contained in the sample. What proportion of the sample is this? Does this indicate that this is a healthy population?
  4. Write down the observation number (row number) of those persons that have low anthropometric levels by all three measures. How many of these are contained in the sample. What proportion of the sample is this? Does this indicate that this is a healthy population?

3. Now let’s practice applying a difference in means tests. Open the data set calorie.sav, by choosing File, Open, Choose the G: or Data drive from the Look in: box. Highlight the file, calorie.sav and click the Open button.

  1. Let test to see if there is a difference in daily calorie intake, between males and females. First of all what is the null hypothesis? (Hint, look back at the notes on doing a T-test.)
  2. Let’s conduct the test. Choose Analyze, Compare Means, Independent Samples T-test. Use the variable calories as your test variable, and gender as your grouping variable. Make sure you define the groups for gender: Press the define groups button, put in a 0 for Group 1 and 1 for Group 2, press continue, Ok.
  3. Answer the following questions based on the SPSS Output:
    1. Focus on the Group Statistics table. What is the mean amount of calorie intake for males? For females? Do they look different for the two groups? Do you think that you will reject your null hypothesis?
    2. Look at the Sig value in the Independent Samples Test box. Is it smaller or larger than .05? Do you accept or reject the null hypothesis?
    3. Can you state in your own words as to whether males and females differ in their calorie intake based on your test results?

4. The final exercise involves using a cross tabs to see if either high cholesterol or high blood pressure leads to a greater likelihood of a coronary incident. Open the data set Open the data set hgbloodp.sav, by choosing File, Open, Choose the G: or Data drive from the Look in: box. Highlight the file, hgbloodp.sav and click the Open button.

  1. Let’s test to see if those with high cholesterol have a greater likelihood of having a coronary incident. First choose Analyze, Descriptive Statistics, Crosstabs. Choose high cholesterol (highchol) as the row variable, and Coronary incident as the column variable. Click the Statistics… button; check the box next to Chi-square, and click the Continue button. Click the Format… button; check the box next to Row under Percentages, and click the Continue button. Click OK. Answer the following questions based on the output.
    1. Look at the crosstabulation table. What percentage of those persons with low cholesterol (Less than 240) has had a coronary incident? What percentage of those persons with high cholesterol (greater than 240) has had a coronary incident? Which is higher?
    2. What is the null hypothesis? Based on what you found above, do you think you will reject the null hypothesis?
    3. Based on the chi-square test do you accept or reject the null hypothesis?
    4. Can you state in your own words your conclusion as to whether persons with high cholesterol have a higher likelihood of coronary incidents?
  2. Let’s test to see if those with high systolic blood pressure have a greater likelihood of having a coronary incident. First choose Analyze, Descriptive Statistics, Crosstabs. Choose high systolic blood pressure (hghsystl) as the row variable, and Coronary incident as the column variable. Click the Statistics… button; check the box next to Chi-square, and click the Continue button. Click the Format… button; check the box next to Row under Percentages, and click the Continue button. Click OK. Answer the following questions based on the output.
    1. Look at the crosstabulation table. What percentage of those persons with low systolic blood pressure has had a coronary incident? What percentage of those persons with high systolic blood pressure has had a coronary incident? Which is higher?
    2. What is the null hypothesis? Based on what you found above, do you think you will reject the null hypothesis?
    3. Based on the chi-square test do you accept or reject the null hypothesis?
    4. Can you state in your own words your conclusion as to whether persons with high systolic blood pressure have a higher likelihood of coronary incidents?
  3. Repeat the same exercise for diastolic blood pressure.

tag –>

Notes

Part I: SPSS For Windows: Z-Scores

A statistical measure called a Z-Score is used to gauge the health of a population. The measure helps us sort the unhealthy members from the healthy members of a population. We’ll start with anthropometric measures such as height for age, weight for height, and weight for age. Then we will use the notion of a Z-Score to help us find which individuals are extremely different from the typical values of these measures in a healthy population. For example, the Z-Score will help us find individuals that have extremely low weight for height as compared to what would be found in a healthy population. The low weight for height value may indicate poor nutrition in that individual.

We need to review some statistical terminology to be able to understand the notion of a Z-Score. In particular we need to review the ideas of a mean and standard deviation.

Mean
The mean of a random variable is a measure of central tendency of that random variable. An estimate of a mean is the arithmetic average. This is the sum of all the values in your sample, divided by the number of cases. The mean can be very loosely interpreted as the “typical” value of a population.

For example: If the mean age of a population is 45, we can interpret this as the average age in this population. It is loosely the typical age in the population. (See Figure A) Suppose we have another population that has a mean age of 15. We could conclude that it would be more likely to find younger people in the second population. The typical person is younger in the second population. (See Figure B)

Standard Deviation (SD)
The standard deviation is a measure of dispersion of values of a random variable around the mean. It tells us the range of likely values we should find in a population. In a normal distribution, 68% of cases fall within one SD of the mean, 95% of cases fall within 2 SD of the mean and 99% of cases fall within 3 SD of the mean. For example, if the mean age of a population were 45, with a standard deviation of 10, 95% of the cases would be between 25 and 65 in a normal distribution. In this case, two standard deviations are 2 times 10 or 20. If we add two standard deviations to the mean we get 65, if we subtract two standard deviations from the mean we get 25. Intuitively, this means that most of the people in this population (95%) are between the ages of 25 and 65. If someone is older than 65, or younger than 25, this is very rare. It happens only 5% of the time.

Z-Scores: A Z-Score is a measure that tells you how many standard deviation units a value is above or below the mean. From the above example, if the mean age of a population is 45, with a standard deviation of 10, an age of 35 is 1 Z-Score below the mean. Likewise, an age of 25 is 2 Z-Scores below the mean. The way we interpret this is that ages below 2 Z-Scores are quite rare in this population. Observations that have a Z-Score of negative two or smaller and observations that have a Z-Score of two or larger are rare values. These values only appear approximately 5% of the time.

The distribution of the anthropometric indices can be expressed in terms of Z-Scores. The three indices are height for age, weight for height, and weight for age. In calculating these Z-Scores for a population, we use the mean of the index from a healthy (reference) population. The Z-Score cutoff point recommended by WHO, CDC, and others to classify low anthropometric levels is 2 SD units below the reference mean for the three indices. The proportion of the population that falls below a Z-score of -2 is generally compared with the reference (healthy) population proportion. In the healthy population, only 2.3% of people fall below this cutoff. In a less healthy population, this proportion would be greater than 2.3%. The reason is that the mean value of the index tends to be higher in the healthy population so there would be many more values that fall below the healthy mean in the unhealthy population. There is also another more stringent cutoff measure. The cutoff for very low anthropometric levels is usually more than 3 SD units below the reference mean.

Part II: SPSS For Windows Tests For Difference In Means: T-Tests And Crosstabs

Differences in Means T-tests
The differences in means t-test is a statistical test one would use to prove if the mean of a variable (measurement) is different by group. For example, suppose a researcher wants to test to see if the level of female education is different by household type. More specifically, the research question is whether the average level of female education different in dual/male-headed households as compared to female-headed households. The measurement in this example is the level of female education, the grouping variable is household type, group A is dual/male-headed and group B is female-headed household. This research question can be tested by using an differences in means t-test using SPSS.

Independent-Samples t-tests
There are two different kinds of difference in means tests: one for independent samples, one for dependent samples. The particular differences in means t-test we will use in this case is the Independent?Samples t-test. The word independent has a very specific meaning. If the independent samples version of the test is used, one must know that measurements you are comparing across groups are mathematically independent. What this means is that very different processes determine the measurements. One helpful characteristic for independence is to have the subjects in the two groups be different individuals, and each individual should only be present in one group. Gender fulfills this criterion for independence, as one cannot belong to the “male and female” groups at once.

If the measurements cannot be assumed to be determined independently, then we say the samples are dependent. The samples would be dependent if there were two measurements made using the same individual (e.g. blood pressure for the same person before and after an exam). Group A would be blood pressure measurements for a group of people before a treatment; Group B would be blood pressure measurements after a treatment. Both groups contain the same people; the difference between the groups is the treatment. In this case one should use a dependent samples t-test instead. Another example of when to use a dependent t-test is when the group members are different individuals, but are somehow related – for example the groups of male and female were composed of boyfriend and girlfriend pairs. In our example, it is okay to do an independent t-test, because dual/male-headed and female-headed households are two separate groups, with unrelated individuals in them.

The basic underlying assumption (called a null hypothesis) behind the test is that there is no difference between the means of the groups (e.g. the average female education level for female-headed households will be the same as that for dual/male-headed households). After conducting the test we will either accept this hypothesis (conclude that there is no difference in the means) or we will reject the hypothesis (conclude that there are differences in the two groups). The test essentially calculates the means, compares them statistically and based on this comparison, will either cause us to reject the null

hypothesis, or not give us enough evidence to reject the null hypothesis. Let’s first look at how to conduct the test in SPSS and then look at how we make the decision to either to reject or not reject the null hypothesis.

To conduct an Independent?Samples T Test in SPSS:
· Open the Data set you want to use in SPSS: File, Open, Data, H:\ECDIET.SAV
· From the menus choose: Analyze, Compare Means, Independent?Samples T Test.

· Select one or more quantitative test variables. (e.g. var 82a: femeduc). A separate t test is computed for each variable.

· Select a single grouping variable (e.g. var 1: Household type), and click the Define Groups… button to specify two codes for the groups you want to compare. (e.g. Put a 1 in the Group 1 box for female-headed household type (matricentric) , and put a 2 in the Group 2 box for dual/male-headed household type (patricentric).)

· If you would like, you can click Options to control the treatment of missing data and the level of the confidence interval.
Interpreting The Results Of The T-Test

The test produces two tables. The first table summarizes the means by group. This is titled Group Statistics. The first row gives the number N, mean, standard deviation, and standard error of the mean for the first group (1), which is female-headed households. The second row gives the number N, mean, standard deviation, and standard error of the mean for the second group (2), which is dual/male-headed households. The level female education for female-headed households is higher, at 3.64 years, whereas the level female education for dual/male-headed households is slightly lower, at 3.37 years.

VAR1 household type
N
Mean
Std. Deviation
Std. Error Mean
FEMEDUC level of education of adult female in household
1
14
3.64
3.079
.823
2
30
3.37
2.297
.419

To see if this difference is statistically significant, we look at the second table (Independent Samples t-test). We will use the first row results in this test (we are assuming the variances are equal). We want to focus on the column titled Sig. 2-Tailed. This reports the statistical significance of the t-statistic. The decision to reject the null hypothesis is made if this number is less than 0.05. The sig. value tells us the probability that the null hypothesis is true, given our data. If this value falls below 5%, we would reject our null hypothesis that the means across household types are the same in this case, and conclude that there is a significant difference. The intuitive idea is that a sig. value lower than 5% says that the probability that the null hypothesis is true is very small, so we should reject this null hypothesis that the means are the same.

In our current case the sig. value is (0.714). We do not reject the null hypothesis in this case because the sig value is much larger than .05, the probability the null is true is quite large. We would continue to accept the null hypothesis, and conclude that we have no evidence for a significant difference in average female education level for dual/male-headed versus female-headed households.


Crosstabulations (Crosstabs) and Chi Square Tests

Suppose a researcher wanted to find out if female-headed households and dual/male-headed households differed in whether or not they gathered wild food. That is, are dual/male-headed households as likely to gather wild food as female-headed households? We would use the crosstabs procedure because both of these variables (household type and gath) are categorical.

The null hypothesis or basic assumption behind the statistical test is that there is no difference between the household types in the probability that they gather wild food, in other words that these probabilities are equal – female-headed households are just as likely to gather wild food as dual/male-headed households. The test we will use to prove this is a Chi Square test.

How To Compute Cross tabulations in SPSS:
· Open the Data set you want to use in SPSS: File, Open, Data, H:\ECDIET.SAV
· From the menus choose: Analyze, Descriptive Statistics, Crosstabs.
· Select one or more row variables (This is the category or grouping variable. In our example it is household type or var1) and one or more column variables (In this example we chose ‘hh gathers wild food’.)
· Also click Statistics and check the Chi Square box, Click the Continue Button.
· Click Cells for observed and expected values, percentages, and residuals. (Under this you should check the Observed box under the Counts heading, and check the Row box under the Percentages heading)

How to interpret the results of the Cross tabulation

Three tables are produced: a case summary table, the table containing the cross tabulations and a table reporting the results. The case summary table simply lists the number of valid cases, number of missing cases, the total number of cases, and the percentages of each.

The second table contains the cross tabulations of food gathering and household type. The rows represent household type, the columns the represent food gathering habit. The first row counts how many persons are in household type 1, female-headed households. The first column represents the number of persons who do not gather wild food. There are 12 persons in the sample that are in female-headed households and do not gather wild food, whereas there are relatively few persons (3) persons in the female-headed households that do gather wild food. Dual/male-headed households follow the same pattern: few (3) gather wild food, and most (30) do not gather wild food. The percentages of each household type in each food-gathering category are also calculated. For example, 12/15 people, or 80% of the persons in female-headed households are NOT wild food gatherers.

The final row represents the counts and percents of the total number of persons in the sample in both food-gathering categories. 42 of 48 persons or 87.5% of all persons in the sample are NOT wild food gatherers.


Are the differences in food gathering by household type that much different from those in the population? To answer this question we use the Chi Square statistic.


a. Computed only for a 2×2 table
b. 2 cells (50.0%) have expected count less than 5. The minimum expected count is 1.88.

We want to focus on the column titled Asymp. Sig. 2-Tailed and the row Pearson Chi-Square. This reports the statistical significance of the Chi Sqaure -statistic. The decision to reject the null hypothesis is made if this number is less than 0.05. The sig. value tells us the probability that the null hypothesis is true, given our data. If this value falls below 5%, we would reject our null hypothesis that wild food gathering is just as likely for both household types, and conclude that there is a significant difference. The intuitive idea is that a sig. value lower than 5% says that the probability that the null hypothesis is true is very small, so we should reject this null hypothesis that the means are the same.

In our case, the Sig. value is 0.289, which is much higher than 0.05. So we would continue to accept the null hypothesis, and say that we do not have evidence to support a different likelihood of wild food gathering for female-headed versus dual/male-headed households.

Assignments

There are no assignments for Module 9.