Sampling Methods

This section is adapted from Chapter 1 of OpenIntro Statistics, second edition.

The first step in conducting research is to identify the topics or questions to be investigated. A clearly laid-out research question helps identify the subjects or cases that should be studied and what variables might be important. 

Consider the following four research questions:

1. What is the average mercury content in swordfish in the Atlantic Ocean?

2. Over the last 5 years, what is the average time a Washington community college student needed to achieve an AA degree?

3. Does a new drug reduce the number of deaths among patients with severe heart disease?

4. What percentage of registered voters in Washington state approve of the job for governor is doing?

Each research question refers to a target population. In the first question, the target population is all swordfish in the Atlantic ocean, and each fish represents a case. Often it is impractical or too expensive to collect data for every case in a population. Instead, we collect data about a sample selected from the population. (In rare cases, we are able to collect data about all of the cases in a population; this is called a census.)

Anecdotal evidence
Consider the following possible responses to the three research questions:

1. A man on the news got mercury poisoning from eating swordfish, so the average mercury concentration in swordfish must be dangerously high.

2. I met two students who needed more than 5 years to get their AA, so it must take way longer than two years to get an AA degree for most people.

3. My neighbor had a heart attack and died after they gave him a new heart disease drug, so the drug must not work.

4. All of my Facebook friends think the governor is doing a horrible job, so most of the voters in the state must think the same thing.

Each of these conclusions are based on some data, but there are two problems. First, the data only represents one or two cases. Second, and more importantly, it is unclear whether these few cases are actually representative of the population. We refer to data collected in this haphazard fashion as anecdotal evidence.

Anecdotal evidence often consists of unusual cases that we recall based on their striking characteristics. For instance, we are more likely to remember the two people we met who took 5 years to graduate than the six others who graduated in two years. Instead of looking at the most unusual cases, we should examine a (typically larger) sample with cases more representative of the population.

Sampling from a population
We might try to estimate the time it takes for a community college student to get an AA degree by collecting a sample of students. In general, we always seek to randomly select a sample from a population. The most basic type of random selection is equivalent to how raffles are conducted. For example, in selecting graduates, we could write each graduate's name on a raffle ticket, put those tickets in a large box, shake up the box, and draw 100 tickets. The selected names would represent a random sample of 100 graduates.

Why pick a sample randomly? Why not just pick a sample by hand? Consider the following scenario.

Suppose we ask a student who happens to be majoring in accounting to select several graduates for the study. It's possible she would pick a disproportionate number of graduates from business-related fields like accounting, simply because she knows more students in that area than those seeking to major in science or music or psychology. (Or perhaps her selection might be representative of the population.) When selecting samples by hand, we run the risk of picking a biased sample, even if that bias is unintentional or difficult to discern.

The most basic random sample is called a simple random sample, and it is the equivalent of using a raffle as described above to select cases. This means that each case in the population has an equal chance of being included (and furthermore that each group of people in the population has an equal chance of being selected). Taking a simple random sample helps minimize bias, but does not absolutely guarantee a representative sample. Even when people are selected at random, caution must be exercised if the non-response rate is high. For instance, if only 30% of the people randomly sampled to participate in a survey actually respond, then it is unclear whether the results are representative of the entire population. This non-response bias can skew results.

Another common downfall is a convenience sample, where individuals who are easily accessible are more likely to be included in the sample. For instance, if a political survey stops people walking in downtown Seattle to ask them what they think about the governor, this will not represent all of Washington state. It is often difficult to discern what sub-population a convenience sample represents.

There are two primary types of data collection: observational studies and experiments. Researchers perform an observational study when they collect data in a way that does not directly interfere with how the data arise. For instance, researchers may collect information via surveys, review medical or company records, or follow a cohort of many similar individuals to study why certain diseases might develop. In each of these situations, researchers merely observe the data that arise. In general, observational studies can provide evidence of a naturally occurring association between variables, but they cannot by themselves show a cause-and-effect connection.

When researchers want to investigate the possibility of a cause-and-effect connection, they conduct an experiment. Usually there will be both an explanatory and a response variable. For instance, we may suspect administering a drug will reduce mortality in heart attack patients over the following year. To check if there really is a causal connection between the explanatory variable and the response, researchers will collect a sample of individuals and split them into groups. The individuals in each group are randomly assigned a treatment. When individuals are randomly assigned to a group, the experiment is called a randomized experiment. For example, each heart attack patient in a drug trial could be randomly assigned, perhaps by flipping a coin, into one of two groups: the first group receives a placebo (fake treatment) and the second group receives the actual drug.

Observational studies
Generally, data in observational studies are collected only by monitoring what occurs, while experiments require the primary explanatory variable in a study be assigned for each
subject by the researchers.

Making causal conclusions based on experiments is often reasonable. However, making the same causal conclusions based on observational data can be treacherous and is not recommended. Thus, observational studies are generally only sufficient to show associations.

Suppose an observational study tracked sunscreen use and skin cancer, and found that the more sunscreen someone used, the more likely the person was to have skin cancer. Does this mean sunscreen causes skin cancer? Some previous research tells us that using sunscreen actually reduces skin cancer risk, so maybe there is another variable that can explain this hypothetical association between sunscreen usage and skin cancer. One important piece of information that is absent is sun exposure. If someone is out in the sun all day, she is more likely to use sunscreen and more likely to get skin cancer. Exposure to the sun is unaccounted for in the simple investigation.

In this example, sun exposure is a confounding (or lurking) variable, a variable that is associated with both the explanatory and response variables. While one method to justify making causal conclusions from observational studies is to exhaust the search for confounding variables, there is no guarantee that all confounding variables can be examined or measured.

Sampling methods
Almost all statistical methods are based on the notion of implied randomness. If observational data are not collected in a random framework from a population, the conclusions that arise from these statistical methods are not reliable. Here we consider three random sampling techniques: simple, stratied and cluster sampling.

Simple random sampling is probably the most intuitive form of random sampling. Consider the salaries of Major League Baseball (MLB) players, where each player is a member of one of the league's 30 teams. To take a simple random sample of 120 baseball players and their salaries from players active during the 2010 season, we could write the names of that season's 828 players onto slips of paper, drop the slips into a bucket, shake the bucket around until we are sure the names are all mixed up, then draw out slips until we have the sample of 120 players.

Stratified sampling is a divide-and-conquer sampling strategy. The population is divided into groups called strata. The strata are chosen so that similar cases are grouped together, then a second sampling method, usually simple random sampling, is employed within each stratum. In the baseball salary example, the teams could represent the strata; some teams have a lot more money (we're looking at you, Yankees). Then we might randomly sample 4 players from each team for a total of 120 players. Stratified sampling is especially useful when the cases in each stratum are very similar with respect to what is being measured about the cases. The downside is that analyzing data from a stratified sample is a more complex task than analyzing data from a simple random sample. The analysis methods introduced in this course would need to be extended to analyze data collected using stratified sampling.

A cluster sample is much like a two-stage simple random sample. We break up the population into many groups, called clusters. Then we sample a fixed number of clusters and collect a simple random sample within each cluster. This technique is similar to stratified sampling in its process, except that there is no requirement in cluster sampling to sample from every cluster. Stratified sampling requires observations be sampled from every stratum.

Sometimes cluster sampling can be a more economical random sampling technique than the alternatives. Also, unlike stratified sampling, cluster sampling is most helpful when there is a lot of case-to-case variability within a cluster but the clusters themselves don't look very different from one another. For example, if neighborhoods represented clusters, then this sampling method works best when the neighborhoods are very diverse.

A downside of cluster sampling is that more advanced analysis techniques are typically required, although the methods in this course can be extended to handle such data.

Exercises

1. Identify the sampling method employed in each situation.

a) A statistics students posts a survey for his Facebook friends in order to gather data for a class project.

b) A college conducts a survey of its students by randomly selecting 10 classes that meet at 8:30 a.m. and surveying all of the students within those classes.

c) An orchestra collects ticket stubs from its subscribers who attend a concert, puts those stubs in a paper back, shakes up the paper bag, and draws out 20 stubs, then calls the people who bought those tickets to ask them questions about the type of music they most want to hear on concert programs.

d) The host of a cable talk show asks his viewers to contact him via Twitter to answer "yes" or "no" to a question posed on his program about raising taxes.

e) A political research firm randomly selects 100 voters from each of the 50 states to participate in a phone survey about an upcoming presidential election.

2. A statistics instructor wants to investigate whether online instruction is less effective than a traditional lecture course. At the end of a quarter, he gives the same exam to his online students and to his lecture students, then compares the grades of the online students to the lecture students.

a) What is this explanatory variable in this situation?

b) What is the response variable?

c) Is this an observational study or an experiment?

d) If, on average, the grades for the online section are significantly higher than the grades for the lecture section, can he conclude that online instruction is more effective?