
Defining "Statistics"
The word statistics, like economics or mathematics, denotes a field of academic study. We can define statistics as:
A branch of mathematics that involves collecting, summarizing, analyzing and interpreting data, along with effectively communicating the results.
Data, in turn, means "information" and is the plural of the Latin word datum, which denotes a single piece of information. (For example: Your age would be a datum, while the ages of all the students in your statistics class would be data.)
The English word statistics comes from the German Statistik, meaning "the study of political facts and figures"; first coined in 1749, it was derived from the Latin word statisticus, meaning "of the state."
To this day, statistics remains widely employed in government and political science, but has spread to uses across the academic spectrum. Statistics lies at the heart of the scientific method, and provides tools essential for everyone from biologists to engineers, psychologists to sociologists, accountants to computer programmers. The explosion of data gathering (and storage) at the dawn of the 21st century (think Google and Facebook, for example, not to mention the NSA) has placed a focus on techniques for effectively analyzing massive quantities of data. Businesses and government agencies offer highly competitive salaries for trained statisticians.
For some of you reading this, an introductory course in statistics may be the first step toward a degree—and a career—in statistics. Many more of you will not follow that path, but you may very well use statistical techniques in your chosen field: even if you end up hiring a statistician to perform more advanced data-gathering and analysis, being able to speak the language of statistics will provide you with an advantage over your co-workers (just as knowing how an automobile works might allow you to more effectively communicate with a mechanic). And for the rest of you, a basic knowledge of the tools and principles of statistical thinking is an essential part of becoming an informed citizen of the modern world.
A second meaning
The data we collect sometimes includes information about all of the people or things we wish to study. For example, each quarter roughly 13,000 students enroll in classes at Edmonds Community College. If a student government officer wants to learn more about the people she serves, these 13,000 students would be the population of interest. The college collects information about each student: for example, the number of credits each student is taking during the current quarter and the student's gender. A parameter is a number that summarizes some attribute of a population: in our current example, it might be the average number of credits for all students enrolled this quarter, or the percentage of all students who are female.
In many situations it can be extremely difficult—if not impossible—to gather data from everyone (or everything) in a population. For example, if a political reporter wants to know how all registered voters in Washington state feel about legislation passed on January 1, 2013, to avert the so-called "fiscal cliff," he would face a mightily expensive and time-consuming task if he tried to interview all 3,125,516 Washington voters who participated in the most recent presidential election. Instead, he (or someone he hires) might interview a subset of those voters. We call such a subset a sample drawn from the population.
For example, a poll of 524 Washington voters conducted September 7–9, 2012, by SurveyUSA found that 57% of those surveyed supported Initiative 502, a ballot measure legalizing the use of marijuana, while 32% planned to vote against I-502, with the rest undecided. Here, we call 57% a statistic, defined to be a number that summarizes some attribute of a sample. Unsurprisingly, we use the word statistics to refer to two or more such numbers. In the actual election, which took place on November 6, 2012, 55.7% of Washington voters cast their ballot in favor of I-502; here 55.7% would represent a parameter, because it summarizes data from the entire population. (A common mnemonic device uses the fact that population and parameter both begin with the letter "p," while sample and statistic both begin with "s" in order to recall how these terms are related.)
Under most circumstances, whether statistics refers to the first meaning (a field of study) or the second (more than one statistic) should be clear from the context.
The process of statistical analysis
As indicated in the first definition above, statistics involves much more than mere number-crunching. Typically, we begin by asking a question:
- Are Edmonds Community College students signing up for more credits this quarter, on average, than five years ago?
- Do Washington voters approve of the fiscal cliff legislation passed on January 1, 2013?
- Is high-fructose corn syrup less healthy than "real" sugar?
- Is a new medication more effective at preventing heart attacks than a simple aspirin?
Carefully stating the question allows us to identify the population of interest.
Next, we develop a plan to gather data, which often involves selecting a sample from the population; careful sampling techniques help us find a sample that is reasonably representative of the population. (For example, if our population is all students attending EdCC this quarter, a sample that included only females, or included students who graduated a decade ago, would likely differ in important ways from our population of interest.)
Once we've collected our data, we create graphical displays and compute summary statistics to help identify patterns and interesting features. This process, sometimes called descriptive statistics (because it helps describe attributes of our sample), falls under the umbrella of exploratory data analysis (EDA), which also includes some more sophisticated tools. EDA techniques may provide hints about the answer to our initial question, or in some cases suggest better questions to ask. Data mining sometimes applies EDA techniques to search for patterns and relationships in massive data sets.
We then often employ the tools of inferential statistics to infer conclusions about the population from the information collected about our sample. If 57% of 524 Washington voters surveyed plan to vote to legalize marijuana, should we have concluded that a majority of all voters plan to do so? We'll rarely be able to answer such questions definitively, but the tools of inferential statistics often allow us to quantify the degree of (un)certainty with which we can state such conclusions.
Finally, we need to express our conclusions in a clear, concise manner so that someone who has not taken a course in statistics can readily understand our answer to the initial question.
While it might seem reasonable to first study how to pose reasonable investigative questions and then learn about sampling techniques before moving on to descriptive statistics, it will actually help to know what we can do to summarize data once we have it before we take an in-depth look at data-collection methods. So, for the time being, we will look at various ways to effectively organize and summarize data before returning to these other important topics.
Exercises
1. Mason-Dixon, a nonpartisan polling firm based in Jacksonville, Florida, conducted a phone survey of 800 registered Florida voters (whom the company deemed likely to vote in the November 2012 election) on behalf of the Tampa Bay Times, Miami Herald, El Nuevo Herald, Bay News 9 and Central Florida News 13. The poll, conducted Sept. 17–19, found that 46% of those surveyed planned to vote to re-elect President Barack Obama, while 45% planned to vote for this Republican rival, former Massachusetts governor Mitt Romney.
a) Identify the population.
b) Identify the sample.
c) Are the numbers 46% and 45% mentioned above statistics or parameters?
2. Of the 100 senators in the United States Senate, 20 are women (according to Wikipedia) as of January 3, 2013.
a) Are the 100 senators a sample or a population?
b) If we report that 20% of all senators are currently female, is 20% a statistic or a parameter?
c) If we consider the 100 senators to be the population, would the group of senators from Washington state (Patty Murray and Maria Cantwell) be a representative sample?
d) Wikipedia also reports that 82 of the 433 members of the House of Representatives are female (as of June 4, 2013, two of the 435 seats are vacant). Are these 433 members of congress a representative sample of all U.S. citizens?
3. On September 21, 2012, The New England Journal of Medicine published an article online ("A Randomized Trial of Sugar-Sweetened Beverages and Adolescent Body Weight") reporting the results of research conducted by a team led by Dr. David S. Ludwig of Boston Children's Hospital.
a) Download the article from the link above and read the abstract on the first page. (Don't worry about terms you don't understand; you'll become familiar with many of them by the end of this course.) Identify the population associated with the study.
b) Identify the sample.
c) The article reports that after one year, the average weight gain for study participants who consumed non-sugary beverages was 1.9 kg less than the average for those who did not modify their beverage consumption. Is 1.9 kg a statistic or a parameter?