Statistics

Visual Organizers in Statistics

Statistics is the grammar of science.
Karl Pearson

The average human has one breast and one testicle.
Des McHale

Nothing like starting with two very different quotations for inspiration. I discovered that among the quotations about
A table shows biology grades for boys and girlsstatistics that nearly nine out of ten are sarcastic or negative. These were among the more positive ones. It took me close to a minute to make sense of the second one. I hope you were faster. Once you understand, it really is meaningful. We can never assume that the “average” statistically will actually describe anyone at all, and certainly it is unlikely to describe most people.

Many of you know immediately how visuals are used in statistics. From kindergarten on, we have been expected to interpret and occasionally create graphs. What always surprised me, as a teacher, was how little students know about graphs.

So that’s where we will begin. I created some data about students taking a test.

You really don’t need to spend any time studying the numbers. This could be any set of grades. The only difference is that grades for girls and those for boys are listed separately.

I’m not suggesting you add up the numbers and find the averages or anything that will take any effort. There is no need to actually create a graph. Simply decide what kind of graph or graphs would be the best choice.

Please click to see The Best Choices

What makes a Statistical Graphic Good?
Beginning with Bar and Line Graphs

Many students seem to believe that they can take their data and create any kind of graph they want. It is crucial to use the right sort of graph for the data. Let us use a simple bar graph as an example.

There are some students who would use a line graph, supposing the data showed the number of jobs decreasing. But actually, all of these jobs are increasing.Much more important is that these jobs could be arranged in any order. They could, for example be in alphabetical order. Then the “line” would zig-zag up and down.

A bar graph is used to compare separate categories of data. It shows that some are higher or larger than others in the groups but there is no change from one to the other.

A line graph is used to show change, usually change over time. It might show a child’s increasing height or weight over a period of years. It might show a store’s increasing and decreasing business by the month. Income may be higher in certain months like around Christmas or other holidays. You might use a line graph to show your car’s changing value in relationship to the miles on the car (something that changes over time.) It might show how much sugar can be dissolved in water as the temperature is increased.

In any of these cases, the independent value is shown along the x-axis (horizontal). This would include the time, temperature or mileage on your car. The dependent values (they depend on the time, temperature, mileage) is always shown on the y-axis (vertical.)

In recent years there has been disagreement about the terms we use. Traditionally, these are called Line Graphs. But mathematicians are now using that term for something entirely unrelated.( In a specialized area called “graph theory, line graphs represent adjacency between edges.)

What we commonly call Line Graphs are more properly called Line Charts or Line Plots. Some insist that in a Line Chart, you connect the dots with straight lines, and that in Line Plots you draw a line that doesn’t always touch the dots but seems to be the best fit. For example, if students measured the amount of sugar that dissolved at various temperatures, they would get different measurements. The best line would probably go between the central or more frequent measurements. To add to the confusion, Edward Tufte recommends that you simply show the dots and not connect them.

A few words about Pie Charts

Before computer graphing programs, many people avoided creating Pie Charts (also called Circle Charts.). It isn’t hard to understand why. First you had to find the total. In the case of the biology grades, the total number of points is 2,979. Then for each section of data, you needed to calculate what percent it was of the whole. That’s a lot of math. But that’s not the hard part. You then calculated that percentage of the 360 degrees in the circle. The result would be your angle. Then all you needed to do was measure the central angles carefully.

a better title would be Geologic Time in Eras

A Pie Chart showing Eras on the Geologic Time ScaleWith a computer graphing program, it’s ridiculously simple. You enter the data with a label for each on a spreadsheet. Then the computer created the chart.

I assume that in more expensive or more advanced programs, I could have created a more meaningful chart. I would have liked my own choice of colors, showing Cenozoic in red, and Mesozoic and Paleozoic in other bright colors. When I was in graduate school, the Time Scale stopped here. Nearly all known fossils were from these era.

I would show the Eoarchean era in Black. As far as Scientists know, there was no life in that time. Then I would show Neoproterozoic through Paleoarchean Eras in shades of brown. We know much less about these eras than we do about the first three.

The only real advantage of this sort of chart is that it does show their comparative lengths. The usual Geologic Time Scale often lists the time in ma or mya (units meaning million years ago). But the times are so large that  a Chart drawn to scale would either not be able to include the important data in the first three eras or it would need to be a very long chart.

To fill in a little of the data, the Paleozoic is the area that for many years was considered to have the first fossils, thus the first evidence of life on earth. It includes first jawless fish and later the more familiar jawed fish. There were trilobites, sharks, insects, and eventually the first land plants. It is divided into six periods and four major extinctions.  The Mesozoic is the era of the dinosaurs and earliest mammals. The Cenozoic is divided into two periods. The Tertiary Period included the development of many new kinds of mammals. The Quaternary includes the development of man. The earliest epoch here is where we find Neanderthals. The most recent time, the Holocene Epoch includes toward the end, the rise of modern man.

I tried to divide the Cenozoic Era to show the development of modern man, but even the Quaternary period starting 1.8 million years ago was too small to even show a sliver of color on the chart.

This simple effort is similar to a much better developed chart that is similar to a clock, showing modern man appearing in the last few seconds of the day.

To see a wonderful version of this clock, click here: The Geologic Clock

All graphs should be honest, clear and easy to understand.Be careful to avoid distortion of your data.

My rules for creating graphs include the following.

1. Make sure you use the kind of graph that is appropriate for your data and what you want to learn or display about that data.

2. Always include a Title that accurately Describes your data. For example, with the grades on the Biology test, the title, “Grades in Biology” is much too vague. “Grades in Mr. Smith’s Biology Test on October 3,  2012” is more accurate. Even better would be “A Comparison of grades by Boys and Girls in Mr. Smith’s Biology Tests on October 3, 2012.

3. Always divide the numbers along each axis in even units. You can count by 1, 2, 5, 10, 100, or whatever but you cannot count 2, 4, 6, 8, 10, 15, 20. This distorts the graph.

4. It is best to begin your charts with 0 where possible. The temperature cannot start at zero because it would be ice. It can begin where the experiment began. You certainly cannot begin the line at 0 for the car’s value. It will approach zero as it get’s older. Graphs of grades are best starting with 0 and going to 100. If you only show the range from 70-90 where the grades are, it appears that those making grades in the low 70’s did very poorly. The truth is that no one did poorly in that case.

4. Always label the information on both axes with a general description and the units. Example would be Time in minutes, Time in years, Temperature in degrees Celsius, Distance traveled in miles, Volume of sugar in teaspoons, Mass of sugar in grams, Value of car in US dollars.

Edward Tufte, in his book, The Visual Display of Quantitative Information, says “Tables usually outperform graphics in reporting on small data sets of 20 numbers or less. The special power of graphics comes in the display of large data sets. (p. 56)  He describe six principles for creating graphics. (p. 77)

1. The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional the the numerical quantities represented.
2. Clear, detailed, and thorough labeling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graphic itself. Label important events in the data.
3. Show data variation, not design variation.
4. In time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units.
5. The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data.
6. Graphics must not quote data out of context.

Avoid creating distortions in your data.

“No one can write decently who is distrustful of the reader’s intelligence, or whose attitude in patronizing.”
E.B White, The Elements of Style, p. 70

Tufte, page 81, says creating statistical graphics is much the same.

Contempt for graphic and their audience, along with the lack of quantitative skills among the illustrators, has deadly consequences for graphical work: over-decorated and simplistic designs, tiny data sets, and big lies.

There are many common errors made in creating graphs. You need to avoid them.

We mentioned several of these above. When possible, start your list of numbers at 0. Never change the interval size in mid-graph. Never use a line chart unless you are showing how one things changes in relation to changes in the other.

A website called The Gallery of Visualization    www.datavis.ca/index/php   shows graphs they consider the best and some of the worst. My favorite was seen on Fox TV.  Click to see a Strange Pie Chart

I assume readers on this website understand what is strange about this chart. If you don’t, ask a friend.

The best graphics are simple and include only what is important. What Tufte calls “Chartjunk” is all the unnecessary and distracting material. He says computer graphics add to the problem. They allow you to use three dimensional bar graphs that might look interesting but add no information. A simple bar graph is better. Adding color just to make it look interesting is also distracting. Color should help you distinguish different types of data.

According to Tufte, The grid lines can also be a distraction. They should be pale gray instead of black or removed where not needed.The worst chartjunk, according to Tufte are the black and white patterns such as slanted lines and cross-hatching or even a checkerboard effect that supposedly made the chart look more interesting. Instead, they often create moire patterns and distract the poor reader trying to make sense of them. Tufte concludes that it is better to avoid computerized graphics and “tell the story with a table.” (p.120)

A very helpful website for those who want more information is Perceptualedge.com  For a link to their page with complex graphics and how they suggest they be simplified for greater understanding, Perceptual Edge

Students should have many opportunities to use graphs when dealing with data. You will most commonly use them in science.

I will end this page with a final quote from Tufte.

If the statistics are boring, then you’ve got the wrong numbers. Finding the right numbers requires as much specialized skill — and hard work as creating a beautiful design or covering a complex news story. p. 80

Next: Diagrams

Leave a Reply