Your Ruby Slippers: five key data takeaways

Hi there, readers! We have so enjoyed having you on this data journey with us. The posts we’ve shared since March are an introduction to data literacy, and we’re wrapping up that theme today. Fear not! This series—Between a Graph & a Hard Place—isn’t going anywhere. We’re just starting a new theme, like the next chapter in a book. (We’re data people, but who can resist a book metaphor?)  

We hope that you’ve learned something—preferably lots of things—and will join us on the next leg of our journey. Based on surveying you, our readers, the new direction we’re taking is to share how you can actually do research and evaluation in the library. After today’s post, we’re going to post every other week. We love writing these, and they take time to write well. If you’re worried you’ll forget when we’re posting again, it’s easy to sign up here to get notified when we have a new post.

We’d like to give a good send off to this chapter and show you how all the posts tie together. As we review each post today, I want you to keep five big themes in mind. These key ideas apply to every area of data literacy and each post from the series connects to them. 

Five themes in data literacy:

  • The quality of research varies. Details matter, so take the time to think about them. 
  • Your common sense will take you far. Does what you’re reading make sense?
  • Our human brain has feelings, biases, and preferences. Stay aware of yours.
  • Researchers are also human. They have feelings, biases, and preferences too.
  • When considering what data mean, err on the cautious side. What do we know from these data? What is more of a guess?

These themes are your data literacy ruby slippers. You have them with you all the time, and if you start to feel lost or confused, they can show you the way home. You just have to remember you have them! With these big themes in mind, we’re ready to review data literacy.

How to compare apples to oranges

  • When data are compared, think carefully about what two things are being compared and if they are truly similar to each other.
  • One way to make things more comparable is to use per capita, or per person, data.
  • Comparisons can be messy. Keep your thinking cap on.

Habits of mind for working with data

  • Give yourself permission to struggle and get help.
  • Acknowledge your feelings about the topic.
  • Whether you like the data or not, that information gives you an opportunity to learn.

Do the data have an alibi?

  • The quality of the data matters.
  • Where were the data published and when were they collected?
  • Who the authors are is also important. What is their area of expertise? Why did they publish this?

What’s typical and why does it matter?

  • Means and medians are measures of what’s typical.
  • Knowing what’s typical can be very helpful for comparisons.
  • The mean (average of a data set) is impacted by extreme values, so sometimes the median (middle value in a data set) is more representative. 

Correlation doesn’t equal causation

  • Correlation is one way that two variables relate to each other.
  • A strong correlation is when we can predict with a high level of accuracy the values of one variable based on the values of the other. They co-occur. 
  • Causation is different because it’s a cause and effect relationship: we know that A leads to B. 

The right data for the job – part 1

  • Do the data collected make sense based on the research question?
  • What data were collected and how they were collected are both important.

The right data for the job – part 2

  • Definitions impact what data are collected and how they are interpreted. 
  • The data collected for research are usually a sample of a larger population.
  • To be representative, the sample needs to reflect the population in key ways.

Visualizing Data: a misleading y-axis

  • The y-axis (vertical axis) does not always begin at zero on a chart.
  • The y-axis may be shown on a larger or smaller scale (zoomed in or out).
  • Depending on how the y-axis is displayed, the data will look different—which can highlight or obscure differences between groups or changes over time.

Visualizing Data: the logarithmic scale

  • Logarithmic (or log) scales are another way to display the y-axis. 
  • On a log scale, the distances between intervals increase by a percentage: multiplying by x each time.
  • Log scales are useful because they show rates of change—the percent something increases or decreases.

Visualizing Data: color

  • Color can help you understand visual information, but it can also confuse or mislead you.
  • We have feelings about colors and their meanings, which are not always conscious.
  • Red holds a special place in our brain. It says “pay attention.” 

Visualizing Data: choosing the right chart

  • The best chart for showing change over time is a line or bar chart. 
  • The best chart for showing multiple variables is a bar chart.
  • The best chart for comparing something to the total is a pie chart.

Here we are, at the end of this chapter! We are delighted to have come this far. Knowing that these blog posts have been useful for you all makes us so happy. Please join us on July 29th to continue the journey. We look forward to seeing you then!

Visualizing Data: choosing the right chart

If you walk into a hardware store, you might see an entire aisle of screws—short ones, long ones, phillips head, flat head, ones with weird little anchors on the ends. They might all be screws, but they each serve a specific purpose—for wood or cement, for different screwdrivers, for thick or thin materials. It’s the same with data visualizations. They might all be charts, but pie charts, bar charts, and line charts all serve a different purpose. When data visualizers use the wrong one (often unintentionally), you’re left with a chart that doesn’t really make sense. 

Below are charts using the same data—the number of reference questions, by topic, asked each month from January through April. Let’s take a look at what information we can gather based on how those data are displayed in the visualization.

Line ChartsLine charts are commonly used to track changes over a period of time. They have a y-axis (up and down) and an x-axis (left to right) to plot two different variables. While a bar chart can also be used for this purpose, a line chart is particularly helpful when smaller changes exist or when you’re comparing changes over the same period of time for more than one group, like in the chart above. 

Here we can see that something might have happened in February to cause healthcare, business, and employment to all increase. Homework questions dropped off a bit though. Did schools give kids time off before online learning started? We know to investigate those questions because the line chart helps us identify trends. 

Pie/Donut ChartsPie/donut charts should only be used to compare parts to a whole. Each category is associated with a slice of the pie which corresponds to that category’s proportion (or percentage) of the total.  We can see that the majority of questions asked during this time period were about employment because it’s the largest slice. The least amount of questions were about genealogy. However, there’s a lot we can’t see. For instance, we have no idea how many reference questions in each category were asked in each month. We can’t see if there was a spike in healthcare questions in February when flu season hit its peak.

If you added up the values of each slice, they would equal 100 percent because each slice of the chart is determined by dividing the whole (total number of reference questions) by the part (question topic). As a reader, a huge red flag should go off if they don’t (unless the chart states it’s due to rounding). Sometimes pie charts will only have a legend that tells you what each slice represents, rather than data labels. In these cases, it’s even harder to discern how slices compare to one another because our brains are terrible at making spatial comparisons between circular areas. In general, pie charts should not contain more than five slices. When they do, it becomes difficult to read and some slices might be so small that you can’t interpret them anyways, rendering the data visualization pretty much useless. 

Bar ChartsBar charts are used to compare things between different groups or to track changes over time. They can also be used to present data that sum to more/less than 100 percent because, unlike pie charts, they aren’t limited to presenting parts to a whole. Like a line chart, they have an x-axis and y-axis, but bar charts aren’t confined to using a unit of time across the x-axis. For instance, a bar chart could use a demographic variable like age group. They can also be stacked, like in the example below. Conclusion

When looking at charts, think about whether the one the creator chose makes sense for the data story they’re trying to tell. Are they talking about changes over time, comparisons between multiple groups, or how much something makes up of the total? If the story doesn’t match the visual, be careful to draw any conclusions based on the chart. In addition, 3D renderings of any of these charts are likely to cause distortion and be visually inaccurate, even if it’s the right type of chart for the job. Here’s a nifty cheat sheet that always helps me recall when each chart should be used, and some important notes to remember: 

  • If it’s talking about something changing over time, it should be a line or bar chart 
  • If it’s talking about multiple variables, it should be a bar chart
  • If it’s talking about comparing something to the total, it should be a pie chart.

LRS’s Between a Graph and a Hard Place blog series provides strategies for looking at data with a critical eye. Every week we’ll cover a different topic. You can use these strategies with any kind of data, so while the series may be inspired by the many COVID-19 statistics being reported, the examples we’ll share will focus on other topics. To receive posts via email, please complete this form.

Let us know what you think!

When the COVID-19 pandemic began a couple of months ago, we at LRS began thinking about how we could help. What skills could we share that might be useful to library staff and our communities?  So many different sources were releasing charts and graphs to help us all understand what was happening, and we were all trying to process a lot of data every day. LRS created the Between a Graph and a Hard Place blog series to provide strategies for looking at all kinds of data with a critical eye—strategies that could be used in a library or in our everyday lives. 

We are wrapping up the first part of that series and we would love to get your feedback about what worked, what didn’t, and what you think we should do next. Don’t worry—we’re going to keep writing these posts for you! However, in lieu of publishing a post this week, we have created a survey to collect your thoughts to help guide our future posts. If you have ten minutes, we would greatly appreciate it if you’re able to fill it out. 

Thank you so much and see you next week! 

Visualizing Data: Color

I love color. As long as I can remember, I have kept my crayons organized in rainbow order. It makes me happy to see them that way! It’s a little tedious with the magical 64 pack of crayons, but totally worth it. I am an extreme example, but humans in general are visual creatures. Color impacts how we perceive and understand visual information—including graphs, charts, and infographics. 

A good data visualization combines a thoughtful display of the data with strong art and design principles, including color. Our brains are wired to pay attention to color, even if some of us perceive it differently (read more here). While color can help you understand visual information, it can also confuse or mislead you. Understanding the principles that data visualization designers use can give you insight into the role that color plays when you process visual information. 

When we make charts at LRS, we try to use several different shades of one color or one main color and a highlight color. Why just one or two colors? Believe me, if it worked, I would make all of our charts look like rainbows. The problem is that for each color you use, a viewer has to process how they personally feel about it and what that color symbolizes in our culture. Then they have to sort out what that color means in the chart. 

Our emotional reactions to color are not always conscious. If I went to the dentist and found myself sitting in a neon yellow waiting room, I would become incredibly anxious, but I may not know why. Designers spend a lot of time studying color and use it strategically, which is both good and bad for you, the viewer. The power of color can help you understand and it can emotionally manipulate you. 

What’s your favorite color? Do you know why? What about your least favorite color? Why? You carry around those preferences in your brain all the time. We’re going to look at some examples now, and I want you to keep track of how you feel about the colors.

Look at that beautiful rainbow! These pale shades of basic colors makes me think about spring and a happy version of childhood where nothing ever goes wrong—like a fanciful children’s book about talking animals. As a designer, I would use these colors to evoke viewers’ sense of nostalgia about childhood before I talk about children’s programming at the library. 

As a viewer, I’m distracted by the colors even though I like them. I really like that shade of green, so I just want to think about that column. Is the green column the most important data in this chart? I have no idea. My eye also keeps getting drawn to the red color—is that where I’m supposed to focus? While these colors are all different, they still have a similar level of saturation or brightness. What happens when that is not consistent?

This chart is really hard for me to interpret as a viewer. I think I’m supposed to focus on 2010—I can barely pull my eyes away from it. The data from 1980 is bright too. I don’t know why the data for 1980 and 2010 are shown in brighter, more saturated colors. I’m losing track of 2020, even though it has the largest value. 

Two color choices are creating a lot of confusion here. One is the use of red. Red holds a special place in our brain (read more here). It’s one of our brain’s priority colors—meaning that we are particularly skilled at perceiving it and its different shades. The cultural symbolism of red is also important. Think about the places red shows up in our world: stop signs, stop lights, warning symbols, and sports cars. Red says to us: pay attention. And sometimes also “bad” or “danger.” We can’t help but stare at the saturated red color and assume it’s important. 

The second confusing choice is the saturation of the colors. The green column is as saturated as the red column, which makes me assume it is the second most important data here. My intuition thinks more color = more attendance, but in this chart the two most saturated columns are not the ones with the highest values. Overall, color is not helping here.

Ah, ok. I’m still not sure what the takeaway message from this chart is, but at least I don’t want to run away from it. It’s easier for me to think about the data now that I’m not distracted by the colors. I can focus and develop some questions. The one thing that is missing is a visual cue about where I should focus or what is most important.

Ah, there’s my cue! This chart provides both a cohesive experience and a good indication of where the viewer is supposed to focus. I don’t need to spend a lot of energy deciphering it. I still don’t know what happened in 2010, but I feel curious and ready to find out more. The use of color augments my understanding of the data.

Out there in the wilds of the internet, there are some data visualizations where color is a barrier to understanding the data or used to elicit an emotional response. As a viewer, you don’t get to change the colors to be less distracting or add in a helpful cue about where to focus. If only we could! Instead, notice if the colors are distracting you or producing a strong emotional reaction and do your best to work around it. Often that means focusing on the data in spite of the colors. I have also printed data visualizations in grayscale to strip the color out myself. 

I could go on about color, but I want you to get back out there, using these skills! If you want to learn more about color, I recommend this episode of the podcast Radiolab.


Visualizing Data: the logarithmic scale

Welcome to part 2 on data visualizations. If you’re just joining us, we talked last week about how the y-axis can be altered to mislead a reader about the data. You can find that post here. Now, let’s jump right back into another big data visualization misunderstanding. 

The goal of data visualizations is to allow readers to easily understand complex data, but sometimes it’s the data visualization that we don’t understand. Certain techniques are utilized because they are the best fit for the data—not the best fit for the reader—and that can cause quite a bit of confusion if we don’t know what we’re looking at! Such is the case with logarithmic scales, which most people are unfamiliar with, but encounter all the time. Let’s break it down together.

That scale is growing out of control! 

Logarithmic (or log) scales are simply another way to display your y-axis. Unlike linear scales, where the distance between each interval increases by the same amount (adding x each time), the distance between each interval in a log scale increases by the same percentage (multiplying by x each time). Log scales are useful because they show rates of change—the percent something increases or decreases.

Imagine your library grows its print collection yearly by 100 percent. That means every year you double the number of books on your shelves. The first year you have only 192, the next year 384, then 768…1,536…Fifteen years later you’d have more than 3 million books! Good luck using a linear scale to show that kind of growth in your annual report. A better option would be to use a log scale where you can show your collection has grown annually by 100 percent. Take a look at the same data using a linear scale versus a log scale. Can you tell which one is which?

That’s right, the one on top uses a log scale (x10) and the one on the bottom uses a linear scale (+1 million). As you can see, the linear scale makes these data look like you didn’t get any books until eight years after you opened! However, if you weren’t familiar with log scales you might also think you increased your collection by the same number of books every year, instead of at the same rate

Let’s say instead of expanding your book collection by 100% annually, that growth rate begins to slow down after eight years. You still increased your collection by 27,000 books in the last year, but the log scale might make you assume you got less books than you did the first couple of years. This flattening effect is often misleading, but it simply shows a decrease in the rate, not in absolute numbers. 

Log scales have their advantages and are often used to display data that cover a wide range of values or numbers that are growing exponentially. For epidemiologists who study disease spread, log scales allow them to chart the first outbreak (often a couple of people) up to community or global spread. The volcanic explosivity index and the Richter scale, which measures earthquakes, are other common uses of a log scale.


Like we mentioned last week, data visualizations are all about conveying the data’s story. When you see a log scale, remember that the story is about the rate of change, not the absolute numbers. Understanding how and why certain data visualization tactics are used will help you read any data story. Next week we’ll cover some new tactics so be sure to join us! 

LRS’s Between a Graph and a Hard Place blog series provides strategies for looking at data with a critical eye. Every week we’ll cover a different topic. You can use these strategies with any kind of data, so while the series may be inspired by the many COVID-19 statistics being reported, the examples we’ll share will focus on other topics. To receive posts via email, please complete this form.

Visualizing Data: a misleading y-axis

With great power comes great responsibility—that’s how I feel about data visualizations. Good ones help readers quickly understand the data and can convey an important message to a lot of people. However, bad data visualizations can intentionally or unintentionally mislead, causing us to come to the wrong conclusions. In this multi-part post, we’ll unpack some of the most common mistakes and give you the tools to spot them. 

Omitting the baseline

Imagine you’re telling someone the plot of a novel, but you start in the middle. All of the sudden the protagonist slays the dragon, which is a huge jump from introduction to climax. The character arc would feel pretty extreme, right? This is the literary equivalent to omitting the baseline in data visualizations. 

Omitting the baseline means the y-axis (height of the graph) doesn’t start at zero, resulting in a truncated graph. Truncated graphs might be unintentionally used to save space or intentionally done to cause one group to look better than it should. Take a look at the truncated graph below. The creator of this visualization (me) made a design choice to start the y-axis at 3,000 instead of zero because all of the data were around 3,000. Then it’s easier to see the differences right? 

 Yes, but now Library A appears to have circulated more than twice as many books as Library B or C this month and that’s just not true. Below are the same data with the baseline. In comparison, all three libraries circulated about the same number of books. The difference between Library A and C appears much less significant. 

Some graphs might leave off the y-axis entirely for a cleaner look making it harder to tell if the data are truncated. Ask yourself if the different columns look proportional. For instance, if 3,200 should look like half the size of 3,500. If not, then your baseline isn’t zero. 

Manipulating the y-axis

Manipulating the y-axis can be thought of as the exact opposite of truncating data. This visualization tactic is used to blow out the scale of a graph to minimize or maximize a change. For instance, this graph shows average annual global temperature from 1880 to 2015. 

Source: National Review

The scale goes from -10 to 110, which I suppose is the range of possible temperatures in fahrenheit. However, this scale doesn’t make sense for these data. Instead, it serves to flatten the line and convey the idea that average annual global temperatures really haven’t changed in the last 135 years. Here’s the graph again with a more meaningful y-axis. Now we can see the upward trend more clearly.

Source: Quartz

You may have noticed this graph has a truncated scale (missing baseline)! So why is ok in this situation and not others? When talking about scales and axes, one way isn’t necessarily wrong or right, it’s about the message the visualization is designed to convey. You want to make sure that the data are not visualized to be intentionally misleading—making you think something is more or less important than it really is.


None of these data visualization tactics that we covered today are inherently wrong. Remember that data visualizations are all about conveying the data’s story and like any story, people can take creative license. It’s important to be able to spot these scale manipulations to avoid getting the wrong idea about what the data are really telling you.

Next week we’ll cover some more common tactics, so stay tuned! 

The right data for the job – part II

Hello, again! Are you ready to learn more about the right data for the job? We are reviewing the  qualifications of various data to answer different kinds of research questions, just like we would review job candidates’ qualifications for a job. Last week we talked about the importance of what data were collected and how they were collected. This week we’re going to consider the importance of definitions and what it means for data to be representative. I know you have been waiting anxiously to figure out what we were going to do with those hats, so let’s jump back in! 

1) Can you define that for me?

In research, definitions matter a lot. How researchers define important concepts impacts both what data are collected and how they are interpreted. For instance, last week we talked about collecting data by looking at something people created – hats in a knitting class – and whether these hats could be defined as a “success.” Those hats can be used as our data, but we need to specify how we are defining success.

So how does one go about measuring if a hat was “successful”? Is a hat successful simply if it is completed? Or, does it need to be round and fit on someone’s head? What if it’s too itchy for any human to wear, but a cat decides it’s an amazing toy? To come to a conclusion about the success of these particular hats, and use that to evaluate the success of the program, researchers need to make decisions about these types of questions and how they relate to the research question. 

As a reader of research, look for a clear connection between the research question, how the concepts being studied are defined, and the conclusions that were drawn. They should all align. When you’re researching a new topic, be aware that there can be wide variety in how a concept is defined in different fields and by different researchers.

2) Representative

Do the data actually represent the thing that is being studied? Let’s say you want to know how many people in your service area read a book last month. You could call every single person to ask, but this is unrealistic because of the resources it would require. An alternative approach is to collect data from a sample of the population. In this scenario, everyone in your service area is the population and your sample is the people you actually collect data from. 

Creating a truly representative sample is difficult because it must meet these l criteria:

  1. Your sample should equal a certain percentage of your population. There are tools, like this one, to easily calculate what your sample size should be.  In general, if your population is smaller than 100, you should be surveying everyone. 
  2. Every member of the population needs to have an equal chance of being included in the study – meaning that the sample is randomly selected. This reduces bias and the potential for certain groups to be over-represented and their opinions magnified while others are under-represented. 

Results from a sample can be generalized to the population if it meets these criteria. 

What if the sample doesn’t meet these criteria? Then, check for another criterion – whether the sample otherwise mirrors the characteristics of the population.

Let’s say your sample size is 250, so you ask the first 250 people who walk into the library if they read a book last month. These data are going to be skewed because not everyone in your service area visits the library and those individuals that don’t haven’t had a chance to participate. Those that walk in also might not be representative of your population. For instance, if 50 percent of your population has a college education, 20 percent are African American, and 10 percent are above the age of 65, your sample should also reflect that.  

When reading research, check to see whether the sample meets the three criteria above. If it doesn’t meet the first two, you can be more confident that the results are still somewhat representative of the population if the demographics of the sample are similar to the population’s.

Getting a representative sample can be challenging, and researchers may acknowledge that some groups were over- or under-represented in their study. That doesn’t mean that research can’t provide valuable information. It does mean that this particular research may not be able to draw accurate conclusions beyond those individuals who participated in the study, or about the groups that were under-represented in their study. Be cautious about research that does not acknowledge or discuss significant differences between the population and their sample. 


You made it! You are ready to go interview some data! Let’s review: research results are based on data, and the quality of those data matters. Do the data collected actually answer the question that was asked? What data were collected, how they were collected, what definitions were used, and whether the data are representative all impact the quality and interpretation of the data. You don’t need to be an expert to consider whether the data used are really answering the research question. Use your common sense and these tips to think critically about the right data for the job.

The right data for the job

Imagine that different types of data are different people that you are interviewing for a job. The job is to answer a specific research question. You want to know what their qualifications are—will they do a good job? Are their strengths a good match for the task? Like we’ve mentioned before, if there are issues with the underlying data, then there will be issues with the results. It’s important to consider if the data collected make sense for the question that was asked. 

Today, the data “qualifications” we’ll be looking at are what kind of data were collected and how the data were collected. Next week, in part two of this post, we’ll cover definitions and what it means for data to be representative. Are you ready to get in there and interview some data? Come on, it’ll be fun!

1) What data were collected? 

The data collected need to be appropriate for the research question. One step of deciding what data are most appropriate is selecting quantitative or qualitative data. Quantitative research is concerned with things we can count. How many books were checked out last year? Of all the people who have a library card, how many of them checked out at least one item last year? Qualitative research is concerned with capturing people’s experiences and perspectives. Why did someone check out an item? How did having that item make them feel? Qualitative research often aims to give us insight into the thinking or feeling behind an action. 

Which type of data is most appropriate depends on the research question. If the researchers want to know how many children completed a summer learning program, the quantitative data from registration and completion work well. If researchers want to know what children got out of their experience, it would be more appropriate to collect qualitative data on a question like, “Please tell us more about your experience with our summer programming. What did you enjoy? What could be improved?”

What doesn’t work is if the researcher wants to know about the thinking behind an action, but asks for a number. Or if the researcher wants to know a number, but asks participants how they felt about their experience. The key is for the research question and the data collected to be a good match. 

2) How were the data collected?

Once researchers figure out what they want to know, they have to decide how to collect the data. There are many ways to collect data—too many for this post. A few common ways that data are collected in libraries are to ask people (such as with surveys or interviews), observe people, or evaluate something they created. Each of these strategies is more appropriate in some situations than others.


Asking is great when you want to understand someone’s experience. People understand what’s going on internally for them. An example is asking participants what they wanted to get out of an experience and what they felt like they did get out of it. They are the experts on that. Asking, in particular using surveys, is also an effective strategy for collecting a large amount of information relatively quickly.

There are a few challenges with asking people. One issue is that we all want to feel good about ourselves and look good to others. Even if I receive an anonymous survey in the mail, I may put down an inaccurate answer if it fulfills these desires. We call this social desirability bias.

For example, if you want to know how many books I checked out last year, I would be happy to tell you. The problem is that I don’t remember. Maybe 50? Maybe 100? Or maybe I do remember, but I want to tell you it was 500 because that makes me feel good.

The second issue is that sometimes we simply don’t know, but we’ll guess if you ask us to. Our memory is unreliable. When asking people to report on something, it needs to be something that they can report relatively easily and accurately. A final issue is that sometimes people don’t understand the question as the researchers intended it.

Researchers keep these challenges in mind so they can take steps to mitigate them. Some common practices are to avoid asking questions that are highly sensitive, asking about things that are easy to remember, and testing surveys and interview questions in advance for understanding.


Observation enables researchers to directly witness people’s behaviors and interactions, rather than relying on their self-report. This can be very useful in situations where people don’t know or struggle to articulate the information researchers are trying to collect.

For example, let’s say a researcher wants to know more about children’s experiences at a storytime. Children are still learning words to express how they feel. If children who attended storytime were asked what they got out of the experience, they might say it was fun. They are probably not going to say they worked on fine motor skills and social skills. An observer, however, can collect data about the activities during storytime, watch children’s interactions with each other, and their facial expressions. 

The challenge with any observation is that the information is captured or analyzed by a person. We are all biased and subjective in unique ways, and notice some things and miss others. To mitigate this challenge, researchers use a structure or guide while observing and, when possible, more than one observer. 

Create something

Creating something is a good way to collect data in situations where researchers are evaluating skills. Let’s say there was a knitting class at the library. The main goal was for participants to knit a hat. At the end of the class, every participant selects their best work and those hats all go on display in the library. Those hats are the data. To determine if the class’s goal was achieved, you would look at all the hats and decide if they were  “successful” hats. 

As you read research, take note of the reason why researchers chose a particular method of collecting data. Does their reasoning make sense? Do the strengths of that method match well with the research question? What’s important here again is the match between the research question and how the data were collected. 


Wow! That was a lot. How does your brain feel? A little gooey? Very understandable. Remember, you need the right data for the job. What data were collected and how they were collected are two ways to see if the data are a good fit for the research question. We’ll cover the other data qualifications next week: definitions and what it means for data to be representative. And we’ll talk more about those hats! 


Correlation doesn’t equal causation (but it does equal a lot of other things)

Correlation ≠ causation

Data are pieces of information, like the number of books checked out at the library or reference questions asked. Those pieces of information are simply points on a chart or numbers in a spreadsheet until someone interprets their meaning. People create charts and graphs so that we can visualize that meaning more easily. However, sometimes the visualization misleads us and we come to the wrong conclusions. Such is the case when we confuse correlation (a statistical measurement of how two variables move in relation to each other) with causation (a cause-and-effect relationship). In other words, we assume one thing is the result of the other when that might not be the case.

Strong correlation = predictability

The confusion often occurs when we see what’s called a strong correlation—when we can predict with a high level of accuracy the values of one variable based on the values of the other. As an example, let’s say we notice our library is busier during the hotter months of the year, so we start writing down the temperature and number of people in the library each day. Our two variables are temperature and number of people. A graph representing these data might look like this: 

This graph is called a scatterplot, and researchers often use it to visualize data and identify any trends that might be occurring. In this case, it looks like as the temperature increases, more people are visiting the library. We would call this a strong positive correlation, which means both variables are moving in the same direction with a high level of predictability.

Correlation = positive or negative; weak or strong

You can also have a strong negative correlation, which would show one value increasing as the other decreases. It would look something like this, where the number of housing insecure patrons in the library are decreasing as the temperature outside increases. 

The closer the points are to forming a compact sloped line, the stronger the correlation appears. If the points were more scattered, but we could still see them trending up or down, we would call that a “weak” correlation. In a weak correlation the values of one variable are related to the other, but with many exceptions. 

Correlation = a statistical measurement known as r

Without getting too deep into statistical calculations, you can determine how strong a correlation is by the correlation coefficient, which is also called r. Values for r always fall between 1 and -1. 

  • The closer r is to 1, the stronger the positive correlation is. In the first example graph above, if r = 1, this would mean there is a uniform increase in temperature and patrons visiting the library, with no exceptions. An 80-degree day would always have more visitors than a 75-degree day. The points on the graph would form a straight line sloping up. 
  • The closer r is to -1, the stronger the negative correlation is. In the second example graph above, if r = -1, this would mean there is a uniform increase in temperature and decrease in housing insecure patrons visiting the library, with no exceptions. A 40-degree day would always have less housing insecure patrons than a 35-degree day. The points on the graph would form a straight line sloping down.
  • The correlation becomes weaker as r approaches 0, with a value of 0 meaning there is no correlation whatsoever. The change of one variable has no effect on the other. In the first example above, if r = 0, one 80-degree day may have more visitors than a 40-degree day, whereas a second 80-degree day may have less visitors than a 40-degree day. There is no consistent pattern.

Correlation = an observed association 

Let’s focus on the first chart. If we did this calculation, we would find that r = 0.947.  Should we conclude that high outside temperatures cause more people to visit the library? Does that mean we should crank up the air conditioning so we can draw in more visitors? Not so fast. 

All we can conclude from these data is that there is an association between the outside temperature and people in the library. It’s a good first step to figuring out what is going on, but it’s not possible to conclude temperature causes people to visit or not visit the library. There could be other causes at play. We call these lurking variables.

Correlation (might) = something else entirely

A lurking variable is a variable that we have not measured, but affects the relationship between the other two variables (outside temperature and number of people in the library). Warmer weather usually occurs in the summertime when kids are out of school. So the increase in the number of people could be because of your summer reading program and kids having more time to come visit. The temperature outside might also affect the hours of your library. Did you have to close often during the winter because of snowstorms? Maybe you operate longer hours in the summer because you know it’s busier that time of year.

The point of the previous example is to show that association does not imply causation. You could find support for a cause-and-effect link by asking patrons their reasons for coming to the library through surveys or interviews. However, only by conducting an experiment can you truly demonstrate causation.

Correlation = a starting point, not a conclusion

Before I leave you, there’s one very important point to make. Sometimes the best we can do is say there’s a correlation between these data and that’s it. In the real world, dealing with real people, it can be difficult or controversial to investigate causation through experiments. For instance, does education reduce poverty? There’s a strong correlation, but we can’t run an experiment where we educate one group of children and withhold education from another. Poverty is also a really complex issue and it’s difficult to control for all other interacting variables. In this case, and many others, researchers use the observed association as a first step in building a case for causation. 

LRS’s Between a Graph and a Hard Place blog series provides strategies for looking at data with a critical eye. Every week we’ll cover a different topic.You can use these strategies with any kind of data, so while the series may be inspired by the many COVID-19 statistics being reported, the examples we’ll share will focus on other topics. To receive posts via email, please complete this form.


What’s typical and why does it matter?

Average is one of those statistics that comes up a lot. What does it mean? How can we use it? What are its limitations? Today we’re going to talk about both the average, also known as the mean, and another statistic called the median. Means and medians are both ways to find out what’s typical and to compare multiple things.

It’s easier to understand what these two statistics tell us when you know how they are calculated. Don’t worry if you don’t think of yourself as a “math person.” We’re only going to use addition and division. 

What is it?

Let’s do the basic math of how a mean is calculated using an example. I have a storytime at my library, and I want to know the mean age of the children attending today. With their caregivers’ help, I find out the ages of the five children who are there: 3, 3, 4, 2, 3.

So, to calculate the mean age:
3+3+4+2+3 = 15 ← add up all the ages to make a total
15/5 = 3 ← divide the total by the number of children
The mean age is: 3

Why is it used?

The mean tells us what is typical for a group of values. It’s useful to know what a typical value is because it can help you compare multiple groups of values. Let’s say you think that one of your regular storytimes has younger children than another. You find out participants’ ages at your storytimes on Tuesdays and on Saturdays. After doing this for many months, you see a pattern: usually the typical age of participants on Tuesdays is three, but the typical age on Saturday is five. Because you have these data, you decide to start planning slightly different activities for the Tuesday and Saturday storytimes. Useful, right?

What can go wrong?

Means, like all statistics, have pros and cons. Outliers are unusual pieces of data that can really change the mean. Let’s say someone’s older sibling comes to storytime that same day we calculated the mean for already. So now our data are: 3, 3, 4, 2, 3, 9.

We calculate:
3+3+4+2+3 + 9 = 24 ← add up all the ages to make a total
24/6 = 4 ← divide the total by the number of children
The mean age is: 4

Four! Add one nine year old sibling, and the mean jumps all the way to four. Should you change the storytime to be more geared toward four year olds because this nine year old came once? No, probably not.

Enter the median. The median is another way of calculating a typical value, and is less impacted  by an outlier. The median is the middle value in a data set. Or to put it another way, half of the values in the data set are higher than the median and half are lower. 

To calculate the median, put the data in order from lowest to highest, and identify the middle value. Here are our data in order: 2, 3, 3, 3, 4, 9.

In this case, we have an even number of data points, so three and three are both middle values. If we had an odd number of data points, the middle value would be the median–end of calculation. When you have two middle values, you get to bring in your old friend mean to help figure out the median:

3+3 = 6 ← add up the two middle ages to make a total
6/2 = 3 ← divide the total by the number of middle ages 

Surprise, surprise. Three is our median! This is why the median can be so helpful. When there are outliers that will change the mean, the median is not impacted as much and is a more accurate indicator of what’s typical.

Cool math lesson, now what?

Means and medians are both ways to find out what’s typical and to compare multiple things. Here are some examples of how this comes up in everyday life.

What’s typical? Why would we want to know?

  • What is the mean temperature in Colorado in May? 
    • Should I keep a sweater out?
  • What’s the mean value of this car I might buy? 
    • Am I paying too much?

Are these two things similar or different? Why would we want to know?

  • What is the mean salary for library staff in one state compared to another state?
    •  Maybe there’s a place that pays similar where it doesn’t snow in the spring?
  • What was the mean ebook circulation in public libraries in 2019 compared to 2009?
    •  Is ebook circulation increasing, decreasing or staying the same?

The mean and median are a good place to start investigating a question to orient yourself. The key to using means and medians well is to not stop with them. They both indicate what is typical, but not the whole picture. It is important to check, like we discussed before, that what is being compared is actually comparable. The mean doesn’t necessarily take other variables into consideration. For example, comparing the mean salary for library staff in two different states doesn’t take into account the cost of living. The same salary could result in very different qualities of life in two different places. We’ll talk more about the importance of other variables soon. 

Any statistic is tied to the underlying data

Keep in mind that the accuracy of statistics depends on the quality of the dataset that statistic is about. How the data were collected, how much data were collected, and to what extent the data represent the subject all impact the quality. For example, if you collected the data by guessing children’s ages instead of asking, we don’t know if the mean is accurate because we don’t know if the underlying data is correct. 

Even with accurate data, there are limits on the conclusions you can draw. In our example, the data we were collecting about the age of storytime participants would be helpful to your specific library, but you can’t conclude that the mean age of all participants in all Tuesday storytimes everywhere is three. We didn’t collect those data. We have no idea if your storytime on Tuesday is like other libraries’ Tuesday storytimes.

Numbers can’t do the thinking for you

Life is unpredictable and messy. You may have storytimes for months where the mean age is three, and then one week a bunch of two year olds come. Your mean will change then, and you have to decide what to do with that information. Do you want to adjust the storytime? Do you find that there’s a key developmental change between ages two and three and you’d like to market some storytimes for children two and younger? The statistics can help guide your decision, but they will never tell you what to do. You have to decide how to use the statistics and the other information you have to understand what’s happening and what you want to do. 

The tip of the statistics iceberg

For the mean and median to be good measures of what’s typical, the dataset needs to meet some criteria. Those criteria get into probability and what the dataset looks like when it’s arranged a certain way (its distribution). For our purposes here, you don’t need to be deeply familiar with those concepts. If, however, you want to learn more, you could start here.

LRS’s Between a Graph and a Hard Place blog series provides strategies for looking at data with a critical eye. Every week we’ll cover a different topic. You can use these strategies with any kind of data, so while the series may be inspired by the many COVID-19 statistics being reported, the examples we’ll share will focus on other topics. To receive posts via email, please complete this form.