# Finding real data for real data stories

This post is first in a series of posts where I’m going to share some strategies for getting real data for real data stories, specifically to use for statistical investigations that require sample to population inference. As I write them, you will be able to find them all on this page.

**Key considerations for finding real data for sample to population inference tasks**

It’s really important that I stress that the approaches I’ll discuss are not necessarily what I would typically use when finding data to explore. Generally, I’d let the data drive the analysis not the analysis drive the data I try to find. These are specific examples so that the data that is obtained can be used sensibly to perform sample to population inference. It’s also really important to talk about why I’m stressing the above š In NZ we have specific standards that are designed to assess understanding of sample to population inference, using informal and formal methods that have developed by exploring the behaviour of random samples from populations (AS91035, AS91264, AS91582). So, for the students’ learning about rules of thumbs and confidence intervals to make sense, we need to provide students with clearly defined named populations with data that are (or are able to be) randomly sampled from these populations. At high school level at least, these strict conditions are in place so that students can focus on one central question:Ā *What can and can’t I say about the population(s) based on the random sample data?*

For all the examples I’ll cover in this series of posts, there are four key considerations/requirements:

- You need a clearly defined and nameable population. Ideally this should be as simple and clear as possible to help students out but to ensure (2) the “name” can end up being quite specific.
- You need a sampling frame that is a very close match to your population. This means you need a way to access every member of your population to measure stuff about them (variables). Sure, this is not the reality of what happens in the real world in terms of sampling, but remember what I said earlier about what was important š
- You need to select from your sampling frame using a random sampling method to obtain the members of your sample. It is sufficient (and recommended) to stick to simple random sampling. In some cases, you may be able to make an assumption that what you have can be considered a random sample, but I’d prefer to avoid these kinds of situations where possible at high school level.
- You need to define and measure variables from each member of the sample/population. We want students working with multivariate data sets, with several options possible for numerical and categorical data (but don’t forget there is the option to create new variables from what was measured).

I’ll try to refer back to these four considerations/requirements when I discuss examples in the posts that will follow.

Just one very relevant NZ NCEA assessment-specific comment before we talk data. For AS91035 and AS91582, the standards state that students are to be provided with the sample multivariate data for the task – so all of (1) (2) (3) and (4) is done by the teacher. Similarly with AS91264, the requirement for the standard is that students select a random sample (3) from a provided population dataset – so (1) (2) and (4) are done by the teacher. This does not mean the students can’t do more in terms of the sampling/collecting processes, just that these are not requirements for the standards and asking students to do more should not limit their ability to complete the task. I’ll try to give some ideas for how to manage any related issues in the examples.

Just one more point. I haven’t made this (5) in the previous section, but something to watch out for is the nature of your “cases”. Tables of data (which we refer to as datasets) that play nicely with statistical software like iNZightĀ are ones where the data is organised so that each row is a case and each column is a variable. Typically at high school level, the datasets we use are ones where each case (e.g. each individual in the defined population) is measured directly to obtain different variables. Things can get a little tricky conceptually when some of the variables for a case are actually measured by grouping/aggregating related but different cases.

For example, if I take five movies from the internet movie database that have “dog” in the title (imdb.com) and another five with “cat” in the title, I could construct a mini dataset like the one below using information from the website:

For this dataset, each row is a different movie, so the cases are the movies. Each column provides a different variable for each movie. The variablesĀ *Movie title, Year released, Movie length mins, Average rating, Number of ratings, Number photos* andĀ *Genre* were taken straight from the webpage for each movie. I created the variablesĀ *Number words title, Number letters title, Average letters per word, Animal in title, Years since releaseĀ *andĀ *Millennium*. [Something I won’t tackle in this post is what to do about the Genre variable to make this usable for analysis.]

TheĀ *Average rating* variable looks like a nice numerical variable to use, for example, to compare ratings of these movies with “dog” in the title and those with “cat”. The thing is, this particular variable has been measured by aggregating individual’s ratings of the movie using a mean (the related but different cases here are the individuals who rated the movies). You can see why this may be an issue when you look at the variableĀ *Number of ratings*, which again is an aggregate measure (a count) – some of these movies have received less than 200 ratings while others are in the hundreds of thousands. We also can’t see what the distribution of these individual ratings for each movie looks like to decide whether the mean is telling us something useful about the ratings. [For some more really interesting discussion of using online movie ratings, check out this fivethirtyeight article.]

The variableĀ *Average letters per word* has been measured directly from each case, using the characteristics of the movie title. There are still some potential issues with using the variableĀ *Average letters per word* as a measure of, let’s say, complexity of words used in the movie title, since the mean is being used, but at least in thisĀ *case* students can see the movie title.

Another example of case awareness can be seen in the mini dataset below, using data on PhD candidates from the University of Auckland online directory:

For this dataset, each row is a different department, so the cases are the departments. Each column provides a different variable for each department. Gender was estimated based on the information provided in the directory and the data may be inaccurate for this reason. TheĀ *% of PhD candidates that are female* looks like a nice numerical variable to use, for example, to compare gender rates between these departments from the Arts and Science faculties. Generally with numerical variables we would use the mean or median as a measure of central tendency. But this variable was measured by aggregating information about each PhD candidate in that department and presenting this measure as a percentage (the related but different cases here are the PhD candidates). Just think about it, does it really make sense to make a statement like:Ā *The mean % of PhD candidates that are female for these departments of the Arts faculty is 73% whereas the mean % of PhD candidates that are female for these departments of the Science faculty is 44%,*Ā especially when the numbers of PhD candidates varies so much between departments?

Looking at the individual percentages is interesting to see how they vary across departments, but combining them to get an overall measure for each faculty should involve calculating another percentage using the original counts for PhD candidates for each department (e.g. group by faculty). If I want to compare gender rates between the Arts and Science faculties for PhD candidates, I would calculate the proportion of all PhD candidates across these department that are female for each faculty e.g. *58% of the PhD candidates from these departments of the Arts faculty are female, 53% of the PhD candidates from these departments of the Science faculty are female*.

[If you’d like to read more about structuring data in the context of creating a dataset, then check out this excellent post by Rob Gould.]

**Where to next?**

This post was not supposed to deter you from finding and creating your own real datasets! But we do need to think carefully about the data that we provide to students, especially our high school students. Not all datasets are the same and while I’ve seen some really cool and interesting ideas out there for finding/collecting data for investigations, some of these ideas unintentionally produce data that makes it very difficult for students to engage with the core question:Ā *What can and can’t I say about the population(s) based on the random sample data?Ā *

In the next post, I’ll discuss some examples of finding real data online. Until I find time to write this next post, check out these existing data finding posts:

Cat and whisker plots: sampling from the Quick, Draw! dataset