This post is second in a series of posts where I’m going to share some strategies for getting real data to use for statistical investigations that require sample to population inference. As I write them, you will be able to find them all on this page.
What’s your favourite board game?
I read an article posted on fivethirtyeight about the worst board games ever invented and it got me thinking about the board games I like to play. The Game of life has a low average rating on the online database of games referred to in this article but I remember kind of enjoying playing it as a kid. boardgamegeek.com features user-submitted information about hundreds of thousands of games (not just board games) and is constantly being updated. While there are some data sets out there that already feature data from this website (e.g. from kaggle datasets), I am purposely demonstrating a non-programming approach to getting this data that maximises the participation of teachers and students in the data collection process.
To end up with data that can be used as part of a sample to population inference task:
You need a clearly defined and nameable population (in this case, all board games listed on boardgamegeek.com)
You need a sampling frame that is a very close match to your population.
You need to select from your sampling frame using a random sampling method to obtain the members of your sample.
You need to define and measure variables from each member of the sample/population so the resulting data is multivariate.
boardgamegeek.com actually provide a link that you can use to select one of the games on their site at random (https://boardgamegeek.com/boardgame/random), so using this “random” link (hopefully) takes care of (2) and (3). For (4), there are so many potential variables that could be defined and measured. To decide on what variables to measure, I spent some time exploring the content of the webpages for a few different games to get a feel for what might make for good variables. I decided to stick to variables that are measured directly for each game, rather than ones that were based on user polls, and went with these variables:
Millennium the game was released (1000, 2000, all others)
Number of words in game title
Minimum number of players
Maximum number of players
Playing time in minutes (if a range was provided, the average of the limits was used)
Minimum age in years
Game type (strategy or war, family or children’s, other)
I wrote a post recently – Just Google it – which featured real data distributions. boardgamegeek.com also provides simple graphs of the ratings for each game, so we can play a similar matching game. You could also try estimating the mean and standard deviation of the ratings from the graph, with the added game feature of reverse ordering!
Which games do you think match which ratings graphs?
The Lord of the Rings: The Card Game
I couldn’t find a game that had a clear bi-modal distribution for its ratings but I reckon there must be games out there that people either love or hate 🙂 Let me know if you find one! To get students familiar with boardgamegeek.com, you could ask them to first search for their favourite game and then explore what information and ratings have been provided for this on the site. Let the games begin 🙂
This post is first in a series of posts where I’m going to share some strategies for getting real data for real data stories, specifically to use for statistical investigations that require sample to population inference. As I write them, you will be able to find them all on this page.
Key considerations for finding real data for sample to population inference tasks
It’s really important that I stress that the approaches I’ll discuss are not necessarily what I would typically use when finding data to explore. Generally, I’d let the data drive the analysis not the analysis drive the data I try to find. These are specific examples so that the data that is obtained can be used sensibly to perform sample to population inference. It’s also really important to talk about why I’m stressing the above 🙂 In NZ we have specific standards that are designed to assess understanding of sample to population inference, using informal and formal methods that have developed by exploring the behaviour of random samples from populations (AS91035, AS91264, AS91582). So, for the students’ learning about rules of thumbs and confidence intervals to make sense, we need to provide students with clearly defined named populations with data that are (or are able to be) randomly sampled from these populations. At high school level at least, these strict conditions are in place so that students can focus on one central question: What can and can’t I say about the population(s) based on the random sample data?
For all the examples I’ll cover in this series of posts, there are four key considerations/requirements:
You need a clearly defined and nameable population. Ideally this should be as simple and clear as possible to help students out but to ensure (2) the “name” can end up being quite specific.
You need a sampling frame that is a very close match to your population. This means you need a way to access every member of your population to measure stuff about them (variables). Sure, this is not the reality of what happens in the real world in terms of sampling, but remember what I said earlier about what was important 🙂
You need to select from your sampling frame using a random sampling method to obtain the members of your sample. It is sufficient (and recommended) to stick to simple random sampling. In some cases, you may be able to make an assumption that what you have can be considered a random sample, but I’d prefer to avoid these kinds of situations where possible at high school level.
You need to define and measure variables from each member of the sample/population. We want students working with multivariate data sets, with several options possible for numerical and categorical data (but don’t forget there is the option to create new variables from what was measured).
I’ll try to refer back to these four considerations/requirements when I discuss examples in the posts that will follow.
Just one very relevant NZ NCEA assessment-specific comment before we talk data. For AS91035 and AS91582, the standards state that students are to be provided with the sample multivariate data for the task – so all of (1) (2) (3) and (4) is done by the teacher. Similarly with AS91264, the requirement for the standard is that students select a random sample (3) from a provided population dataset – so (1) (2) and (4) are done by the teacher. This does not mean the students can’t do more in terms of the sampling/collecting processes, just that these are not requirements for the standards and asking students to do more should not limit their ability to complete the task. I’ll try to give some ideas for how to manage any related issues in the examples.
Just one more point. I haven’t made this (5) in the previous section, but something to watch out for is the nature of your “cases”. Tables of data (which we refer to as datasets) that play nicely with statistical software like iNZight are ones where the data is organised so that each row is a case and each column is a variable. Typically at high school level, the datasets we use are ones where each case (e.g. each individual in the defined population) is measured directly to obtain different variables. Things can get a little tricky conceptually when some of the variables for a case are actually measured by grouping/aggregating related but different cases.
For example, if I take five movies from the internet movie database that have “dog” in the title (imdb.com) and another five with “cat” in the title, I could construct a mini dataset like the one below using information from the website:
For this dataset, each row is a different movie, so the cases are the movies. Each column provides a different variable for each movie. The variables Movie title, Year released, Movie length mins, Average rating, Number of ratings, Number photos and Genre were taken straight from the webpage for each movie. I created the variables Number words title, Number letters title, Average letters per word, Animal in title, Years since release and Millennium. [Something I won’t tackle in this post is what to do about the Genre variable to make this usable for analysis.]
The Average rating variable looks like a nice numerical variable to use, for example, to compare ratings of these movies with “dog” in the title and those with “cat”. The thing is, this particular variable has been measured by aggregating individual’s ratings of the movie using a mean (the related but different cases here are the individuals who rated the movies). You can see why this may be an issue when you look at the variable Number of ratings, which again is an aggregate measure (a count) – some of these movies have received less than 200 ratings while others are in the hundreds of thousands. We also can’t see what the distribution of these individual ratings for each movie looks like to decide whether the mean is telling us something useful about the ratings. [For some more really interesting discussion of using online movie ratings, check out this fivethirtyeight article.]
The variable Average letters per word has been measured directly from each case, using the characteristics of the movie title. There are still some potential issues with using the variable Average letters per word as a measure of, let’s say, complexity of words used in the movie title, since the mean is being used, but at least in this case students can see the movie title.
For this dataset, each row is a different department, so the cases are the departments. Each column provides a different variable for each department. Gender was estimated based on the information provided in the directory and the data may be inaccurate for this reason. The % of PhD candidates that are female looks like a nice numerical variable to use, for example, to compare gender rates between these departments from the Arts and Science faculties. Generally with numerical variables we would use the mean or median as a measure of central tendency. But this variable was measured by aggregating information about each PhD candidate in that department and presenting this measure as a percentage (the related but different cases here are the PhD candidates). Just think about it, does it really make sense to make a statement like: The mean % of PhD candidates that are female for these departments of the Arts faculty is 73% whereas the mean % of PhD candidates that are female for these departments of the Science faculty is 44%, especially when the numbers of PhD candidates varies so much between departments?
Looking at the individual percentages is interesting to see how they vary across departments, but combining them to get an overall measure for each faculty should involve calculating another percentage using the original counts for PhD candidates for each department (e.g. group by faculty). If I want to compare gender rates between the Arts and Science faculties for PhD candidates, I would calculate the proportion of all PhD candidates across these department that are female for each faculty e.g. 58% of the PhD candidates from these departments of the Arts faculty are female, 53% of the PhD candidates from these departments of the Science faculty are female.
This post was not supposed to deter you from finding and creating your own real datasets! But we do need to think carefully about the data that we provide to students, especially our high school students. Not all datasets are the same and while I’ve seen some really cool and interesting ideas out there for finding/collecting data for investigations, some of these ideas unintentionally produce data that makes it very difficult for students to engage with the core question: What can and can’t I say about the population(s) based on the random sample data?
In the next post, I’ll discuss some examples of finding real data online. Until I find time to write this next post, check out these existing data finding posts:
Last night, I saw a tweet announcing that Google had made data available on over 50 million drawings from the game Quick, Draw! I had never played the game before, but it is pretty cool. The idea behind the game is whether a neural network can learn to recognize doodling – watch the video below for more about this (with an example about cats of course!)
For each game, you are challenged to draw a certain object within 20 secs, and you get to see if the algorithm can classify your drawing correctly or not. See my attempt below to draw a trumpet, and the neural network correctly identifying that I was drawing a trumpet.
Since I am clearly obsessed with cats at the moment, I went straight to the drawings of cats. You can see ALL the drawings made for cats (hundred of thousands) and can see variation in particular features of these drawings. I thought it would be cool to be able to take a random sample from all the drawings for a particular category, so after some coding I set up this page: learning.statistics-is-awesome.org/draw/. I’ve included below each drawing the other data provided in the following order:
the word the user was told to draw
the two letter country code
whether the drawing was correctly classified
number of individual strokes made for the drawing
[Update: There are now more variables available – see this post for more details]
So, on average, how many whiskers do Quick, Draw! players draw on their cats?
So, turns out it’s a fairly safe bet that the mean number of whiskers per cat drawing made by Quick, Draw! players is somewhere between 2.2 and 3.5 whiskers. Of course, these are the drawings that have been moderated (I’m assuming for appropriateness/decency). When you look at the drawings, with that 20 second limit on drawing time, you can see that many players went for other features of cats like their ears, possibly running out of time to draw the whiskers. In that respect, it would be interesting to see if there is something going on with whether the drawing was correctly classified as being a cat or not – are whiskers a defining feature of cat drawings?
I reckon there are a tonne of cool things to explore with this dataset, and with the ability to randomly sample from the hundreds and hundreds of thousands of drawings available under each category, a good reason to use statistical inference 🙂 I like that students can develop their own measures based on features of the drawings, based on what they are interested in exploring.
After I published this post, I took a look at the drawings for octopus and then for octagon, a fascinating comparison.
I wonder if players of Quick, Draw! are more likely to draw eight sides for an octagon or eight legs for an octopus? I wonder if the mean number of sides drawn for an octagon is higher than the mean number of legs draw for an octopus?