Recently I’ve been developing and trialling learning tasks where the learner is working with a provided data set but has to do something “human” that motivates using a random sample as part of the strategy to learn something from the data.
Since I already had a tool that creates data cards from the Quick, Draw! data set, I’ve created a prototype for the kind of tool that would support this approach using the same data set.
I’ve written about the Quick, Draw! data set already:
For this new tool, called different strokes, users sort drawings into two or more groups based on something visible in the drawing itself. Since you have the drag the drawings around to manually “classify” them, the larger the sample you take, the longer it will take you.
There’s also the novelty and creativity of being able to create your own rules for classifying drawings. I’ll use cats for the example below, but from a teaching and assessment perspective there are SO many drawings of so many things and so many variables with so many opportunities to compare and contrast what can be learned about how people draw in the Quick, Draw!
Here’s a precis of the kinds of questions I might ask myself to explore the general question What can we learn from the data about how people draw cats in the Quick, Draw! game?
Are drawings of cats more likely to be heads only or the whole body? [I can take a sample of cat drawings, and then sort the drawings into heads vs bodies. From here, I could bootstrap a confidence interval for the population proportion].
Is how someone draws a cat linked to the game time? [I can use the same data as above, but compare game times by the two groups I’ve created – head vs bodies. I could bootstrap a confidence interval for the difference of two population means/medians]
Is there a relationship between the number of strokes and the pause time for cat drawings? [And what do these two variables actually measure – I’ll need some contextual knowledge!]
Do people draw dogs similarly to cats in the Quick, Draw! game? [I could grab new samples of cat and dog drawings, sort all drawings into “heads” or “bodies”, and then bootstrap a confidence interval for the difference of two population proportions]
But I haven’t seen dot-shaped ones yet, so this led me to re-develop the Quick! Draw! sampling tool to be able to create some 🙂
I was also motivated to work some more on the tool after the fantastic Wendy Gibbs asked me at the NZAMT (New Zealand Association of Mathematics Teachers) writing camp if I could include variables related to the times involved with each drawing. I suspect she has read this super cool post by Jim Vallandingham (while you’re at his site, check out some of his other cool posts and visualisations) which came out after I first released the sampling tool and compares strokes and drawing/pause times for different words/concepts – including cats and dogs!
The drawing and pause times are in seconds. The drawing time captures the time taken for each stroke from beginning to end and the pause time captures all the time between strokes. If you add these two times together, you will get the total time the person spent drawing the word/concept before either the 20 seconds was up, or Google tried to identify the word/concept. Below the word/concept drawn is whether the drawing was correctly recognised (true) or not (false).
I also added three ways to use the data cards once they have been generated using the sampling tool (scroll down to below the data cards). You can now:
download a PDF version of the data cards, with circles the same size as the circle punch shown above (2″/5cm)
download the CSV file for the sample data
show the sample data as a HTML table (which makes it easy to copy and paste into a Google sheet for example)
In terms of options (2) and (3) above, I had resisted making the data this accessible in the previous version of the sampling tool. One of the reasons for this is because I wanted the drawings themselves to be considered as data, and as human would be involved in developed this variable, there was a need to work with just a sample of all the millions of drawings. I still feel this way, so I encourage you to get students to develop at least one new variable for their sample data that is based on a feature of the drawing 🙂 For example, whether the drawing of a cat is the face only, or includes the body too.
There are other cool things possible to expand the variables provided. Students could create a new variable by adding drawing_time and pause_time together. They could also create a variable which compares the number_strokes to the drawing_time e.g. average time per stroke. Students could also use the day_sketched variable to classify sketches as weekday or weekend drawings. Students should soon find the hemisphere is not that useful for comparisons, so could explore another country-related classification like continent. More advanced manipulations could involve working with the time stamps, which are given for all drawings using UTC time. This has consequences for the variable day_sketched as many countries (and places within countries) will be behind or ahead of the UTC time.
If you’ve made it this far in the post…. why not play with a little R 🙂
I wonder which common household pet Quick! drawers tend to use the most strokes to draw? Cats, dogs, or fish?
Have a go at modifying the R code below, using the iNZightPlots package by Tom Elliott and my [very-much-in-its-initial-stages-of-development] iNZightR package, to see what we can learn from the data 🙂 If you’re feeling extra adventurous, why not try modifying the code to explore the relationship between number of strokes and drawing time!
This post is second in a series of posts where I’m going to share some strategies for getting real data to use for statistical investigations that require sample to population inference. As I write them, you will be able to find them all on this page.
What’s your favourite board game?
I read an article posted on fivethirtyeight about the worst board games ever invented and it got me thinking about the board games I like to play. The Game of life has a low average rating on the online database of games referred to in this article but I remember kind of enjoying playing it as a kid. boardgamegeek.com features user-submitted information about hundreds of thousands of games (not just board games) and is constantly being updated. While there are some data sets out there that already feature data from this website (e.g. from kaggle datasets), I am purposely demonstrating a non-programming approach to getting this data that maximises the participation of teachers and students in the data collection process.
To end up with data that can be used as part of a sample to population inference task:
You need a clearly defined and nameable population (in this case, all board games listed on boardgamegeek.com)
You need a sampling frame that is a very close match to your population.
You need to select from your sampling frame using a random sampling method to obtain the members of your sample.
You need to define and measure variables from each member of the sample/population so the resulting data is multivariate.
boardgamegeek.com actually provide a link that you can use to select one of the games on their site at random (https://boardgamegeek.com/boardgame/random), so using this “random” link (hopefully) takes care of (2) and (3). For (4), there are so many potential variables that could be defined and measured. To decide on what variables to measure, I spent some time exploring the content of the webpages for a few different games to get a feel for what might make for good variables. I decided to stick to variables that are measured directly for each game, rather than ones that were based on user polls, and went with these variables:
Millennium the game was released (1000, 2000, all others)
Number of words in game title
Minimum number of players
Maximum number of players
Playing time in minutes (if a range was provided, the average of the limits was used)
Minimum age in years
Game type (strategy or war, family or children’s, other)
I wrote a post recently – Just Google it – which featured real data distributions. boardgamegeek.com also provides simple graphs of the ratings for each game, so we can play a similar matching game. You could also try estimating the mean and standard deviation of the ratings from the graph, with the added game feature of reverse ordering!
Which games do you think match which ratings graphs?
The Lord of the Rings: The Card Game
I couldn’t find a game that had a clear bi-modal distribution for its ratings but I reckon there must be games out there that people either love or hate 🙂 Let me know if you find one! To get students familiar with boardgamegeek.com, you could ask them to first search for their favourite game and then explore what information and ratings have been provided for this on the site. Let the games begin 🙂
This post is first in a series of posts where I’m going to share some strategies for getting real data for real data stories, specifically to use for statistical investigations that require sample to population inference. As I write them, you will be able to find them all on this page.
Key considerations for finding real data for sample to population inference tasks
It’s really important that I stress that the approaches I’ll discuss are not necessarily what I would typically use when finding data to explore. Generally, I’d let the data drive the analysis not the analysis drive the data I try to find. These are specific examples so that the data that is obtained can be used sensibly to perform sample to population inference. It’s also really important to talk about why I’m stressing the above 🙂 In NZ we have specific standards that are designed to assess understanding of sample to population inference, using informal and formal methods that have developed by exploring the behaviour of random samples from populations (AS91035, AS91264, AS91582). So, for the students’ learning about rules of thumbs and confidence intervals to make sense, we need to provide students with clearly defined named populations with data that are (or are able to be) randomly sampled from these populations. At high school level at least, these strict conditions are in place so that students can focus on one central question: What can and can’t I say about the population(s) based on the random sample data?
For all the examples I’ll cover in this series of posts, there are four key considerations/requirements:
You need a clearly defined and nameable population. Ideally this should be as simple and clear as possible to help students out but to ensure (2) the “name” can end up being quite specific.
You need a sampling frame that is a very close match to your population. This means you need a way to access every member of your population to measure stuff about them (variables). Sure, this is not the reality of what happens in the real world in terms of sampling, but remember what I said earlier about what was important 🙂
You need to select from your sampling frame using a random sampling method to obtain the members of your sample. It is sufficient (and recommended) to stick to simple random sampling. In some cases, you may be able to make an assumption that what you have can be considered a random sample, but I’d prefer to avoid these kinds of situations where possible at high school level.
You need to define and measure variables from each member of the sample/population. We want students working with multivariate data sets, with several options possible for numerical and categorical data (but don’t forget there is the option to create new variables from what was measured).
I’ll try to refer back to these four considerations/requirements when I discuss examples in the posts that will follow.
Just one very relevant NZ NCEA assessment-specific comment before we talk data. For AS91035 and AS91582, the standards state that students are to be provided with the sample multivariate data for the task – so all of (1) (2) (3) and (4) is done by the teacher. Similarly with AS91264, the requirement for the standard is that students select a random sample (3) from a provided population dataset – so (1) (2) and (4) are done by the teacher. This does not mean the students can’t do more in terms of the sampling/collecting processes, just that these are not requirements for the standards and asking students to do more should not limit their ability to complete the task. I’ll try to give some ideas for how to manage any related issues in the examples.
Just one more point. I haven’t made this (5) in the previous section, but something to watch out for is the nature of your “cases”. Tables of data (which we refer to as datasets) that play nicely with statistical software like iNZight are ones where the data is organised so that each row is a case and each column is a variable. Typically at high school level, the datasets we use are ones where each case (e.g. each individual in the defined population) is measured directly to obtain different variables. Things can get a little tricky conceptually when some of the variables for a case are actually measured by grouping/aggregating related but different cases.
For example, if I take five movies from the internet movie database that have “dog” in the title (imdb.com) and another five with “cat” in the title, I could construct a mini dataset like the one below using information from the website:
For this dataset, each row is a different movie, so the cases are the movies. Each column provides a different variable for each movie. The variables Movie title, Year released, Movie length mins, Average rating, Number of ratings, Number photos and Genre were taken straight from the webpage for each movie. I created the variables Number words title, Number letters title, Average letters per word, Animal in title, Years since release and Millennium. [Something I won’t tackle in this post is what to do about the Genre variable to make this usable for analysis.]
The Average rating variable looks like a nice numerical variable to use, for example, to compare ratings of these movies with “dog” in the title and those with “cat”. The thing is, this particular variable has been measured by aggregating individual’s ratings of the movie using a mean (the related but different cases here are the individuals who rated the movies). You can see why this may be an issue when you look at the variable Number of ratings, which again is an aggregate measure (a count) – some of these movies have received less than 200 ratings while others are in the hundreds of thousands. We also can’t see what the distribution of these individual ratings for each movie looks like to decide whether the mean is telling us something useful about the ratings. [For some more really interesting discussion of using online movie ratings, check out this fivethirtyeight article.]
The variable Average letters per word has been measured directly from each case, using the characteristics of the movie title. There are still some potential issues with using the variable Average letters per word as a measure of, let’s say, complexity of words used in the movie title, since the mean is being used, but at least in this case students can see the movie title.
For this dataset, each row is a different department, so the cases are the departments. Each column provides a different variable for each department. Gender was estimated based on the information provided in the directory and the data may be inaccurate for this reason. The % of PhD candidates that are female looks like a nice numerical variable to use, for example, to compare gender rates between these departments from the Arts and Science faculties. Generally with numerical variables we would use the mean or median as a measure of central tendency. But this variable was measured by aggregating information about each PhD candidate in that department and presenting this measure as a percentage (the related but different cases here are the PhD candidates). Just think about it, does it really make sense to make a statement like: The mean % of PhD candidates that are female for these departments of the Arts faculty is 73% whereas the mean % of PhD candidates that are female for these departments of the Science faculty is 44%, especially when the numbers of PhD candidates varies so much between departments?
Looking at the individual percentages is interesting to see how they vary across departments, but combining them to get an overall measure for each faculty should involve calculating another percentage using the original counts for PhD candidates for each department (e.g. group by faculty). If I want to compare gender rates between the Arts and Science faculties for PhD candidates, I would calculate the proportion of all PhD candidates across these department that are female for each faculty e.g. 58% of the PhD candidates from these departments of the Arts faculty are female, 53% of the PhD candidates from these departments of the Science faculty are female.
This post was not supposed to deter you from finding and creating your own real datasets! But we do need to think carefully about the data that we provide to students, especially our high school students. Not all datasets are the same and while I’ve seen some really cool and interesting ideas out there for finding/collecting data for investigations, some of these ideas unintentionally produce data that makes it very difficult for students to engage with the core question: What can and can’t I say about the population(s) based on the random sample data?
In the next post, I’ll discuss some examples of finding real data online. Until I find time to write this next post, check out these existing data finding posts:
Last night, I saw a tweet announcing that Google had made data available on over 50 million drawings from the game Quick, Draw! I had never played the game before, but it is pretty cool. The idea behind the game is whether a neural network can learn to recognize doodling – watch the video below for more about this (with an example about cats of course!)
For each game, you are challenged to draw a certain object within 20 secs, and you get to see if the algorithm can classify your drawing correctly or not. See my attempt below to draw a trumpet, and the neural network correctly identifying that I was drawing a trumpet.
Since I am clearly obsessed with cats at the moment, I went straight to the drawings of cats. You can see ALL the drawings made for cats (hundred of thousands) and can see variation in particular features of these drawings. I thought it would be cool to be able to take a random sample from all the drawings for a particular category, so after some coding I set up this page: learning.statistics-is-awesome.org/draw/. I’ve included below each drawing the other data provided in the following order:
the word the user was told to draw
the two letter country code
whether the drawing was correctly classified
number of individual strokes made for the drawing
[Update: There are now more variables available – see this post for more details]
So, on average, how many whiskers do Quick, Draw! players draw on their cats?
So, turns out it’s a fairly safe bet that the mean number of whiskers per cat drawing made by Quick, Draw! players is somewhere between 2.2 and 3.5 whiskers. Of course, these are the drawings that have been moderated (I’m assuming for appropriateness/decency). When you look at the drawings, with that 20 second limit on drawing time, you can see that many players went for other features of cats like their ears, possibly running out of time to draw the whiskers. In that respect, it would be interesting to see if there is something going on with whether the drawing was correctly classified as being a cat or not – are whiskers a defining feature of cat drawings?
I reckon there are a tonne of cool things to explore with this dataset, and with the ability to randomly sample from the hundreds and hundreds of thousands of drawings available under each category, a good reason to use statistical inference 🙂 I like that students can develop their own measures based on features of the drawings, based on what they are interested in exploring.
After I published this post, I took a look at the drawings for octopus and then for octagon, a fascinating comparison.
I wonder if players of Quick, Draw! are more likely to draw eight sides for an octagon or eight legs for an octopus? I wonder if the mean number of sides drawn for an octagon is higher than the mean number of legs draw for an octopus?
In April 2017, I presented an ASA K-12 statistics education webinar: Statistical reasoning with data cards (webinar). Towards the end of the webinar, I encouraged teachers to get students to make their own data cards about their cats. A few days later, I then thought that this could be something to get NZ teachers and students involved with. Imagine a huge collection of real data cards about dogs and cats? Real data that comes from NZ teachers and students? Like Census At School but for pets 🙂 I persuaded a few of my teacher friends to create data cards for their pets (dogs or cats) and to get their students involved, to see whether this project could work. Below is a small selection of the data cards that were initially created (beware of potential cuteness overload!)
The project then expanded to include more teachers and students across NZ, and even the US, and I’ve now decided to keep the data card generator (and collection) page open so that the set of data cards can grow over time. Please use the steps below to get students creating and sharing data cards about their pets.
Creating and sharing data cards about dogs and cats
Inevitably, there will be submissions made that are “fake”, silly or offensive (see below).
Data cards submitted to the project won’t automatically be added to any public sets of data cards, and will be checked first. Just like with any surveying process that is based on self-selection, is internet based and relies on humans to give honest and accurate answers, there is the potential for non-sampling errors. To help reduce the quantify of “fake” data cards, if you are keen to have your students involved with this project it would be great if you could do the following:
1. Talk to your students about the project and explain that the data cards will be shared with other students. They will be sharing information about their pet and need to be OK with this (and don’t have to!). The data will be displayed with a picture of their pet, so participation is not strictly anonymous. All of this is important to discuss with students as we need to educate students about data privacy 🙂
2. When students submit their data, they are given the finished data card which they can save. Set up a system where students need to share the data card they have created with you e.g. by saving into a shared Google drive or Dropbox, or by emailing the data card to you. The advantage for you of setting up this system is that you get your class/school set of data cards to use however you want. The advantage for me is that this level of “watching” might discourage silly data cards being created.
Inspired by Fisher’s Iris data, this sample of flowers was created through simulation from a carefully designed model. From a student’s perspective, these flowers represent a random sample of flowers from a much bigger population of statistics flowers. The idea is that students get all of the 300 cards and need to measure different features of the flowers and determine other variables to create their sample data.
Designed variables are: type of statistics flower (tictastics, stistactis, or castistist), petal colour (red, orange, blue, green), number of petals, petal length, petal width and stigma diameter. The diagram below shows how the measurements should be taken by students:
I have made the sample size 300 to allow for categorical and distributional exploration e.g. What proportion of all statistics flowers have a black stigma? Does stigma colour appear to be linked to petal colour for statistics flowers? How could the number of petals for statistics flowers be distributed? But I appreciate that it would take a long time for students to measure 300 different flowers and record necessary data! Perhaps students could look at the flowers visually first, sort them by type of flower and see if they can detect any features that appear to differ (e.g. colour, petal length, etc.). Students could then measure some of the flowers and chuck this data into a graph for an initial view before being given access to the digital sample to do some more exploring. Remember these data cards represent a sample and the true population parameters, for example the mean petal length of all statistics flowers, are unknown to you and the students. It is not intended that these cards are used for “population bags”.
The data for each runner entered in the Auckland Marathon 2015 was obtained from https://www.aucklandmarathon.co.nz/. This data is owned by the organisers of the Auckland Marathon and can not be used for commercial purposes unless by prior written permission from the organisers.
For each runner, the following was recorded:
time in hours (this is blank if the runner did not compete in the race)
place (this is blank if the runner did not compete in the race)
distance in km (this is blank if the runner did not compete in the race)
mean pace km per hr (this is blank if the runner did not compete in the race)
NB: This data set contains information about the five different races which are part of the Auckland Marathon 2015. It may be necessary to focus on just one of these races for a meaningful investigation, for example if comparing running times for male and female runners (whether as part of a sample-to-population inference or as part of exploring the population data).
The data for each player in the Rugby World Cup 2015 was obtained from http://www.rugbyworldcup.com/. This data is owned by the Rugby World Cup Ltd (RWC) and can not be used for commercial purposes unless by prior written permission from the RWC.
NB: This data set should be used with care for sample-to-population inferenceinvolving comparison, as both categorical variables (team and position) involve a large number of outcomes (16 teams and 11 positions). This means it is not likely that a random sample of 80 players from the population of Rugby World Cup 2015 players, for example, will contain sufficient numbers of players in any two groups for comparison e.g. England vs New Zealand OR forwards vs backs. If you use all the data for NZ and all the data for England to compare the age of players, for example, you will have used all of the data for this population and so there is no need to “make a call” about what is going on “back in the population” 🙂
My advice would be to use this data set for either single variable sampling investigations OR exploratory data analysis for the entire population. There is also something interesting in using the time variable (debut) to explore other variables 🙂
This population of stick people was created using data from the Census at School 2015 database. For the data cards, rather than put/indicate gender on the card I have used a fictional name, taken from the names of children entered in the 2015 Auckland kids marathon. The relevant questions from the Census at School 2015 survey are Q1, Q2, Q17, Q27 cellphone, facebook, snapchat, Q31 TV, and Q32 reading (the questions can be found here). The diagram below shows what each part of the data card represents:
For some great teaching notes for using data cards, check out Pip Arnold’s resources on Census at School, here are a couple: ID cards | Using data cards. I also used these data cards in a workshop on data challenges which you can read more about here.
Here is the population data set as a CSV file for teacher reference: CAS2015_edited