## Cat and whisker plots – sampling from the Quick, Draw! dataset

Last night, I saw a tweet announcing that Google had made data available on over 50 million drawings from the game Quick, Draw! I had never played the game before, but it is pretty cool. The idea behind the game is whether a neural network can learn to recognize doodling – watch the video below for more about this (with an example about cats of course!)

For each game, you are challenged to draw a certain object within 20 secs, and you get to see if the algorithm can classify your drawing correctly or not. See my attempt below to draw a trumpet, and the neural network correctly identifying that I was drawing a trumpet.

Since I am clearly obsessed with cats at the moment, I went straight to the drawings of cats. You can see ALL the drawings made for cats (hundred of thousands) and can see variation in particular features of these drawings. I thought it would be cool to be able to take a random sample from all the drawings for a particular category, so after some coding I set up this page: learning.statistics-is-awesome.org/draw/. I’ve included below each drawing the other data provided in the following order:

• the word the user was told to draw
• the two letter country code
• the timestamp
• whether the drawing was correctly classified
• number of individual strokes made for the drawing

[Update: There are now more variables available – see this post for more details]

So, on average, how many whiskers do Quick, Draw! players draw on their cats?

I took a random sample of 50 drawings from those under the cat category using the sampling tool on learning.statistics-is-awesome.org/draw/. Below are the drawings selected 🙂

Counting how many individual whiskers were drawn was not super easy but, according to my interpretation of the drawings, here is my sample data (csv file). Using the awesome iNZight VIT bootstrapping module (and the handy option to add the csv file directly to the URL e.g. https://www.stat.auckland.ac.nz/~wild/VITonline/bootstrap/bootstrap.html?file=http://learning.statistics-is-awesome.org/draw/cat-and-whisker-plots.csv), I constructed a bootstrap confidence interval for the mean number of whiskers on cat drawings made by Quick, Draw! players.

So, turns out it’s a fairly safe bet that the mean number of whiskers per cat drawing made by Quick, Draw! players is somewhere between 2.2 and 3.5 whiskers. Of course, these are the drawings that have been moderated (I’m assuming for appropriateness/decency). When you look at the drawings, with that 20 second limit on drawing time, you can see that many players went for other features of cats like their ears, possibly running out of time to draw the whiskers. In that respect, it would be interesting to see if there is something going on with whether the drawing was correctly classified as being a cat or not – are whiskers a defining feature of cat drawings?

I reckon there are a tonne of cool things to explore with this dataset, and with the ability to randomly sample from the hundreds and hundreds of thousands of drawings available under each category, a good reason to use statistical inference 🙂 I like that students can develop their own measures based on features of the drawings, based on what they are interested in exploring.

After I published this post, I took a look at the drawings for octopus and then for octagon, a fascinating comparison.

I wonder if players of Quick, Draw! are more likely to draw eight sides for an octagon or eight legs for an octopus? I wonder if the mean number of sides drawn for an octagon is higher than the mean number of legs draw for an octopus?

## Auckland Marathon 2015 runners (population data)

The data for each runner entered in the Auckland Marathon 2015 was obtained from https://www.aucklandmarathon.co.nz/. This data is owned by the organisers of the Auckland Marathon and can not be used for commercial purposes unless by prior written permission from the organisers.

For each runner, the following was recorded:

• bib number
• name
• time in hours (this is blank if the runner did not compete in the race)
• place (this is blank if the runner did not compete in the race)
• gender
• division
• age division
• distance in km (this is blank if the runner did not compete in the race)
• mean pace km per hr (this is blank if the runner did not compete in the race)

NB: This data set contains information about the five different races which are part of the Auckland Marathon 2015. It may be necessary to focus on just one of these races for a meaningful investigation, for example if comparing running times for male and female runners (whether as part of a sample-to-population inference or as part of exploring the population data).

Here is the population data set as a CSV file: all_races_auckland_marathon_2015_final

## Rugby World Cup 2015 players (population data)

The data for each player in the Rugby World Cup 2015 was obtained from http://www.rugbyworldcup.com/. This data is owned by the Rugby World Cup Ltd (RWC) and can not be used for commercial purposes unless by prior written permission from the RWC.

Thanks to @cushlat for the idea 🙂

For each player, the following was recorded:

• team played for (team)
• name (name)
• number of international matches played (caps)
• position (position)
• number of years since debuted (years_since_debut)
• date of debut (debut)
• age at Rugby World Cup 2015 (age)
• age minus years_since_debut (approx_age_debuted)
• height in cm (height_cm)
• weight in kg (weight_kg)

NB: This data set should be used with care for sample-to-population inference involving comparison, as both categorical variables (team and position) involve a large number of outcomes (16 teams and 11 positions). This means it is not likely that a random sample of 80 players from the population of Rugby World Cup 2015 players, for example, will contain sufficient numbers of players in any two groups for comparison e.g. England vs New Zealand OR forwards vs backs. If you use all the data for NZ and all the data for England to compare the age of players, for example, you will have used all of the data for this population and so there is no need to “make a call” about what is going on “back in the population” 🙂

My advice would be to use this data set for either single variable sampling investigations OR exploratory data analysis for the entire population. There is also something interesting in using the time variable (debut) to explore other variables 🙂

Here is the population data set as a CSV file: rubgy_world_cup_2015