Want to make some awesome gift tags/labels for Christmas or holiday-related presents? Here’s a fun little statistical art project. Write whatever words you want in the app below, create some secret snowflakes (the secret part being no one else will know what words you used unless of course you choose to display them), play around with colours if you want (uncheck the option to use random colours), freeze the snowflakes when you get something you like, download your masterpiece and use in some way.

Oh yeah, the snowflakes are made by rotating each letter in the words in a magical statistical way (i.e. randomness).

To make our gift labels, I made the first colour white (the background #ffffff), made the other two colours black (#000000), and then printed on to adhesive sticker paper I had left over from our wedding.

Enjoy and have a great holiday break!

**Secret snowflakes app should be shown below (**otherwiseÂ here is the link) – works best using a Chrome browser

For many high school teachers here in New Zealand, the teaching year is over and it’s now a six-week summer break before school starts again next year. Despite the well-deserved break, some teachers are already thinking about ideas for next year. I’ve been amazed (and inspired) by the teachers who have signed up to spend a day with Liza and I on Friday 15th December to learn more about working with modern data (more details here). We are both really looking forward to the full-day workshop One of the tools we’ll be working with at the workshop is the platform IFTTT (If This Then That). It’s basically a way to connect devices and online accounts using APIs (application programming interfaces) without using code.

I used IFTTTÂ recently to collect data on New York Times articles. One of the reasons why I started collecting data on New York Times articles was because of their free, online feature â€ś*Whatâ€™s Going On in This Graph?”*.Â On Tuesday, December 12 and every second Tuesday of the month through the US school year, *The New York Times Learning Network, *in partnership with the American Statistical Association, hosts a live online discussion about a timely graph like the one shown below.

Students from around the world â€śreadâ€ť the graph by posting comments about what they notice and wonder in an online forum. Â Teachers live-moderates by responding to the comments in real time and encouraging students to go deeper.Â All releases are archived so that teachers can use previous graphs anytime (readÂ this introductory post to learn more). I used â€ś*Whatâ€™s Going On in This Graph?*” when I was teaching ourÂ Lies, Damned lies and Statistics course, and it is such an awesome resource for helping build statistical literacy and thinking.

So, inspired by the New York Times graphs, about two months ago I created an “applet” on IFTTT that creates a new row in a Google spreadsheet every time a new article is posted to the New York Times website. It stopped working for some reason at the end of November – check out the “raw” data here: https://docs.google.com/spreadsheets/d/1PXGh0xBrJbmrfWq3nRylH5GBqzVd4SYWWiXQj3v9tdQ/edit?usp=sharingÂ

So what’s going on with the data I collected? Your first thought on viewing the data might be – huh? You call this data? The only variable that is “graph ready” is which section each of the nearly 6000 articles were published in. But there are so many variables in data sets just like this one waiting to be defined and explored. After our workshop on Friday, I’ll post an “after” version of this same data set

]]>This post is second in a series of posts where I’m going to share some strategies for getting real data to use for statistical investigations that require sample to population inference. As I write them, you will be able to find them all on this page.

**What’s your favourite board game?**

I read an article posted on fivethirtyeight about the worst board games ever inventedÂ and it got me thinking about the board games I like to play. The *Game of life* has a low average rating on the online database of games referred to in this article but I remember kind of enjoying playing it as a kid.Â boardgamegeek.com features user-submitted information about hundreds of thousands of games (not just board games) and is constantly being updated. While there are some data sets out there that already feature data from this website (e.g. fromÂ kaggle datasets), I am purposely demonstrating a non-programming approach to getting this data that maximises the participation of teachers and students in the data collection process.

To end up with data that can be used as part of a sample to population inference task:

*You need a clearly defined and nameable populationÂ*(in this case,Â all board games listed on boardgamegeek.com)*You need a sampling frame that is a very close match to your population.**You need to select from your sampling frame using a random sampling method to obtain the members of your sample.**You need to define and measure variables from each member of the sample/population so the resulting data is multivariate.*

boardgamegeek.comÂ actually provide a link that you can use to select one of the games on their site at random (https://boardgamegeek.com/boardgame/random), so using this “random” link (hopefully) takes care of (2) and (3). For (4), there are so many potential variables that could be defined and measured. To decide on what variables to measure, I spent some time exploring the content of the webpages for a few different games to get a feel for what might make for good variables. I decided to stick to variables that are measured directly for each game, rather than ones that were based on user polls, and went with these variables:

- Millennium the game was released (1000, 2000, all others)
- Number of words in game title
- Minimum number of players
- Maximum number of players
- Playing time in minutes (if a range was provided, the average of the limits was used)
- Minimum age in years
- Game type (strategy or war, family or children’s, other)
- Game available in multiple languages (yes or no)

**Time to play!**

I’ve set up a Google form with instructions of how you can help create a random sample of games from boardgamegeek.com at this link:Â https://goo.gl/forms/8yBqryGTzrZGhEVx2. As people play along, the sample data will be added here:Â https://docs.google.com/spreadsheets/d/e/2PACX-1vSzR_VSVzaaeWpCvYbAQCUewaM3Tad2zfTBO7AWuDgFFTj5Jaq2TBo6N-gQGCe5e5t_qKW7Knuq6-pr/pub?gid=552938859&single=true&output=csvÂ . The URL to the game is included so that the data can be checked. Feel free to copy and adapt however you want, but do keep in mind that nature of the variables you use. In particular, be very careful about using any of the aggregate ratings measures (and another great article by fivethirtyeight about movie ratings explains some of the reasons why.)

**Bonus round**

I wrote a post recently –Â Just Google itÂ – which featured real data distributions.Â boardgamegeek.comÂ also provides simple graphs of the ratings for each game, so we can play a similar matching game. You could also try estimating the mean and standard deviation of the ratings from the graph, with the added game feature of reverse ordering!

Which games do you think match which ratings graphs?

- Monopoly
- The Lord of the Rings: The Card Game
- Risk
- Tic-tac-toe

I couldn’t find a game that had a clear bi-modal distribution for its ratings but I reckon there must be games out there that people either love or hate Let me know if you find one! To get students familiar withÂ boardgamegeek.com, you could ask them to first search for their favourite game and then explore what information and ratings have been provided for this on the site. Let the games begin

]]>Here’s a really quick idea for a matching activity, totally building off Pip Arnold’s excellent work on shape.

At the bottom of this post are six “Popular times” graphs generated today by Google when searching for the following places of interest:

- Cafe
- Shopping mall
- Library
- Swimming pool
- Gym
- Supermarket

Can you match which graphs go with which places?

[you can find the answers at the bottom]

Click here to reveal the answers

]]>

This post is first in a series of posts where I’m going to share some strategies for getting real data for real data stories, specifically to use for statistical investigations that require sample to population inference. As I write them, you will be able to find them all on this page.

**Key considerations for finding real data for sample to population inference tasks**

It’s really important that I stress that the approaches I’ll discuss are not necessarily what I would typically use when finding data to explore. Generally, I’d let the data drive the analysis not the analysis drive the data I try to find. These are specific examples so that the data that is obtained can be used sensibly to perform sample to population inference. It’s also really important to talk about why I’m stressing the above In NZ we have specific standards that are designed to assess understanding of sample to population inference, using informal and formal methods that have developed by exploring the behaviour of random samples from populations (AS91035, AS91264, AS91582). So, for the students’ learning about rules of thumbs and confidence intervals to make sense, we need to provide students with clearly defined named populations with data that are (or are able to be) randomly sampled from these populations. At high school level at least, these strict conditions are in place so that students can focus on one central question:Â *What can and can’t I say about the population(s) based on the random sample data?*

For all the examples I’ll cover in this series of posts, there are four key considerations/requirements:

- You need a clearly defined and nameable population. Ideally this should be as simple and clear as possible to help students out but to ensure (2) the “name” can end up being quite specific.
- You need a sampling frame that is a very close match to your population. This means you need a way to access every member of your population to measure stuff about them (variables). Sure, this is not the reality of what happens in the real world in terms of sampling, but remember what I said earlier about what was important
- You need to select from your sampling frame using a random sampling method to obtain the members of your sample. It is sufficient (and recommended) to stick to simple random sampling. In some cases, you may be able to make an assumption that what you have can be considered a random sample, but I’d prefer to avoid these kinds of situations where possible at high school level.
- You need to define and measure variables from each member of the sample/population. We want students working with multivariate data sets, with several options possible for numerical and categorical data (but don’t forget there is the option to create new variables from what was measured).

I’ll try to refer back to these four considerations/requirements when I discuss examples in the posts that will follow.

Just one very relevant NZ NCEA assessment-specific comment before we talk data. For AS91035 and AS91582, the standards state that students are to be provided with the sample multivariate data for the task – so all of (1) (2) (3) and (4) is done by the teacher. Similarly with AS91264, the requirement for the standard is that students select a random sample (3) from a provided population dataset – so (1) (2) and (4) are done by the teacher. This does not mean the students can’t do more in terms of the sampling/collecting processes, just that these are not requirements for the standards and asking students to do more should not limit their ability to complete the task. I’ll try to give some ideas for how to manage any related issues in the examples.

Just one more point. I haven’t made this (5) in the previous section, but something to watch out for is the nature of your “cases”. Tables of data (which we refer to as datasets) that play nicely with statistical software like iNZightÂ are ones where the data is organised so that each row is a case and each column is a variable. Typically at high school level, the datasets we use are ones where each case (e.g. each individual in the defined population) is measured directly to obtain different variables. Things can get a little tricky conceptually when some of the variables for a case are actually measured by grouping/aggregating related but different cases.

For example, if I take five movies from the internet movie database that have “dog” in the title (imdb.com) and another five with “cat” in the title, I could construct a mini dataset like the one below using information from the website:

For this dataset, each row is a different movie, so the cases are the movies. Each column provides a different variable for each movie. The variablesÂ *Movie title, Year released, Movie length mins, Average rating, Number of ratings, Number photos* andÂ *Genre* were taken straight from the webpage for each movie. I created the variablesÂ *Number words title, Number letters title, Average letters per word, Animal in title, Years since releaseÂ *andÂ *Millennium*. [Something I won’t tackle in this post is what to do about the Genre variable to make this usable for analysis.]

TheÂ *Average rating* variable looks like a nice numerical variable to use, for example, to compare ratings of these movies with “dog” in the title and those with “cat”. The thing is, this particular variable has been measured by aggregating individual’s ratings of the movie using a mean (the related but different cases here are the individuals who rated the movies). You can see why this may be an issue when you look at the variableÂ *Number of ratings*, which again is an aggregate measure (a count) – some of these movies have received less than 200 ratings while others are in the hundreds of thousands. We also can’t see what the distribution of these individual ratings for each movie looks like to decide whether the mean is telling us something useful about the ratings. [For some more really interesting discussion of using online movie ratings, check out this fivethirtyeight article.]

The variableÂ *Average letters per word* has been measured directly from each case, using the characteristics of the movie title. There are still some potential issues with using the variableÂ *Average letters per word* as a measure of, let’s say, complexity of words used in the movie title, since the mean is being used, but at least in thisÂ *case* students can see the movie title.

Another example of case awareness can be seen in the mini dataset below, using data on PhD candidates from the University of Auckland online directory:

For this dataset, each row is a different department, so the cases are the departments. Each column provides a different variable for each department. Gender was estimated based on the information provided in the directory and the data may be inaccurate for this reason. TheÂ *% of PhD candidates that are female* looks like a nice numerical variable to use, for example, to compare gender rates between these departments from the Arts and Science faculties. Generally with numerical variables we would use the mean or median as a measure of central tendency. But this variable was measured by aggregating information about each PhD candidate in that department and presenting this measure as a percentage (the related but different cases here are the PhD candidates). Just think about it, does it really make sense to make a statement like:Â *The mean % of PhD candidates that are female for these departments of the Arts faculty is 73% whereas the mean % of PhD candidates that are female for these departments of the Science faculty is 44%,*Â especially when the numbers of PhD candidates varies so much between departments?

Looking at the individual percentages is interesting to see how they vary across departments, but combining them to get an overall measure for each faculty should involve calculating another percentage using the original counts for PhD candidates for each department (e.g. group by faculty). If I want to compare gender rates between the Arts and Science faculties for PhD candidates, I would calculate the proportion of all PhD candidates across these department that are female for each faculty e.g. *58% of the PhD candidates from these departments of the Arts faculty are female, 53% of the PhD candidates from these departments of the Science faculty are female*.

[If you’d like to read more about structuring data in the context of creating a dataset, then check out this excellent post by Rob Gould.]

**Where to next?**

This post was not supposed to deter you from finding and creating your own real datasets! But we do need to think carefully about the data that we provide to students, especially our high school students. Not all datasets are the same and while I’ve seen some really cool and interesting ideas out there for finding/collecting data for investigations, some of these ideas unintentionally produce data that makes it very difficult for students to engage with the core question:Â *What can and can’t I say about the population(s) based on the random sample data?Â *

In the next post, I’ll discuss some examples of finding real data online. Until I find time to write this next post, check out these existing data finding posts:

Cat and whisker plots: sampling from the Quick, Draw! dataset

]]>This post provides the notes for the plenary I gave for the Auckland Mathematical Association (AMA) about usingÂ images as a source of data for teaching statistical investigations.

You might be disappointed to find out that my talk (and this post) is not about the movie pixels, as my husband initially thought it was. It’s probably a good thing I decided to focus on pixels in terms of data about a computer or digital image, asÂ the box office data about pixels *the movie*Â suggests that the movie didn’t perform so well Instead for this talk I presented some examples of using images as part of statistical investigations that (hopefully) demonstrated how the different combinations of humans, digital technologies, and modelling can lead to some pretty interesting data. The abstract for the talk is below:

How are photos of cats different from photos of dogs? How could someone determine where you come from based on how you draw a circle? How could the human job of counting cars at an intersection be cheaply replaced by technology? I will share some examples of simple models that I and others have developed to answer these kinds of questions throughÂ statistical investigations involvingÂ the analysis ofÂ both static and dynamic images.Â We will also discuss how the process of creating these models utilises statistical, mathematical and computational thinking.

As I was using a photo of my cat Elliot to explain the different ways we can use images to collect data, a really funny thing happened (see the embedded tweet below).

When @annafergussonnz is talking about a cat image in her plenary talk “Power of Pixels” #thatisarealcat #mtbos #mathchat pic.twitter.com/WsHoTNCG3Y

â€” Subash Chandar K (@elsubash) September 1, 2017

Yes, an actual real #statscat appeared in the room! What are the chances of that?

Pixels are the squares of colour that make up computer or digital (raster) images. Each image has a certain number of pixels e.g. an image that is 41 pixels in width and 15 pixels in height contains 615 pixels, which is an obvious link to concepts of area. The 615 pixels are stored in an ordered list, so the computer knows how to display them, and each pixel contains information about colour. Using RGB colour values (other systems exist), each pixel contains information about the amounts of red, green and blue on a scale of 0 to 255 inclusive. To get at the information about the pixels is going to require some knowledge of digital technologies, and so the use of images within statistical investigations can be a nice way to teach objectives from across the different curriculum learning areas.

Using images as a source of data can happen on at least three levels. Using the aforementionedÂ photo of my cat Elliot, humans could extract data from the image by focusing on things they can see, for example, that that image is a black and white photo and not in colour, that there are two cats in the photo, and that Elliot does not appear to be smiling. Data that is also available about the image using digital tech includes variables such as the number of pixels, the file type and the file size. Data that can be generated using models related to this image could be identifying the most prominent shade of grey, the likelihood this photo will get more than 100 likes on instagram and what the photo is of (cat vs dog for example, a popular machine learning task).

**Static images**

The first example used the data, in particular the photos, collected as part of the ongoing data collection project I have running about cats and dogsÂ (the current set of pet data cards can be downloaded here). As humans, we can look at images, notice things that are different and these features can be used to create variables.Â For example, if you look at some of the photos submitted: some pets are outside while others are inside; some pets are looking at the camera while others are looking away from the camera; and some are “close ups” while others taken from a distance.

These potential variables are all initially categorical, but by using digital technologies, numerical variables are also possible. To create a measure of whether a photo is a “close up” shot of a pet, the area the pet takes up of the photo can be measured. This is where pixels are super helpful.Â I used paint.net, free image editing software, to show that if I trace around the dog in this photo using the lasso tool that the dog makes up about 61 000 pixels. If you compare this figure to the total number of pixels in the image (90 000), you can calculate the percentage the dog makes up of the photo.

For the current set of pet data card, each photo now has this percentage displayed.Â Based on this very small sample of six pets, it kind of looks like maybe cats typically make up a larger percentage of the photo than dogs, but I will leave this up to you to investigate using appropriate statistical modelling

For a pretty cool example of using static images, humans, digital technologies and models, you should take a look at how-old.net. As humans, we can look at photos of people and estimate their age and compare our estimates to people’s actual ages. What how-old.net has done is used machine learning to train a model to predict someone’s age based on the features of the photo submitted. I asked teachers at the talk to select which of the three photos they thought I looked the youngest in (most said B), which is the same photo that the how-old.net model predicted I looked the youngest in. A good teaching point about the model used by how-old.net is that it does get updated, as new data is used to refine its predictions.

You can also demonstrate how models can be evaluated by comparing what the model predicts to the actual value (if known). Fortunately I have a large number of siblings and so a handy (and frequently used) range of different aged people to test the how-old.net model. Students could use public figures, such as athletes, politicians, media personalities or celebrities, to compare each person’s actual age to what the model predicts (since it’s likely that both photos and ages are available on the internet).

There is also the possibility of setting up an activity around comparing *humans vs models*Â – for the same set of photos, are humans better at predicting ages than how-old.net? Students could be asked to consider how they could set up this kind of activity, what photos could they use, and how would they decided who was better – humans or models?

**Drawings**

The next example used the set of drawings Google has made available from their Quick! Draw! game and artificial intelligence experiment. I’ve already written a post about this data set, so have a read of that post if you haven’t already In this talk, I asked teachers to draw a quick sketch of cat and then asked them to tell me whether they drew just the face, or the body as well (most drew the face and body – I’m not sure if the appearance of an actual cat during the talk influenced this at all!) I also asked them to think about how many times they lifted their pen off the paper. I probably forgot to say this at the time, but for some things humans are pretty good at providing data but for others, digital technologies are better. In the case of drawing and thinking about how many strokes you made while drawing, we would get more accurate data if we could measure this using a mouse, stylus or touchscreen than asking people to remember.

Using the random sampler toolÂ that I have set up that allows you to choose one of the objects players have been asked to draw for Quick! Draw!, I generated a random sample of 200 of the drawings made when asked to draw a cat. The data the can be used from each drawing is a combination of whatÂ *humans* andÂ *digital technologies* can measure. The drawing itself (similar to the photos of pets in the first example) can be used to create different variables, for example whether the sketch is of the face only, or the face and body. Other variables are also provided, such as the timestamp and country code, both examples of data that is captured from players of the game without them necessarily realising (e.g. digital traces).

After manually reviewing all 200 drawings and recording data about the variables, I usedÂ iNZight VITÂ to construct bootstrap confidence intervals for the proportion of all drawings made of cats in the Quick! Draw! dataset that were only of faces and for the difference between the mean number of strokes made for drawings of cats in the Quick! Draw! dataset that were of bodies and mean number of strokes made for drawings of cats in the Quick! Draw! dataset that were of faces. Interestingly, while the teachers at the talk mostly drew sketches of cats with bodies, most players of Quick! Draw! only sketch the faces of cats. This could be due to the 20 second time limit enforced when playing the game. It makes sense that the, on average, Quick! Draw! players use more strokes to draw cats with bodies versus cats with just faces. I wished at the time that I had also recorded information about the other variables provided for each drawing, as it would have been good to further explore how the drawings compare in terms of whether the game correctly identified more of the face-only drawings of cats than the body drawings.

What is also really interesting is the artificial intelligence aspect of the game. The video below explains this pretty well, but basically the model that is used to guess what object is being drawn is trained on what previous players of the game have drawn.

From a maths teachers perspective, this is a good example of what can go wrong with technology and modelling. For example, players are asked to draw a square, and because the model is trained onÂ *how* they draw the object, players who draw four lines that are roughly perpendicular behave similarly from the machine’s perspective because the technology is looking for commonalities between the drawings. What the technology is not detecting is that some players do not know what a square is, or think squares and rectangles are the same thing. So the data being used to train the model is biased. The consequence of this bias is that the model will now reinforce players misunderstanding that a rectangle is a square by “correctly” predicting they are drawing a square when they draw a rectangle! An interesting investigation I haven’t done yet would be to estimate what percentage of drawings made for squares are rectangles Â I would also suggest checking out some of the other “shape” objects to see other examples e.g. octagons.

Using a more complex form of the Google Quick! Draw! dataset, Thu-Huong Ha and Nikhil Sonnad analysed over 100 000 of the drawings made of circles to show how language and culture influences sketches. For example, they found that 86% of the circles drawn by players in the US were drawn counter clockwise, while 80% of the circles drawn by players in Japan were drawn clockwise. To me, this is really fascinating stuff, and really cool examples of how using images as a source of data can result in really meaningful investigations about the world.

**Animation**

The last example I used was about using videos as a source of data for probability distribution modelling activities. I’ve presented some workshops before where I used a video (traffic.mp4) from a live streaming traffic camera positioned above a section of the motorway in Wellington. Focusing on the lane of traffic closest to the front of the screen, I got teachers to count how many cars arrived to a fixed point in that lane every five seconds. This gave us a nice set of data which we could then use to test the suitability of a Poisson distribution as a model.

For this talk, I wanted to demonstrate how humans could be replaced (potentially) by digital technologies and models. Since the video is a collection of images shown quickly (around 50 frames per second), we can use pixels, or potentially just a single pixel, in the images to measure various attributes of the cars. About a year ago, I set myself the challenge of exploring whether it would be possible to glean information about car counts, car colours etc. and shared my progress with this personal project at the end of the talk.

So, yes there does exist pretty fancy video analysis software out there that I couldÂ *use* to extract the data I want, but I wanted to investigate whether I could use a combination ofÂ *statistical, mathematical and computational* thinking toÂ *create* my own model to generate the data. As part of my PhD, I’m interesting in finding out what activities could help introduce students to the modern art and science of learning from data, and what is nice about this example is that idea of how the model could count how many cars are arriving every five seconds to a fixed point on the motorway is actually pretty simple and so potentially a good entry point for students.

The basic idea behind the model is that when there are no cars at the point on the motorway, the pixel I am tracking is a certain colour. This colour becomes my reference colour for the model. Using the RBG colour system, for each frame/image in the traffic video, I can compare the current colour of the pixel *e.g. rgb(100, 250, 141)* to the reference colour *e.g. rgb(162, 158, 162).Â *As soon as the colour changes from the reference colour, I can infer this means a car has arrived to the point on the motorway. And as soon as the colour changes back to the reference colour, I can infer that the car has left the point on the motorway. While the car is moving past the point, I can also collect data on the colour of the pixel from each frame, and use this to determine the colour of the car.

I’m still working on the model (in that I haven’t actually modified it since I first played around with the idea last year) and the video below shows where my project within CODAP (Common Online Data Analysis Platform) is currently at. When I get some time, I will share the link to this CODAP data interactive so you and your students can play around with choosing different pixels to track and changing other parameters of the model I’ve developed

You might notice by watching this video that the model needs some work. The colours being recorded for each car are not always that good (average colour is an interesting concept in itself, and I’ve learned a lot more about how to work with colour since I developed the model) and some cars end up being recorded twice or not at all. But now that I’ve developed an initial model to count the cars that arrive every five seconds, I can compare the data generated from the model to the data generated by humans to see how well my model performed.

You can see at the moment, that the data looks very different when comparing what the humans counted and what the digital tech + model counted. So maybe the job of traffic counter (my job during university!) is still safe – for now

**Going crackers**

I didn’t get time in the talk to show an example of a statistical investigation that used images (photos of animal crackers or biscuits) to create a informal prediction model. I’ll write about this in another post soon – watch this space!

]]>Last night, I saw a tweet announcing that Google had made data available on over 50 million drawings from the game Quick, Draw!Â I had never played the game before, but it is pretty cool. The idea behind the game is whether a neural network can learn to recognize doodling – watch the video below for more about this (with an example about cats of course!)

For each game, you are challenged to draw a certain object within 20 secs, and you get to see if the algorithm can classify your drawing correctly or not. See my attempt below to draw a trumpet, and the neural network correctly identifying that I was drawing a trumpet.

Since I am clearly obsessed with cats at the moment, I went straight to the drawings of cats. You can see ALL the drawings made for cats (hundred of thousands) and can seeÂ *variation* in particular features of these drawings. I thought it would be cool to be able to take a random sample from all the drawings for a particular category, so after some coding I set up this page: learning.statistics-is-awesome.org/draw/. I’ve included below each drawing the other data provided in the following order:

- the word the user was told to draw
- the two letter country code
- the timestamp
- whether the drawing was correctly classified
- number of individual strokes made for the drawing

**So, on average, how many whiskers do Quick, Draw! players draw on their cats?**

I took a random sample of 50 drawings from those under the cat category using the sampling tool onÂ learning.statistics-is-awesome.org/draw/. Below are the drawings selected

Counting how many individual whiskers were drawn was not super easy but, according to my interpretation of the drawings, here is my sample dataÂ (csv file). Using the awesome iNZight VIT bootstrapping moduleÂ (and the handy option to add the csv file directly to the URL e.g.Â https://www.stat.auckland.ac.nz/~wild/VITonline/bootstrap/bootstrap.html?file=http://learning.statistics-is-awesome.org/draw/cat-and-whisker-plots.csv), I constructed a bootstrap confidence interval for the mean number of whiskers on cat drawings made by Quick, Draw! players.

So, turns out it’s a fairly safe bet that the mean number of whiskers per cat drawing made by Quick, Draw! players is somewhere between 2.2 and 3.5 whiskers. Of course, these are the drawings that have been moderated (I’m assuming for appropriateness/decency). When you look at the drawings, with that 20 second limit on drawing time, you can see that many players went for other features of cats like their ears, possibly running out of time to draw the whiskers. In that respect, it would be interesting to see if there is something going on with whether the drawing was correctly classified as being a cat or not – are whiskers a defining feature of cat drawings?

I reckon there are a tonne of cool things to explore with this dataset, and with the ability to randomly sample from the hundreds and hundreds of thousands of drawings available under each category, a good reason to use statistical inference I like that students can develop their own measures based on features of the drawings, based on what they are interested in exploring.

]]>After I published this post, I took a look at the drawings for octopus and then for octagon, a fascinating comparison.

I wonder if players of Quick, Draw! are more likely to draw eight sides for an octagon or eight legs for an octopus? I wonder if the mean number of sides drawn for an octagon is higher than the mean number of legs draw for an octopus?

Estimating the mean and standard deviation of a discrete random variable is something we expect NZ students to be able to do by the time they finish Year 13 (Grade 12). TheÂ idea is that students estimate these properties of a distribution using visual features of a display (e.g. a dot plot) and, ideally, these measures are visually and conceptually attached to a real data distribution with a context and not treated entirely as mathematical concepts.

At the start of this year I went looking for an interactive dot plot to use when reviewing mean and standard deviation with my intro-level statistics students. Initially, I wanted something where I could drag dots around on a dot plot and show what happens to the mean, standard deviation etc. as I do this. Then I wanted something where you could drag dots on and off the dot plot, rather than having an initial starting dot plot, so students could build dot plots based on various situations. I came across a few examples of interactive-ish dot plots out there in Google-land but none quite did what I wanted (or kept the focus on what I wanted), so I decided to write my own. [**Note:**Â CODAPÂ would have been my choice if I had just wanted to drag dots around.** Extra note:**Â CODAP is pretty awesome for many manyÂ reasons].

In my head as I developed the app was an activity I’ve used in the past to introduce standard deviation as a measure –Â Exploring statistical measures by estimating the ages of famous peopleÂ – as well as a workshop by the awesome Christine Franklin. For NZ-based teachers (or teachers who want to come to beautiful New Zealand for our national mathematics teachers conference), Chris is one of the keynote speakers at the NZAMT 2017 conferenceÂ and is running a workshop at this conference calledÂ *Conceptualizing Variation from the Mean: Evolving from ‘Number of Steps’ to the ‘SAD’ to the ‘MAD’ to the ‘Standard Deviation’*Â which you should get along to if you can. Also in my head was the idea of the mean of a distribution being like the “balancing point”, and other activities I have used in the past based on this analogy and also see-saws! My teaching colleague Liza BoltonÂ was also super helpful at listening to my ideas, suggesting awesome ones of her own, and testing the app throughout its various versions.

*dots* – an interactive dot plot

You can accessÂ *dots* at this address:Â learning.statistics-is-awesome.org/dots/Â but you might want to keep reading to find out a little more about how it works Â Below is a screenshot of the app, with some brief descriptions of how things areÂ *supposed* to work. Current limitations forÂ *dots* are that no more than 35 dots will be displayed, the axis is fixed between 0 and 34, and that dots can only be placed on whole numbers. I had played around with making these aspects of the app more flexible, but then decided not to pursue this as I’m not trying to re-create graphing/statistical software with this interactive.

Since I’ve got theÂ It’s raining cats and dogs (hopefully)Â project running, I thought I’d use some of the data collected so far to show a fewÂ examples of how to useÂ *dots*. [**Note: **The data collection phase of the cats and dogs data cards project is still running, so you can get your students involved]. Here are 15 randomly selected cats from the data cards created so far, with the age of each cat removed.

Once you get past how cute these cats are, what do you think the mean age of these cats is (in years)? Can you tell which cat is the oldest? How much variation do you think there is between the ages of these cats?

**Dragging dots onto the dot plot**

AÂ dot plot can be created by dragging dots on to the plot (don’t forget to add a label for the axis like I did!)

**Sending data to the dot plot**

You can also add the data and the label to the URL so that the plot is ready to go. Use the structure shown below to do this, and then click on the link to see the ages of these cats on the interactive dot plot.

Turns out China is the oldest cat in this sample.

**Exploring the balance point**

You can click below the dots on the axis to indicate your estimate for the mean. You could do a couple of things after this. You could click the **Mean** button to show the mean, and check how this compares to your estimated mean. Or you could click the **Balance test** button to turn in on (green), and see how well the dots balance on the point you have estimated as the mean (or both like I did).

*Estimating standard deviation*

Estimating standard deviation is hard. I try not to use “rules” that only work with Normally distributed-ish data (like take the range and divide by six) and aren’t based on what the standard deviation is a measure of. Visualising standard deviation is also a tricky thing. In the video below I’ve gone with two approaches: one uses a Chrome extension Web Paint to draw on the plot where I think is theÂ *average distance each dot is from the mean* and one uses the absolute deviations.

**UsingÂ “random distribution”**

This is the option I have used the most when working with students individually. Yes, there is no context when using this option, but in my conversations with students when talking about the mean and standard deviation I’m not sure the lack of context makes it non-conceptual-building activity. The short video below shows using the median as a starting point for the estimate of the mean, and the adjusting from here depending on other features of the distribution (e.g. shape). The video ends by dragging a dot around to see what happens to the different measures, since that was the starting point for developingÂ *dots*

**Other ideas for using dots?**

Share them belowÂ the related Facebook post, on Twitter, or wherever – I’d be super keen to hear whether you find this interactive dot plot useful for teaching students how to estimate mean and standard deviation

PS no cats were harmed in the making of this GIF

]]>In April 2017, I presented anÂ ASA K-12 statistics education webinar:Â Statistical reasoning with data cards (webinar). Towards the end of the webinar, I encouraged teachers to get students to make their own data cards about their cats. A few days later, I then thought that this could be something to get NZ teachers and students involved with. Imagine a huge collection of real data cards about dogs and cats?Â Real data that comes from NZ teachers and students? Like Census At School but for pets I persuaded a few of my teacher friends to create data cards for their pets (dogs or cats) and to get their students involved, to see whether this project could work. Below is a small selection of the data cards that were initially createdÂ (beware of potential cuteness overload!)

The project then expanded to include more teachers and students across NZ, and even the US, and I’ve now decided to keep the data card generator (and collection) page open so that the set of data cards can grow over time. Please use the steps below to get students creating and sharing data cards about their pets.

**Creating and sharing data cardsÂ about dogs and cats**

Inevitably, there will be submissions made that are “fake”, silly or offensive (see below).

Data cards submitted to the project won’t automatically be added to any public sets of data cards, and will be checked first. Just like with any surveying process that is based on self-selection, is internet based and relies on humans to give honest and accurate answers, there is the potential for non-sampling errors. To help reduce the quantify of “fake” data cards, if you are keen to have your students involved with this projectÂ it would be great if you could do the following:

**1.** Talk to your students about the project and explain that the data cards will be shared with other students. They will be sharing information about their pet and need to be OK with this (and don’t have to!). The data will be displayed with a picture of their pet, so participation is not strictly anonymous. All of this is important to discuss with students as we need to educate students about data privacy

**2.**Â When students submit their data, they are given the finished data card which they can save. Set up a system where students need to share the data card they have created with you e.g. by saving into a shared Google drive or Dropbox, or by emailing the data card to you. The advantage for you of setting up this system is that you get your class/school set of data cards to use however you want. The advantage for me is that this level of “watching” might discourage silly data cards being created.

**3. **Share this link with your students http://learning.statistics-is-awesome.org/dogsvscats/Â and let the rain of cats and dogs begin!

Pet data cardsThe data collection period for this set of data cards was 1 May 17 to 19 May 17.

The diagram below shows the data included on each data card:

Additional data that could be used from each data card includes:

- Whether the pet photo was taken inside or outside
- Whether the pet photo is rotated (and the angle of rotation)
- The number of letters in the pet name
- The number of syllables in the pet name
PDF of all data cards: click to download

]]>

If you haven’t heard of the activity *Which one doesn’t belong?*Â (WODB)*, *itÂ involves showing students four “things” and asking them to describe/argue which one doesn’t belong. There are heaps of examples of *Which one doesn’t belong?Â *in action for math(s) on the web, Twitter, and even in a book. From what I’ve seen, for math(s) I think the activity is pretty cool. In terms of whether WODB works for stats, however, I’m not so sure. Perhaps for definitions, facts, static pieces of knowledge it *could* work (?), but in terms of making comparisons involving data and its various representations (including graphs/displays), I need more convincing. There’s something different between comparing properties of shapes (for example), which remain fixed, and comparingÂ data about something/someone, which could vary.

For example, **What*** Â catÂ doesn’t belong?Â *for the four “stats cats” data cards shown below.

To make comparisons between the four cats means to reason with data, but if I am considering only the data provided in these four data cards then these comparisons are made without uncertainty. For example, I can say definitively, for these four cats, that:

- Elliot is the only cat with a name that has three syllables,
- Molly is the only female cat,
- Joey is the only cat is both an inside and outside cat,
- Classic is the only cat that uses a cat door.

I could argue many different cases for which cat (or photo) does not belong. This is all cool, but doesn’t feel like statistics to me. Statistics is all about using data to make decisions in the face of uncertainty, by appreciating different sources of variation and considering how to deal with these. In particular, inferential reasoning involves going beyond the data at hand, thinking about generalisability, considering the quality and quantity of data available,Â and appreciating/communicating the possibility of being wrong not matter how “right” the methodology.

So while I appreciate that WODBÂ allows for “not just one correct answer” and the development of argumentation skills,Â I’d be more happier if this kind of activity *within statistics teaching*Â ledÂ to the posing ofÂ statistical investigative questions (SIQ):Â **WODB->SIQ. **Why?Â We need more data and more of an idea of where the data came from to *really answer* the *really interesting questions* that comparing these four cats might provoke us to consider. We need students toÂ *feel the uncertainty* that comes from thinking and reasoning statistically and to help students find ways to *deal with this uncertainty*. We also need students to care about the questions being asked of the data – my worry here is that otherwise the question students might ask when using WODB isÂ *Who cares which one doesn’t belong? *

Questions I have when looking at these stats cats data cards, which are interesting to me are:* I wonder* …. *How many syllables do cats’ names have? Do most cats have two syllable names? Is Elliot (my cat!) an unusual name for this reason? Do I spend too much on cat food ($NZD30 per week)? Or maybe black cats are more expensive to feed?*Â I won’t be able to get **definitive answers** to these questions, but by collecting more data and investigating these questions using statistical methods I can get a better understanding of what could be **plausible answers.**

PS Want some of these data cards? Head here –>Â It’s raining cats and dogs (hopefully)

]]>