## You say data, I say data cards …

This long weekend (in Auckland anyway!), I spent some time updating the Quick! Draw! sampling tool (read more about it here Cat and whisker plots: sampling from the Quick, Draw! dataset). You may need to clear your browser cache/data to see the most recent version of the sampling tool.

One of the motivations for doing so was a visit to my favourite kind of store – a stationery store – where I saw (and bought!) this lovely gadget:

It’s a circle punch with a 2″/5 cm diameter. When I saw it, my first thought was “oh cool I can make dot-shaped data cards”, like a normal person right?

Using data cards to make physical plots is not a new idea – see censusatschool.org.nz/resource/growing-scatterplots/ by Pip Arnold for one example:

But I haven’t seen dot-shaped ones yet, so this led me to re-develop the Quick! Draw! sampling tool to be able to create some 🙂

I was also motivated to work some more on the tool after the fantastic Wendy Gibbs asked me at the NZAMT (New Zealand Association of Mathematics Teachers) writing camp if I could include variables related to the times involved with each drawing. I suspect she has read this super cool post by Jim Vallandingham (while you’re at his site, check out some of his other cool posts and visualisations) which came out after I first released the sampling tool and compares strokes and drawing/pause times for different words/concepts – including cats and dogs!

So, with Quick! Draw! sampling tool you can now get the following variables for each drawing in the sample:

The drawing and pause times are in seconds. The drawing time captures the time taken for each stroke from beginning to end and the pause time captures all the time between strokes. If you add these two times together, you will get the total time the person spent drawing the word/concept before either the 20 seconds was up, or Google tried to identify the word/concept. Below the word/concept drawn is whether the drawing was correctly recognised (true) or not (false).

I also added three ways to use the data cards once they have been generated using the sampling tool (scroll down to below the data cards). You can now:

1. download a PDF version of the data cards, with circles the same size as the circle punch shown above (2″/5cm)
2. download the CSV file for the sample data
3. show the sample data as a HTML table (which makes it easy to copy and paste into a Google sheet for example)

In terms of options (2) and (3) above, I had resisted making the data this accessible in the previous version of the sampling tool. One of the reasons for this is because I wanted the drawings themselves to be considered as data, and as human would be involved in developed this variable, there was a need to work with just a sample of all the millions of drawings. I still feel this way, so I encourage you to get students to develop at least one new variable for their sample data that is based on a feature of the drawing 🙂 For example, whether the drawing of a cat is the face only, or includes the body too.

There are other cool things possible to expand the variables provided. Students could create a new variable by adding drawing_time and pause_time together. They could also create a variable which compares the number_strokes to the drawing_time e.g. average time per stroke. Students could also use the day_sketched variable to classify sketches as weekday or weekend drawings. Students should soon find the hemisphere is not that useful for comparisons, so could explore another country-related classification like continent. More advanced manipulations could involve working with the time stamps, which are given for all drawings using UTC time. This has consequences for the variable day_sketched as many countries (and places within countries) will be behind or ahead of the UTC time.

If you’ve made it this far in the post…. why not play with a little R 🙂

I wonder which common household pet Quick! drawers tend to use the most strokes to draw? Cats, dogs, or fish?

Have a go at modifying the R code below, using the iNZightPlots package by Tom Elliott and my [very-much-in-its-initial-stages-of-development] iNZightR package, to see what we can learn from the data 🙂 If you’re feeling extra adventurous, why not try modifying the code to explore the relationship between number of strokes and drawing time!

## Hey! You’ve got to hide that population away …

Back in 2012 was when I first set up an online tool for taking a random sample from a hidden population. I didn’t share or promote this tool at the time because it was always meant to be a short term solution to a short term problem for my department. 2012 in NZ was the first year of AS91264 Use statistical methods to make an inference and we had hundreds of Year 12 students and far fewer computers. We wanted a quick way for students to use the computer to get their random sample, graph it, print/save it and then move back to a desk to write up their report by hand. We also didn’t want them to see all the data that was in the population data set, as we thought that would be distracting.

Note: The title of this post is based on a song by The Beetles but I don’t think I believe that the population has always got to be hidden. You can read more about my thoughts on stuff related to samping in this post Using awesome real data

So I wrote some code which was completely based on the data viewer tool on Census At School NZ, where you can get a random sample from the Census At School database of your choice and then get the graphs and summary statistics displayed for that sample. The idea was that we could put whatever population data we wanted “behind the scenes” and students would choose what to sample using an interface. While initially it was intended for Year 12 only (since AS91264 has the requirement to sample), I extended this tool to include bootstrapping analysis for AS91582 (under type of analysis – Year 13) and the randomisation test for AS91583 (for this, students would just paste in their data directly to the webpage).

Below are some screen shots of this old tool from 2012:

This online inference tool had limitations as I am sure you will have identified 🙂 Unlike iNZight which has an interface designed to allow students to get into data faster and deeper, this tool was completely focused on getting the output for the inference, and the sample data generated by the tool could not be explored. The graphics are also not that great, and I needed to set up a page for each data set we wanted to use. Additionally, for the bootstrapping confidence interval, there was no animation to show how the interval was constructed (unlike the awesome iNZight VIT), which is such an important and essential part of using this method.

Fortunately, in the years that followed, our Principal gave us more and more desktop computers, and so students were able to complete their entire assessment on computers at a much slower pace using awesome tools such as Google docs (with great addons like Doctopus for us to manage their work). Later, we were also able to trial iNZight lite (we used it for AS91580 Investigate time series data), which is the online version of iNZight.

Time for a sampling tool update?

One of the awesome teachers I worked with emailed me recently wanting to set up something like the Census At School NZ random sampler tool. The Census At School random sampler tool gives you access to Census At School data sets since 2005, and also other data sets such as Kiwi Kapers, NZ incomes, Census at School data from other countries and Statistics NZ SURFs (income and births). One of the benefits of the tool is that the complete population data set is hidden behind the interface.

In terms of setting up something similar, there were a couple of options:

(1) not develop anything but instead put more population data sets up on Census At School NZ site since they have a great sampling interface set up. This is a valid option and if you have any great population data sets to contact, just get in touch with the friendly people at Census At School NZ.

(2) set up something similar to my 2012 tool but without the graphs, where teachers send me data sets and I make them available for sampling on my website. This is essentially the same as option (1) except that I would have responsibility for setting up and maintaining the data sets, and the teachers sharing them would lose control of them. However, we often use data collected from our own population of students, which wouldn’t be that interesting or appropriate for students from other schools.

(3) set up a sampling interface where teachers can use whatever data set they want, whenever they want, and keep ownership of the data set. I’ve calling this BYOP – Bring Your Own Population 🙂

After revisiting the code I used in 2012 and the code I used recently to set up the random redirect tool, I realised it wouldn’t take too much time to create a sampling tool for option 3. All you need for this new sampling tool is a csv file which is hosted publicly somewhere on the web, and where the first row consists of the variable names and the second row consists of a full set of values for each variable (no missing values for any variable).

You can see it in action here http://www.mathstatic.co.nz/sampler-new/UFNXFXDF  (for this example I used the Auckland Marathon 2015 data, this link has information about the data).

You can enter in the sample size you want (to a max of 30% of the size of the data set), and if you want, you can choose to only sample from certain groups within the population e.g. age division (up to 34 vs 35 – 39). You can then copy and paste the sample generated to wherever you like, export the sample as a csv file, or jump straight into iNZight lite with the data. I’ve made the page deliberately plain, so it will be up to you to provide the information about the data being used and how to use the tool.

To read more about this new sampling tool and how to set up your own sampling URL, head here: BYOP sampling tool