Recently I’ve been developing and trialling learning tasks where the learner is working with a provided data set but has to do something “human” that motivates using a random sample as part of the strategy to learn something from the data.
Since I already had a tool that creates data cards from the Quick, Draw! data set, I’ve created a prototype for the kind of tool that would support this approach using the same data set.
I’ve written about the Quick, Draw! data set already:
For this new tool, called different strokes, users sort drawings into two or more groups based on something visible in the drawing itself. Since you have the drag the drawings around to manually “classify” them, the larger the sample you take, the longer it will take you.
There’s also the novelty and creativity of being able to create your own rules for classifying drawings. I’ll use cats for the example below, but from a teaching and assessment perspective there are SO many drawings of so many things and so many variables with so many opportunities to compare and contrast what can be learned about how people draw in the Quick, Draw!
Here’s a precis of the kinds of questions I might ask myself to explore the general question What can we learn from the data about how people draw cats in the Quick, Draw! game?
Are drawings of cats more likely to be heads only or the whole body? [I can take a sample of cat drawings, and then sort the drawings into heads vs bodies. From here, I could bootstrap a confidence interval for the population proportion].
Is how someone draws a cat linked to the game time? [I can use the same data as above, but compare game times by the two groups I’ve created – head vs bodies. I could bootstrap a confidence interval for the difference of two population means/medians]
Is there a relationship between the number of strokes and the pause time for cat drawings? [And what do these two variables actually measure – I’ll need some contextual knowledge!]
Do people draw dogs similarly to cats in the Quick, Draw! game? [I could grab new samples of cat and dog drawings, sort all drawings into “heads” or “bodies”, and then bootstrap a confidence interval for the difference of two population proportions]
I’ve been working on a little side project for the last year or so. I thought this might be a good time to share this with you, particularly since I probably (with a very high probability) won’t be making any more posts for the rest of the year due a few little things called a dissertation and a wedding 🙂
The idea was to create a digital learning environment for working with data cards, in an attempt to make stronger connections between data cards, data structures and data displays, and to make effective use of tablets/devices (particularly in large lecture groups like my current teaching situation). This first digital environment is based on the C@S stick people data cards I created last year, but could involve any population/data etc, since everything is created dynamically. The idea to use stick people (figures) for the data cards was based on material Rob Gould presented at the NZAMT conference in 2015 regarding the Introduction to Data Science (IDS) course the Mobilize team created for high school students.
In stickland, the members of its population (the C@S stick people) ride by on skateboards. The numbers displayed on each stick person are their unique three digit ID number. The environment is set up so that the stick people arrive to this stretch of road in stick land in a random order and at random times. Students could check this out by watching the stick people skate on by and recording their ID numbers. They should see no pattern to the numbers and be convinced that they can not predict what ID number the next stick person will have (well, I guess if you watched for long enough you would be able to predict the last ID number……)
To select stick people to find out more about them, students click on the stick person as they skate past. Some of the stick people are faster than others (more about that next year!) so it’s not always easy to catch them. This means that it will take different times for students to collect the same number of data cards. As the stick people are selected, a stack of data cards starts to be built on the top right hand side of data card screen below.
At this point we’re in a similar position to where we would be if we had given students a set of data cards each, or if we had asked them to select a random sample of data cards from a population bag. One of the really awesome things about data cards is the physical nature of them – students can move them around, sort them, line them up, etc. So in this digital environment, students can drag the stick people data card around by tapping their heads and dragging their finger.
I love getting students to sort the data cards by a categorical variable (e.g. Facebook user) and then by another categorical variable (e.g. Snapchat user) to build ideas of two-way tables and conditioning.
You can also get students to make graphs out of the data cards (see one of Pip Arnold’s excellent resources along these lines here on Census At School NZ). In this digital environment, students can make the cards bigger or smaller, and can move into “dot” mode as they move into graphical representations by encoding the data.
To help students build understanding of what are essential features of their graphs, there is a drawing tool so they can add in additional information like axes, labels, numbers etc. I can see a whole lot of potential here, particularly with students exploring different ways to organise and display data.
To help build understanding of the relationship between units, variables and data structures (specifically rectangular data sets), an interactive spreadsheet builds below the data card screen as the cards are collected. When a student selects a data card, this stick person’s row of data is highlighted in the spreadsheet, and vice versa. To check each student can match the data shown on the data card to the spreadsheet, data is missing from the spreadsheet (shown by grey boxes).
Students will need to find the relevant stick person, read the card for the appropriate variable, and enter this data to make a complete data set. At the moment, I’ve set this feature so that there is missing data for 10 different stick people (one of each variable on the data card) and that the data can not be visualised using software (iNZight lite) until the missing data has been found.
The final link is to explore the data using software like iNZight lite, which has been designed by Chris Wild to help students “get into data deeper and faster” (PS I’m not sure if that is an exact quote!). The data cards are not automatically linked to the data in iNZight lite, so if more data cards are collected, the iNZight button will need to be pressed again to update. I’m excited about getting students to explore relationships and build informal predictive models (after trying this out with the data cards earlier), and then checking these models out by easily selecting more stick people (see more about this kind of activity in my post about data challenges).
Back in 2012 was when I first set up an online tool for taking a random sample from a hidden population. I didn’t share or promote this tool at the time because it was always meant to be a short term solution to a short term problem for my department. 2012 in NZ was the first year of AS91264 Use statistical methods to make an inference and we had hundreds of Year 12 students and far fewer computers. We wanted a quick way for students to use the computer to get their random sample, graph it, print/save it and then move back to a desk to write up their report by hand. We also didn’t want them to see all the data that was in the population data set, as we thought that would be distracting.
Note: The title of this post is based on a song by The Beetles but I don’t think I believe that the population has always got to be hidden. You can read more about my thoughts on stuff related to samping in this post Using awesome real data
So I wrote some code which was completely based on the data viewer tool on Census At School NZ, where you can get a random sample from the Census At School database of your choice and then get the graphs and summary statistics displayed for that sample. The idea was that we could put whatever population data we wanted “behind the scenes” and students would choose what to sample using an interface. While initially it was intended for Year 12 only (since AS91264 has the requirement to sample), I extended this tool to include bootstrapping analysis for AS91582 (under type of analysis – Year 13) and the randomisation test for AS91583 (for this, students would just paste in their data directly to the webpage).
Below are some screen shots of this old tool from 2012:
This online inference tool had limitations as I am sure you will have identified 🙂 Unlike iNZight which has an interface designed to allow students to get into data faster and deeper, this tool was completely focused on getting the output for the inference, and the sample data generated by the tool could not be explored. The graphics are also not that great, and I needed to set up a page for each data set we wanted to use. Additionally, for the bootstrapping confidence interval, there was no animation to show how the interval was constructed (unlike the awesome iNZight VIT), which is such an important and essential part of using this method.
Fortunately, in the years that followed, our Principal gave us more and more desktop computers, and so students were able to complete their entire assessment on computers at a much slower pace using awesome tools such as Google docs (with great addons like Doctopus for us to manage their work). Later, we were also able to trial iNZight lite (we used it for AS91580 Investigate time series data), which is the online version of iNZight.
Time for a sampling tool update?
One of the awesome teachers I worked with emailed me recently wanting to set up something like the Census At School NZ random sampler tool. The Census At School random sampler tool gives you access to Census At School data sets since 2005, and also other data sets such as Kiwi Kapers, NZ incomes, Census at School data from other countries and Statistics NZ SURFs (income and births). One of the benefits of the tool is that the complete population data set is hidden behind the interface.
In terms of setting up something similar, there were a couple of options:
(1) not develop anything but instead put more population data sets up on Census At School NZ site since they have a great sampling interface set up. This is a valid option and if you have any great population data sets to contact, just get in touch with the friendly people at Census At School NZ.
(2) set up something similar to my 2012 tool but without the graphs, where teachers send me data sets and I make them available for sampling on my website. This is essentially the same as option (1) except that I would have responsibility for setting up and maintaining the data sets, and the teachers sharing them would lose control of them. However, we often use data collected from our own population of students, which wouldn’t be that interesting or appropriate for students from other schools.
(3) set up a sampling interface where teachers can use whatever data set they want, whenever they want, and keep ownership of the data set. I’ve calling this BYOP – Bring Your Own Population 🙂
After revisiting the code I used in 2012 and the code I used recently to set up the random redirect tool, I realised it wouldn’t take too much time to create a sampling tool for option 3. All you need for this new sampling tool is a csv file which is hosted publicly somewhere on the web, and where the first row consists of the variable names and the second row consists of a full set of values for each variable (no missing values for any variable).
You can enter in the sample size you want (to a max of 30% of the size of the data set), and if you want, you can choose to only sample from certain groups within the population e.g. age division (up to 34 vs 35 – 39). You can then copy and paste the sample generated to wherever you like, export the sample as a csv file, or jump straight into iNZight lite with the data. I’ve made the page deliberately plain, so it will be up to you to provide the information about the data being used and how to use the tool.
To read more about this new sampling tool and how to set up your own sampling URL, head here: BYOP sampling tool
This population of stick people was created using data from the Census at School 2015 database. For the data cards, rather than put/indicate gender on the card I have used a fictional name, taken from the names of children entered in the 2015 Auckland kids marathon. The relevant questions from the Census at School 2015 survey are Q1, Q2, Q17, Q27 cellphone, facebook, snapchat, Q31 TV, and Q32 reading (the questions can be found here). The diagram below shows what each part of the data card represents:
For some great teaching notes for using data cards, check out Pip Arnold’s resources on Census at School, here are a couple: ID cards | Using data cards. I also used these data cards in a workshop on data challenges which you can read more about here.
Here is the population data set as a CSV file for teacher reference: CAS2015_edited
I haven’t heard anything from anyone with any problems, and there seems to be a bit of traffic to the challenge page, so hopefully this is going well. I’ll allow checking of the first list of reserved words tomorrow. Students should put in what they predict the readability score will be for each word. These predicted scores will be checked against the actual readability scores and students will be given an overall result e.g. 85%. Oh, and just because you’re a teacher too you’ll get this idea for an investigative question/problem……. How long does it take a student to submit a swear word into a text analysis tool?
Related “reading themed” statistical investigation ideas
Check out http://josephrocca.com/randomsentence/ where you can generate “random” sentences from books that are no longer under U.S.A. copyright restrictions – so books generally published before the early 20th century. You could compare the process for random sampling sentences from digital books to processes for random sampling sentences from physical books (so much here with different sampling methods). You could give students an actual physical book and challenge them to estimate the total word count (check using the digital version!), or get students to devise a way to compare the “readability” of two books, or….?
So what was so surprising?
Recap: I got 10 dominoes from a supermarket recently and was surprised to find that all 10 were different (there are 50 different dominoes to collect). Ok, so on the face of it this may look like a familiar (and not super awesome) starter. Collecting cereal cards, ice block sticks, seed packets…….. But I was surprised to see this because I was thinking that a random process like this would mean I should expect to see at least one double up e.g. like seeing runs of heads when you flip a coin. When I thought about it more, I realised I wasn’t taking into account there were 50 dominoes – this makes a difference.
More about SOLO
SOLO stands for the Structure of the Observed Learning Outcomes. It’s a model/taxonomy for defining different levels of understanding or thinking and was developed by J. Biggs and K. Collis in 1982. I’ve been using SOLO in my teaching of statistics since around 2006 and think it’s awesome. It fits so well with building conceptual understandings of statistics rather than just procedural ones. I use SOLO in (at least) two ways: (1) to structure good questions for students to use when working with data, questions to make them think at different levels and (2) to plan my teaching of a topic e.g. what are the key ideas (not skills)?
The prices increased from Jan to Feb and then decreased from Mar to May and then increased again…..
I think I like this answer on Quora re how to explain over-fitting of models. Some of the language is a bit off – I think if you swap the word “hypothesis” for “model” and remove “experiments” and replace with “observations” it reads better. But I like the idea of how to explain to students that a model is not about getting a perfect fit to the observed data and that simpler can be better (e.g. go for the minimum number of trend lines as possible that tell the general story of what is happening……).