If you’ve been keeping track of my various talks & workshops over the last year or so, you will have noticed that I’ve become a little obsessed with analysing images (see power of pixels and/or read more here). As part of my PhD research, I’ve been using images to broaden students’ awareness of what is data, and data science, and it’s been so much fun!
The power of pixels: Using R to explore and exploit features of images
Thursday, Oct 18, 2018, 6:00 PM
G15, Science Building 303, University of Auckland 38 Princes Street Auckland, NZ
30 Members Attending
Kia ora koutou Anna Fergusson, one of our R-ladies Auckland co-organisers, will be the speaker at this meetup. We’ll explore a range of techniques and R packages for working with images, all at an introductory level. Time: 6:00 arrival for a 6:30pm start. What to bring: Laptops with R installed, arrive early if you are a beginner and would like hel…
This is not a teaching-focused workshop, it’s more about learning fun and cool things you can do with images, like making GIFs like the one below….
…. and other cool things, like classifying photos as cats or dogs, or finding the most similar drawing of a duck!
It will be at an introductory level, and you don’t need to be a “lady” to come along, just supportive of gender diversity in the R community (or more broadly, data science)! If you’ve never used R before, don’t worry – just bring yourself along with a laptop and we’ll look after you 🙂
But I haven’t seen dot-shaped ones yet, so this led me to re-develop the Quick! Draw! sampling tool to be able to create some 🙂
I was also motivated to work some more on the tool after the fantastic Wendy Gibbs asked me at the NZAMT (New Zealand Association of Mathematics Teachers) writing camp if I could include variables related to the times involved with each drawing. I suspect she has read this super cool post by Jim Vallandingham (while you’re at his site, check out some of his other cool posts and visualisations) which came out after I first released the sampling tool and compares strokes and drawing/pause times for different words/concepts – including cats and dogs!
The drawing and pause times are in seconds. The drawing time captures the time taken for each stroke from beginning to end and the pause time captures all the time between strokes. If you add these two times together, you will get the total time the person spent drawing the word/concept before either the 20 seconds was up, or Google tried to identify the word/concept. Below the word/concept drawn is whether the drawing was correctly recognised (true) or not (false).
I also added three ways to use the data cards once they have been generated using the sampling tool (scroll down to below the data cards). You can now:
download a PDF version of the data cards, with circles the same size as the circle punch shown above (2″/5cm)
download the CSV file for the sample data
show the sample data as a HTML table (which makes it easy to copy and paste into a Google sheet for example)
In terms of options (2) and (3) above, I had resisted making the data this accessible in the previous version of the sampling tool. One of the reasons for this is because I wanted the drawings themselves to be considered as data, and as human would be involved in developed this variable, there was a need to work with just a sample of all the millions of drawings. I still feel this way, so I encourage you to get students to develop at least one new variable for their sample data that is based on a feature of the drawing 🙂 For example, whether the drawing of a cat is the face only, or includes the body too.
There are other cool things possible to expand the variables provided. Students could create a new variable by adding drawing_time and pause_time together. They could also create a variable which compares the number_strokes to the drawing_time e.g. average time per stroke. Students could also use the day_sketched variable to classify sketches as weekday or weekend drawings. Students should soon find the hemisphere is not that useful for comparisons, so could explore another country-related classification like continent. More advanced manipulations could involve working with the time stamps, which are given for all drawings using UTC time. This has consequences for the variable day_sketched as many countries (and places within countries) will be behind or ahead of the UTC time.
If you’ve made it this far in the post…. why not play with a little R 🙂
I wonder which common household pet Quick! drawers tend to use the most strokes to draw? Cats, dogs, or fish?
Have a go at modifying the R code below, using the iNZightPlots package by Tom Elliott and my [very-much-in-its-initial-stages-of-development] iNZightR package, to see what we can learn from the data 🙂 If you’re feeling extra adventurous, why not try modifying the code to explore the relationship between number of strokes and drawing time!
Simulation-based inference is taught as part of the New Zealand curriculum for Statistics at school level, specifically the randomisation test and bootstrap confidence intervals. Some of the reasons for promoting and using simulation-based inference for testing and for constructing confidence intervals are that:
students are working with data (rather than abstracting to theoretical sampling distributions)
students can see the re-randomisation/re-sampling process as it happens
the “numbers” that are used (e.g. tail proportion or limits for confidence interval) are linked to this process.
If we work with the output only, for example the final histogram/dot plot of re-sampled/bootstrap differences, in my opinion, we might as well just use a graphics calculator to get the values for the confidence interval 🙂
In our intro stats course, we use the suite of VIT (Visual Inference Tools) designed and developed by Chris Wild to construct bootstrap confidence intervals and perform randomisation tests. Below is an example of the randomisation test “in action” using VIT:
Last year, VIT was made available as a web-based app thanks to ongoing work by Ben Halsted! So, in this short post I’ll show how to use VIT Online with Google sheets – my two favourite tools for teaching simulation-based inference 🙂
2. Under File –> Publish to web, choose the following settings (this will temporarily make your Google sheet “public” – just “unpublish” once you have the data in VIT Online)
Be careful to select “Sheet1” or whatever the sheet you have your data in, not “Entire document”. Then, select “Comma-separated values (.csv)” for the type of file. Directly below is the link to your published data which you need to copy for step 3.
4. At this point, your data is in VIT online, so you can go back and unpublish your Google sheet by going back to File –> Publish to web, and pressing the button that says “Stop publishing”.
The same steps work to get data from a Google spreadsheet into VIT online for the other modules (bootstrapping etc.).
[Actually, the steps are pretty similar for getting data from a Google spreadsheet into iNZight lite. Copy the published sheet link from step 2 in the appropriately named “paste/enter URL” text box under the File –> Import dataset menu option.]
In terms of how to use VIT online to conduct the randomisation test, I’ll leave you with some videos by Chris Wild to take a look at (scroll down). Before I do, just a couple of differences between the VIT Chris uses and VIT Online and a couple of hints for using VIT Online with students.
You will need to hold down ctrl to select more than one variable before pressing the “Analyse” button e.g. to select both the “Prompt” and “Height estimate in metres” variables in the giraffe data set.
Also, to define the statistic to be tested, in VIT Online you need to press the button that says “Precalculate Display” rather than “Record my choices” as shown in the videos.
Note: VIT Online is not optimised to work on small screen devices, due to the nature of the visualisations. For example, it’s important that students can see all three panels at the same time during the process, and can see what is happening!
Want to make some awesome gift tags/labels for Christmas or holiday-related presents? Here’s a fun little statistical art project. Write whatever words you want in the app below, create some secret snowflakes (the secret part being no one else will know what words you used unless of course you choose to display them), play around with colours if you want (uncheck the option to use random colours), freeze the snowflakes when you get something you like, download your masterpiece and use in some way.
Oh yeah, the snowflakes are made by rotating each letter in the words in a magical statistical way (i.e. randomness).
To make our gift labels, I made the first colour white (the background #ffffff), made the other two colours black (#000000), and then printed on to adhesive sticker paper I had left over from our wedding.
Enjoy and have a great holiday break!
Secret snowflakes app should be shown below (otherwise here is the link) – works best using a Chrome browser 🙂
For many high school teachers here in New Zealand, the teaching year is over and it’s now a six-week summer break before school starts again next year. Despite the well-deserved break, some teachers are already thinking about ideas for next year. I’ve been amazed (and inspired) by the teachers who have signed up to spend a day with Liza and I on Friday 15th December to learn more about working with modern data (more details here). We are both really looking forward to the full-day workshop 🙂 One of the tools we’ll be working with at the workshop is the platform IFTTT (If This Then That). It’s basically a way to connect devices and online accounts using APIs (application programming interfaces) without using code.
I used IFTTT recently to collect data on New York Times articles. One of the reasons why I started collecting data on New York Times articles was because of their free, online feature “What’s Going On in This Graph?”. On Tuesday, December 12 and every second Tuesday of the month through the US school year, The New York Times Learning Network, in partnership with the American Statistical Association, hosts a live online discussion about a timely graph like the one shown below.
Students from around the world “read” the graph by posting comments about what they notice and wonder in an online forum. Teachers live-moderates by responding to the comments in real time and encouraging students to go deeper. All releases are archived so that teachers can use previous graphs anytime (read this introductory post to learn more). I used “What’s Going On in This Graph?” when I was teaching our Lies, Damned lies and Statistics course, and it is such an awesome resource for helping build statistical literacy and thinking.
So what’s going on with the data I collected? Your first thought on viewing the data might be – huh? You call this data? The only variable that is “graph ready” is which section each of the nearly 6000 articles were published in. But there are so many variables in data sets just like this one waiting to be defined and explored. After our workshop on Friday, I’ll post an “after” version of this same data set 🙂
Estimating the mean and standard deviation of a discrete random variable is something we expect NZ students to be able to do by the time they finish Year 13 (Grade 12). The idea is that students estimate these properties of a distribution using visual features of a display (e.g. a dot plot) and, ideally, these measures are visually and conceptually attached to a real data distribution with a context and not treated entirely as mathematical concepts.
At the start of this year I went looking for an interactive dot plot to use when reviewing mean and standard deviation with my intro-level statistics students. Initially, I wanted something where I could drag dots around on a dot plot and show what happens to the mean, standard deviation etc. as I do this. Then I wanted something where you could drag dots on and off the dot plot, rather than having an initial starting dot plot, so students could build dot plots based on various situations. I came across a few examples of interactive-ish dot plots out there in Google-land but none quite did what I wanted (or kept the focus on what I wanted), so I decided to write my own. [Note:CODAP would have been my choice if I had just wanted to drag dots around. Extra note: CODAP is pretty awesome for many many reasons].
In my head as I developed the app was an activity I’ve used in the past to introduce standard deviation as a measure – Exploring statistical measures by estimating the ages of famous people – as well as a workshop by the awesome Christine Franklin. For NZ-based teachers (or teachers who want to come to beautiful New Zealand for our national mathematics teachers conference), Chris is one of the keynote speakers at the NZAMT 2017 conference and is running a workshop at this conference called Conceptualizing Variation from the Mean: Evolving from ‘Number of Steps’ to the ‘SAD’ to the ‘MAD’ to the ‘Standard Deviation’ which you should get along to if you can. Also in my head was the idea of the mean of a distribution being like the “balancing point”, and other activities I have used in the past based on this analogy and also see-saws! My teaching colleague Liza Bolton was also super helpful at listening to my ideas, suggesting awesome ones of her own, and testing the app throughout its various versions.
dots – an interactive dot plot
You can access dots at this address: learning.statistics-is-awesome.org/dots/ but you might want to keep reading to find out a little more about how it works 🙂 Below is a screenshot of the app, with some brief descriptions of how things are supposed to work. Current limitations for dots are that no more than 35 dots will be displayed, the axis is fixed between 0 and 34, and that dots can only be placed on whole numbers. I had played around with making these aspects of the app more flexible, but then decided not to pursue this as I’m not trying to re-create graphing/statistical software with this interactive.
Since I’ve got the It’s raining cats and dogs (hopefully) project running, I thought I’d use some of the data collected so far to show a few examples of how to use dots. [Note: The data collection phase of the cats and dogs data cards project is still running, so you can get your students involved]. Here are 15 randomly selected cats from the data cards created so far, with the age of each cat removed.
Once you get past how cute these cats are, what do you think the mean age of these cats is (in years)? Can you tell which cat is the oldest? How much variation do you think there is between the ages of these cats?
Dragging dots onto the dot plot
A dot plot can be created by dragging dots on to the plot (don’t forget to add a label for the axis like I did!)
Sending data to the dot plot
You can also add the data and the label to the URL so that the plot is ready to go. Use the structure shown below to do this, and then click on the link to see the ages of these cats on the interactive dot plot.
You can click below the dots on the axis to indicate your estimate for the mean. You could do a couple of things after this. You could click the Mean button to show the mean, and check how this compares to your estimated mean. Or you could click the Balance test button to turn in on (green), and see how well the dots balance on the point you have estimated as the mean (or both like I did).
Estimating standard deviation
Estimating standard deviation is hard. I try not to use “rules” that only work with Normally distributed-ish data (like take the range and divide by six) and aren’t based on what the standard deviation is a measure of. Visualising standard deviation is also a tricky thing. In the video below I’ve gone with two approaches: one uses a Chrome extension Web Paint to draw on the plot where I think is the average distance each dot is from the mean and one uses the absolute deviations.
Using “random distribution”
This is the option I have used the most when working with students individually. Yes, there is no context when using this option, but in my conversations with students when talking about the mean and standard deviation I’m not sure the lack of context makes it non-conceptual-building activity. The short video below shows using the median as a starting point for the estimate of the mean, and the adjusting from here depending on other features of the distribution (e.g. shape). The video ends by dragging a dot around to see what happens to the different measures, since that was the starting point for developing dots 🙂
Other ideas for using dots?
Share them below the related Facebook post, on Twitter, or wherever – I’d be super keen to hear whether you find this interactive dot plot useful for teaching students how to estimate mean and standard deviation 🙂
In April 2017, I presented an ASA K-12 statistics education webinar: Statistical reasoning with data cards (webinar). Towards the end of the webinar, I encouraged teachers to get students to make their own data cards about their cats. A few days later, I then thought that this could be something to get NZ teachers and students involved with. Imagine a huge collection of real data cards about dogs and cats? Real data that comes from NZ teachers and students? Like Census At School but for pets 🙂 I persuaded a few of my teacher friends to create data cards for their pets (dogs or cats) and to get their students involved, to see whether this project could work. Below is a small selection of the data cards that were initially created (beware of potential cuteness overload!)
The project then expanded to include more teachers and students across NZ, and even the US, and I’ve now decided to keep the data card generator (and collection) page open so that the set of data cards can grow over time. Please use the steps below to get students creating and sharing data cards about their pets.
Creating and sharing data cards about dogs and cats
Inevitably, there will be submissions made that are “fake”, silly or offensive (see below).
Data cards submitted to the project won’t automatically be added to any public sets of data cards, and will be checked first. Just like with any surveying process that is based on self-selection, is internet based and relies on humans to give honest and accurate answers, there is the potential for non-sampling errors. To help reduce the quantify of “fake” data cards, if you are keen to have your students involved with this project it would be great if you could do the following:
1. Talk to your students about the project and explain that the data cards will be shared with other students. They will be sharing information about their pet and need to be OK with this (and don’t have to!). The data will be displayed with a picture of their pet, so participation is not strictly anonymous. All of this is important to discuss with students as we need to educate students about data privacy 🙂
2. When students submit their data, they are given the finished data card which they can save. Set up a system where students need to share the data card they have created with you e.g. by saving into a shared Google drive or Dropbox, or by emailing the data card to you. The advantage for you of setting up this system is that you get your class/school set of data cards to use however you want. The advantage for me is that this level of “watching” might discourage silly data cards being created.
Since I happen to have a floor, a cat, and tape I thought I’d give it a go. You can see the result at the top of this post 🙂 Amazing right?
Well, no, not really. I marked out the square two days ago, and our cat Elliot only sat in the square today.
our cat often sits on the floor
our cat often sits on different parts of said floor
that we have a limited amount of floor
I marked out the square in an area that he likes to sit
that we were paying attention to where on the floor our cat sat
… and a whole lot of other conditions, it actually isn’t as amazing as Twitter thinks. Also, my hunch is that people who do witness their cat sitting the square post this on Twitter more often than those who give up waiting for the cat to sit in the square.
Below is a little simulation based on our floor size and the square size we used, taking into account our cat’s disposition for lying down in places. It’s just a bit of fun, but the point is that with random moving and stopping within a fixed area, if you watch long enough the cat will sit in the square 🙂
PS The cat image is by Lucie Parker. And yes, the cat only has to partially in the square when it stops but I figured that was close enough 🙂
This post provides the notes for a workshop I ran at the Otago Mathematics Association (OMA) Conference about using data challenges to encourage statistical thinking.
Until last week, I had never re-presented or adapted a workshop that I had developed in a previous year. So it really interesting to take this workshop on data challenges, which I had presented at the AMA and CMA stats days last year, and work through it again with a new bunch of awesome teachers in Dunedin. I wrote notes about this workshop last year – Using data challenges to encourage statistical thinking – so this post will just share a few things I tweaked the second time around, including an activity we tried in Stickland 🙂
Some changes and additions
To show an example of a predictive model in action, we used one of a few online tools which attempt to predict your age using your name (based on US data) e.g. rhiever.github.io/name-age-calculator/index.html. I also demonstrated another online tool that attempts to predict your gender based on writing (hackerfactor.com/GenderGuesser.php) by using my abstract for this workshop (it did correctly predict, based on the writing being formal, that it was written by a female). For the actual data challenge itself using the celebrity data, I purposefully removed Dr Dre from the training data set to make it easier to explore the data without worrying about how to handle his extremely high earnings for 2014 (new link here).
Another thing I changed about the workshop this time around was that rather than use physical data cards (these Census at school stick people data cards), we tried out my new digital data cards in the virtual world of Stickland. I’ve already shared a little bit about the ideas behind Stickland – see the Welcome to stickland! post – so what follows is an example of how we used Stickland in the workshop. (Just a quick reminder that the data cards are real students from the NZ Census At School 2015 data, the names being the only variable that is not real).
The activity starts with the idea of wanting to predict whether a stick person chosen at random from Stickland uses Facebook or not. If you head to learning.statistics-is-awesome.org/stickland, the first thing you could do is select a sample of stick people and see what proportion of them use Facebook. I got the teachers in this workshop to select 20 stick people and then let them play with moving the data cards around in the grey screen below (click or touch the card to drag the card to somewhere else on the screen e.g. to sort the cards into Facebook users and non-Facebook users).
For the sample shown above, an equal number of stick people are Facebook users than not, but of course this will vary from sample to sample. I then told the teachers that this particular stick person is a Snapchat user, and asked them if this changes their prediction of whether they are a Facebook user or not. One way to explore this is to create a two way table with the cards (see below) and then reason with this.
Most of the different samples showed a similar story to the sample above: Of the Snapchat users, most were Facebook users and of the non-Snapchat users, most were non-Facebook users. I then suggested (if we had time) we could also explore whether knowing the gender and age of the stick person would help us build a better model for predicting Facebook usage. At this stage (considering multiple variables/factors) I would want the students to move into software that allows them to explore the data more deeply (more about how that is possible is discussed in the Welcome to stickland! post). We didn’t do this in the workshop and the teachers had to leave Stickland perhaps before they wanted to 🙂
Where to next?
Stickland is just in “proof of concept” form at the moment and will no doubt have lots of bugs and weird features. In the Welcome to stickland! post, I discuss the influence of others in developing these digital data cards, in particular Pip Arnold and her work with statistical investigations and data cards that stretches back to at least 2005 (if not earlier!). Feel free to have a play and to let me know what you think about the concept, but this is definitely a possible project for 2017 and not intended to be a fully featured product yet.
I’ve been working on a little side project for the last year or so. I thought this might be a good time to share this with you, particularly since I probably (with a very high probability) won’t be making any more posts for the rest of the year due a few little things called a dissertation and a wedding 🙂
The idea was to create a digital learning environment for working with data cards, in an attempt to make stronger connections between data cards, data structures and data displays, and to make effective use of tablets/devices (particularly in large lecture groups like my current teaching situation). This first digital environment is based on the C@S stick people data cards I created last year, but could involve any population/data etc, since everything is created dynamically. The idea to use stick people (figures) for the data cards was based on material Rob Gould presented at the NZAMT conference in 2015 regarding the Introduction to Data Science (IDS) course the Mobilize team created for high school students.
In stickland, the members of its population (the C@S stick people) ride by on skateboards. The numbers displayed on each stick person are their unique three digit ID number. The environment is set up so that the stick people arrive to this stretch of road in stick land in a random order and at random times. Students could check this out by watching the stick people skate on by and recording their ID numbers. They should see no pattern to the numbers and be convinced that they can not predict what ID number the next stick person will have (well, I guess if you watched for long enough you would be able to predict the last ID number……)
To select stick people to find out more about them, students click on the stick person as they skate past. Some of the stick people are faster than others (more about that next year!) so it’s not always easy to catch them. This means that it will take different times for students to collect the same number of data cards. As the stick people are selected, a stack of data cards starts to be built on the top right hand side of data card screen below.
At this point we’re in a similar position to where we would be if we had given students a set of data cards each, or if we had asked them to select a random sample of data cards from a population bag. One of the really awesome things about data cards is the physical nature of them – students can move them around, sort them, line them up, etc. So in this digital environment, students can drag the stick people data card around by tapping their heads and dragging their finger.
I love getting students to sort the data cards by a categorical variable (e.g. Facebook user) and then by another categorical variable (e.g. Snapchat user) to build ideas of two-way tables and conditioning.
You can also get students to make graphs out of the data cards (see one of Pip Arnold’s excellent resources along these lines here on Census At School NZ). In this digital environment, students can make the cards bigger or smaller, and can move into “dot” mode as they move into graphical representations by encoding the data.
To help students build understanding of what are essential features of their graphs, there is a drawing tool so they can add in additional information like axes, labels, numbers etc. I can see a whole lot of potential here, particularly with students exploring different ways to organise and display data.
To help build understanding of the relationship between units, variables and data structures (specifically rectangular data sets), an interactive spreadsheet builds below the data card screen as the cards are collected. When a student selects a data card, this stick person’s row of data is highlighted in the spreadsheet, and vice versa. To check each student can match the data shown on the data card to the spreadsheet, data is missing from the spreadsheet (shown by grey boxes).
Students will need to find the relevant stick person, read the card for the appropriate variable, and enter this data to make a complete data set. At the moment, I’ve set this feature so that there is missing data for 10 different stick people (one of each variable on the data card) and that the data can not be visualised using software (iNZight lite) until the missing data has been found.
The final link is to explore the data using software like iNZight lite, which has been designed by Chris Wild to help students “get into data deeper and faster” (PS I’m not sure if that is an exact quote!). The data cards are not automatically linked to the data in iNZight lite, so if more data cards are collected, the iNZight button will need to be pressed again to update. I’m excited about getting students to explore relationships and build informal predictive models (after trying this out with the data cards earlier), and then checking these models out by easily selecting more stick people (see more about this kind of activity in my post about data challenges).