I’m pretty excited about the talks and workshops I’m doing over the next month or so! Below are the summaries or abstracts for each talk/workshop and when I get a chance I’ll write up some of the ideas presented in separate posts.
Keynote: Searching for meaningful sampling in apple orchards, YouTube videos, and many other places! (AMA, Auckland, September 14, 2019)
In this talk, I shared some of my ideas and adventures with developing more meaningful learning tasks for sampling. Using the “Apple orchard” exemplar task, I presented some ideas for “renovating” existing tasks and then introduced some new opportunities for teaching sample-to-population inference in the context of modern data and associated technologies. I shared a simple online version of the apple orchard and also talked about how my binge watching of DIY YouTube videos led to my personal (and meaningful) reason to sample and compare YouTube videos.
Workshop: Expanding your toolkit for teaching statistics (AMA, September 14, Auckland, 2019)
In this workshop, we explored some tools and apps that I’ve developed to support student’s statistical understanding. Examples were: an interactive dot plot for building understanding of mean and standard deviation, a modelling tool for building understanding of distributional variation, tools for carrying out experiments online and some new tools for collecting data through sampling.
The slides for both the keynote and workshop are embedded below:
Talk: Introducing high school statistics teachers to code-driven tools for statistical modelling (VUW/NZCER, Wellington, September 30, Auckland, 2019)
Abstract: The advent of data science has led to statistics education researchers re-thinking and expanding their ideas about tools for teaching and learning statistical modelling. Algorithmic methods for statistical inference, such as the randomisation test, are typically taught within NZ high school classrooms using GUI-driven tools such as VIT. A teaching experiment was conducted over three five-hour workshops with six high school statistics teachers, using new tasks designed to blend the use of both GUI-driven and code-driven tools for learning statistical modelling. Our findings from this exploratory study indicate that teachers began to enrich and expand their ideas about statistical modelling through the complementary experiences of using both GUI-driven and code-driven tools.
Keynote: Follow the data (NZAMT, Wellington, October 3, 2019)
Abstract: Data science is transforming the statistics curriculum. The amount, availability, diversity and complexity of data that are now available in our modern world requires us to broaden our definitions and understandings of what data is, how we can get data, how data can be structured and what it means to teach students how to learn from data. In particular, students will need to integrate statistical and computational thinking and to develop a broader awareness of, and practical skills with, digital technologies. In this talk I will demonstrate how we can follow the data to develop new learning tasks for data science that are inclusive, engaging, effective, and build on existing statistics pedagogy.
Workshop: Just hit like! Data science for everyone, including cats (and maybe dogs) (NZAMT, Wellington, October 2, 2019)
Abstract: Data science is all about integrating statistical and computational thinking with data. In this hands-on workshop we will explore a collection of learning tasks I have designed to introduce students to the exciting world of image data, measures of popularity on the web, machine learning, algorithms, and APIs. We’ll explore questions such as “Are photos of cats or dogs more popular on the web?”, “What makes a good black and white photo?”, “How can we sort photos into a particular order?”, “How can I make a cat selfie?” and many more. We’ll use familiar statistics tools and approaches, such as data cards, collaborative group tasks and sampling activities, and also try out some new computational tools for learning from data. Statistical concepts covered include features of data distributions, informal inference, exploratory data analysis and predictive modelling. We’ll also discuss how each task can also be extended or adapted to focus on specific aspects and levels of the statistics curriculum. Please bring along a laptop to the workshop.
Recently I’ve been developing and trialling learning tasks where the learner is working with a provided data set but has to do something “human” that motivates using a random sample as part of the strategy to learn something from the data.
Since I already had a tool that creates data cards from the Quick, Draw! data set, I’ve created a prototype for the kind of tool that would support this approach using the same data set.
I’ve written about the Quick, Draw! data set already:
For this new tool, called different strokes, users sort drawings into two or more groups based on something visible in the drawing itself. Since you have the drag the drawings around to manually “classify” them, the larger the sample you take, the longer it will take you.
There’s also the novelty and creativity of being able to create your own rules for classifying drawings. I’ll use cats for the example below, but from a teaching and assessment perspective there are SO many drawings of so many things and so many variables with so many opportunities to compare and contrast what can be learned about how people draw in the Quick, Draw!
Here’s a precis of the kinds of questions I might ask myself to explore the general question What can we learn from the data about how people draw cats in the Quick, Draw! game?
Are drawings of cats more likely to be heads only or the whole body? [I can take a sample of cat drawings, and then sort the drawings into heads vs bodies. From here, I could bootstrap a confidence interval for the population proportion].
Is how someone draws a cat linked to the game time? [I can use the same data as above, but compare game times by the two groups I’ve created – head vs bodies. I could bootstrap a confidence interval for the difference of two population means/medians]
Is there a relationship between the number of strokes and the pause time for cat drawings? [And what do these two variables actually measure – I’ll need some contextual knowledge!]
Do people draw dogs similarly to cats in the Quick, Draw! game? [I could grab new samples of cat and dog drawings, sort all drawings into “heads” or “bodies”, and then bootstrap a confidence interval for the difference of two population proportions]
But I haven’t seen dot-shaped ones yet, so this led me to re-develop the Quick! Draw! sampling tool to be able to create some 🙂
I was also motivated to work some more on the tool after the fantastic Wendy Gibbs asked me at the NZAMT (New Zealand Association of Mathematics Teachers) writing camp if I could include variables related to the times involved with each drawing. I suspect she has read this super cool post by Jim Vallandingham (while you’re at his site, check out some of his other cool posts and visualisations) which came out after I first released the sampling tool and compares strokes and drawing/pause times for different words/concepts – including cats and dogs!
The drawing and pause times are in seconds. The drawing time captures the time taken for each stroke from beginning to end and the pause time captures all the time between strokes. If you add these two times together, you will get the total time the person spent drawing the word/concept before either the 20 seconds was up, or Google tried to identify the word/concept. Below the word/concept drawn is whether the drawing was correctly recognised (true) or not (false).
I also added three ways to use the data cards once they have been generated using the sampling tool (scroll down to below the data cards). You can now:
download a PDF version of the data cards, with circles the same size as the circle punch shown above (2″/5cm)
download the CSV file for the sample data
show the sample data as a HTML table (which makes it easy to copy and paste into a Google sheet for example)
In terms of options (2) and (3) above, I had resisted making the data this accessible in the previous version of the sampling tool. One of the reasons for this is because I wanted the drawings themselves to be considered as data, and as human would be involved in developed this variable, there was a need to work with just a sample of all the millions of drawings. I still feel this way, so I encourage you to get students to develop at least one new variable for their sample data that is based on a feature of the drawing 🙂 For example, whether the drawing of a cat is the face only, or includes the body too.
There are other cool things possible to expand the variables provided. Students could create a new variable by adding drawing_time and pause_time together. They could also create a variable which compares the number_strokes to the drawing_time e.g. average time per stroke. Students could also use the day_sketched variable to classify sketches as weekday or weekend drawings. Students should soon find the hemisphere is not that useful for comparisons, so could explore another country-related classification like continent. More advanced manipulations could involve working with the time stamps, which are given for all drawings using UTC time. This has consequences for the variable day_sketched as many countries (and places within countries) will be behind or ahead of the UTC time.
If you’ve made it this far in the post…. why not play with a little R 🙂
I wonder which common household pet Quick! drawers tend to use the most strokes to draw? Cats, dogs, or fish?
Have a go at modifying the R code below, using the iNZightPlots package by Tom Elliott and my [very-much-in-its-initial-stages-of-development] iNZightR package, to see what we can learn from the data 🙂 If you’re feeling extra adventurous, why not try modifying the code to explore the relationship between number of strokes and drawing time!
This post is second in a series of posts where I’m going to share some strategies for getting real data to use for statistical investigations that require sample to population inference. As I write them, you will be able to find them all on this page.
What’s your favourite board game?
I read an article posted on fivethirtyeight about the worst board games ever invented and it got me thinking about the board games I like to play. The Game of life has a low average rating on the online database of games referred to in this article but I remember kind of enjoying playing it as a kid. boardgamegeek.com features user-submitted information about hundreds of thousands of games (not just board games) and is constantly being updated. While there are some data sets out there that already feature data from this website (e.g. from kaggle datasets), I am purposely demonstrating a non-programming approach to getting this data that maximises the participation of teachers and students in the data collection process.
To end up with data that can be used as part of a sample to population inference task:
You need a clearly defined and nameable population (in this case, all board games listed on boardgamegeek.com)
You need a sampling frame that is a very close match to your population.
You need to select from your sampling frame using a random sampling method to obtain the members of your sample.
You need to define and measure variables from each member of the sample/population so the resulting data is multivariate.
boardgamegeek.com actually provide a link that you can use to select one of the games on their site at random (https://boardgamegeek.com/boardgame/random), so using this “random” link (hopefully) takes care of (2) and (3). For (4), there are so many potential variables that could be defined and measured. To decide on what variables to measure, I spent some time exploring the content of the webpages for a few different games to get a feel for what might make for good variables. I decided to stick to variables that are measured directly for each game, rather than ones that were based on user polls, and went with these variables:
Millennium the game was released (1000, 2000, all others)
Number of words in game title
Minimum number of players
Maximum number of players
Playing time in minutes (if a range was provided, the average of the limits was used)
Minimum age in years
Game type (strategy or war, family or children’s, other)
I wrote a post recently – Just Google it – which featured real data distributions. boardgamegeek.com also provides simple graphs of the ratings for each game, so we can play a similar matching game. You could also try estimating the mean and standard deviation of the ratings from the graph, with the added game feature of reverse ordering!
Which games do you think match which ratings graphs?
The Lord of the Rings: The Card Game
I couldn’t find a game that had a clear bi-modal distribution for its ratings but I reckon there must be games out there that people either love or hate 🙂 Let me know if you find one! To get students familiar with boardgamegeek.com, you could ask them to first search for their favourite game and then explore what information and ratings have been provided for this on the site. Let the games begin 🙂
This post is first in a series of posts where I’m going to share some strategies for getting real data for real data stories, specifically to use for statistical investigations that require sample to population inference. As I write them, you will be able to find them all on this page.
Key considerations for finding real data for sample to population inference tasks
It’s really important that I stress that the approaches I’ll discuss are not necessarily what I would typically use when finding data to explore. Generally, I’d let the data drive the analysis not the analysis drive the data I try to find. These are specific examples so that the data that is obtained can be used sensibly to perform sample to population inference. It’s also really important to talk about why I’m stressing the above 🙂 In NZ we have specific standards that are designed to assess understanding of sample to population inference, using informal and formal methods that have developed by exploring the behaviour of random samples from populations (AS91035, AS91264, AS91582). So, for the students’ learning about rules of thumbs and confidence intervals to make sense, we need to provide students with clearly defined named populations with data that are (or are able to be) randomly sampled from these populations. At high school level at least, these strict conditions are in place so that students can focus on one central question: What can and can’t I say about the population(s) based on the random sample data?
For all the examples I’ll cover in this series of posts, there are four key considerations/requirements:
You need a clearly defined and nameable population. Ideally this should be as simple and clear as possible to help students out but to ensure (2) the “name” can end up being quite specific.
You need a sampling frame that is a very close match to your population. This means you need a way to access every member of your population to measure stuff about them (variables). Sure, this is not the reality of what happens in the real world in terms of sampling, but remember what I said earlier about what was important 🙂
You need to select from your sampling frame using a random sampling method to obtain the members of your sample. It is sufficient (and recommended) to stick to simple random sampling. In some cases, you may be able to make an assumption that what you have can be considered a random sample, but I’d prefer to avoid these kinds of situations where possible at high school level.
You need to define and measure variables from each member of the sample/population. We want students working with multivariate data sets, with several options possible for numerical and categorical data (but don’t forget there is the option to create new variables from what was measured).
I’ll try to refer back to these four considerations/requirements when I discuss examples in the posts that will follow.
Just one very relevant NZ NCEA assessment-specific comment before we talk data. For AS91035 and AS91582, the standards state that students are to be provided with the sample multivariate data for the task – so all of (1) (2) (3) and (4) is done by the teacher. Similarly with AS91264, the requirement for the standard is that students select a random sample (3) from a provided population dataset – so (1) (2) and (4) are done by the teacher. This does not mean the students can’t do more in terms of the sampling/collecting processes, just that these are not requirements for the standards and asking students to do more should not limit their ability to complete the task. I’ll try to give some ideas for how to manage any related issues in the examples.
Just one more point. I haven’t made this (5) in the previous section, but something to watch out for is the nature of your “cases”. Tables of data (which we refer to as datasets) that play nicely with statistical software like iNZight are ones where the data is organised so that each row is a case and each column is a variable. Typically at high school level, the datasets we use are ones where each case (e.g. each individual in the defined population) is measured directly to obtain different variables. Things can get a little tricky conceptually when some of the variables for a case are actually measured by grouping/aggregating related but different cases.
For example, if I take five movies from the internet movie database that have “dog” in the title (imdb.com) and another five with “cat” in the title, I could construct a mini dataset like the one below using information from the website:
For this dataset, each row is a different movie, so the cases are the movies. Each column provides a different variable for each movie. The variables Movie title, Year released, Movie length mins, Average rating, Number of ratings, Number photos and Genre were taken straight from the webpage for each movie. I created the variables Number words title, Number letters title, Average letters per word, Animal in title, Years since release and Millennium. [Something I won’t tackle in this post is what to do about the Genre variable to make this usable for analysis.]
The Average rating variable looks like a nice numerical variable to use, for example, to compare ratings of these movies with “dog” in the title and those with “cat”. The thing is, this particular variable has been measured by aggregating individual’s ratings of the movie using a mean (the related but different cases here are the individuals who rated the movies). You can see why this may be an issue when you look at the variable Number of ratings, which again is an aggregate measure (a count) – some of these movies have received less than 200 ratings while others are in the hundreds of thousands. We also can’t see what the distribution of these individual ratings for each movie looks like to decide whether the mean is telling us something useful about the ratings. [For some more really interesting discussion of using online movie ratings, check out this fivethirtyeight article.]
The variable Average letters per word has been measured directly from each case, using the characteristics of the movie title. There are still some potential issues with using the variable Average letters per word as a measure of, let’s say, complexity of words used in the movie title, since the mean is being used, but at least in this case students can see the movie title.
For this dataset, each row is a different department, so the cases are the departments. Each column provides a different variable for each department. Gender was estimated based on the information provided in the directory and the data may be inaccurate for this reason. The % of PhD candidates that are female looks like a nice numerical variable to use, for example, to compare gender rates between these departments from the Arts and Science faculties. Generally with numerical variables we would use the mean or median as a measure of central tendency. But this variable was measured by aggregating information about each PhD candidate in that department and presenting this measure as a percentage (the related but different cases here are the PhD candidates). Just think about it, does it really make sense to make a statement like: The mean % of PhD candidates that are female for these departments of the Arts faculty is 73% whereas the mean % of PhD candidates that are female for these departments of the Science faculty is 44%, especially when the numbers of PhD candidates varies so much between departments?
Looking at the individual percentages is interesting to see how they vary across departments, but combining them to get an overall measure for each faculty should involve calculating another percentage using the original counts for PhD candidates for each department (e.g. group by faculty). If I want to compare gender rates between the Arts and Science faculties for PhD candidates, I would calculate the proportion of all PhD candidates across these department that are female for each faculty e.g. 58% of the PhD candidates from these departments of the Arts faculty are female, 53% of the PhD candidates from these departments of the Science faculty are female.
This post was not supposed to deter you from finding and creating your own real datasets! But we do need to think carefully about the data that we provide to students, especially our high school students. Not all datasets are the same and while I’ve seen some really cool and interesting ideas out there for finding/collecting data for investigations, some of these ideas unintentionally produce data that makes it very difficult for students to engage with the core question: What can and can’t I say about the population(s) based on the random sample data?
In the next post, I’ll discuss some examples of finding real data online. Until I find time to write this next post, check out these existing data finding posts: