The power of pixels: Modelling with images

This post provides the notes for the plenary I gave for the Auckland Mathematical Association (AMA) about using images as a source of data for teaching statistical investigations.

You might be disappointed to find out that my talk (and this post) is not about the movie pixels, as my husband initially thought it was. It’s probably a good thing I decided to focus on pixels in terms of data about a computer or digital image, as the box office data about pixels the movie suggests that the movie didn’t perform so well 🙂 Instead for this talk I presented some examples of using images as part of statistical investigations that (hopefully) demonstrated how the different combinations of humans, digital technologies, and modelling can lead to some pretty interesting data. The abstract for the talk is below:

How are photos of cats different from photos of dogs? How could someone determine where you come from based on how you draw a circle? How could the human job of counting cars at an intersection be cheaply replaced by technology? I will share some examples of simple models that I and others have developed to answer these kinds of questions through statistical investigations involving the analysis of both static and dynamic images. We will also discuss how the process of creating these models utilises statistical, mathematical and computational thinking.

As I was using a photo of my cat Elliot to explain the different ways we can use images to collect data, a really funny thing happened (see the embedded tweet below).

Yes, an actual real #statscat appeared in the room! What are the chances of that? 🙂

Pixels are the squares of colour that make up computer or digital (raster) images. Each image has a certain number of pixels e.g. an image that is 41 pixels in width and 15 pixels in height contains 615 pixels, which is an obvious link to concepts of area. The 615 pixels are stored in an ordered list, so the computer knows how to display them, and each pixel contains information about colour. Using RGB colour values (other systems exist), each pixel contains information about the amounts of red, green and blue on a scale of 0 to 255 inclusive. To get at the information about the pixels is going to require some knowledge of digital technologies, and so the use of images within statistical investigations can be a nice way to teach objectives from across the different curriculum learning areas.

Using images as a source of data can happen on at least three levels. Using the aforementioned photo of my cat Elliot, humans could extract data from the image by focusing on things they can see, for example, that that image is a black and white photo and not in colour, that there are two cats in the photo, and that Elliot does not appear to be smiling. Data that is also available about the image using digital tech includes variables such as the number of pixels, the file type and the file size. Data that can be generated using models related to this image could be identifying the most prominent shade of grey, the likelihood this photo will get more than 100 likes on instagram and what the photo is of (cat vs dog for example, a popular machine learning task).

Static images

The first example used the data, in particular the photos, collected as part of the ongoing data collection project I have running about cats and dogs (the current set of pet data cards can be downloaded here). As humans, we can look at images, notice things that are different and these features can be used to create variables. For example, if you look at some of the photos submitted: some pets are outside while others are inside; some pets are looking at the camera while others are looking away from the camera; and some are “close ups” while others taken from a distance.

These potential variables are all initially categorical, but by using digital technologies, numerical variables are also possible. To create a measure of whether a photo is a “close up” shot of a pet, the area the pet takes up of the photo can be measured. This is where pixels are super helpful. I used paint.net, free image editing software, to show that if I trace around the dog in this photo using the lasso tool that the dog makes up about 61 000 pixels. If you compare this figure to the total number of pixels in the image (90 000), you can calculate the percentage the dog makes up of the photo.

For the current set of pet data card, each photo now has this percentage displayed. Based on this very small sample of six pets, it kind of looks like maybe cats typically make up a larger percentage of the photo than dogs, but I will leave this up to you to investigate using appropriate statistical modelling 🙂

For a pretty cool example of using static images, humans, digital technologies and models, you should take a look at how-old.net. As humans, we can look at photos of people and estimate their age and compare our estimates to people’s actual ages. What how-old.net has done is used machine learning to train a model to predict someone’s age based on the features of the photo submitted. I asked teachers at the talk to select which of the three photos they thought I looked the youngest in (most said B), which is the same photo that the how-old.net model predicted I looked the youngest in. A good teaching point about the model used by how-old.net is that it does get updated, as new data is used to refine its predictions.

You can also demonstrate how models can be evaluated by comparing what the model predicts to the actual value (if known). Fortunately I have a large number of siblings and so a handy (and frequently used) range of different aged people to test the how-old.net model. Students could use public figures, such as athletes, politicians, media personalities or celebrities, to compare each person’s actual age to what the model predicts (since it’s likely that both photos and ages are available on the internet).

There is also the possibility of setting up an activity around comparing humans vs models – for the same set of photos, are humans better at predicting ages than how-old.net? Students could be asked to consider how they could set up this kind of activity, what photos could they use, and how would they decided who was better – humans or models?

Drawings

The next example used the set of drawings Google has made available from their Quick! Draw! game and artificial intelligence experiment. I’ve already written a post about this data set, so have a read of that post if you haven’t already 🙂 In this talk, I asked teachers to draw a quick sketch of cat and then asked them to tell me whether they drew just the face, or the body as well (most drew the face and body – I’m not sure if the appearance of an actual cat during the talk influenced this at all!) I also asked them to think about how many times they lifted their pen off the paper. I probably forgot to say this at the time, but for some things humans are pretty good at providing data but for others, digital technologies are better. In the case of drawing and thinking about how many strokes you made while drawing, we would get more accurate data if we could measure this using a mouse, stylus or touchscreen than asking people to remember.

Using the random sampler tool that I have set up that allows you to choose one of the objects players have been asked to draw for Quick! Draw!, I generated a random sample of 200 of the drawings made when asked to draw a cat. The data the can be used from each drawing is a combination of what humans and digital technologies can measure. The drawing itself (similar to the photos of pets in the first example) can be used to create different variables, for example whether the sketch is of the face only, or the face and body. Other variables are also provided, such as the timestamp and country code, both examples of data that is captured from players of the game without them necessarily realising (e.g. digital traces).

After manually reviewing all 200 drawings and recording data about the variables, I used iNZight VIT to construct bootstrap confidence intervals for the proportion of all drawings made of cats in the Quick! Draw! dataset that were only of faces and for the difference between the mean number of strokes made for drawings of cats in the Quick! Draw! dataset that were of bodies and mean number of strokes made for drawings of cats in the Quick! Draw! dataset that were of faces. Interestingly, while the teachers at the talk mostly drew sketches of cats with bodies, most players of Quick! Draw! only sketch the faces of cats. This could be due to the 20 second time limit enforced when playing the game. It makes sense that the, on average, Quick! Draw! players use more strokes to draw cats with bodies versus cats with just faces. I wished at the time that I had also recorded information about the other variables provided for each drawing, as it would have been good to further explore how the drawings compare in terms of whether the game correctly identified more of the face-only drawings of cats than the body drawings.

What is also really interesting is the artificial intelligence aspect of the game. The video below explains this pretty well, but basically the model that is used to guess what object is being drawn is trained on what previous players of the game have drawn.

From a maths teachers perspective, this is a good example of what can go wrong with technology and modelling. For example, players are asked to draw a square, and because the model is trained on how they draw the object, players who draw four lines that are roughly perpendicular behave similarly from the machine’s perspective because the technology is looking for commonalities between the drawings. What the technology is not detecting is that some players do not know what a square is, or think squares and rectangles are the same thing. So the data being used to train the model is biased. The consequence of this bias is that the model will now reinforce players misunderstanding that a rectangle is a square by “correctly” predicting they are drawing a square when they draw a rectangle! An interesting investigation I haven’t done yet would be to estimate what percentage of drawings made for squares are rectangles 🙂 I would also suggest checking out some of the other “shape” objects to see other examples e.g. octagons.

Using a more complex form of the Google Quick! Draw! dataset, Thu-Huong Ha and Nikhil Sonnad analysed over 100 000 of the drawings made of circles to show how language and culture influences sketches. For example, they found that 86% of the circles drawn by players in the US were drawn counter clockwise, while 80% of the circles drawn by players in Japan were drawn clockwise. To me, this is really fascinating stuff, and really cool examples of how using images as a source of data can result in really meaningful investigations about the world.

Animation

The last example I used was about using videos as a source of data for probability distribution modelling activities. I’ve presented some workshops before where I used a video (traffic.mp4) from a live streaming traffic camera positioned above a section of the motorway in Wellington. Focusing on the lane of traffic closest to the front of the screen, I got teachers to count how many cars arrived to a fixed point in that lane every five seconds. This gave us a nice set of data which we could then use to test the suitability of a Poisson distribution as a model.

For this talk, I wanted to demonstrate how humans could be replaced (potentially) by digital technologies and models. Since the video is a collection of images shown quickly (around 50 frames per second), we can use pixels, or potentially just a single pixel, in the images to measure various attributes of the cars. About a year ago, I set myself the challenge of exploring whether it would be possible to glean information about car counts, car colours etc. and shared my progress with this personal project at the end of the talk.

So, yes there does exist pretty fancy video analysis software out there that I could use to extract the data I want, but I wanted to investigate whether I could use a combination of statistical, mathematical and computational thinking to create my own model to generate the data. As part of my PhD, I’m interesting in finding out what activities could help introduce students to the modern art and science of learning from data, and what is nice about this example is that idea of how the model could count how many cars are arriving every five seconds to a fixed point on the motorway is actually pretty simple and so potentially a good entry point for students.

The basic idea behind the model is that when there are no cars at the point on the motorway, the pixel I am tracking is a certain colour. This colour becomes my reference colour for the model. Using the RBG colour system, for each frame/image in the traffic video, I can compare the current colour of the pixel e.g. rgb(100, 250, 141) to the reference colour e.g. rgb(162, 158, 162). As soon as the colour changes from the reference colour, I can infer this means a car has arrived to the point on the motorway. And as soon as the colour changes back to the reference colour, I can infer that the car has left the point on the motorway. While the car is moving past the point, I can also collect data on the colour of the pixel from each frame, and use this to determine the colour of the car.

I’m still working on the model (in that I haven’t actually modified it since I first played around with the idea last year) and the video below shows where my project within CODAP (Common Online Data Analysis Platform) is currently at. When I get some time, I will share the link to this CODAP data interactive so you and your students can play around with choosing different pixels to track and changing other parameters of the model I’ve developed 🙂

You might notice by watching this video that the model needs some work. The colours being recorded for each car are not always that good (average colour is an interesting concept in itself, and I’ve learned a lot more about how to work with colour since I developed the model) and some cars end up being recorded twice or not at all. But now that I’ve developed an initial model to count the cars that arrive every five seconds, I can compare the data generated from the model to the data generated by humans to see how well my model performed.

You can see at the moment, that the data looks very different when comparing what the humans counted and what the digital tech + model counted. So maybe the job of traffic counter (my job during university!) is still safe – for now 🙂

Going crackers

I didn’t get time in the talk to show an example of a statistical investigation that used images (photos of animal crackers or biscuits) to create a informal prediction model. I’ll write about this in another post soon – watch this space!

It’s raining cats and dogs (hopefully)

In April 2017, I presented an ASA K-12 statistics education webinar: Statistical reasoning with data cards (webinar). Towards the end of the webinar, I encouraged teachers to get students to make their own data cards about their cats. A few days later, I then thought that this could be something to get NZ teachers and students involved with. Imagine a huge collection of real data cards about dogs and cats? Real data that comes from NZ teachers and students? Like Census At School but for pets 🙂 I persuaded a few of my teacher friends to create data cards for their pets (dogs or cats) and to get their students involved, to see whether this project could work. Below is a small selection of the data cards that were initially created (beware of potential cuteness overload!)

The project then expanded to include more teachers and students across NZ, and even the US, and I’ve now decided to keep the data card generator (and collection) page open so that the set of data cards can grow over time. Please use the steps below to get students creating and sharing data cards about their pets.

Creating and sharing data cards about dogs and cats

Inevitably, there will be submissions made that are “fake”, silly or offensive (see below).

Data cards submitted to the project won’t automatically be added to any public sets of data cards, and will be checked first. Just like with any surveying process that is based on self-selection, is internet based and relies on humans to give honest and accurate answers, there is the potential for non-sampling errors. To help reduce the quantify of “fake” data cards, if you are keen to have your students involved with this project it would be great if you could do the following:

1. Talk to your students about the project and explain that the data cards will be shared with other students. They will be sharing information about their pet and need to be OK with this (and don’t have to!). The data will be displayed with a picture of their pet, so participation is not strictly anonymous. All of this is important to discuss with students as we need to educate students about data privacy 🙂

2. When students submit their data, they are given the finished data card which they can save. Set up a system where students need to share the data card they have created with you e.g. by saving into a shared Google drive or Dropbox, or by emailing the data card to you. The advantage for you of setting up this system is that you get your class/school set of data cards to use however you want. The advantage for me is that this level of “watching” might discourage silly data cards being created.

3. Share this link with your students http://learning.statistics-is-awesome.org/dogsvscats/ and let the rain of cats and dogs begin!

Pet data cards

The data collection period for this set of data cards was 1 May 17 to 19 May 17.

The diagram below shows the data included on each data card:


Additional data that could be used from each data card includes:

  • Whether the pet photo was taken inside or outside
  • Whether the pet photo is rotated (and the angle of rotation)
  • The number of letters in the pet name
  • The number of syllables in the pet name

PDF of all data cards: click to download

 

Auckland Marathon 2015 runners (population data)

auckland_marathon_2015

The data for each runner entered in the Auckland Marathon 2015 was obtained from https://www.aucklandmarathon.co.nz/. This data is owned by the organisers of the Auckland Marathon and can not be used for commercial purposes unless by prior written permission from the organisers.

For each runner, the following was recorded:

  • bib number
  • name
  • time in hours (this is blank if the runner did not compete in the race)
  • place (this is blank if the runner did not compete in the race)
  • gender
  • division
  • age division
  • distance in km (this is blank if the runner did not compete in the race)
  • mean pace km per hr (this is blank if the runner did not compete in the race)

NB: This data set contains information about the five different races which are part of the Auckland Marathon 2015. It may be necessary to focus on just one of these races for a meaningful investigation, for example if comparing running times for male and female runners (whether as part of a sample-to-population inference or as part of exploring the population data).

Here is the population data set as a CSV file: all_races_auckland_marathon_2015_final

Rugby World Cup 2015 players (population data)

rugby_world_cup_2015

The data for each player in the Rugby World Cup 2015 was obtained from http://www.rugbyworldcup.com/. This data is owned by the Rugby World Cup Ltd (RWC) and can not be used for commercial purposes unless by prior written permission from the RWC.

Thanks to @cushlat for the idea 🙂

For each player, the following was recorded:

  • team played for (team)
  • name (name)
  • number of international matches played (caps)
  • position (position)
  • number of years since debuted (years_since_debut)
  • date of debut (debut)
  • age at Rugby World Cup 2015 (age)
  • age minus years_since_debut (approx_age_debuted)
  • height in cm (height_cm)
  • weight in kg (weight_kg)

NB: This data set should be used with care for sample-to-population inference involving comparison, as both categorical variables (team and position) involve a large number of outcomes (16 teams and 11 positions). This means it is not likely that a random sample of 80 players from the population of Rugby World Cup 2015 players, for example, will contain sufficient numbers of players in any two groups for comparison e.g. England vs New Zealand OR forwards vs backs. If you use all the data for NZ and all the data for England to compare the age of players, for example, you will have used all of the data for this population and so there is no need to “make a call” about what is going on “back in the population” 🙂

My advice would be to use this data set for either single variable sampling investigations OR exploratory data analysis for the entire population. There is also something interesting in using the time variable (debut) to explore other variables 🙂

Here is the population data set as a CSV file: rubgy_world_cup_2015