This post is second in a series of posts where I’m going to share some strategies for getting real data to use for statistical investigations that require sample to population inference. As I write them, you will be able to find them all on this page.

What’s your favourite board game?

I read an article posted on fivethirtyeight about the worst board games ever invented and it got me thinking about the board games I like to play. The Game of life has a low average rating on the online database of games referred to in this article but I remember kind of enjoying playing it as a kid. boardgamegeek.com features user-submitted information about hundreds of thousands of games (not just board games) and is constantly being updated. While there are some data sets out there that already feature data from this website (e.g. from kaggle datasets), I am purposely demonstrating a non-programming approach to getting this data that maximises the participation of teachers and students in the data collection process.

To end up with data that can be used as part of a sample to population inference task:

  1. You need a clearly defined and nameable population (in this case, all board games listed on boardgamegeek.com)
  2. You need a sampling frame that is a very close match to your population.
  3. You need to select from your sampling frame using a random sampling method to obtain the members of your sample.
  4. You need to define and measure variables from each member of the sample/population so the resulting data is multivariate.

boardgamegeek.com actually provide a link that you can use to select one of the games on their site at random (https://boardgamegeek.com/boardgame/random), so using this “random” link (hopefully) takes care of (2) and (3). For (4), there are so many potential variables that could be defined and measured. To decide on what variables to measure, I spent some time exploring the content of the webpages for a few different games to get a feel for what might make for good variables. I decided to stick to variables that are measured directly for each game, rather than ones that were based on user polls, and went with these variables:

  • Millennium the game was released (1000, 2000, all others)
  • Number of words in game title
  • Minimum number of players
  • Maximum number of players
  • Playing time in minutes (if a range was provided, the average of the limits was used)
  • Minimum age in years
  • Game type (strategy or war, family or children’s, other)
  • Game available in multiple languages (yes or no)

Time to play!

I’ve set up a Google form with instructions of how you can help create a random sample of games from boardgamegeek.com at this link: https://goo.gl/forms/8yBqryGTzrZGhEVx2. As people play along, the sample data will be added here: https://docs.google.com/spreadsheets/d/e/2PACX-1vSzR_VSVzaaeWpCvYbAQCUewaM3Tad2zfTBO7AWuDgFFTj5Jaq2TBo6N-gQGCe5e5t_qKW7Knuq6-pr/pub?gid=552938859&single=true&output=csv . The URL to the game is included so that the data can be checked. Feel free to copy and adapt however you want, but do keep in mind that nature of the variables you use. In particular, be very careful about using any of the aggregate ratings measures (and another great article by fivethirtyeight about movie ratings explains some of the reasons why.)

Bonus round

I wrote a post recently – Just Google it – which featured real data distributions. boardgamegeek.com also provides simple graphs of the ratings for each game, so we can play a similar matching game. You could also try estimating the mean and standard deviation of the ratings from the graph, with the added game feature of reverse ordering!

Which games do you think match which ratings graphs?

  1. Monopoly
  2. The Lord of the Rings: The Card Game
  3. Risk
  4. Tic-tac-toe
A
B
C
D

I couldn’t find a game that had a clear bi-modal distribution for its ratings but I reckon there must be games out there that people either love or hate 🙂 Let me know if you find one! To get students familiar with boardgamegeek.com, you could ask them to first search for their favourite game and then explore what information and ratings have been provided for this on the site. Let the games begin 🙂

Anna teaches introductory-level statistics at the University of Auckland. She enjoys facilitating workshops to support professional development of statistics teachers and thinks teaching statistics (and mathematics) is awesome. Anna is also undertaking a PhD in statistics and data science education.

Game of data