[RPG] How to test whether a die is fair

diceloaded-dice

I have a d20 that seems to be, well, remarkably lucky.* How can I determine whether it's really just luck, or whether the die is in fact unfairly biased?

*) Well, I don't, really. This is actually a spin-off from this question, which is specifically about determining whether a die is loaded. This one is intended as a more general question about how to detect any kind of bias in dice, since we apparently don't have one yet. I've posted my own answer below, but feel free to add more.

Best Answer

What kinds of bias can dice have?

Lots of kinds, actually. Perhaps the most common accidentally occurring types of bias are:

  • "Shaved" dice, which are not quite symmetrical, but slightly wider or narrower on one axis than on others. A shaved d6 with, say, the 1–6 axis longer than the others will roll those sides less often, making it "less swingy" than a fair d6 should be (but leaving the average roll unchanged). The name comes from cheaters actually shaving or sanding down dice to flatten them, but cheap dice may have this kind of bias simply due to being poorly made. Other similar biases due to asymmetric shape are also possible, especially in dice with many sides.

  • Uneven (concave / convex) faces may be more or less likely to "stick" to the table, favoring or disfavoring the opposite side. The precise effect may depend on the table material, and on how the dice are rolled. Again, cheap plastic dice case easily have this kind of bias, e.g. due to the plastic shrinking unevenly as it cools after molding. Uneven edges can also create bias, particularly if the edge is asymmetric (i.e. sharper on one side).

  • Actual "loaded" dice, i.e. dice with a center of gravity offset from their geometric center, may occur accidentally due to either bubbles trapped inside the plastic or, more commonly, simply due to the embossed numbers on the sides of the die affecting the balance. In fact, almost all dice, with the exception of high-quality casino dice deliberately balanced to avoid this kind of bias, will likely have it to some small extent.

How do I find out whether a die is fair?

Obviously, you need to roll it. Preferably, you should do this the same way, on the same kind of table, as you'd use in a game; while truly fair dice should be fair on any surface, some types of bias may show up only on some surfaces.

Keep rolling the same die several times, and count how many times each side comes up. If you've got a friend to help you, you can have them tally up the rolls as you call them out, so you don't have to switch between rolling and marking the results all the time. Once your arm gets tired of rolling dice, switch roles.

How many times do you need to roll?

For the type of statistical test described below (Pearson's \$\chi^2\$ test), a common rule of thumb is to have at least five times as many rolls as there are sides on the die. Thus, for a d20, you need at least 100 rolls for the test to be valid. (There are other statistical tests that can be used with fewer rolls, but they require slightly more complicated math.) Obviously, more rolls won't hurt if you have the patience for it, and the more rolls you tally up, the better the test will detect subtle biases.

(Note: If you've, say, bought a large bunch of cheap d6's for rolling large dice pools, it can be OK to just roll them all together and tally up the number of times each face comes up. Sure, this way you won't detect if one of the dice is, say, slightly more likely to roll a 6, while another one is slightly less likely to roll it, but you'll still detect any systematic biases due to, say, all the dice being unsymmetrical the same way.)

OK, I've rolled the die 100 times. Now what?

Now it's time to do some math.

  1. First, look up the tally of how many times each side came up. Below, I'll call the number of times side 1 came up \$n_1\$, the number of times side 2 came up \$n_2\$, and so on up to \$n_{20}\$ for a d20. I'll also use \$N\$ to denote the total number of rolls, i.e. \$N = n_1 + n_2 + \dots + n_{20}\$.

  2. Next, calculate the expected number of times each side should have come up for a fair dice, i.e. the total number of rolls divided by the number of sides.1 (It's OK for this to be a fractional number.) Call this number \$n_{\exp}\$. For example, for \$N = 100\$ rolls of a d20, \$n_{\exp} = \frac{N}{20} = 5\$.

  3. Now, for each side k (from 1 to 20, for a d20), calculate the difference between the actual and the expected count of times the side came up, square it (i.e. multiply it by itself), and divide it by the expected count. That is, calculate:

    $$\chi^2_k = \frac{ \left( n_k - n_{\exp} \right) ^2}{n_{\exp}}$$

    for each possible number \$k\$ of your die (i.e. from \$k = 1\$ to \$k = 20\$, for a d20).2

  4. Finally, add up all the results from the previous step to obtain the test statistic $$\chi^2 = \chi^2_1 + \chi^2_2 + \dots + \chi^2_{20} = \sum_{k=1}^{20} \frac{ \left( n_k - n_{\exp} \right) ^2}{n_{\exp}}.$$

OK, I've got this \$\chi^2\$ figure. What do I do with it?

The \$\chi^2\$ value you've calculated is a measure of how biased the die appears to be, based on the numbers you've rolled with it. But what counts as a reasonable value of \$\chi^2\$, and where's the threshold at which you should start getting suspicious?

For that, you either need to do some more math, or, more easily, just look it up in a table.

To use the table, you first need to know how many "degrees of freedom" our test has. That's simpler than it sounds: for a \$d\$-sided die, the test has \$\nu = d - 1\$ degrees of freedom (i.e. \$\nu = 19\$ for a d20).3 This will tell you which row in the table to look at.

In the table above, row 19 looks like this:

                Probability less than the critical value
  ν           0.90      0.95     0.975      0.99     0.999
----------------------------------------------------------
 19         27.204    30.144    32.852    36.191    43.820

What does this mean? Well, it means that, if the die is actually fair, then \$\chi^2\$ will be less than 27.204 in 90% of all tests, less than 30.144 in 95% of all tests, and so on. Only once in a thousand tests will a fair d20 actually produce a \$\chi^2\$ value higher than 43.820.

Thus, by comparing \$\chi^2\$ to the critical values in the table, you can estimate how likely it is to be biased.4 If \$\chi^2 \le 27\$, the die probably has no bias, or at least you haven't counted enough rolls to detect it; around \$\chi^2 \ge 30\$ or so, you might want to be concerned, and maybe set the die aside for further testing; if \$\chi^2 \ge 40\$, you can declare the die biased with pretty high confidence.

Note that the chi-squared test does not say anything about how the die is biased: a die that, say, rolls 10 more often and 11 less often than it should is just as likely to fail the test as one that rolls 20 more often and 1 less often. Of course, if the chi-squared test does detect bias, you can just look at the tally counts yourself to see which ones occur more often than you'd expect.

Ps. For convenience, here are the table rows for a few other commonly used types of dice:5

Upper-tail critical values of χ² distribution with ν degrees of freedom (source: NIST)

                Probability less than the critical value
  ν           0.90      0.95     0.975      0.99     0.999
----------------------------------------------------------
  1 (d2)     2.706     3.841     5.024     6.635    10.828
  2 (d3)     4.605     5.991     7.378     9.210    13.816
  3 (d4)     6.251     7.815     9.348    11.345    16.266
  5 (d6)     9.236    11.070    12.833    15.086    20.515
  7 (d8)    12.017    14.067    16.013    18.475    24.322
  9 (d10)   14.684    16.919    19.023    21.666    27.877
 11 (d12)   17.275    19.675    21.920    24.725    31.264
 19 (d20)   27.204    30.144    32.852    36.191    43.820

Footnotes:

1) For an ordinary fair die, the expected number of times each side comes up is obviously the same, but we could use the chi-squared test also for dice that we don't expect to roll each number equally often (like, say, dice where the same number appears several times). In that case, we'd just have a different \$n_{\exp}\$ for each possible roll of the die.

2) I'm not aware of a conventional symbol for these intermediate values, but \$\chi^2_k\$ seems like a reasonable choice, given both that they add up to the test statistic \$\chi^2\$, and that each of them is the square of an (approximately) normally distributed random variable, and thus is itself \$\chi^2\$-distributed. Your favorite statistics text, if it bothers to give them a symbol at all, may use something else.

3) The number of degrees of freedom is essentially the number of values in our measurements that can vary independently. Here, we're measuring 20 values, \$n_1\$ to \$n_{20}\$, but they're not quite independent: we know that \$n_1 + n_2 + \dots + n_{20} = N\$, so once we know 19 of the values, we can calculate the last one based on the other 19. Hence, 19 degrees of freedom.

4) Note that the numbers in the table header give the probability that a perfectly fair die will produce a \$\chi^2\$ value higher than the critical value in that column. This is not the same as the probability that a die with \$\chi^2\$ less than the critical value is fair, or that a die with \$\chi^2\$ higher than the critical value is biased; to calculate those probabilities, you'd first have to know the a priori frequency of bias among your dice. Indeed, in some sense, these questions are not even meaningful to ask: truly fair dice only exist in the platonic realm of ideas, and every real die almost certainly has some bias, if you measure it carefully enough. Thus, in a sense, any claim that a given die is fair is false; all we can really say is that it's close enough to fair that we can't tell the difference.

5) A "d2" is, of course, a coin. Use the "d3" column (\$\nu = 2\$) e.g. for Fudge dice.


Addendum: So, just how many rolls do we need to actually detect biased dice? Well, I did some quick simulation tests, using an extremely biased virtual d20 that never rolls a 1, and rolls 20 twice as often as it should. Using the different \$\chi^2\$ thresholds given in the table above, and various numbers of test rolls, from the minimum of 100 up to 400, here's the fraction of runs on which the \$\chi^2\$ value exceeded the threshold:

                   Probability of passing a fair die
          |   0.90      0.95      0.975     0.99     0.999
Rolls     +-----------------------------------------------
          |        Probability of detecting the bias
100       |   0.50      0.37      0.26      0.17     0.054
200       |   0.89      0.80      0.69      0.55     0.28
300       |   0.9932    0.972     0.938     0.87     0.62
400       |   0.9999    0.9992    0.9961    0.985    0.88

In each case, the probability of falsely detecting bias in a fair die is essentially independent of the number of rolls — this is a deliberate feature of the \$\chi^2\$ test. The probability of correctly detecting the biased die, however, increases significantly with more rolls.

From the table above, we can see that 100 rolls (the minimum number for the \$\chi^2\$ test to even be valid) is way too little to detect even such an egregious bias: even if we set the \$\chi^2\$ threshold so low that we end up rejecting 10% of all fair dice, we still catch only about 50% of the biased ones, and it only gets worse as we increase the threshold.

On the other hand, with 400 rolls, things look a lot better: setting the threshold at \$\chi^2 \le 36.191\$, 99% of all fair dice will pass this test, while about 98.5% of all the biased dice in this test will fail it. (Of course, we're still talking about very strongly biased dice here; more subtle bias will be harder to detect.)

OK, but surely a die that never rolls 1 should be easy to spot? After all, with a fair d20, the probability of rolling 100 times and never seeing a 1 is only \$\left(\frac{19}{20}\right)^{100} \approx 0.006\$. Shouldn't that be plenty of reason to consider the die biased? What gives?

Well, one reason why the \$\chi^2\$ test seems so ineffective here is that it's looking for any kind of bias. Sure, if we rolled a d20 a hundred times, and never saw a 1, we might be justifiably suspicious. But what if we never saw a 7, or a 15, or any of the other possible rolls? Would those also be reason to call the die biased?

Well, it turns out that, even though the probability of never rolling a 1 in 100 rolls on a d20 is only about 0.6%, the probability of never rolling some number is about 20 times that, or about 12%. So if we rejected all 20-sided dice that never rolled some number in 100 rolls, we'd end up rejecting about 12% of all fair dice, too. And, of course, there also are many other kinds of possible biases that the \$\chi^2\$ test will also detect; thus, with just 100 rolls, it's actually quite likely to detect some bias even in a d20 that's perfectly fair, and so we need to set the threshold value quite high to compensate.

If we were only interested in bias affecting the most extreme rolls (1 and 20), we could modify the \$\chi^2\$ test to e.g. lump all the rolls between 2 and 19 into a single category, with \$n_{\exp} = \frac{18 \times N}{20}\$, and use the \$\chi^2\$ threshold for two degrees of freedom (since we now have only three possible outcomes: 1, 20, or something else). Such a modified \$\chi^2\$ test is a lot better at detecting this particular form of bias, with more than half of the biased dice failing the test at the 1% false positive rate even with just 100 rolls, and over 99.99% of them failing it with 200 rolls.

Of course, the price we pay for that extra discriminatory power is that this modified test will be completely oblivious to most other kinds of bias — for example, it will happily pass a die that never rolls a 2, and that rolls 19 twice as often as it should.