]> Benford's Law

## 15. Benford's Law

Benford's law refers to probability distributions that seem to govern the significant digits in real data sets. The law is named for the American physicist and engineer Frank Benford, although the law was actually discovered earlier by the astronomer and mathematician Simon Newcomb.

To understand Benford's law, we need some preliminaries. Recall that a positive real number $x$ can be written uniquely in the form $x y 10 n$ (sometimes called scientific notation) where $y 110 1$ is the mantissa and $n$ is the exponent (both of these terms are base 10, of course). Note that

$x y n$

where the logarithm function is the base 10 common logarithm instead of the usual base $e$ natural logarithm. In the old days BC (before calculators), one would compute the logarithm of a number by looking up the logarithm of the mantissa in a table of logarithms, and then adding the exponent. Of course, these remarks apply to any base $b 1$, not just base 10. Just replace 10 with $b$ and the common logarithm with the base $b$ logarithm.

#### Distribution of the Mantissa

Suppose now that $X$ is a number selected at random from a certain data set of positive numbers. Based on empirical evidence from a number of different types of data, Newcomb, and later Benford, noticed that the mantissa $Y$ of $X$ seemed to have distribution function $F y 1 y$ for $y 110 1$. We will generalize this to an arbitrary base $b 1$. Thus, let

$F y 1 b y , 1 b y 1$

Show that $F$ satisfies the mathematical properties of a distribution function for a continuous distribution on $1 b 1$.

Note that the corresponding probability density function is $f y 1 y b$ for $y 1 b 1$,

Show that

1. $Y b 1 b b$
2. $Y b 1 b 2 b b 1 2 b 1 b$

For the standard base 10 decimal case

1. Sketch the graph of $f$.
2. Compute the mean and variance explicitly.

#### Distribution of the Digits

Assume now that the base is a positive integer $b +$, which of course is the case in standard number systems. Suppose that the sequence of digits of our mantissa $Y$ (in base $b$) is $N 1 N 2$, so that

$Y k 1 N k b k$

Thus, our leading digit $N 1$ takes values in $1 2 b 1$, while each of the other significant digits takes values in $0 1 b 1$. Our goal is to compute the joint probability density function of the first $k$ digits. But let's start, appropriately enough, with the first digit law, the discrete probability density function of the leading digit:

Show that $N 1 n b 1 1 n b n 1 b n$ for $n 1 2 b 1$. Hint: Note that $N 1 n$ if and only if $Y n b n 1 b$.

Consider the standard base 10 decimal case.

1. Explicitly compute the values of the probability density function of $N 1$ and sketch the graph.
2. Find $N 1$
3. Find $N 1$

Now, to compute the joint probability density function of the first $k$ significant digits, some additional notation will help. If $n 1 1 2 b 1$ and $n j 0 1 b 1$ for $j 2 3 k$, let

$n 1 n 2 n k b j 1 k n j b k j$

Of course, this is just the base $b$ version of what we do in our standard base 10 system: we represent integers as strings of digits between 0 and 9 (except that the first digit cannot be 0). Here is a base 5 example:

$324 3 5 2 2 5 1 4 5 0$

Show that

$N 1 n 1 N 2 n 2 N k n k b 1 1 n 1 n 2 n k b$

Hint: Note that $N 1 n 1 N 2 n 2 N k n k Y n 1 n 2 n k b b k n 1 n 2 n k b 1 b k$. Now use the distribution function of $Y$ and properties of logarithms.

In the standard base 10 decimal case, explicitly compute the values of the joint probability density function of $N 1 N 2$.

Of course, the probability density function of a given digit can be obtained by summing the joint probability density over the unwanted digits in the usual way. However, except for the first digit, these functions do not reduce to simple expressions.

Show that

$N 2 n k 1 b 1 b 1 1 k n b k 1 b 1 b 1 1 k b n , n 0 1 b 1$

Consider the standard base 10 decimal case.

1. Explicitly compute the values of the probability density function of $N 2$, and sketch the graph.
2. Find $N 2$
3. Find $N 2$

Comparing Exercise 6 and Exercise 10, note that the distribution of $N 2$ is flatter than the distribution of $N 1$. In general, it turns out that distribution of $N k$ converges to the uniform distribution on $0 1 b 1$ as $k$. Interestingly, the digits are dependent.

Use the results of Exercise 6, Exercise 8, and Exercise 10 to show that $N 1$ and $N 2$ are dependent in the standard base 10 decimal case.

In the standard base 10 decimal case, find each of the following.

1. $N 1 5 N 2 3 N 3 1$
2. $N 1 3 N 2 1 N 3 5$
3. $N 1 1 N 2 3 N 3 5$

#### Theoretical Explanation

Aside from the empirical evidence noted by Newcomb and Benford (and many others since), why does Benford's law work? For a theoretical explanation, see the article A Statistical Derivation of Significant Digit Law, by Ted Hill.