(OT) Benford's Law (for my math friends)

greenspun.com : LUSENET : TimeBomb 2000 (Y2000) : One Thread

Sorry for the length. Came in my mail; no URL~~~ Hallyx

New Scientist (10th July 1999 page 27) THE POWER OF ONE

Everyday numbers obey a law so unexpected it is hard to believe it's true. Armed with this knowledge, says Robert Matthews, it's easy to catch those who have been faking research results or cooking the books.

Alex had no idea what dark little secret he was about to uncover when he asked his brother-in-law to help him out with his term project. As an accountancy student at Saint Mary's University in Halifax, Nova Scotia, Alex needed some real-life commercial figures to work on, and his brother-in-law's hardware store seemed the obvious place to get them.

Trawling through the year's sales figures, Alex could find nothing obviously strange about them. Still, he did what he was supposed to do for his project, and performed a bizarre little ritual requested by his accountancy professor, Mark Nigrini. He went through the sales figures and made a note of how many started with the digit 1. It came out at 93 per cent. He handed it in and thought no more about it.

Later, when Nigrini was marking the coursework, he took one look at that figure and realised that an embarrassing situation was looming. His suspicions hardened as he looked through the rest of Alex's analysis of his brother-in-law's accounts. None of the sales figures began with the digits 2 through to 7, and there were just 4 beginning with the digit 8, and 21 with 9. After a few more checks, Nigrini was in no doubt: Alex's brother-in-law was a fraudster, systematically cooking the books to avoid the attentions of bank managers and tax inspectors.

It was a nice try. At first glance, the sales figures showed nothing very suspicious, with none of the sudden leaps or dives that often attract the attentions of the authorities. But that was just it: they were too regular. And this is why they fell foul of that ritual he had asked Alex to perform.

Because what Nigrini knew - and Alex's brother-in-law clearly didn't - was that the digits making up the shop's sales figures should have followed a mathematical rule discovered accidentally over 100 years ago. Known as Benford's Law, it is a rule obeyed by a stunning variety of phenomena, from stock market prices to census data to the heat capacities of chemicals. Even a ragbag of figures extracted from newspapers will obey the law's demands that around 30 per cent of the numbers will start with a 1, 18 per cent with a 2, right down to just 4.6 per cent starting with a 9.

It is a law so unexpected that at first many people simply refuse to believe it can be true. Indeed, only in the past few years has a really solid mathematical explanation of its existence emerged. But after years of being regarded as a mathematical curiosity, Benford's Law is now being eyed by everyone from tax inspectors to computer designers - all of whom think it could help them solve some tricky problems with astonishing ease. In two weeks' time, the US Institute of Internal Auditors will begin holding training courses on how to apply Benford's Law in fraud investigations, hailing it as the biggest advance in the field for years.

The story behind the law's discovery is every bit as weird as the law itself. In 1881, the American astronomer Simon Newcomb penned a note to the American Journal of Mathematics about a strange quirk he'd noticed about books of logarithms, then widely used by scientists performing calculations. The first pages of such books seemed to get grubby much faster than the last ones.

The obvious explanation was perplexing. For some reason, people did more calculations involving numbers starting with 1 than 8 and 9. Newcomb came up with a little formula that matched the pattern of use pretty well: nature seems to have a penchant for arranging numbers so that the proportion beginning with the digit D is equal to log10 of 1 + (1/D) (see "Here, there and everywhere" below).

With no very convincing argument for why the formula should work, Newcomb's paper failed to arouse any interest, and the Grubby Pages Effect was forgotten for over half a century. But in 1938, a physicist with the General Electric Company in the US, Frank Benford, rediscovered the effect and came up with the same law as Newcomb. But Benford went much further. Using more than 20 000 numbers culled from everything from listings of the drainage areas of rivers to numbers appearing in old magazine articles, Benford showed that they all followed the same basic law: around 30 per cent began with the digit 1, 18 per cent with 2 and so on.

Like Newcomb, Benford did not have any really good explanation for the existence of the law. Even so, the sheer wealth of evidence he provided to demonstrate its reality and ubiquity has led to his name being linked with the law ever since.

It was nearly a quarter of a century before anyone came up with a plausible answer to the central question: why on earth should the law apply to so many different sources of numbers? The first big step came in 1961 with some neat lateral thinking by Roger Pinkham, a mathematician then at Rutgers University in New Brunswick, New Jersey. Just suppose, said Pinkham, there really is a universal law governing the digits of numbers that describe natural phenomena such as the drainage areas of rivers and the properties of chemicals. Then any such law must work regardless of what units are used. Even the inhabitants of the Planet Zob, who measure area in grondekis, must find exactly the same distribution of digits in drainage areas as we do, using hectares. But how is this possible, if there are 87.331 hectares to the grondeki?

The answer, said Pinkham, lies in ensuring that the distribution of digits is unaffected by changes of units. Suppose you know the drainage area in hectares for a million different rivers. Translating each of these values into grondekis will change the individual numbers, certainly. But overall, the distribution of numbers would still have the same pattern as before. This is a property known as "scale invariance".

Pinkham showed mathematically that Benford's Law is indeed scale-invariant. Crucially, however, he also showed that Benford's Law is the only way to distribute digits that has this property. In other words, any "law" of digit frequency with pretensions of universality has no choice but to be Benford's Law.

Pinkham's work gave a major boost to the credibility of the law, and prompted others to start taking it seriously and thinking up possible applications. But a key question remained: just what kinds of numbers could be expected to follow Benford's Law? Two rules of thumb quickly emerged. For a start, the sample of numbers should be big enough to give the predicted proportions a chance to assert themselves. Second, the numbers should be free of artificial limits, and allowed to take pretty much any value they please. It is clearly pointless expecting, say, the prices of 10 different types of beer to conform to Benford's law. Not only is the sample too small, but - more importantly - the prices are forced to stay within a fixed, narrow range by market forces.

Random numbers:

On the other hand, truly random numbers won't conform to Benford's Law either: the proportions of leading digits in such numbers are, by definition, equal. Benford's Law applies to numbers occupying the "middle ground" between the rigidly constrained and the utterly unfettered.

Precisely what this means remained a mystery until just three years ago, when mathematician Theodore Hill of Georgia Institute of Technology in Atlanta uncovered what appears to be the true origin of Benford's Law. It comes, he realised, from the various ways that different kinds of measurements tend to spread themselves. Ultimately, everything we can measure in the Universe is the outcome of some process or other: the random jolts of atoms, say, or the exigencies of genetics. Mathematicians have long known that the spread of values for each of these follows some basic mathematical rule. The heights of bank managers, say, follow the bellshaped Gaussian curve, daily temperatures rise and fall in a wave-like pattern, while the strength and frequency of earthquakes are linked by a logarithmic law.

Now imagine grabbing random handfuls of data from a hotchpotch of such distributions. Hill proved that as you grab ever more of such numbers, the digits of these numbers will conform ever closer to a single, very specific law. This law is a kind of ultimate distribution, the "Distribution of Distributions". And he showed that its mathematical form is...Benford's Law.

Hill's theorem, published in 1996, seems finally to explain the astonishing ubiquity of Benford's law. For while numbers describing some phenomena are under the control of a single distribution such as the bell curve, many more - describing everything from census data to stock market prices - are dictated by a random mix of all kinds of distributions. If Hill's theorem is correct, this means that the digits of these data should follow Benford's Law. And, as Benford's own monumental study and many others have showed, they really do.

Mark Nigrini, Alex's former project supervisor and now a professor of accountancy at the Southern Methodist University, Dallas, sees Hill's theorem as a crucial breakthrough: "It. . . helps explain why the significant-digit phenomenon appears in so many contexts."

It has also helped Nigrini to convince others that Benford's Law is much more than just a bit of mathematical frivolity. Over the past few years, Nigrini has become the driving force behind a far from frivolous use of the law: fraud detection.

In a ground-breaking doctoral thesis published in 1992, Nigrini showed that many key features of accounts, from sales figures to expenses claims, follow Benford's Law - and that deviations from the law can be quickly detected using standard statistical tests. Nigrini calls the fraud-busting technique "digital analysis", and its successes are starting to attract interest in the corporate world and beyond.

Some of the earliest cases - including the sharp practices of Alex's store-keeping brother-in-law - emerged from student projects set up by Nigrini. But soon he was using digital analysis to unmask much bigger frauds. One recent case involved an American leisure and travel company with a nationwide chain of motels. Using digital analysis, the company's audit director discovered something odd about the claims being made by the supervisor of the company's healthcare department. "The first two digits of the healthcare payments were checked for conformity to Benford's Law, and this revealed a spike in numbers beginning with the digits '65'," says Nigrini. "An audit showed 13 fraudulent cheques for between $6500 and $6599...related to fraudulent heart surgery claims processed by the supervisor, with the cheque ending up in her hands."

Benford's Law had caught the supervisor out, despite her best efforts to make the claims look plausible. "She carefully chose to make claims for employees at motels with a higher than normal number of older employees," says Nigrini. "The analysis also uncovered other fraudulent claims worth around $1 million in total."

Not surprisingly, big businesses and central governments are now also starting to take Benford's law seriously. "Digital analysis is being used by listed companies, large private companies, professional firms and government agencies in the US and Europe - and by one of the world's biggest audit firms," says Nigrini.

Warning signs:

The technique is also attracting interest from those hunting for other kinds of fraud. At the International Institute for Drug Development in Brussels, Mark Buyse and his colleagues believe Benford's Law could reveal suspicious data in clinical trials, while a number of university researchers have contacted Nigrini to find out if digital analysis could help reveal fraud in laboratory notebooks.

Inevitably, the increasing use of digital analysis will lead to greater awareness of its power by fraudsters. But according to Nigrini, that knowledge won't do them much good - apart from warning them off. "The problem for fraudsters is that they have no idea what the whole picture looks like until all the data are in," says Nigrini. "Frauds usually involve just a part of a data set, but the fraudsters don't know how that set will be analysed: by quarter, say, or department, or by region. Ensuring the fraud always complies with Benford's Law is going to be tough - and most fraudsters aren't rocket scientists."

In any case, says Nigrini, there is more to Benford's Law than tracking down fraudsters. Take the data explosion that threatens to overwhelm computer data storage technology. Mathematician Peter Schatte at the Bergakademie Technical University, Freiberg, has come up with rules that optimise computer data storage, by allocating disk space according to the proportions dictated by Benford's law.

Ted Hill at Georgia Tech thinks that the ubiquity of Benford's law could also prove useful to those such as Treasury forecasters and demographers who need a simple "reality check" for their mathematical models. "Nigrini showed recently that the populations of the 3000-plus counties in the US are very close to Benford's law," says Hill. "That suggests it could be a test for models which predict future populations-if the figures predicted are not dose to Benford, then rethink the model."

Both Nigrini and Hill stress that Benford's Law is not a panacea for fraud-busters or the world's data-crunching ills. Deviations from the law' s predictions can be caused by nothing more nefarious than people rounding numbers up or down, for example. And both accept that there is plenty of scope for making a hash of applying it to real-life situations: "Every mathematical theorem or statistical test can be misused - that does not worry me," says Hill.

But they share a sense that there are some really clever uses of Benford's law still waiting to be dreamt up. Says Hill: "For me the law is a prime example of a mathematical idea which is a surprise to everyone - even the experts."

Rabat Matthews is Science Correspondent for The Sunday Telegraph Further reading: Digital Analysis Tests and Statistics, written and published by Mark Nigrini, is available from mark.nigrini@msn.com Alex is not the real name of Nigrini's former student

******

Here, there and everywhere:

Nature's preferences for certain numbers and sequences has long fascinated mathematicians. The so-called Golden Mean - roughly equal to 1.62 and supposedly giving the most aesthetically pleasing dimensions for rectangles - has been found lurking in all kinds of places, from seashells to knots, while the Fibonacci sequence - 1, 2, 3, 5, 8 and so on, every figure being the sum of its two predecessors - crops up everywhere in nature, from the arrangement of leaves on plants to the pattern on pineapple skins.

Benford's Law appears to be another fundamental feature of the mathematical universe, with the proportion of numbers starting with the digit D given by log10 of 1+(1/D). In other words, around 100 x log2 (30 per cent) of such numbers will begin with "1"; 100 x log1.5 (17.6 per cent) with "2"; down to 100 x log1.11 (4.6 per cent) with "9".

But the mathematics of Benford's Law goes further, predicting the proportion of digits in the rest of the numbers as well. For example, the law predicts that "0" is the most likely second digit - accounting for around 12 per cent of all second digits - while 9 is the least likely, at 8.5 per cent.

Benford's law thus suggests that the most common non-random numbers are those starting with "10...", which should be almost 10 times more abundant than the least likely, which will be those starting "99..."

As one might expect, Benford's law predicts that the relative proportions of 1, 2, 3 and so on making up later digits of numbers become progressively more even, tending towards precisely 10 per cent for the least significant digit of every large number.

In a nice little twist, it turns out that the Fibonacci sequence, the Golden Mean and Benford's law are all linked. The ratio of successive terms in a Fibonacci sequence tend toward the golden mean, while the digits of all the numbers making up the Fibonacci sequence tend to conform to Benford's law.

end

-- (Hallyx@aol.com), July 20, 1999

Answers

Hmm....

Does this mean that Y2K will most likely be a "1" on a scale of 1 to 10??

Or that all hell will break loose on the date 1/1 ??

Or that it has nothing at all to do with Y2K and simply goes to prove once again that math is cool?

-- ace (x@y.z), July 20, 1999.


Wow and to cool!

Funny though, the article was getting good till they got into fraud and banks.

This also looks like a parallel with "chaotic" strange attractors. It has been shown that they exist with in all natural phenomena and even the stock market and populations.

I would even go on a limb and suggest a multidimensional feel to the "levels". But at the moment that is a weak limb. Try thinking about General Systems.

-- Brian (imager@home.com), July 20, 1999.


It would be interesting see if Benford's Law could be applied to Y2k numbers reported by various organizations. The number of critical systems, for example, or the total lines of code needing remediation. Maybe the number of embedded systems for each organization, or the number of embeddeds needing replacement.

-- Dean -- from (almost) Duh Moines (dtmiller@nevia.net), July 20, 1999.

Dean: Good point.
Ace: Well, "10" also starts with a "1". Then there's also "11", "12" ...
All: Look at a logarithmic scale. The distance between "1" and "2" (all numbers 1.xxx) is the same as from "5" to "10" (ratio 1:2)
So you've got all the numbers from 5 to 9.999... occupying the same amount of "space" as just the "1" numbers. "Kuhl", eh. This would be way off topic except for possible use as Dean suggests. I.e., no one is reporting only 12% compliance. It's always 70%, 80%, 90%. Hmmmm.

-- A (A@AisA.com), July 20, 1999.

bold off

-- A (A@AisA.com), July 20, 1999.


Benford's Law cannot legitimately be applied to reported compliance percentages. Such percentages do not satisfy the condition "the numbers should be free of artificial limits, and allowed to take pretty much any value they please." Percentages are artificially constrained to the range 0-100. Furthermore, they would tend to rise toward the 100 limit over time.

Number of critical systems? Those might qualify _if_ such numbers were exact counts, not estimates, and if there were precise definitions of "critical system". But I think they don't, because they aren't, and there aren't. Ditto for lines of code needing remediation.

-- No Spam Please (nos_pam_please@hotmail.com), July 20, 1999.


That's really neat! Trouble is, while I understand the "Law", as expressed -- "the proportion of numbers starting with the digit D [is] given by log10 of 1+(1/D)." -- I don't understand the explanation for it. This isn't going to keep me awake at night. It's just another "hmmmm!"

It's stated (and presumably proved) that Benford's Law does not apply to random numbers. In that connection it would be interesting to study the "universe" of winning lottery numbers -- including not just state lotteries, but multistate, and the many foreign lotteries. The numerical formats of foreign lotteries are substantively different from those in the U.S., and also vary considerably among themselves. So the distribution would seem to be sufficiently unconstrained.

Winning lottery numbers are supposed to be random. If Benford's Law were shown to hold for this entire set, then its randomness would be refuted. Which might cause some questions to be asked.

-- Tom Carey (tomcarey@mindspring.com), July 21, 1999.


Turns out this matter is on topic after all.

Google turned this up: Following Benford's Law, or Looking Out for No. 1

It's a little more comprehensive than the New Scientist piece.

For instance:

Dr. Nigrini said he believes that conformity with Benford's Law will make it possible to validate procedures developed to fix the Year 2000 problem -- the expectation that many computer systems will go awry because of their inability to distinguish the year 2000 from the year 1900. A variant of his Benford's Law software already in use, he said, could spot any significant change in a company's accounting figures between 1999 and 2000, thereby detecting a computer problem that might otherwise go unnoticed.


-- Tom Carey (tomcarey@mindspring.com), July 21, 1999.

Another nice ht by Google. Benford's Law

This is a fairly technical discussion. It includes a graphic representation of the probabilities of leading digits for all bases from 2 to 10.

-- Tom Carey (tomcarey@mindspring.com), July 21, 1999.


Folks

Benford's law is also the basis of a splendid "Sucker Bet".

The bet goes like this, first find yourself a handy book with lots of tables of real world data in it, a gazetteer or something like that.

You, the wise guy, say to the sucker...

"I'm feeling lucky today! Let's choose a table at random, and, for every number therein which starts with 1, 2, 3 or 4 you pay me a dollar. For every number which starts with a 5,6,7,8 or 9 I'll pay you a dollar! It's a great bet, you've got 5 numbers which will win for you, I've only got 4 that win for me, how can you lose!"

You will win with great consistency, of course! You know this if you've understood Benford's Law.

What about the well known "Birthday Proposition".

At a gathering of people you say "I bet you X dollars there are at least 2 people here who share the same birthday, (day and month only, not birth DATE)"

It's obvious that if there are over 365 people it's a certainty, if there are only 2 it's extremely unlikely, but the question is, just how many people do you need to have in the gathering before it becomes a fair (fifty-fifty) bet? That's the magic number you need to know when making the bet!!

I'll reveal the answer on this thread tomorrow, if nobody else has.

You never know, this stuff might help you fund your preps!

RonD

-- Ron Davis (rdavis@ozemail.com.au), July 21, 1999.



I don't recall the exact number of people, but I believe it is around 15-20.

Fraud in reporting percent completion? Look at how many are nice "even" integers units.....75%, 90%, 80%, 95% done. Never 88%, 93.5%, or 96.2%. Percent completion is a short hand way for saying: "We're working on it, I hope/wish/think/been told/might be xxxxx percent through, but I KNOW we are not 100% through and the bigger the number the better I shound."

-- Robert A Cook, PE (Kennesaw, GA) (cook.r@csaatl.com), July 21, 1999.


Robert, Well, you're not far off the mark, the bet is slightly unfavorable to you at 22 people, slightly favorable at 23 people.

If there are 50 people the probability that you will win is 0.97, in other words (for the mathematically disadvantaged) you would win 97 times out of a hundred. Quite surprising at first glance!

RonD

-- Ron Davis (rdavis@ozemail.com.au), July 21, 1999.


Moderation questions? read the FAQ