redfiona99: (Default)
[personal profile] redfiona99
Introduction:

Some years ago, I read the book, “How Long Is a Piece of String?: More Hidden Mathematics of Everyday Life by Rob Eastaway” (as reviewed here), and one chapter fascinated me. The chapter was chapter 12 - “Is it a fake?”, and the section that particularly caught my interest was about Benford’s Law. Excessively simplifying, in naturally occurring numbers, the leading digits will follow a distinct pattern, and will not be randomly distributed.

The expected % of leading numbers for each digit can be seen in the table below:

lsWciA.png

If you have a large naturally occurring data set that doesn’t conform to this, it tells you there are either constraints on it so that the data doesn’t cover all of the possibilities (e.g. human heights in m are will start with a 1 or a 2, no one has ever been 4 m tall) or something else is going on.

Testing this theory:

I wanted to test this out on *something*. Problem was, what? Most sports data is possibility-limited e.g. fewer goals will be scored in football the 9th or 9xths minute than would be scored in the 8th and 8xths minute, not because of the minute, but because the game stops at the 90th minute. Other data isn’t big enough. I needed a source of numbers that was large and unlimited.

Eventually, possibly in a fit of cynicism, I decided to try the leading digits of numbers reported in the news. Advantages to this plan - I can use a single, traceable data source - one article a day from the BBC news website. The BBC doesn’t tend to delete pages so if someone wanted to double check my numbers, I could give them the links.

Disadvantages to this plan - when I first attempted it, Article 50 was in the news, and skewing my results.

Having looked at the results, and realised this and a few methodological errors, and going a bit stir-crazy because of lockdown 3, I decided to try it again.

Attempt Number 2:

These were the rules I developed to try to avoid that and similar pitfalls:
1 - no numbers in names e.g. 19 in COVID-19 does not count as a leading digit
2 - no numbers from dates (I had done this originally, but worth restating)
3 - only digits written as digits. This threw up an unexpected problem - the BBC has somewhat intermittent editorial control on whether digits under 10 are written as words or numbers, and this may skew results. I’ve saved the links to the articles I’ve used to put the project together so I can go through them again if I want to (or if someone else wants to look at them).

I started on the 1st of February 2021, and will carry on till 1st of February 2022 (barring disaster). The other advantage of this system is that if I miss a day, I can fill them in with more days at the end. I will give monthly updates and running totals, plus some commentary if I have any.
This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

redfiona99: (Default)
redfiona99

June 2025

S M T W T F S
12 34 56 7
8 910 1112 13 14
15161718192021
22232425262728
2930     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 16th, 2025 09:01 am
Powered by Dreamwidth Studios