Sunday, December 2, 2012


partially unrelated internet image

9.5% of words make up 90.5% of Reuters news by word count."The Pareto Principle",
Network Science, Jerome Kunegis, 2012-02-07

Poisson distributions display exponential decay, whereas Pareto distributions, or power laws, display a long-tail, with lots of activity in a short period of time, followed by long times of no activity. Barabasi's Bursts are a reference to the latter, upon study of the patterns of human activity.

Consider also, that the computer has aided the magnified temporal resolution of human activity frequency, thus forcing us to see ourselves as such.

A.L. Barabasi, Bursts, 2010
A Bit on Networks
The Big Heap, Time and Network Configuration

...if "the" is the most popular word in any random text, and it shows up a number of 1,000 times, then "of" (the second-most popular word) would show up 500 times...
 Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

For example, in the The Brown University Standard Corpus of Present-Day American English, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.

The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, and so on.'s_law

*Zipf's law is referenced in Science Fiction author Robert J. Sawyer's www.wake, when the main character is searching for intelligent life on the web.

No comments:

Post a Comment