12. Zipf's law

The last task is about an interesting observation, which is called Zipf's law. Wikipedia writes: “Zipf's law was originally formulated in terms of quantitative linguistics, stating that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation.”

If this rule holds for a text, then the graph of the frequency function is a line in a coordinate system where both coordinate axes are logarithmic. The task will be to graphically check this rule for a natural text, that is, plot the frequencies in a loglog coordinate system. For testing some text samples can be found at http://sandbox.hlt.bme.hu/~gaebor/zipf/. This rule is perfectly implemented by the tokeletes.txt file in the text samples (the Hungarian word "tökéletes" means "perfect"). There are smaller files with Hungarian and larger with English text.

This time (as this is the last task) we give a big help as most of the program code is included, only the missing few lines has to be added. What to do can be figured out in the context and in the comments.

import matplotlib.pyplot as plt
import sys

def plot_zipf(filename):
    """
    Plot the frequencies of the words of a text
    and plot the rank-frequency function in a loglog
    coordinate system.
    """

    d = {}                                           # dictionary
    with open(filename, "r") as f:
        for line in f:
            for word in line.strip().split():        # split into words 
                word = word.strip(',.-_?! ').lower() # deleting punctuation
                #
                # count the nonempty words
                # and write the frequences into the dictionary d,
                # where the word is the KEY and the frequency is the VALUE
                #
                if word != '':
                    ........# counting the words

    #
    # plot the frequences with loglog scale 
    #
    data = ........
    plt.loglog(range(1, len(data)+1), data)
    plt.show()

    # convert the dictionary into a list of pairs, and sort by the
    # second element of the pairs and finally print the first 10 pairs
    print(sorted(d.items(), key=lambda x:x[1], reverse=True)[0:10])

def main():
    filename = sys.argv[1]
    plot_zipf(filename)

if __name__ == "__main__":
    main()

Test your program by running from terminal!

The aim of the exercise is to get to know an empirical distribution and to choose the appropriate representation,
getting to know the basics of processing natural language data and recalling the methods of dictionaries.