Well folks, I currently have 2 or 3 significant blog entries whizzing around my head, and I have to post up one of them even though I have already started writing the other or else I might explode . I guess this can be a little tidbit to keep the posts going until some of my more major endeavors arrive.

So the inspiration for this post originated via a Hack Day at the company where I'm working for the summer. One group presented an auto-summarization feature dubbed "talking points" which essentially took group content and displayed the major bullet points nicely on the groups page. The presenters portrayed this effort as a more impartial view on groups. After all, I know that I for one welcome our robot overlords trust algorithms a lot more than I trust self-promotion intended to get you to join the group. At any rate, I didn't catch the name of the algorithm until it was sent out in an email a week later.

I was pretty interested in how it worked, so I checked out the paper online. From reading the paper, I gained a few insights. It turns out that German computer scientist H.P. Luhn came up with a method for automatically generating abstracts from scientific papers, along with many cool information theoretic ideas. It also turns out that this idea is old, as in programming the data on punch cards and published in 1958 old. It's not even his most famous work either, as the Luhn Algorithm on Wikipedia points to a checksum like algorithm that validates an identification number. Given its age, the algorithm is also described in pretty straightforward English and it seemed pretty easy to implement, so I went ahead and did it.

First I'd like to give you a brief rundown of how the algorithm works. As Luhn states in his paper, the algorithm really doesn't do anything fancy in terms of NLP or anything like that, it pretty much relies on frequency analysis and word spacing:

The justification of measuring word significance by use frequency is based on the fact that a writer normally repeats certain words as he advances or varies his arguments and as he elaborates on an aspect of a subject. This means of emphasis is taken as an indicator of significance. The more often certain words are found in each other’s company within a sentence, the more significance may be attributed to each of these words. - H.P. Luhn


Makes sense to me, certain words are significant and repeated, the denser these clusters are the more valuable they become.

The algorithm itself occurs in two phases. In the first phase, we must determine which words are significant. Luhn states that this is first done by doing a frequency analysis, then finding words which are significant, but not unimportant English words. I made a very sweeping generalization here by taking the top 100 English words and counting those as unimportant. Wikipedia tells me that these words account for roughly half of all written English, so I am going to only use that set as the unimportant words, though there's nothing inherently in my code that makes this so. The first thing I'll do is build a hashset from these words, pretty straightforward:

def load_common_words():
    f = open("words.txt", "r")
    common_words = set()
    for word in f:
        common_words.add(word.strip('\n').lower())
    f.close()
    return common_words

Next I have to get the most common words from my document, whatever it is, and then take a subset of those that are not these most common english words, but are still important. I did this by reading through the file and counting the word frequency, removing the most common words from our dictionary, and then sorting them in descending order. I decided to record the top 10% of these, which I think should fall pretty squarely in the 'important' category that Luhn was looking for.

def top_words(file_name):
    f = open(file_name, "r")
    record = {}
    common_words = load_common_words()
    for line in f:
        words = line.split()
        for word in words:
            w = word.strip('.!?,()\n').lower()
            if record.has_key(w):
                record[w] += 1
            else:
                record[w] = 1

    for word in record.keys():
        if word in common_words:
            record[word] = -1
    f.close()     
    occur = [key for key in record.keys()]
    occur.sort(reverse=True, key=lambda x: record[x])
    return set(occur[: len(occur) / 10 ])

Moving on to the second step, we must give a score to each sentence or set of text. This really generalizes to any body of text, but I was curious to how it would pick out important sentences, so I went ahead and just used sentences. This function calculates our score based upon how tightly the important words are clustered inside the sentence. Luhn called this the "significance factor," which could be calculated for a sentence by bracketing the significant words in the sentence, squaring the number of significant words and then dividing by the total number of words. So if we had a 100 word string of significant words, then the sentence would receive a score of 100. The calculate score method is a little bit hacky looking, but this is to save space and make things a little bit more interesting for you.

def calculate_score(sentence, metric):
    words = sentence.split()
    imp_words, total_words, begin_unimp, end, begin = [0]*5
    for word in words:
        w = word.strip('.!?,();').lower()
        end += 1
        if w in metric:
            imp_words += 1
            begin = total_words
            end = 0
        total_words += 1
    unimportant = total_words - begin - end
    if(unimportant != 0):
        return float(imp_words**2) / float(unimportant)
    return 0.0

All right, and with that we are pretty much ready to put everything together. The summarize function calculates the two parts, then calculates the score for each sentence. The top sentences, in this case I chose to represent the top 5% of scoring sentences, are returned to give a decent summary:

ABSTRACT_SIZE = .05

def summarize(file_name):
    f = open(file_name, "r")
    text = ""
    for line in f:
        text += line.replace('!','.').replace('?','.').replace('\n',' ')
    f.close()
    sentences = text.split(".")
    metric = top_words(file_name)
    scores = {}
    for sentence in sentences:
        scores[sentence] = calculate_score(sentence, metric)
    top_sentences = list(sentences)                                # make a copy
    top_sentences.sort(key=lambda x: scores[x], reverse=True)      # sort by score
    top_sentences = top_sentences[:int(len(scores)*ABSTRACT_SIZE)] # get top 5%
    top_sentences.sort(key=lambda x: sentences.index(x))           # sort by occurrence
    return '. '.join(top_sentences)                            

And wham! Summarization completed. Now this does depend upon a couple of key assumptions that Luhn made. He states that technical writers tend to refer to the same thing over and over with the same words and that even if they use alternate terms for their readers, they will eventually use very specific terms to describe their points. Well, I wanted to give it a test, so I ran it on one of the papers we read in our systems class. The paper is called BitTyrant, an excellent read in itself that discusses an extremely opportunistic and greedy BitTorrent agent.

Here's the beginning and ending snippets of what it came up with. You can check out the whole file/summary and setup with he attached code samples.

 We evaluate BitTyrant performance on real swarms, establishing that all peers, regardless of upload capacity, can significantly improve download performance while reducing upload contributions.  First, although modeling Bit- Torrent has seen a large body of work (see Section 6), our model is simpler and still suffices to capture the correlation between upload and download rates for real swarms. [... Calvin's editorial cut ...]

Unfortunately, our work shows that BitTorrent fails to attain such an equilibrium for typical file sizes in swarms with realistic bandwidth distributions and churn, which Bit- Tyrant exploits through strategic peer and rate selection.  In contrast, BitTyrant maximizes download per unit of upload bandwidth and can drastically reduce its upload contribution by varying the active set size and not sharing its upload bandwidth uniformly with active peers.

I could limit the text further, but then it tends to deteriorate given the sheer amount of information in the paper. Other than the references to  figures which are out of place without the accompanying text, the introduction and conclusion are pretty good. As promised here is the code. Next week I should have a more in-depth project to demonstrate, but until then have fun summarizing and abstracting.