Text Summarisation with Python NLP

2022-07-03T22:00:00Z

Since I used my thesis to test this program and because it is still unpublished, I cannot upload it here. Although, you can store my code locally and run it with one of your documents. The only prerequisite is to have saved the text with a “.txt” extension.

Introduction

This project is part of Codecademy’s course Apply Natural Language Processing with Python.

The main aim of the project is to apply a simple text summarization technique using Python in order to extract the summary of a long text. The text used for this project is my Master’s Thesis in Digital Humanities for the University of Groningen.

The length of the text is 47 pages, from which are excluded the References section and the technical information of the thesis (for instance, the title, the author, student number, etc.). As it is shown below, the overall length of the text used for this project is 16236 tokens.

import numpy as np import pandas as pd import re from bs4 import BeautifulSoup import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize, sent_tokenize with open("thesis_final_version.txt", encoding = "utf-8") as inp: text = inp.read() stopwords = set(stopwords.words("english")) tokens = word_tokenize(text) print(f"The overall length of the text used is: {len(tokens)} tokens") # calculate how often a word occurs in a text freqTable = dict() for word in tokens: words = word.lower() if word in stopwords: continue # continue in order to skip stopwords in the counting #if the word is in the freqTable then add one more if word in freqTable: freqTable[word] += 1 # if it is not in the freqTable, then initiate the counting from 0 else: freqTable[word] = 1 # print(freqTable) # create a dictionary that will sum the frequency of each word # with respect to its sentence # this will give to every sentence, based on the words that it contains # a different weight, which will be used to summarize the text # with the most frequent/important words-sentences sentences = sent_tokenize(text) sentenceValue = dict() for sentence in sentences: for word, freq in freqTable.items(): if word in sentence.lower(): if sentence in sentenceValue: sentenceValue[sentence] += freq else: sentenceValue[sentence] = freq # print(sentenceValue) # sort the frequencies of the sentences in an descending order sort_sentenceValue = sorted(sentenceValue.items(), key=lambda x: x[1], reverse=True) # for i in sort_sentenceValue: # print(i[0], i[1]) sumValues = 0 for sentence in sentenceValue: sumValues += sentenceValue[sentence] print(sumValues) average = sumValues / len(sentenceValue) print(average) # only sentences with a value higher than the average will be included to the summary summary = "" for sentence in sentences: if (sentence in sentenceValue) and (sentenceValue[sentence] > average): summary += " " + "\n" + sentence with open("text/thesis_summary.txt", "w") as infile: infile.write(summary)

Thank you for your time! Please feel free to comment on my code and give me any kind of feedback in order to make it even better.