A common type of analysis of text data is to look at the frequency with which different words appear in the text. In this worksheet we will look at applying word frequency analysis to novel full-texts.
To begin, we will focus on Pride and Prejudice by Jane Austen. For this novel, the file fulltext.txt contains an annotated version of the original text of the novel as provided on Project Gutenberg.
# import the Python packages that we will need to use later on
import re
from collections import Counter
import matplotlib.pyplot as plt
# the relevant files for this novel are stored in the directory below
from pathlib import Path
dir_pride = Path("data") / "pride_prejudice"
Before applying word frequency analysis, we need to perform a number of preparation steps to the full-text.
Firstly, read the entire novel into a single string. Print the length of this string.
# specify the path of the file
novel_fulltext_path = dir_pride / "fulltext.txt"
# load the file
lines = []
with open(novel_fulltext_path, "r", encoding="utf-8") as fin:
# read in the entire novel
fulltext = fin.read()
print("Novel full-text has length %d characters" % len(fulltext))
Next, convert the text that you have loaded into all lowercase.
fulltext = fulltext.lower()
Using the text prepared from Task 1, we now can start to identify the words in the novel's text.
Firstly, split the text into a list of all words appearing in the text. We can define a word as a substring that is separated by whitespace characters (e.g. spaces, tabs etc) and/or punctation symbols.
(Hint: We can do this a number of different ways, including by using regular expressions)
# split based on whitespace characters and punctuation
pattern = re.compile("\W+")
all_words = pattern.split(fulltext)
Next, filter out any words from the list which contain less than 2 characters (symbols). Report the number of filtered words and the number of remaining words.
keep_words = []
for word in all_words:
if len(word) >= 2:
keep_words.append(word)
num_filtered = len(all_words) - len(keep_words)
print("Filtered %d words - Kept %d words" % (num_filtered, len(keep_words)))
Report the number of unique words that appear in the list of remaining words.
# use a set to find the number of unique values in a list
unique_words = set(keep_words)
print("Novel full-text has %d remaining unique words" % len(unique_words))
Using the remaining words from Task 2, count the number of times that each word appears in the list (i.e., the word frequencies).
Display the top-20 most common words.
(Hint: a Python Counter might be useful here)
# turn the list of filtered words into a counter
word_freqs = Counter(keep_words)
# print the top 20
print("Top-20 most common words in the novel are:")
for word, count in word_freqs.most_common(20):
print("%d \t %s" % (count, word))
Many of the words we see above are common stop-words which frequently appear in the English language and might not convey much information about the novel itself. An example set of stop-words is given below.
stopwords = set(["am", "an", "and", "are", "as", "at", "be", "been", "but", "by", "can", "could", "do", "did",
"for", "from", "had", "has", "have", "how", "i", "if", "in", "is", "it", "its", "me", "must", "my", "no", "not",
"of", "on", "one", "or", "our", "say", "said", "shall", "so", "some", "such", "that", "than", "the",
"them", "there", "this", "these", "to", "was", "were", "what", "when", "where", "which", "who", "why",
"will", "with", "would", "you", "your"])
Remove the all of the stop-words from the current list of word frequencies and display the top-20 most common remaining words.
# remove the frequencies for these words
for stopword in stopwords:
del word_freqs[stopword]
# print the top 20 again
print("Top-20 most common words in the novel are:")
for word, count in word_freqs.most_common(20):
print("%d \t %s" % (count, word))
Visualise the frequencies for the top-20 words from above using a horizontal bar chart.
# from the Counter get the top words and corresponding frequencies
top_words = []
top_freqs = []
for word, freq in word_freqs.most_common(20):
top_words.append(word)
top_freqs.append(freq)
# we have to reverse the lists to get the largest value to appear at the top of a horizontal bar chart
top_words.reverse()
top_freqs.reverse()
# now create a plot to display them
plt.figure(figsize=(7, 7))
ax = plt.barh(top_words, top_freqs, color="darkgreen")
# add axis labels to the chart
plt.xlabel("Frequency", fontsize=13)
plt.show()
Now try removing additional stopwords and regenerating the chart above to see how it affects the top-20 word visualisation.
# remove further stopwords... we will just try adding a few more here
extra_stopwords = ["all", "any", "every", "know", "more", "much", "they", "very"]
for stopword in extra_stopwords:
del word_freqs[stopword]
# recreate the plot
top_words = []
top_freqs = []
for word, freq in word_freqs.most_common(20):
top_words.append(word)
top_freqs.append(freq)
# we have to reverse the lists to get the largest value to appear at the top of a horizontal bar chart
top_words.reverse()
top_freqs.reverse()
# now create a plot to display them
plt.figure(figsize=(7, 7))
ax = plt.barh(top_words, top_freqs, color="darkgreen")
# add axis labels to the chart
plt.xlabel("Frequency", fontsize=13)
plt.show()
Next, we expand the word frequency analysis above to consider the full-texts for two different novels in our dataset:
Dracula by Bram Stoker
Frankenstein by Mary Shelley
# the relevant files for these novels are stored in the directories below
dir_dracula = Path("data") / "dracula"
dir_frankenstein = Path("data") / "frankenstein"
Load and prepare the full two full texts using the steps that we saw previously.
# define a function to load the text and apply the text preparation
def load_and_prepare(in_path):
# load the file into a string
with open(in_path, "r", encoding="utf-8") as fin:
# read in the entire novel
fulltext = fin.read()
# convert it to lowercase and return it
return fulltext.lower()
# load the text for this first novel
fulltext_path_dracula = dir_dracula / "fulltext.txt"
fulltext_dracula = load_and_prepare(fulltext_path_dracula)
# second novel
fulltext_path_frankenstein = dir_frankenstein / "fulltext.txt"
fulltext_frankenstein = load_and_prepare(fulltext_path_frankenstein)
For each text, identify a list of the top-30 most common words. You should filter short words (< length 2) and common stop-words as part of this process.
# define a function to split the texts, filter the words, and return the top 30
def find_top30_words(fulltext):
# find all of the words
pattern = re.compile("\W+")
all_words = pattern.split(fulltext)
# remove the short words
keep_words = []
for word in all_words:
if len(word) >= 2:
keep_words.append(word)
# count the word frequencies
word_freqs = Counter(keep_words)
# remove the stopwords
for stopword in stopwords:
del word_freqs[stopword]
# return the top words in a list (without their frequencies)
top_list = []
for word, freq in word_freqs.most_common(30):
top_list.append(word)
return top_list
top_dracula = find_top30_words(fulltext_dracula)
print("Top 30 most common words in the book Dracula:")
for i, word in enumerate(top_dracula):
print("%02d) %s" % (i+1, word))
top_frankenstein = find_top30_words(fulltext_frankenstein)
print("Top 30 most common words in the book Frankenstein:")
for i, word in enumerate(top_frankenstein):
print("%02d) %s" % (i+1, word))
From the top-30 word lists, identify:
The top words common to both novels
The top words unique to Dracula
The top words unique to Frankenstein
# convert the lists to sets first
set_top_dracula = set(top_dracula)
set_top_frankenstein = set(top_frankenstein)
# get the words common to both (set intersection)
print("Top-30 words common to both Dracula and Frakenstein:")
print(set_top_dracula.intersection(set_top_frankenstein))
# use set difference operators
print("Top-30 words unique to Dracula:")
print(set_top_dracula.difference(set_top_frankenstein))
print("Top-30 words unique to Frakenstein:")
print(set_top_frankenstein.difference(set_top_dracula))