In this worksheet we will start with a single novel, Pride and Prejudice, by Jane Austen, and we will look at the metadata associated with this novel, which represents the "data about our data".
The metadata takes two different forms. For each novel, this is stored in two different files in the JSON data exchange file format.
Metadata relating to the novel and the author of the novel. This is stored in the file metadata.json
Metadata relating to the characters in the novel. We refer to this as the character dictionary. This is stored in the file characters.json
# import the Python packages that we will need to use later on
import json
from collections import Counter
import matplotlib.pyplot as plt
# the relevant files for this novel are stored in the directory below
from pathlib import Path
dir_pride = Path("data") / "pride_prejudice"
Start by loading the metadata relating to the novel and the author of the novel from the JSON file metadata.json.
The JSON data from the file should be parsed into a Python data structure, so that we can access it more easily.
# specify the path of the file
novel_meta_path = dir_pride / "metadata.json"
# open the file and parse all of the JSON data
with open(novel_meta_path, "r", encoding="utf-8") as fin:
novel_metadata = json.load(fin)
Once you have loaded and parsed the novel data, extract and display:
The novel title and publication year information.
The full name of the novel's author.
print("Title is", novel_metadata["novel"]["title"])
print("Publication year is", novel_metadata["novel"]["year"])
print("Author is", novel_metadata["author"]["firstname"], novel_metadata["author"]["surname"])
Next, load the data about all of the characters in the novel. Again, the JSON data from the file should be parsed into a Python data structure, so that we can access it more easily. In this case the structure will be a Python dictionary - we will refer to this as a character dictionary.
# specify the path of the file
character_path = dir_pride / "characters.json"
# open the file and parse all of the JSON data
with open(character_path, "r", encoding="utf-8") as fin:
characters = json.load(fin)
How many characters have been identified for this novel?
print("Novel has %d characters" % len(characters))
Display the definitive names for the characters in the character dictionary.
print("All character definitive names:")
print(list(characters.keys()))
Display the aliases provided for a single character, the protagonist of the novel, Elizabeth Bennet.
print("Aliases for Elizabeth Bennet:")
print(characters["elizabeth bennet"]["aliases"])
Next, display the attributes provided for Elizabeth Bennet.
print("Attributes for Elizabeth Bennet:")
print(characters["elizabeth bennet"]["attributes"])
We have character attributes associated with most of the characters in our character dictionary.
First, count the number of characters with either the attribute female or male.
female_count = 0
male_count = 0
# iterate over all of the characters in the character dictionary
for definitive_name in characters:
char = characters[definitive_name]
if "female" in char["attributes"]:
female_count += 1
if "male" in char["attributes"]:
male_count += 1
# display the counts
print("%d female characters, %d male characters" % (female_count, male_count))
Next, count the number of times that each character attribute appears in the dictionary and display the top 20 most common attributes in the dictionary.
(Hint: a Python Counter might be useful here)
# create an empty Counter
counts = Counter()
# iterate over all of the characters in the character dictionary
for definitive_name in characters:
char = characters[definitive_name]
for attribute in char["attributes"]:
counts[attribute] += 1
# print the most common entries in the Counter
print("Top-20 most common character attributes for this book are:")
for attribute, count in counts.most_common(20):
print("%d \t %s" % (count, attribute))
Create a bar chart which visualises the attribute counts for the following character attributes:
required_attributes = ["mother", "father", "wife", "husband", "son", "daughter", "brother", "sister"]
# extract only the counts for the required attributes
required_counts = []
for attribute in required_attributes:
required_counts.append(counts[attribute])
# now create a plot to display them
plt.figure(figsize=(9, 5))
ax = plt.bar(required_attributes, required_counts, color="purple")
# add axis labels to the chart
plt.xlabel("Attribute", fontsize=13);
plt.ylabel("Number of Characters", fontsize=13)
plt.show()
For this task, we will consider metadata related to all three novels in our dataset:
Pride and Prejudice by Jane Austen
Dracula by Bram Stoker
Frankenstein by Mary Shelley
# the relevant files for these novels are stored in the directories below
dir_pride = Path("data") / "pride_prejudice"
dir_dracula = Path("data") / "dracula"
dir_frankenstein = Path("data") / "frankenstein"
Load in the character dictionary for each of the three novels and store them in separate Python dictionaries.
# read the characters for the first novel
character_path_pride = dir_pride / "characters.json"
with open(character_path_pride, "r", encoding="utf-8") as fin:
characters_pride = json.load(fin)
# read the characters for the second novel
character_path_dracula = dir_dracula / "characters.json"
with open(character_path_dracula, "r", encoding="utf-8") as fin:
characters_dracula = json.load(fin)
# read the characters for the third novel
character_path_frankenstein = dir_frankenstein / "characters.json"
with open(character_path_frankenstein, "r", encoding="utf-8") as fin:
characters_frankenstein = json.load(fin)
Count the total number of characters in each novel. Display this information visually using a bar chart.
# create the values for the chart
novel_names = ["Pride and Prejudice", "Dracula", "Frankenstein"]
total_character_counts = [len(characters_pride), len(characters_dracula), len(characters_frankenstein)]
# create the chart
plt.figure(figsize=(8, 5))
ax = plt.bar(novel_names, total_character_counts, color="navy")
# add axis labels to the chart
plt.xlabel("Novel", fontsize=13);
plt.ylabel("Number of Characters", fontsize=13)
plt.show()
Finally, use the character attributes from each novel to calculate the ratio of female-to-male characters in each novel. Display the results as a new bar chart.
# create a simple function to calculate the female-to-male ratio
def calc_ratio(characters):
female_count = 0
male_count = 0
for definitive_name in characters:
char = characters[definitive_name]
if "female" in char["attributes"]:
female_count += 1
if "male" in char["attributes"]:
male_count += 1
return female_count/male_count
# apply the function for each of our novels
total_character_counts = []
total_character_counts.append(calc_ratio(characters_pride))
total_character_counts.append(calc_ratio(characters_dracula))
total_character_counts.append(calc_ratio(characters_frankenstein))
# plot the ratios
plt.figure(figsize=(8, 5))
ax = plt.bar(novel_names, total_character_counts, color="teal")
# add axis labels to the chart
plt.xlabel("Novel", fontsize=13);
plt.ylabel("Ratio of Female-to-Male Characters", fontsize=13)
plt.show()