Grad Research: Data-mining Jane Austen

Published on January 29, 2018

Before there was Mr. Big, there was Mr. Darcy.

Tall and handsome, wealthy and witty, aloof and arrogant…Fitzwilliam Darcy made a poor first impression in Jane Austen’s Pride and Prejudice before winning the affections of protagonist Elizabeth Bennet with his gentlemanliness and kindness. Elizabeth marries him for love, and preserves the family fortune in the process.

The End.

Except not really. Austen’s oeuvre has persisted, thriving through the age of aristocratic suitors, modern dating, and into the era of Tinder.

And Carleton’s Jenna Herdman’s research illustrates this. The second year English PhD has used Google NGram Viewer to track Austen mentions. It scours Google Books, and shows that Austen’s work is mentioned more often as time passes – particularly Pride and Prejudice, on a steady upward trend since about 1990.

Herdman also uses text-mining and distant reading to create data visualizations of the content of Austen’s novels. It helps students understand how they’ve structured and the techniques she’s engaging have helped help academics critique entire bodies of literature and move beyond the literary canon.

There may have been as many as 60,000 novels published in 19^th Century England. Reading one per day, it would take more than 160 years to read them all. But by aggregating data on grammar and language, it’s possible to recognize patterns within the full body of work.

“Thematically, Austen novels generally focus on a female protagonist and a marriage plot,” Herdman says. “The heroine surmounts the financial difficulty of her inheritance position by settling into a marriage that, conveniently, fulfills a domestic ideal of having both romantic love and economic security. The heroine rejects the ‘wrong’ choice of husband – often defined by sexual attractiveness, but which will lead to a ruinous union – in favour of the ‘right’ choice.”

In addition to Pride and Prejudice, Herdman used Voyant – a web-based text analysis tool — to analyze patterns in the romantic rivalries in Sense and Sensibility, Northanger Abbey and Mansfield Park.

Dividing each book into 10 segments, Herdman identifies the number of mentions of each romantic rival. The resulting graph shows Darcy, the romantic hero, fluctuating alongside Wickham, who falsely accused Darcy of denying him a lucrative post before the romantic hero ultimately wins the protagonist’s heart, and dominates mentions in the novel’s conclusion.

Sense and Sensibility has a similar pattern, with its chosen suitor Colonel Brandon given less focus throughout the novel than rival Mr. Willoughby before becoming a focal point at the end.

Visualizing these arcs can help readers better understand plot progression, and Herdman considers these graphs a first step, envisioning more intricate ways that Austen’s characters’ traits can be probed.

“There’s something fascinating about peeling a 19^th Century novel open with digital tools to look for patterns and irregularities,” says Herdman. “Typical undergraduates might read the introduction to the scholarly edition, or look up the summary on Wikipedia or Sparknotes. Similarly, distant reading lets us visualize and explore a novel’s language or themes to supplement our close reading. These graphs aren’t authoritative representations, they’re texts in their own right, which should be interrogated and critiqued.”

Herdman, whose previous work has explored patterns in Dickens and vampire novels, chose to do her PhD at Carleton to work with Dr. Barbara Leckie and Dr. Janice Schroeder, who were both influences in her Master of Arts, which specializes in Digital Humanities. In her PhD, she’ll focus on Henry Mayhew’s London Labour and the London Poor: archiving, cataloguing, and representing it for scholars and the public.

Published in the 1850s, Mayhew’s journalism catalogued poverty, public health and political economy. His team researched poverty, engaging directly with the poor, and London Labour was the first study of urban poverty to communicate the stories of the poor in their own words.

“I’m interested in how the experience of reading London Labour and the London Poor could be enhanced through a hypertextual model of engagement – links to related works — which encourages readers to encounter it in a nonlinear way.”

The online archive she’s creating will facilitate comparisons of various versions, with a scholarly edition that uses digital architecture to navigate the text instead of encountering it as individual print forms. She’s exploring how this transforms interaction with it, and how digitization enables readers to encounter texts in new ways.

“Digital tools offer unprecedented opportunities for representing the massive amount of print culture from the Victorian period. Scholars have digitized periodicals that could previously only be viewed in archives. Digitization is especially useful for making texts searchable using optical character recognition to scan for words.”

Knowledge of digital tools – their applications and limitations – is key to ensuring English literature graduates are prepared for the workforce.

“We increasingly read in non-linear ways, using hypertext, mediated through existing digital interfaces and architecture. English majors need digital skills, so learning the perils of digital analysis first-hand is important. I’m interested in how we might teach digital literacy. It might seem paradoxical, but one reason to use these tools is to interrogate their methods, biases, and failings. Digital production is often characterized by positive rhetoric and claims of democracy, open access, and increased ‘freedom’. However, this ‘library’ is organized by algorithms, not bibliographers, and it enhances the monopoly that search engines like Google have over information.”

–This story was written by Tyrone Burke.