In the first paragraph of “Messy Data and Faulty Tools,” Joanna Swafford cites the Stanford Trans-Historical Poetry Project, found here. Stanford Literary Lab aims to create an algorithm that scans meter by breaking lines into their smallest metrical units in order to classify the metrical scheme of any poem (Algee-Hewitt, et al.). Of course, the project is interesting in its own right—for telling us the popularity of a given meter over the history of English or American literature, or for identifying the limits of metrical rule-breaking—but the project also does just what Swafford suggests: partnership and peer review. Ryan Heuser, of the Trans-Historical Poetry Project Team, has posted the code used in the project here. Heuser provides step-by-step directions for users to run the project’s software and offers up two years of his team’s hard labor “for any purpose whatever” (“Code used in the Literary Lab’s Trans-historical Poetry Project”). Such open access allows other scholars to benefit from the team’s work and test the code’s accuracy. More importantly, outside input can help the Literary Lab with the challenges the project faced.
Like Swafford, Alan Liu advocates for interdisciplinary collaboration in “N+1”—to explore cross-domains of data to “advance the digital humanities both technically and on a broader methodological front.” He suggests that pushing scholars out of their familiar domains and messy-ing their clean data sets can disrupt a feedback loop of set assumptions and create unexpected results (Liu). The Trans-Historical Poetry Project has only an 80% success rate in correctly classifying meter. The project’s specific challenges are elision, extrametrical syllables, feminine endings, and foreign words. Their program works best from the late sixteenth to the late nineteenth centuries because “metrical forms were most stable and recognizable in this period” (Algee-Hewitt, et al.) It seems to me that the project is specifically troubled by Middle English poetry; Chaucer’s iambic pentameter is especially fraught with all the challenges the team listed, and his constant metrical rule-bending can leave scholars and students puzzled. I wonder if including punctuation and manuscript variants in the project’s data pool might help their program to identify Middle English poetry with greater success. Though irregular, punctuation like the virgule (/) separates metrical units, signifies when not to elide, and generally informs readers of the poem’s intended pronunciation. By including sets of manuscripts with the same contents written around the same time period in their data, their program would have access to much more regular rules and examples for all those aforementioned challenges.
Of course, the Trans-Historical Poetry Project may be limited in the same way that many digital humanities projects in literature are limited: by access. Google Ngram Viewer’s search begins at 1500, and Bookworm starts at 1600. These two programs are representative of the lack of open-access, digitally transcribed medieval texts. In her article “More Scale, More Questions,” Tressie McMillan Cottom suggests that economic forces influence “the data being produced, and the scale of which is being produced, and contains how that data can be accessed, analyzed, and politicized” (Cottom). There are many reasons why medieval texts aren’t first to be digitally transcribed, including scribal hands and blackletter type, but the lack of availability of digitized texts preceding 1500 undeniably shapes the kinds of questions scholars, including the Literary Lab, can ask if they hope to spend their time analyzing data instead of transcribing it.
Comments