Friday 3 February 2017

Thoughts on text-mining and intertext


Friends, Romans, countrymen: why haven’t we built a text-mining tool to facilitate one of our favourite occupations within the field, namely the spotting of intertext?

Most of the time, I imagine most of us establish intertext by dint of knowing the one (or two, or three) author(s) of our specialisms extremely well – well enough that certain phrases may trigger our recognition so that we connect the two. Sometimes the idea of intertext has to be discarded if no meaningful engagement can be argued; sometimes we craft entire articles around what the allusion by one author to the words of another can mean. I’m not saying there is anything wrong with this. What I am wondering is whether we are missing opportunities.

The current approach produces results which, though valid in the individual instances they are establishing, cannot say anything about the extent of borrowing and engagement between two authors more widely. We may recognize, if we know our Plautus well, a Ciceronian borrowing of him in a particular speech that we’re working on, but there it ends. Is it the only speech in which he engages with Plautus? Does he do this frequently? Does he draw on some passages or works more than others? Does he borrow from Plautus, and within that from particular works, more than he borrows from other authors which predate him? The questions are much more difficult to answer without a deep and long immersion in both texts, with no guarantee of further results. (I’m not saying this would be ‘wasted’ time in the absence of results, but within the current realities of academic knowledge production and career assessment, time must to a certain extent ‘pay off’.)

People have written theses around constructing these sorts of relationships on the basis of individual occurrences, and may even in recent years have started feeding their authors of choice into text parsing programmes to help with the work, but the results are still limited to a pre-selected set of authors which we either know well or suspect of such intertextual engagement, and therefore informed by our bias and expectations from the beginning.

So why can’t we build a database tool that makes connections across all of Latin literature, and then select from the results what strikes us as interesting? Granted, it would throw up a lot that is misleading or of no use on account of being meaningless -- fixed expressions being one of the obvious pitfalls (if 80% of authors from 3rd century BC Plautus to 4th century AD Ausonius used the phrase ‘getting up at the crack of dawn’ we can probably safely discard the possibility of intertext between them all). But by widening the net I suspect we would notice a lot that no one has before, simply because not many of us are sufficiently closely conversant with more than a handful of authors or texts in the course of our lifetimes. We would also eliminate the possibility that we’d just ‘missed’ stuff, even within our own field of expertise.

What I would greatly enjoy using, should it ever exist, would be an online data-churning resource for the entire body of Latin classical texts (yes, the chronological parameters would require some thinking) which compares syntax and vocab across these texts and flags up passages of similarity, possibly with a percentage indicator judging how great the overlap is. There are already programmes designed to flag up plagiarism, such as CrossCheck/iThenticate, which do similar things. I never dealt with them much during my time in publishing, nor with Turnitin in my academic teaching so far, so sadly I don’t know much about the detail.

Of course these programmes operate slightly differently: you put in a chunk of text, say a submitted journal article or a student’s work, and the programme compares this chunk to the whole body of articles in its database. But even if we simply copied this format, this would already be extremely useful for classical scholars who, as I said above, still start from a point where they pre-select the material they’d like to work with.

After all, I know of no projects centered around questions as broad as ‘let’s see what comes up if we compare everthing ever’. I suspect the processing power required to do this, depending on the database size, would be too much, and building an intelligible and user-friendly interface for viewing the results would be impossible. But ‘let’s see what comes up if I copy and paste Tacitus’ description of the battlefield of the Varian disaster in the Teutoburgerwald’ would already produce wider-reaching results than the human mind is capable of and I, for one, would be extremely interested in seeing them. They machine would not draw your conclusions for you, but you could get straight to the interpreting without having to slosh through the data in a human, and thus flawed, manner.

(Amusingly, about a year ago I was at my funder the AHRC’s annual skills training conference, about which I blogged here. I remember having a conversation with an artist from Falmouth named Dane Watkins at the time, who had an interest in such things and told me that as I was already doing some of this stuff anyway (as indeed I was, in the third chapter of my thesis, which I actually wrote first and must have told him about at the time) I should do it more systematically. I wish I remembered more of the conversation, but it clearly didn’t prompt me into action or even serious consideration at the time.)

So why doesn’t this kind of resource exist yet? I’ve had a long old think about this, and I think it’s possible to split the reasons into the two broad categories of ‘technical’ and ‘emotional’.

Classicists don’t tend to be techies. As a species we are not educated into becoming masters of digital skills, but into linguists, historians, translators, and critical thinkers in a way which has not so far required any (or not much) external mediation between us and our source material. With the advent of Digital Humanities thankfully many people have woken up to the possibilities and importance of IT to the field, with lovely results such as my personal favourites ORBIS or PHI Latin Texts. Though pretty low-grade in its looks (I don’t know about the back-end), I am also rather fond of the texts on Perseus, as well as its dictionary and word study tools. The most exciting one of all that I’ve come across is Diogenes, made by classics scholar and digital humanist Dr P. J. Heslin at the University of Durham, which draws on various other databases of classical texts to allow for rather complicated searching, such as (in Dr Heslin’s own example from this talk) all possible declensions of the word ‘Caesar’ in the works of Cicero, or only certain declensions of the word ‘Athena’ in texts marked in the database as belonging to the genre ‘epic’. What I can’t ascertain at the moment, because I don’t yet understand how to use the thing, is whether it is set up to do what I have outlined above, even though that seems to me to be only a moderate expansion of what it already seems to do.

So why haven’t we gone for full-on text-mining, given that there are some of us who have the technical skills and others who do not have clearly found suitable partners to help them build things? This brings me to the ‘emotional’ reasons.

I’ve already said we do intertext all the time, and we like making connections. Is there, however, some unspoken or even unconscious feeling that these things should be spotted through hard graft rather than automated comparison? Do we feel it's somehow at odds with the literary nature of these texts? Does it demean their art to involve a level of automation in our engagement with them? Do we fear it may undermine past research on intertextual connections which have been argued to be unique but may in fact occur elsewhere, in authors one doesn't happen to be familiar with or interested in? Do we think it would throw up too much that is irrelevant so that it would be too labour-intensive to use? Have others had this idea (I struggle to imagine they haven’t) but haven't had the time, money, inclination, technical skills, right contacts to pursue it? Is someone somewhere already working on this in silence?

Or is it because we like our current method of selecting the direction of our research before starting it, as opposed to seeing what the data throws up and then selecting what we'd like to get our teeth into? Would narrowing down what to follow up on be too difficult? An admittedly quick google (but, in fact, on DuckDuckGo, because they don’t track your searches) on text mining in the humanities threw up only these two examples, on text mining ancient Chinese medical literature and classical music scores.

Why are we not doing this? (If we are, please tell me.)

IT savvy friends have assured me that it wouldn’t be very difficult at all to build such a data-churning tool. Much of our raw data is already out there in digital form (Perseus has most of current Latin literature texts uploaded, as does PHI, as does even a bare-bones, quick-reference resource such as The Latin Library), and it’s not as if we have to worry about copyright.

I imagine the fact that Latin is a declined language could be a bit of a problem, but there must be ways round that. Texts on Perseus, for example, are coded so that if you click on any word in the Latin text it will take you through to a list identifying the word’s morphology. Which means the back end of the resource recognizes word stems and dictionary entry forms as well as their possible modifications when declined or conjugated. It would have to be explored whether building a supra-engine for text mining which drew on the databases of these already existing resources and their databases is possible and whether the institutions which host them would give permission, or under what conditions/for what remuneration they would do so. The Diogenes ‘Thanks’ page refers to Perseus’ Creative Commons licencing which has allowed Dr Heslin to draw on their database, so again I can’t imagine it would be very hard.

If sufficient collaborations could not be established, having to duplicate the data entry work would be a disadvantage in terms of the time and money required to set it up. But at the same time a resource not reliant on the others’ continued hosting of compatible (!) databases into the future would have the advantage of complete control over both its future and its design. A newbuild could make its coding open source and its licencing format CC-BY-NC, allowing others to borrow (for example for other fields of literature), but not for commercial purposes.

Techie people with a classics background must be hard to find, so there would still be a large and important role for classical ‘editors’ to test successive developments of the resource in order to help refine the rules which determine results so that the output would be as accurate and relevant as possible. Presumably, they’d also need to have people on hand to explain to them, nevermind to the programme, how Latin actually works.

Is this the stuff that postdocs are made of?

I need to do more research and then have a long hard think. But really, I need to get on and write the fourth chapter of my thesis.

(**The answer to the question of how I came to have these thoughts is longer and less interesting than these loose thoughts, but briefly: a Tacitean passage I’m working on struck me as very Caesarian in ‘feel’. Leaving aside the difficulty of establishing which criteria to adopt in order to potentially verify this, I also realized that I didn’t have the time to get to know Caesar as well as I do Tacitus during the scope of my current project, and this led me to think of the idea of the digital tool I am ruminating on in this blog post.**)