Language Modelling and Visualization for Parallel Text Analysis

University College Cork (Ireland)

Existing text reuse detection systems strongly depend on naïve, brute word matching; due to unstable spellings, text reuse detection remains a major challenge for literary and historical documents. This project will apply and further develop representation learning techniques (e.g. Large Language models), which enable us to move from strict word matching through a deeper analysis of textual parallels, unhindered by issues such as spelling variation and context awareness, and which can also spot allusions on top of literal quotes. The role of the expert or lay user supporting the process will also be taken into account by means of novel visualisation techniques tailored for the application. The successful candidate will be expected to pursue PhD-quality research which locates and tracks candidates for re-use categories in the light of current user-centred Large Language Models.This research might lay the foundation for future applications of text re-use in the computer-assisted analysis of literary text, currently a major gap in the capabilities of digital tools and methods for humanities research. An appropriate literary test case with a rich tradition in exposition, allusion, and adapting will be found. The researcher should: 1) develop methods and tools for detecting text re-use in works of literature; 2) track the variation and/or stability in which text was reused; 3) utilise machine learning techniques for accurate, yet also efficient text matching and indexing; 4) interpret specific cases of style and content reuse in literature and other documentary sources.