State of the Art

Three areas of innovation

The network crucially advances the state of the art in historical computational linguistics in its multidisciplinary combination of statistical Natural Language Processing with humanities expertise from longitudinal fields including languages, linguistics, literature and history. Each research domain informs and improves the other. CASCADE will advance the state of the art in the following areas:

  • More accurate semantically-aware information retrieval. The network proposes to develop methods that are both data-driven and sensitive to the longitudinal characteristics of data and meaning (semantic change). 

  • A better understanding of the characteristics and appropriateness of different methods because the network proposes to apply them within different longitudinal studies that rely on natural (human) language.

  • Fundamental advancement in the field of historical linguistics which is traditionally ‘case based’ (i.e. using small, predefined corpora) and observational rather than computational in its methods of analysis.

Current methodologies

Semantically-aware information retrieval (often termed ‘semantic search’ because search is the most common aspiration) requires the use of methodologies that process contextual information in order to determine the meaning of a particular word or group of words. In semantic search services more specifically, the challenge is complicated by the requirement to understand the intent of the user also (i.e. what do they intend their search term or phrase to mean?).

Current methodologies are overwhelmingly predicated on the target data source being highly structured using a data model and/or additional metadata that implies semantic relationships between the words that occur in the source.  For example, structuring data as a triplestore (e.g. RDF XML) using subject-predicate-object triples.

More advanced data models, but using the same principles, are graph databases and, in particular, ontologies in which entire knowledge domains can be formalised as a schema and then the source data modelled to fit the schema. Social network analysis relies on these types of data models.

These approaches all have drawbacks: They impose a standard semantic model onto the data that has been pre-determined and, because they require an enormous investment of time to model source data in line with the schema, they are untenable at scale.  

The need to be data-driven

Genuine semantically-aware information retrieval must be capable of handling natural (human) language at scale and without the prerequisite of a semantic model. That is, it needs to be data-driven using natural language processing and computational linguistic approaches that can undertake probabilistic reasoning based only on the evidence to hand.

Computational linguistics and NLP are not new fields, dating back to the 1950s, but it is with the rise of probabilistic approaches facilitated by more powerful computing that we have seen these fields move away from the supervised or pre-modelled approaches discussed above and towards the current state of the art in statistical NLP.

With statistical NLP,  employing neural networks and Machine Learning, linguistic data can be automatically structured using algorithms based on probabilistic measures and descriptive and inferential statistics. Further, patterns can be observed within the structured data using similar algorithms. Such patterns constitute linguistic norms or rules in particular contexts (historical, lexical, grammatical, or genre- or style-related contexts, for example). Such patterns can then be identified in data that has not been seen before. 

Meaning is usually de-emphasised

Statistical NLP has traditionally viewed words as data points identical to any other data points in any other data science (e.g. via ‘bag of words’ models). More recent work in NLP incorporates a more sophisticated understanding of words (including features such as part of speech and syntactic functions, as well as lexical co-occurrence or proximity information).

Humanities scholars recognise that words are not simple data points, and are even more complex than their parts of speech, syntactic functions, or co-occurrence: they additionally function within complex structures of human interaction and communication, and within still other complex structures of grammar-meaning relationships and proximity relationships.

Probabilistic measures of words, therefore, must account for expressions of meaning in human communicative interaction in historical, social, and cultural contexts, as well as expressions of meaning in grammatical structure. It is precisely this variable—meaning—that is de-emphasised in much of the work in data analytics and computational linguistics.

Emphasising meaning and time

In contrast, meaning and its manifestations in time are at the centre of CASCADE’s research. This emphasis allows us to further develop approaches in the following areas:

  • Probabilistic measures of language: How are probabilities of linguistic features such as words or meanings measured? A huge array of measurements is possible, but only a few have been employed and assessed.

  • Descriptive and inferential statistics: Which algorithms, built on which measures of probability, are actually meaningful in relation to our understanding of linguistic semantics and pragmatics? A wide array of algorithms is possible; only a smaller number have been tested towards particular purposes.

  • Descriptive and explanatory adequacy: What can we conclude, based on carefully applied descriptive and inferential statistics, and rigorous measurements of linguistic probability, about linguistic norms or rules, and how can this help us to describe and explain the nature of language and meaning? 

Towards semantically-aware information retrieval

Semantically aware information retrieval must be built upon semantically informed probability measures and statistical analyses. Advancement in the state of the art of identifying lexical variation for text correction must incorporate the state of the art in linguistic knowledge across all domains; not just lexical, orthographic, and part-of-speech, as is already common, but semantic as well. Identifying meaning change in historical linguistics, similarly, relies on the application of semantics in quantitative processes, as do identification of genre and style. 

Despite the promise of statistical NLP, it is difficult to argue that natural language processing and computational linguistics have delivered real-world semantically-aware information retrieval systems, despite the urgent need for them.

True, there is an aspect of semantic search underpinning Google, using statistical inference to group similar documents around particular keywords, but this is hardly semantic in the sense of understanding what the user means when s/he is conducting a search.

A search for “powerful controversial leader usa” exposes this (with or without the term “caucasian”). Google returns a group of documents in which the keywords appear within proximity of one another but it does not try to understand what the search phrase probably means, and that it should be returning results that do not necessarily include the search phrase. We still have a long way to go.

The elephant in the room

All these characteristics constitute an elephant in the room for data analytics and the aim of retrieving semantically meaningful results from large text corpora. They are represented by the questions posed in CASCADE’s Research Objectives:

  • To what degree does data need to be improved by annotation?

  • Should meaning be identified using internal or external measures?

  • How do various algorithms influence the results?

  • And to what extent does time as a variable enhance our understanding of other variables in language data?

These are fundamental questions that currently limit the success of large scale data analytics because many NLP approaches ‘flatten out’ problems in data by using annotated data, external measures (e.g. training files), supervised algorithms and/or algorithms that do not make their reasoning transparent.

Trying to avoid variation in data

This concern is reflected in the emergent domain of ‘BERTology’ and research to understand the inner workings of NLP models. In the majority of instances ‘flattening’ is done to minimise variation in the data. Most NLP approaches give the impression that data is created synchronically. However, when we take a diachronic interest in language, such as in the domain of historical linguistics or in social media content analytics where the scale of data creation is rapid and informal, it becomes impossible to avoid these research problems because avoiding variation in the data is no longer the chief objective—quite the opposite. 

The CASCADE approach

The fundamental advancements sought by CASCADE concern the identifying and processing diachronic characteristics such as semantic change and spelling variation, grammatical and stylistic change, and social variation for the purpose of understanding meaning in language over time.

Statistical NLP approaches to date have accommodated these characteristics as errors or noise in the system and sought to identify meaning synchronically. In our view, the neglect of such variation and change is a chief reason for genuine semantically-aware information retrieval systems failing to materialise. By identifying language change, one can disambiguate or inductively identify meaning.

These characteristics are thus the object of study for all CASCADE’s research projects. The natural field for understanding language change over time—historical linguistics—has traditionally limited itself to small, hand-crafted corpora and non-computational (or supervised computational) methods of analysis.

Transferring this know-how to larger datasets, CASCADE brings together Europe’s leading practitioners in data-driven, computational historical linguistics to train a new generation of researchers and achieve a fundamental advancement in historical linguistics which will impact on semantic search and retrieval more generally.