The problem with data

We lack the knowledge and the tools to interrogate diachronic text data at scale, whether for the purpose of locating a specific piece of information or to extract new knowledge from a macroscopic perspective of the data. This is a problem familiar to academic researchers who use online archives of primary source data; organisations that need to understand the knowledge-bases that underpin their business (such as internal technical documentation); and the general public who have been deluged with social media content since 1999. Attending to the particular and neglected variable of time, we enhance understanding of other variables in large and small data.

A significant quantity of the world’s data is unstructured natural (human) language, created over time and exhibiting marked variation at every level—from features such as non-standard spelling and syntax, to variation in meaning. For example, the linguistic variation evidenced in the millions of comments associated with a Youtube channel is not dissimilar to what one finds in an archive of nineteenth century newspapers; both exhibit variation and changes in spelling, orthography and meaning among texts and across time.

Yet the accommodation of such critical facets in semantic search algorithms or text analytic methodologies is weak or non-existent, partly because those designing them are not familiar with the nuances and goals of humanities-led reading practices.

CASCADE’s key innovation lies in its aim to equip humanities researchers with interdisciplinary digital skills for language analysis applicable to the wider information society: the ability to retrieve and analyse unstructured natural language diachronically and at scale, and to refine the tools available to others whose work requires the ability to inspect and interpret big textual data in a reliable and semantically sensitive way.

An opportunity for the Humanities

Humanities students are uniquely placed to address the text analytics skills deficit given their prior training in the analysis of texts and textual meaning in close, rigorous, principled, systematic ways. Our network will provide them with skills in data science (drawing on the EDSA curriculum) and facilitate the use of these skills in critical and creative ways to improve the state of the art in text analytics.

Such training will enhance their employability within and beyond the academic sector; create opportunities for technical innovation in sectors concerned with information retrieval (as evidenced by our non-academic partners: e.g. climate research and InsurTech); and begin a ‘trickle down’ effect as they in turn actively pass on their knowledge and expertise to a non-academic audience through the use of digital media such as Wikipedia and educational vlogs.

Further, our researchers will be able to facilitate more reliable data-driven approaches to Europe’s societal challenges, because high volume, historical, evidential data underpins much of this challenge-based research: qualitative health data, agricultural research papers, and socio-cultural impacts of energy policy and development.

Addressing the skills deficit

Addressing the skills deficit cannot be undertaken by a single institution in a single country. The methodologies for longitudinal semantic analysis are complex and varied. Distinct methodological approaches to address different aspects of the challenge have been developed in universities and companies independently of one another.

Our network will be the first time that this knowledge is brought together, shared and developed for the benefit of an international cohort of early-stage researchers and their academic and non-academic colleagues. Only through a multi-institutional, multi-national interdisciplinary collaboration can we offer the holistic and forensic training in text analytics (technical, applied, and/or qualitative) required to tackle the emergent challenges of the information society, and encourage innovation in the use of cutting-edge tools.