Timely and Novel

The data deluge

This network is timely because data analytical skills in the field of longitudinal semantic search are critical for the next generation of data discovery services, particularly in the archives and cultural heritage domains where data deluge remains a barrier to access.

The volume of digital data produced by the world is increasing at an exponential rate. The European Strategy for Data, citing the International Data Corporation (IDC 2018), remarks that the volume of data in the world is predicted to grow from 33 zettabytes in 2018 to 175 zettabytes in 2025.1 In other words, over seven years the world will create five times more data than the entire volume of data created between 1970 (generally accepted as the start of the ‘information age’) and 2018.

Most of this data will not be natural (human) language but mechanical data (images, videos, audio, mobile signals and data transmitted by the internet of things). However, there are over 1.9 billion websites on the web, including 600 million blogs. Calculations based on WordPress’ market share suggest seven million blog posts are written every day, and Twitter estimates that 500 million tweets are sent every day.

Meanwhile, each year the data archived from the UK Government’s Home Office is equivalent to the total sum of documents created in Britain between the medieval period and the mid-1980s. This is no doubt the case for other European states too. Finally, large-scale digitisation projects using diverse technologies are adding many billions of words to the (theoretically) searchable digital resources available to the general public and to researchers. For example, the Dutch National Library’s digitised archive contains more than 1.3 million newspaper issues (15 million digitised newspaper pages), 320,000 books, 1.5 million magazine pages and 1.5 million radio news reports – a grand total of 60 million pages of fully searchable text.

An opportunity and a challenge

Clearly, as a society we are creating a lot of data, and with the IDC’s European Data Market Study 2021-2023 valuing the EU27 data economy at over 400 billion euros in 20212, it represents an opportunity for Europe to be world-leading. However, our ability to query, retrieve and analyse data is not developing at a rate, or in ways, for us to make best use of it. This lies as one of the chief barriers to achieving a ‘digital single market’ in Europe, and limits Europe’s ability to be successful in the data-agile economy.

GDPR and the EU’s Data Act seek to empower individuals and businesses by improving access to and control of data, in order to address this in part. But according to the European Data Market Study, there was an excess demand for 338,000 data specialists in 2021 and “the supply-demand asymmetry for data workers is widening”.

The European Strategy for Data reinforces this view when articulating the skills investment ambitions of its Digital Europe programme and common European skills data space, in order to halve a projected deficit of 1 million data specialists by 2025. Speaking seven years ago in 20153 when the European data market was worth 50.4 billion euros, Gabriella Cattaneo from IDC’s European Government Consulting unit identified data scientists with ‘hard’ technology skills such as programming and data analytics as the most worrying shortfall in the data economy (people who can build and use tools that enable data to be understood and value to be extracted). 

This skills gap exists in spite of the widely-held view that access to more data should directly and immediately yield more social, cultural and economic value and insight, because it enables us to analyse existing phenomena (systems, behaviour, trends, etc.) more accurately as well as achieving more holistic perceptions of the phenomena (macroscopic). Further, there is the view that access to more data should yield new value because it reveals qualities within a phenomenon that are not discernible at a small scale.

All this is of interest to the emergent, interdisciplinary field of data science in which it is believed that a new generation of data scientists will be able to unlock the potential of data for social, cultural and economic gain. Much of this promise relies on those 338,000 missing data scientists, identified by the IDC, who are able to develop methodologies and tools to retrieve, shape and present large-scale data in ways that are meaningful to the end-user. 

The need to be semantically meaningful

In most domains, the idea of data being retrieved and analysed in a meaningful way means semantically meaningful: that a search for the string “powerful controversial caucasian leader usa” will return President Donald Trump; and that the phrase “our wonderful leader” will be understood computationally to be satirical in the context of most European cultures.

As we become increasingly overwhelmed with data, the literal keyword-based approach to data retrieval and analysis is insufficient, largely because human language is rarely literal or keyword-based, as humanities scholars are acutely aware. Further, the globalisation of data, facilitated by the internet, in terms of an egalitarianism of content creation and access for most developed countries, results in rapid changes in meaning, and informality and variation in spelling and grammar across language groups.

These features mirror the characteristics of large historical corpora except that the diachronic changes of historical corpora take place almost synchronically and at scale in today’s data deluge. We need to move to models that are responsive to polysemy, spelling variation and semantic change.

The potential of the Humanities

Computational linguistics and the humanities more broadly have a unique contribution to make to the problem, as recognised by Horizon 2020’s interest in language technologies as an enabler for the problems of achieving a digital single market, and Horizon Europe’s funding policy focus on advanced computing and Big Data in Cluster 4: Digital, Industry & Space. In our view, innovation in Big Data and learning analytics arising from Cluster 4 can have a direct impact on innovation in Cluster 2: Culture, Creativity & Inclusive Society.

Humanities scholars are skilled in selecting, analysing and comprehending language-based evidence that is often unstructured, messy, idiosyncratic and which exhibits significant variation over time. Humanities scholars are sensitive to the significance of variation in data. As such, humanities researchers should be uniquely equipped in the principles and techniques for comprehending language data over time. Further, computational linguists focus on methodologies for comprehending language-based evidence at scale, aiming to reveal the underlying meaning of text data instead of merely identifying keywords used to communicate meaning.

CASCADE aims to help address the skills deficit in the EU data economy by training a new generation of data scientists who originate in the humanities and have skills that are valuable to a new era of data-driven academic research as well as our information society more broadly.

  1. See section ‘C. Competences: Empowering individuals, investing in skills and in SMEs’ in https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52020DC0066
  2. See https://www.idc.com/eu/for-eu/explore/news?id=750722d1b647ef66ba7a page 49.
  3. See http://2015.data-forum.eu/sites/default/files/1140-1155_Gabriela%20Cattaneo_SEC.pdf