The last ten years has seen a significant step change in the European data landscape. The Consortium of European Social Science Data Archives (CESSDA) has created and co-located a range of data services which enables both structural and semantic interoperability to allow cross-national discoverability of social science data as part of the European Open Science Cloud (EOSC).
The Metacurate-ML project funded by the ESRC Future Data Service brings together CLOSER, UK Data Service (UKDS), the Computer Science Department at the University of Surrey, and the Scottish Centre for Social Research (ScotCen) to generate metadata which is FAIR ready and can be utilised by these emerging data services.
The specific challenges the project seeks to solve are:
The large-scale capture of questionnaire metadata into mature standards (DDI-Lifecycle), by extending existing work using pre-trained language models with recent developments from vision research and zero-shot techniques. This will allow us to extract the specific items with questionnaires such as question texts, responses and routing to create a rich source of metadata which provenances’ data collection methodology to the resultant data. Growing Up in Scotland will be the candidate study for this automated extraction.
With variable labels often being misleading or not very informative, and due to the possibility of multiple response items being coded differently, variable-level approaches are not the best enablers for question/conceptual comparison. Questions used in real world surveys when put through models such as BERT, even when trained on large amounts of data have poor prediction.
Pair-wise comparison of a question text from one study to all questions in a second study provides a basis for understanding the question asked, it is limited by not considering factors such as the response domain, where in the questionnaire was it asked and of whom (target population), and also whether the question is related to other questions in the same study or similar studies.
This rich metadata will be annotated with conceptual and vocabulary information which aligns with the European Language Social Science Thesaurus, originally developed at UKDS and now deployed across 26 countries, using the 50,000 questions available in CLOSER Discovery as the base training dataset.
Conceptual annotations and provenance can provide contextual information to inform a range of data processing activities. In this project we will be combining such metadata with the data itself to allow evaluation of data items to assist with human-in-the-loop automated disclosure evaluation using the techniques deployed in software such as sdcMicro.