Predicting Concepts

Strategic context

Finding questions that are of use for research has been a challenge. Questionnaires are scattered in PDFs, or where question banks exist, they have tended to be spread across different resources, making it time consuming and limited (in the main) to text search. This is a time consuming process and dependent on knowing where questions are located. CESSDA question bank seeks to bring together questions from across Europe, but aligning that with European Language Social Science Thesaurus (ELSST) will be a time consuming and non-scalable problem.

Extracting the questions is the first problem, which is being tackled elsewhere in this project, but providing auto-suggestion for humans to validate will be a major assistance to getting "the content through the system" so that data linked to the questions can be discovered.

Project aims

The RCNIC project (Machine learning to enhance metadata in cohort studies,ST/S003916/1) will take the outputs from the extraction project and auto-tag with concepts which align with the European Language Social Science Thesaurus (ELSST). ELSST has been adopted as the primary ontology for CESSDA data, it has been in use for over a decade in its English language predecessor HASSET.

Tagging of ELSST terms to questions and data in CLOSER Discovery is a significant proportion of the workload and one of the limiting factors in scaling. CESSDA and its service providers face similar problems.

The UCL RCNIC funded project will utilise the questions (question text and response domains) and linked concepts (based on the ELSST) held in CLOSER Discovery, the CLOSER metadata store. The aim is that the project will create a model that will be able to classify existing questions (and predict from new questions) to these existing concepts.

Further funding from DiRAC (Understanding the multiple dimensions of prediction of concepts in social and biomedical science questionnaires, ST/S003916/1) has extended this work to investigate the role, labelling accuracy, additional contextual metadata and training composition contribute to the accuracy and confidence of predction