Project Overview

The project was conceived as a collaboration between Jon Johnson at CLOSER & Dr Suparna De at the University of Surrey, to investigate the intersection of metadata annotation in the social sciences and machine learning.

We aim to disseminate our findings to the Computer Science, Social Science and Metadata communities at conferences and in journals, so that these communities can gain a better understanding of the subjects under consideration and the benefits cross-discplinary collaboration.

Pilot projects

We secured a small grant from ESRC in 2021 to look at extraction of metadata from social science questionnaires, this was supplemented by a grant from DIRAC to look at predicting concepts from question text, which is related to an adjacent problem in astronomy - it produces a large amount of unstructured text than would benefit from conceptual classification

DIRAC supported further work to look at more sophisticated models for concept prediction.


This is project funded by the ESRC as part of the Future Data Services Data & Infrastructure call,

It builds on our pilot projects to automate data and metadata curation for survey data, to help overcome the current reliance on non-sustainable manual processes.

Such automation will also provide additional metadata which can be used to improve the discovery, evaluation and curation of these rich and widely used research investments and be the basis for further innovations such as automation of disclosure risk control to support the needs of an emerging FAIR research landscape.

It will use CLOSER Discovery as a training dataset to develop ML models for the extraction of metadata from survey questionnaires, and its annotation to an established vocabulary to support discovery.

The Growing Up in Scotland study will be used as the target metadata and data sources to develop ML models for the identification of key variables for input into the Anonymisation Decision Making Framework and automation of disclosure risk assessment.

We will be developing novel machine learning models which are tailored to the specific challenges of semantically rich survey data collection and research datasets.

The alignment of both structural (standards) and semantic metadata (controlled vocabularies and conceptual frameworks) as the output from these ML models can be used to create metadata resources which meet the evolving needs of researchers from a range of disciplines who utilise longitudinal population and other survey data.

This is a collaboration between CLOSER (Jon Johnson), University of Surrey (Dr Suparna De), the UK Data Service (Dr Deirdre Lungley) and Scotcen (Paul Bradshaw).

Contact Us

Metadata Automation Project


  • Automating capturing structured content from questionnaires. (2021) ESRC. ES/K000357/1
  • Machine learning to enhance metadata in cohort studies. (2021). STFC. ST/S003916/1
  • Understanding the multiple dimensions of prediction of concepts in social and biomedical science questionnaires. (2022). STFC. ST/S003916/1
  • Extraction and Utilisation of Metadata from Non-machine-actionable Documents to Improve Data Curation and Discovery. (2024). ESRC. ES/Z502935/1

Our Funders