Automating Question Capture

Strategic Context

There is growing recognition that current data discovery resources are not meeting the changing needs of researchers. The recent launch of the Catalogue of Mental Health Measures which primarily focuses on provenance, the reproducibility movement, the adoption of FAIR and the active re-examination of data infrastructures by funders in the UK and Europe are indications of that. What is not evident, is how it is possible to uplift existing resources to move towards that in a practical sense.

Barriers to progress in questionnaire capture

CLOSER Discovery was tasked to provide a richer layer of metadata about the data collection process for both ESRC resources, but also for allied MRC funded studies in biomedical science. The aim was to provide enhanced discoverability and context to the available data. In addition it aims to repurpose these resources as a reusable question bank to provide input into questionnaire design and development which was not available as actionable metadata, and to provide sufficient detail for its use in post survey collection data management and dissemination either through the UKDA (ESRC) or other mechanisms (MRC).

If the scope of CLOSER Discovery is to be expanded, or other resources are to be created which have similar capabilities, a mechanism will need to be found that enables the ingest of questionnaires into structured metadata an order of magnitude quicker

There are three main challenges:

  • Historic questionnaire capture will require high accuracy auto-extraction from (primarily) PDFs;
  • Current and future collection will require the ability to move from manual specification of questionnaires;
  • Provision of high quality survey question banks to make the development of tools to support that viable.

Capturing Content

The general approach is that the extraction of the questions, and the responses along with the instructions (which form the core part of the questions), would be extracted by the generation of algorithms trained using the corpus of CLOSER structured questionnaires and PDFs. Natural Language Processing Mechanisms (NLP) such as Named Entity Recognition (NER) as well as machine learning-based techniques (including Bayesian learning, among others) would be used to identify the questionnaire elements.

The creation of this content, is one of the building blocks upon which the provenance and enhanced description of data can be established