Metacurate-ML

METACURATE-ML logo

The last ten years has seen a significant step change in the European data landscape. The Consortium of European Social Science Data Archives (CESSDA) has created and co-located a range of data services which enables both structural and semantic interoperability to allow cross-national discoverability of social science data as part of the European Open Science Cloud (EOSC).

The Metacurate-ML project is funded by the ESRC Future Data Services program and the EPSRC through the Department of Science Innovation and Technology AI for Science program bringing together CLOSER, UK Data Service (UKDS), the Computer Science Department at the University of Surrey, and the Scottish Centre for Social Research (ScotCen) to generate metadata which is FAIR ready and can be utilised by these emerging data services and can be readily utilised by AI/ML technologies

Metadata Extraction

Questionnaires and forms used in surveys are constructed from structurally small entities which are semantically diverse and in the long tail of document types which is distinctly a different problem that SOTA LLM models excel at. The project developed a new approach, Pixel Based Density Segmentation (PBDS) which identifies the entities by mimicking the way in which a human reads the document and v alidating the accuracy of entity matching and extraction using relatively small numbers of entity labelled training data which represents the structural dimension of the document layout. As well as improving accuracy, this approach mitigates against the known limitations of LLM based retrieval on such large documents where contiguous information crosses page boundaries leading to hallucination, left-right information flows and consumption of large numbers of tokens, which reduces the cost effectiveness of such extraction into a production pipeline.

The PBDS approach resulted in improved classification of entities within a questionnaire including question text, interviewer instructions and response options using a fine-tuned model based on a wider corpus of s emantically diverse entity labelled training data.

The output is easily transformable into the DDI-Lifecycle metadata standard, the most widely used FAIR standard used in social sciences for this type of data. Furthermore, lowering the amount of training data from an all LLM approach allows a wider set of potential users to adopt these methods for production of FAIR metadata for use in enhancing existing data collections.

Semantic Interoperability

Aligning data that is semantically equivalent is an important consideration in the linking, comparison and evaluation of social science data. The information available alongside data in archives is often insufficient for this to be machine actionable and hence able to be utilised in discovery or curation pipelines. Metadata obtained from questionnaires (see above) provides a rich resource to assist in automation. In question construction, the addition or absence of a small number or single word can totally alter the concept. Traditional textual comparison and existing LLM embeddings are insensitive to subtle changes needed to discriminate between these fine-grained concepts. New embedding methods have been developed (Reimannian embeddings) which distinguish between ‘self’, ‘middle-level’, and ‘unrelated’ concepts, reducing the search space that is acceptable for ‘human in the loop’ validation, especially important in privacy critical scenarios, such as disclosure control and the development of ontological frameworks such as controlled vocabularies for supporting discovery and AI ready datasets.

A new method, Dual-Branch LoRA for Invariant Representation (DLIR) has been developed which supports multi-lingual conceptual comparison, disentangling domain knowledge from its linguistic features to support conceptual discovery and comparison without recourse to translation.

Question Bank

Output from the questionnaire extraction has successfully been ingested into a DDI-Lifecycle based RDF store. The schema is more lightweight than other serialisations such as XML, simplifying the future curation of extracted metadata and its incorporation into automated pipelines for use in classification for discovery and other data processing use cases.

Automation of Disclosure Control

Conceptual annotations and provenance can provide contextual information to inform a range of data processing activities. In this project we will be combining such metadata with the data itself to allow evaluation of data items to assist with human-in-the-loop automated disclosure evaluation using the techniques deployed in software such as sdcMicro.


The Team

Bringing together expertise in data management, metadata and computer science

  • Jon Johnson - UCL, CLOSER (Principal Investigator)
  • Paul Bradshaw - Scottish Centre for Social Research (Co-Investigator)
  • Suparna De - University of Surrey (Co-Investigator)
  • Deirdre Lungley - UK Data Service (Co-Investigator)
  • Ivan Evodokimov - UK Data Service (Senior Developer)
  • Jacob Joy - UK Data Service (Senior Developer)
  • Justina Li - University of Surrey (Research Fellow)
  • Michelle Lin - University of Surrey (Research Engineer)
  • Oliver Lyttleton - UCL, CLOSER (Developer)
  • Becky Oldroyd - UCL, CLOSER (Metadata Manager)
  • Chandresh Pravin - University of Surrey (Research Fellow)
  • Zeqiang Wang - University of Surrey (PhD Student)
  • Sarah White - UCL, CLOSER (Metadata Assistant)