Metadata Enhancement Project
The goal of the NSHD Showcase work has been to significantly help the research community to discover relevant NSHD datasets. To bring this about we have undertaken a programme of work to enhance and improve the quality of metadata held in our databases, to make labels more descriptive, and to help users discover them in both the Showcase and Condor platforms.
This project is due for completion at the end of 2026 where hope to have gone through all 35000+ variables in the NSHD repository. Once variables have been updated, they will be deposited to the Showcase. You can find the list of the questionnaires which have been cleaned and deposited to the Showcase and those that are in the pipeline on the Essential Information section of the Showcase.
The metadata enhancement work can be split into two parts:
- Enhance the metadata for fields already in the NSHD databases.
- Create new metadata fields to make use of the Showcase features.
Enhancements to existing fields
- Modifying all variables labels to make them more descriptive. Not all metadata in the database was entered from the questionnaires. They were usually extracted from SPSS files, which in past versions had a more stringent restriction on the number of characters for the labels. Variable labels will be expanded to be more descriptive and reflect more closely to the questionnaire text.
- Where possible, add the age of the participant at the end of each variable label for Condor. For new researchers not familiar with the 1946 birth cohort, it was not always obvious what age a variable was collected just by looking at the year of data collection. The age mentioned in the label quickly allows the researchers to identify what point in the life the variable was collected for the study member.
- For repeat measures (longitudinal variables), we will standardise the variable label, so the main part of the label is the same for each instance of the variable. The only difference would be the age of the participant towards the end of the label.
- Condor already had the ability to categorize variables into topics. These were 23 top level categories. Variables could belong to multiple categories, but these were not assigned to all the variables, so many variables had no categories. All variables will be re-categorised into new topics. These, where possible, will match the UK Biobank categorization. Researchers will be able to navigate their way through hierarchical categories and subcategories.
- A flag already existed in the database to highlight if a variable was a derived variable. This was not populated for all the variables. We are updating this field as variables are being prepared for the showcase.
- Condor database aleady has a field to record whether the value label is a missing value category. These would then be excluded from statistics/ plots. This field was not populated for all the variables. As part of the showcase project, this field is being updated and is used to produce new plots/stats for the variable.
- Values labels are also being tidied up to make them consistent. For example, there are cases where ‘No’ is labelled in different ways, i.e. No, no, NO. We have tried to make them consistent whilst cleaning the Metadata.
- Both the data and metadata will be compared to identify missing value labels which will then curated.
- Condor already had the capability to link variables together. For the Showcase we can link variables which are the same but collected at different data collections. The requirement for us to able to link variables for the showcase are:
- Question text is the same
- The variable was also coded using the same codings.
Linking in Condor can be done in the same way, but there are no conditions for how variables can be linked together. This allows us to link variables together even if the Showcase requirements are not met i.e. we link variables that form part of a scale, which is not possible to do on the Showcase.
Creation of new metadata
The following is a list of new metadata created for both Condor and the Showcase.
- Units – A separate field is created to capture the units of measurement.
- Reason why variable is not public – There is a flag in the database which can hide variables from researchers. We have created a field to record the reason for why this has been done.
- Flag if variable is sensitive – Some variables are sensitive but are not hidden i.e. ICD and BNF codes. These variables are flagged to notify the users that the variables are sensitive.
- Reason why variable if flagged as sensitive – This field records the reason why the variable is flagged as sensitive.
- Notes – Notes field to record variable specific notes to help the researcher.
- New plots/statistics for the variables. Histogram and deciles are created for continuous variables and bar charts and counts are created for categorical variables.
The following is a list of new metadata created for just the Showcase.
- Coded_as – Derived using an algorithm to classify variables as categorical/continuous
- Code_type – Records the data type of the variable.