Our most recent addition to the Energy Data Centre (EDC) came to us because the data needed publishing to support a journal article. Journal requirements are an increasingly common prompt for publishing data sets. Of course, there are plenty of other good reasons too. It helps to get the most benefit from publicly funded research. And it can enable future research to build on the original work.
The data set included the references, costs, and impacts from a systematic review first undertaken for a UKERC report on the costs and impacts of intermittent renewable energy generation. The data set was recently updated and expanded for a journal publication.
Westmill wind and solar farms – photo credit Westmill Sustainable Energy Trust (WeSET)
The data for the project was submitted to us in a multi-sheet Excel spreadsheet. Excel can be really helpful for viewing and working with data, but it’s not our preferred format because the software and format change over time and different versions behave differently. Excel nearly always rewrites date and time formats using local settings and this isn’t helpful for research data. Where possible we ask for plain text in comma-separated-variables (CSV) format. Lots of software can read this and you can open it in any text editor. We think that there’s a good chance that will remain true in the very long-term too.
For some data, it’s not possible to use simple formats. For example, The Strategic UK Carbon Capture and Storage (CCS) Storage Appraisal Project produced geological reservoir models of five potential carbon dioxide storage sites around the UK. They are available in the EDC. The models are created in proprietary software and can only be viewed with that same software. The data provider reformatted this data for us as a collection of CSV files. We also worked on the files to ensure text values were quoted consistently and none of the commas found in the text could confuse the CSV format. The data provider also provided the metadata.
Metadata is data about data. It’s really important because it allows users to find your data in a catalogue and to understand what they have found. Our core metadata includes the data set name and description, time and geographic coverage, the type and format of the data, who created it and who made it available, what the access rights are, and which IEA energy categories it fits in. This is all included in the EDC data catalogue entry for the data set.
Our advice to data providers is to think about what information they would need to make use of the data confidently themselves in a few years. The data should also make sense to someone who wasn’t connected to the original project. Do your filenames and column names make immediate sense? It is best to have a “README” file with the data – this is a plain text file with “README” in the name to guide users to it first. It should describe what’s in the files, and what the formats and column names are to help users interpret and use the data.
If there any reports or documentation that will give the research context and describe the methodology we will link to them or put a copy with the data. We linked to the original UKERC report in the data catalogue entry for this data set. It provides a much richer description of how the research was conducted than a “README” file or catalogue description can. We then review these documents to sense check them and suggeste clarifications where they can help a new user.
The data held in the Energy Data Centre is actually stored in the long term archive at the Centre for Environmental Data Analysis (CEDA). The CEDA Archive is a designated NERC data centre for atmospheric and earth observation research. The infrastructure that hosts the CEDA Archive is called JASMIN – a unique data analysis facility for environmental science. Managed jointly by the Centre for Environmental Data Analysis and STFC Scientific Computing Department, JASMIN is located at the Rutherford Appleton Laboratory in Oxfordshire. Racks of computers face each other across tight, noisy corridors of flashing lights. Everything is arranged to accommodate miles of neatly laid coloured cables. In using the CEDA Archive we know that the EDC data is backed up and actively managed for long-term preservation.
JASMIN – photo credit STFC
Once the data files and documents are ready, we just have to copy them over to our EDC area in the CEDA Archive. At the same time, we make the data catalogue page public, and then users can find the data record. Most data in the EDC is open access, but some have access restrictions. Users will be able to see that it is there and what the restrictions are. They can then apply for access if appropriate.
The final step it to assign the data a Digital Object Identifier (DOI). A DOI is a link to the data set that won’t change and will always take you to the data record. DOIs are widely used for citing and linking to journal papers and increasingly data sets and other research outputs. Once the DOI was available the journal paper could cite that. The paper then links to the data set that supported that research.
The data underpinning research is valuable and wherever possible should be accessible to users after reports and papers are published. Please get in touch if you want to discuss archiving data from a project your are, or have worked on.
Access the data for the project mentioned above here.