School of Mathematics

Energy data availability: The good, the bad and the ugly

Open access to energy data is currently high on the agenda. Quoting from the Energy Data Task Force : "In all these [government, industry and societal energy] strategies and ambitions, data is recognised as crucial to building a smart system that supports achieving decarbonisation objectives and creates significant economic opportunities." Data can support all aspects of energy system operation and planning, from how local and national levels can be coordinated in operation, through the longest term policy questions around the mix of resources we need. This agenda is being taken forward by the Modernising Energy Data Access (MEDA) innovation programme, which is developing the necessary technical platforms and governance for data sharing.

This article examines the range of reasons why data might currently be unavailable, and proposes possible ways forward where this is not covered by EDTF and successor activity. It is structured into two broad categories of data: those which do exist, and those which do not - clearly these bring quite different issues!

1. Energy data which already exist

a. Data which are, and should remain, non-discloseable

Data might be classified as non-discloseable due to established principles regarding commercial sensitivity and/or national security. While on a case-by-case basis such classification might be revisited, if it is in the public interest for such data not to be openly available, then it should remain confidential to the organisation which holds it.

b. Data which cannot be published openly, but for which limited disclosure is in the public interest

This is a very important category in which data cannot be published openly, but where there is a significant public interest in data being shared in confidence with relevant bodies, e.g. for system operation or regulatory purposes. Where relevant, it may be that aggregated summary statistics might be published openly - a good example of this is the component availability databases coordinated by the North American Electricity Reliability Corporation.

c. Data which are regarded as non-discloseable, without adequate reason

Data in this category might arise when the appropriate balance between commercial and public interest has not been reached, or where the statutory framework was developed before the public interest in data availability was widely recognised. For some data it would be very difficult to persuade individual companies one by one to publish openly, but it might be much more acceptable to companies if they all published that data.

d. Data where there are no significant sensitivities, but which are not published

This is a significant category, particularly in the regulated network sector where commercial sensitivity issues are reduced by the absence of direct competition. I expect that making more network data available will be one of the major early consequences of the EDTF agenda.

e. Data which are published but not in a useful format

This frustrating category covers issues such as poor web archiving (e.g. not having good batch download functions), and poor metadata (making it difficult to identify what datasets exist or what fields within a dataset mean). The EDTF and MEDA activities should help spread good practice here, and regulatory bodies might also consider what quality of presentation of data is required to satisfy statutory responsibilities.

2. Energy data which do not yet exist

This is another category of frustrating circumstances, where a dataset would have great value, but does not exist anywhere in a useful form. This category is less directly the business of the EDTF and direct successor activity, but should still be part of research, innovation and policy agendas. Here, I address this largely through illustrative examples:

a. The dataset has never been created at all

Throughout the last 15 years of intensive activity in the areas of renewables deployment and integration, there has not been a widely available and well calibrated spatially disaggregated dataset of historic renewable resource for the UK. There have been some significant efforts on smaller scale, e.g. Renewables Ninja, and open data releases following EDTF might contribute further valuable source data. However, the combination of producing a very high quality dataset and providing ongoing updating/curation would need to be a national strategic project.

b. Issues of design experiment

There have been numerous surveys of smart meter data or other forms of consumer/network trial in the UK, but many have not followed good practice in design of the statistical sampling - and without good sampling design it can be impossible to do useful ex post analysis (or, possibly worse, people might think there are useful conclusions to be drawn when actually there are not).

c. Existing datasets are incomplete

Generally, datasets associated with larger scale components or systems are of high quality and are reasonably complete - this is simply a consequence of economies of scale. However, data on smaller components (e.g. domestic voltage electricity networks, or domestic pressure gas networks) may be much less comprehensive, or lack appropriate metadata. While the cost-benefit trade-off may be against very widespread universal data collection at more local level, this balance might be explored more carefully, or the benefits of improved sampling practice might be investigated.

One complication in the UK is that in some cases we do not have a body which can naturally take on the ongoing curation and updating of models and datasets. In the USA, for instance, this function might naturally sit with the Department of Energy National Laboratories, or, within its technical scope, the North American Electricity Reliability Corporation. It usually does not work for universities to do this, and while working out how to carry out data projects might be bona fide basic research, carrying them out is typically not. We also in this country have a tendency to fund such work as a series of fixed term projects, rather than the more natural funding structure of a long-term ongoing activity.

Conclusion and way forward

There is clearly great value in wider sharing of energy data for coordinated operation and planning of systems, and to enable analysis by a wider range of stakeholders. For data which already exist, the Energy Data Task Force and Modernising Energy Data Access activities are already developing necessary technical capabilities and governance structures - though beyond simply making numerical data available, there are wider issues such as providing important context (e.g. How were data collected? Are there measurement uncertainties?) which need to be addressed.

Where datasets do not already exist, there would be benefit in a more strategic view of where a national effort to produce datasets would add value, and what kind of bodies would be best placed to create and curate datasets. There would also be great benefit in coordination between energy analysts and other relevant communities - for data technology and applied statistics - to ensure that appropriate combinations of professional skills are available.

All of this does of course need to be motivated by use cases, which is for instance the philosophy of the National Digital Twin Programme - and a forthcoming article in his series will unpick the meanings of "digital twin" and "artificial intelligence", two terms which are in common use, but which lack commonly agreed definitions.

Acknowledgments: The author acknowledges collaboration with Paul Plumptre on an earlier unpublished version of the article, and valuable advice from Matt Hastings and Sue Chadwick in the preparation of this version.

Chris Dent is Professor of Industrial Mathematics; Director of the Statistical Consulting Unit at the University of Edinburgh; and a Turing Fellow at the Alan Turing Institute.