Ensuring the Long Term Impact of Earth Science Data through Data Curation and Preservation

Governments fund science and associated archives to generate impact for the society they represent. To implement this, governments fund research and demand that it has demonstrable impact. For example the UK-based Natural Environment Research Council (NERC) is tasked with the following strategic objectives:

To deliver world-leading environmental research at the frontiers of knowledge:

  • enabling society to respond urgently to global climate change and the increasing pressures on natural resources;

  • contributing to UK leadership in predicting the regional and local impacts of environmental change from days to decades; and

  • creating and supporting vibrant, integrated research communities.

NERC funds data centers in order to accomplish its science aims; they are not preserving merely for the sake of it. In order to justify funding for long-term preservation in today’s challenging economic climate, archives must be able to present a convincing case for the value of the data in the long term. They must present strong arguments for potential impact of the data they aim to preserve, in line with the strategic goals of the bodies which fund them.

The challenge of maximizing the impact of archived data lies in much more than storage and providing access to “the bits”; it requires addressing the usability of the data both in the long term and to scientists outside the immediate field or group of producers. With much of earth science data containing observations that are not repeatable, key areas of scientific investigation become dependent on the meaningful exploitation and reuse of such data.

This article presents the potential impact of earth science data in terms of societal benefit in areas such as disaster management, water, climate change, health, ecosystems, and agriculture. Using illustrative examples from their archives, the Science Data Infrastructure for Preservation (SCIDIP-ES) earth science partners discuss how data curation and digital preservation activities are vital for ensuring impact in these areas.

Disasters

The long-term preservation of earth observation data allows researchers to more accurately assess vulnerability, strengthen preparedness and early-warning measures, and, after disaster strikes, rebuild housing and infrastructure in ways that limit future risks. The longer the earth observation time series, the more effective and accurate these assessments will be. Good quality data curation ensures such lengthy time series are available and enhances the compatibility of observations from diverse sources. Preserving data and its usability over the long term allows a better understanding of the relationship between natural disasters and climate change. These climate forecasts must become an integral part of sustainable development, planning, and of strategies for adaptation and risk management. Interoperability is also a positive consequence of curation and preservation activities. These activities make it easier to integrate different types of disaster-related data and information from diverse sources. Analysis and decision making for disaster response is improved, resulting in risk reduction.

Risk management/reduction and planning through the analysis of historic data also has applications in many domains other than earth science and disaster management. For example, economists frequently look to the past to understand financial crises and how to manage risk within them. Urban planners frequently need access to population and transport data to base resourcing decisions on. Data curation supports research in many domains through the availability of lengthy time-series and historic data.

Advanced networks for geohazards are complex systems acquiring more than plain seismic data. They perform geophysical and environmental long-term monitoring by acquiring seismological, geomagnetic, gravimetric, accelerometric, physico-oceanographic, hydroacoustic, and bio-acoustic measurements. Scientific objectives include the study of seismic signals, tsunami generation, its hydroacoustic precursors, warning/ambient noise characterization in terms of marine mammal sounds, and environmental and anthropogenic sources. An example is the multidisciplinary abyssal observatory SN1 in the Western Ionian Sea off Eastern Sicily. Seismic data provide information on ground motion, earthquake localization, and magnitude. The SN1 observatory is a real time tsunami detector (Figure 1 [the illustrations that appeared in the print version of this article are available in the PDF at left]) whose main customer is the Italian Civil Defense. Data are a continuous collection of samples over time. However, as time goes by new algorithms and increasing computational power provide new tools to improve knowledge of the Earth, using past data that must be stored with all ancillary data to be decoded.

As a real example, a seismic dataset consists of hourly time series in binary files for every acquisition channel. Keys to decode the file are contained both in its header and in code files and must contain information on the acquisition chain: type of instrument (manufacturer and model), bytes per sample, samples per second, voltage per bit count, voltage per meter, poles and zeroes for the transfer function describing the chain to the deconvolution from byte/voltage to its original crust displacement, and coordinates.

Information is linked through entities and their relationships are captured in a relational database. In a physical view some of the data are in a Storage Area Network. Processed data provide the upper level of information together with their geographical coordinates of each earthquake along with its depth and magnitude. All data must be preserved and periodically must be reanalyzed to better understand the natural process in order to provide advice to the community and the Italian Civil Defense on hazards. This informs as to where and how to build houses or power stations, etc. The earthquake and seismic hazard history of the country relies on stored data whose design must consider performance and usability from now to the future.

Health

It has been established that changes in the natural environment can compromise human health. Climate change and extreme weather events are associated with a wide range of health risks. Emerging infectious diseases such as HIV/ AIDS and Lyme appear to be linked to land-use changes that have opened up previously hidden pathways for disease transmission. Comprehensive datasets support prevention, research, health-care planning and delivery, and timely public alerts. It is therefore vital that a sufficiently long and comprehensible earth observation record is maintained. Data curation supports the necessary analysis of historical records that allow this. Human health is influenced by many factors and preserved data from different domains are required to garner a fuller understanding in a number of areas. It could be from within the health domain itself using historic neuroimaging scans to track disease progression/indicators or social data such as poverty indicators, which have important links to health.

Air pollution is another of the major environmental problems having an impact on human health. Short- and long- term exposure to air pollutants, at levels usually experienced by urban populations, can produce a broad range of adverse health effects, ranging from respiratory symptoms to mortality (as documented by several studies).

The analysis of air quality data from monitoring networks allows the evaluation of the health impact of air pollution on urban populations. When long-term data series are available, it is possible to examine air quality trends over the past years; this can provide significant information to health scientists for clinical and toxicological studies related to chronic exposure. ISPRA manages a national level repository of air quality data that was developed in compliance with EU Commission Directives. The repository centralizes the data fluxes concerning air quality required by national or European legislation.

The repository content is published on a dedicated web portal, BRACE, which provides access to raw measurements and air quality statistics (Figure 2). The BRACE repository is constantly updated with Near Real Time air quality data measurements coming from local authorities (Regional Focal Points). Additionally, validated air quality measurements are uploaded on a yearly basis on the platform where they are aggregated and sent to the European Environment Agency (EEA). The report supports European and national policy development and management in the field of air quality. Specific studies on the influence of air pollution on human health were carried out using the BRACE data. The impact of PM10 and ozone concentration on urban populations of 13 large Italian cities was investigated in the period 2002-2004 and large impacts on human health were observed.

Long-term data preservation of air quality measurements is extremely important to evaluate the air pollution trends, their influence on human health, and the effects on these trends due to regulatory measures. In this context, long-term data preservation is a crucial tool in improving human health in the future.

Energy

A long-term, high quality earth observation record is also necessary for understanding the processes that cause fluctuations in hydro, solar, ocean, and wind power. Such a record is produced and maintained by the careful curation and preservation of data. This understanding of processes facilitates innovation in the area of sustainable energy resources.

The Natural Environment Research Council (NERC) Mesosphere- Stratosphere-Troposphere (MST) Radar Facility at Aberystwyth is an atmospheric science research station in West Wales, UK. The MST radar (Figure 3), which measures profiles of the wind through the lowest 20 kilometers of the atmosphere, has been in operation for 20 years. The high time-resolution of the observations makes this dataset unique for the British Isles. Although the data have been predominantly used by members of the atmospheric science community, an increasing number of recent projects have come from other fields of research. These have mostly been aeronautical.

For over 10 years, the Facility has also operated a number of auxiliary instruments, e.g. for measuring ground-level wind, temperature, pressure, and rainfall. Although similar datasets exist, this one has been extensively accessed by researchers from outside of the atmospheric science community. The facts that the existence of the data is easily discoverable through web searches and that NERC has an open access policy are cited as reasons. Such projects have covered subjects as diverse as animal behavior, plant growth, wind energy, and economics.

Traditionally, data from this site has been used to study atmospheric phenomena such as Rossby Wave, turbulence, etc. However, high quality curation and support by the project scientist has seen usage of this data move into new areas for exploitation outside its traditional user group.

Researchers such as Issaeva in 2009 have been able to use MST facility data to study the incorporation of wind energy into the electricity generation system. This requires a detailed analysis of wind speed in order to minimize system balancing cost and avoid a significant mismatch between supply and demand. Power generation and consumption in the electricity networks have to be balanced every minute, therefore it is necessary to study wind speed on a one-minute time scale. The researchers were able to examine the statistical characteristics of one-minute average values of wind speed with data from the site.

Another such example is that of the work of Aglietti, et al., who used MST data to investigate the collection of solar energy using a high altitude aerostatic platform. The researchers were able to produce a preliminary sizing of a facility stationed at 6 km altitude and its costing, based on realistic values.

Climate

While the socio-economic impact of climate change is widely recognized and understood, many of the underlying processes and systems causing it are not adequately understood. This impacts society’s ability to predict climate change, to take evasive action, and to plan for its consequences. An accessible, comprehensible long-term earth observation record is again vital for the development of such an understanding. Earth science archives help by not only preserving the data over the long term but also by improving the broader scientific community’s ability to access and interpret them correctly. This is necessary for understanding of climate change processes and systems. Similar requirements for the analysis and study of complex systems over long periods of time are present in economic, social, biological, and physical science domains.

The MERIS sensor on board ENVISAT (Figure 4) is a programmable, medium-spectral resolution, imaging spectrometer operating in the solar reflective spectral range. Its main mission objective is the measurement of sea color in the oceans and in coastal areas. Knowledge of sea color can be converted into a measurement of chlorophyll pigment concentration, suspended sediment concentration, and atmospheric aerosol loads over water. The impact of MERIS data on climate studies is of paramount importance, since measuring the color of the oceans allows scientists to understand the ocean carbon cycle. This is a fundamental parameter in forecasting the rise of atmospheric CO2, which is naturally absorbed by the oceans. MERIS data are also used by the scientist to understand the thermal regime of the upper ocean and anthropic variables such as fisheries management.

Data produced by the MERIS sensor has been used throughout the years for different studies of the atmosphere and the land. MERIS data allows the estimation, among others, of the cloud type, top height and albedo, surface pressure, and vegetation indexes. The state and evolution of terrestrial vegetation is characterized by a large number of physical, biochemical, and physiological variables. A few of these can be extracted from space data, provided the appropriate inversion tools are available. They jointly determine the Fraction of Absorbed Photosynthetically Active Radiation (FAPAR), which acts at once as an integrated indicator of the status of the plant canopy and can reasonably be expected to be retrieved by remote sensing techniques. The importance of the FAPAR with respect to further applications in biosphere studies has motivated the hypothesis that the FAPAR variable can be used to quantify the presence of vegetation.

Spaceborne Earth observation data have been collected for over 40 years now. Valuable time series of data have been compiled, in particular when data continuity has been ensured through continuous missions operating the same or radiometrically similar sensors. One such example is the series of NOAA satellites carrying the Advanced Very High Resolution Radiometer (AVHRR). The mission and data delivery started in the early 1970s and is ongoing. The sensor’s medium- resolution data are still being received and archived by data centers worldwide. Products are derived and analyzed by a wide community of users (Figure 5).

The 40-year time series of NOAA AVHRR data constitutes an invaluable dataset for monitoring surface parameters for climate research, which requires long time spans of monitoring in order to be relevant for assessing changes in climate and help improve prediction models. In addition to the valuable sensor data, extensive scientific expertise has been developed over time on AVHRR data calibration and exploitation. In addition to the AVHRR sensor data and mission documentation, this expert knowledge, along with documented methodologies, algorithms, and scientific results has to be preserved as well and made accessible to current and future generations. It is the combined body of sensor data and associated knowledge that makes the preserved data valuable and exploitable in the future.

Water

Freshwater is vital for households, agriculture, and industry and ever larger quantities will be needed for burgeoning human populations over the coming decades. Better forecasting models and understanding of the global processes that affect water supply are needed to deal with a finite resource that is coming under increasing stress. It is again the study of a long-term record that will permit the development of scientific understanding in this area. Forecasting based on historic observations is also a common requirement for economic and demographic studies.

Groundwater provides the baseline flow for streams and rivers and is also a primary source of drinking water. The assessment of groundwater reserves and quality is determined from long-term datasets. Groundwater levels provide an indication of the range of potentially variable parameters relating to resource availability, including: the direction of groundwater flow, the magnitude of seasonal fluctuations in groundwater levels, and the interplay between recharge, storage, and discharge from the aquifer. The broadly cyclic nature of groundwater level fluctuations (seasonality) requires long-term datasets for the identification of any trend in the data, which may be a consequence of natural, e.g. climate, or anthropogenic, changes to the stresses on these resources. Hindcast modeling is performed on long-term datasets in order to calibrate predictive modeling for resource planning. Groundwater-level data are therefore essential for the long-term management of available water resources. Long-term groundwater chemistry datasets provide information with respect to changing patterns in diffuse and dispersed contaminant sources, groundwater- bedrock interactions, and seasonality in groundwater quality. Groundwater-level and chemistry datasets are used in conjunction with each other for the development of process understanding (Figure 6). For example, predicting the impacts of climate change on water quality or undertaking source apportionment analysis. The availability of long time-series data has a direct impact on the reliability of these predictive models and therefore on the management of water resources.

Ecosystems

Terrestrial, coastal, and marine ecosystems provide essential socio-economic and environmental benefits. Ecosystems the world over, however, are under tremendous stress from rapid land-use change, pollution, and the overexploitation of natural resources. The protection of ecosystems requires long-term monitoring. SCIDIP-ES supports the interoperability of diverse data from different time periods, which greatly improves a scientist’s ability to monitor changes in these ecosystems.

For example, the European Commission’s Marine Strategy Framework Directive promotes the use of an ecosystem-level approach to the management of the marine environment and recommends the monitoring of the human impacts on an ecosystem to ensure its preservation whilst allowing sustainable use of the marine resources. This monitoring and prediction of the response of a specified ecosystem to any change due to these pressures requires access to good quality and well-documented long time-series data. Large volumes of marine data of this type are available but there are often significant barriers to their discovery and reuse; the most significant of these being the use of different standards, formats, and co-ordinate systems both across national boundaries and even between organizations in the same country.

This challenge is currently being addressed by several European initiatives with which SCIDIP-ES will be collaborating closely, including the EU funded Geo- Seas project that is developing an e-infrastructure for the delivery and exchange of marine geoscience data. The project has adopted and adapted the existing architecture and technologies developed within the oceanography community, resulting in a common e-infrastructure for the delivery of both marine geoscience and oceanography data. The implementation of this joint e-infrastructure has led to increased interoperability across a wide range of data types in use within the marine domain. This has been achieved through the use of a common metadata standard that conforms to the ISO 19115 standard, makes use of a suite of common vocabularies, and provides access to the data in an agreed set of data delivery and exchange formats that are commonly used throughout the marine domain.

The implementation of this joint infrastructure across a large number of European marine data centers is facilitating the development of this ecosystem level multidisciplinary science through the development of a common approach to management of the data.

Agriculture

Data curation also supports the sustainable management of agriculture by helping to (as discussed above) forecast climate change, water shortages, and damage to ecosystems. Food supplies depend on these trends within natural environment, including weather and climate, freshwater supplies, soil moisture, and other variables. This will allow the management of agriculture in a sustainable way. Sustainability relies not only on environmental factors but supply and demand, population changes, and transport issues.

The UK’s Natural Environment Research Council Airborne Research & Survey Facility (ARSF) records hyperspectral data, LiDAR measurements of ground height, and aerial photographs of specific survey sites onboard an extensively modified Dornier 228 aircraft. This airborne remote sensing facility provides an efficient method for the rapid collection of data over a specified area—supporting a wide range of applications, including environmental science, geomorphology, archaeology, ecology, geological surveying, pollution control, and disaster management.

Data are processed by the ARSF data analysis team at the Plymouth Marine Laboratory and are archived at the National Earth Observation Data Centre (NEODC). As partners in the European Facility for Airborne Research in Environmental and Geo-sciences (EUFAR), an EU framework 7 project, the NERC ARSF supported the Soil Erosion Detection within MEDiterranean agricultural areas using HYperspectral data Project (SEDMEDHY). SEDMEDHY collected hyperspectral field measurements of different soil (volcanic and calcareous), vegetation (esparto and dwarf palm), and natural mixtures (non-photosynthetic pasture and biogenic soil crust), which were used to analyze spectral variability in semi-arid environments (Figure 7). The combination of the Visible/Near Infra Red and Short Wave Infra Red wavelengths enabled the discrimination and quantification of soil properties and plant species based on the differences in absorption properties although the high variability of the landscape and subtle differences between components makes it necessary to use high-spectral resolution information.

Escribano, et al. used such data to monitor changes in soil and vegetation in fragile Mediterranean semi-arid environments. The preservation and discovery mechanisms will enhance the prospect of comparative analysis of such environments in the future.

Return on Investment

The initial cost of data production and subsequent data archiving and preservation is expensive. The project will ensure that data are available in the long-term and eventually useable for re-purposing, with small additional costs. In this way the cost-effectiveness of research is improved dramatically. The development of business models which examine value, cost, and sustainability for data preservation (in particular Earth Science data) and services provision permit the design and long-term management of scientific data assets. Such data assets offer return both in the form of direct financial return and of benefits to society, which are funded through government, charitable, and other forms of financial support providing sustainability.

The preserved datasets both in the Earth Science domain but also in other domains (both scientific and non-scientific) can be utilized to generate additional data and information, and can contribute to the output of additional scientific research activities. They can therefore be used for research evaluation and evolution, which is becoming increasingly important. There is often a lack of awareness and underestimation of the real intellectual capital held by different organizations. Many organizations fail to appreciate the wealth of knowledge they hold or have access to or the benefits in allowing users to fully exploit their holdings. Loss of such information or bad preservation preventing its future use is tantamount to devaluing the real research assets over time.

Innovation Through Interoperability

The availability and interoperability of research datasets pushes forward the research agenda especially in the inter- and multi-disciplinary areas. It is the cross fertilization of ideas from different domains and the integration of data from different sources that permits new discoveries and innovations. Cross fertilization and data integration often require the availability of long-term series of data. Time- series data are vitally important in fields such as climate change, social studies, economics, epidemiology, and many others.

It is this interoperability with the future, supported by good data curation practices, that will allow high-quality time series of data to be created and maintained for both present and future use. Curation practices ensure the longevity of data by maintaining the semantics description and all necessary information that make them reusable. This simultaneously makes the data accessible, usable, and interoperable. It is this availability and interoperability between datasets that will stimulate innovation and lead directly to wealth creation and improvement in the quality of life.

Sharing the Cost and Avoiding Duplication of Resources

Preservation on a dataset by dataset basis can be very costly. However, representation information such as structural descriptions of common file formats and semantic descriptions of common parameters can be reused across many datasets. Collaborative curation activities, such as the SCIDIP-ES project, support the reuse of such information through its registry repository services, which aim to reduce costs for archives. Smaller, less well-funded archives can also take advantage of representation information produced by larger, better-funded archives that are willing to open up their representation information for general use. Scientific domains in areas such as health, social science, and economics that capture, store, and utilize information in common ways in order to share these costs make the long-term exploitation of their research assets feasible.

Improved Social and Environmental Responsibility

As a consortium we wish to raise awareness of earth science archives responsibility and to encourage archives to carefully curate and manage their data. In the Earth Observation domain, for example, several commercial providers own and manage satellite missions and the related data archives. The commercial exploitation of these data generates revenues that are at the basis of the existence of the commercial entities. The commercial value of Earth Observation data drastically reduces as data become older than three years.

As preservation of the data is costly, requiring for example migration to new archive technologies and other activities, old data are often not properly preserved by commercial providers and become no longer readable or usable after their initial commercial value has passed. It is useful to recall that even if these data have low commercial value, the scientific value remains very high (e.g., climate change and disaster monitoring applications). Earth Science data should therefore be preserved responsibly by their owners, as they constitute a humankind asset. Raising awareness of this will encourage archives to care for their data or, if unable or willing to do so themselves, transfer them in a safe and timely manner to another organization that can then maintain them.

Esther Conway (esther.conway@stfc.ac.uk) is Data Scientist: Earth Observation, Sam Pepler (sam.pepler@stfc.ac.uk) is Head of Curation, and Wendy Garland (wendy.garland@stfc.ac.uk) is Senior Data Scientist: Aircraft, all with the Centre for Environmental Data Archival (www.ceda.ac.uk/) at Rutherford Appleton Laboratory in the UK.

David Hooper (david.hooper@stfc.ac.uk) is Project Scientist with the Natural Environment Research Council MST Radar Facility (mst.nerc. ac.uk/) in the UK.

Fulvio Marelli (fulvio.marelli@esa.int) is SCIDIP-ES Project Manager with the European Space Agency (www.esa.int) in Italy.

Luca Liberti (luca.liberti@isprambiente.it) is Researcher and Emanuela Piervitali (emanuela.piervitali@isprambiente.it) is Professor and Researcher, both with the ISPRA Institute for Environmental Protection and Research (www.isprambiente.gov.it/en/) in Italy.

Katrin Molch (katrin.molch@dlr.de) is Earth Observation Data Librarian with the German Aerospace Center [DLR] (www.dlr.de/dlr/en/) in Germany.

Helen Graves (hmg@ bgs.ac.uk) is Senior Data Manager with the British Geological Survey (www.bgs.ac.uk/) in the UK. Lucio Badiali (lucio.badiali@ingv.it) is Senior Researcher in Technology with the Istituto Nazionale di Geofisica e Vulcanologia (http://www.ingv.it/en/) in Italy.

Acknowledgments

The authors thank Vanessa Bank, Hydrogeologist with the British Geological Survey, who contributed to the project discussed in this paper. The work presented in this article was funded by the Seventh Framework Programme (FP7) of the European Commission (EC) under the Grant Agreement 283401. Data from the NERC ARSF are provided courtesy of NERC via the NERC Earth Observation Data Centre (NEODC)

Footnotes

Airborne Research & Survey Facility (ARSF)
arsf.nerc.ac.uk/

Advanced Very High Resolution Radiometer (AVHRR)
noaasis.noaa.gov/NOAASIS/ml/avhrr.html

Aglietti, G. S., S. Redi, A. R. Tatnall, and T. Markvart. “Harnessing high-altitude solar power.” IEEE Transaction on Energy Conversion, June 2009, 24 (2):442-451. http://dx.doi.org/10.1109/TEC.2009.2016026

BRACE
www.brace.sinanet.apat.it

Envisat
https://earth.esa.int/web/guest/missions/esa-operational-eo-missions/envisat

Escribano, P., A. Palacios-Orueta, C. Oyonarte, and S. Chabrillat. “Spectral properties and sources of variability of ecosystem components in a Mediterranean semiarid environment.” Journal of Arid Environments, September 2010, 74 (9): 1041–1051. http://dx.doi.org/10.1016/j.jaridenv.2010.02.001

European Commission’s Marine Strategy Framework Directive
ec.europa.eu/environment/water/marine/directive_en.htm

EUFAR
www.eufar.net/

Geo-Seas project
www.geo-seas.eu/

ISO 19115, Geographic information – Metadata www.iso.org/iso/home/store/catalogue_tc/catalogue_detail. htm?csnumber=26020

Issaeva, Natalia. Quantifying the system balancing cost when wind energy is incorporated into electricity generation system. PhD thesis, The University of Edinburgh, 2009. hdl.handle.net/1842/3804

MERIS
https://earth.esa.int/web/guest/missions/esa-operational-eo-missions/envisat/instruments/meris

MST radar
mst.nerc.ac.uk/

Natural Environment Research Council (NERC)
www.nerc.ac.uk

NERC Earth Observation Data Centre (NEODC)
www.neodc.rl.ac.uk/

National Oceanographic and Atmospheric Administration (NOAA) – Satellites
www.noaa.gov/satellites.html

Plymouth Marine Laboratory
www.pml.ac.uk/

SEDMEDHY project
www.eufar.net/members/bo/TA/project.php?id=1213

Science Data Infrastructure for Preservation (SCIDIP-ES) project
www.scidip-es.eu/