Managing large multidimensional hydrologic datasets: A case study comparing NetCDF and SciDB

Haicheng Liu; Peter Van Oosterom; Theo Tijssen; Tom Commandeur; Wen Wang

doi:10.2166/hydro.2018.136

Managing large multidimensional hydrologic datasets: A case study comparing NetCDF and SciDB

Haicheng Liu, Peter Van Oosterom, Theo Tijssen, Tom Commandeur, Wen Wang

Urban Data Science

Research output: Contribution to journal › Article › Scientific › peer-review

2 Citations (Scopus)

Abstract

Management of large hydrologic datasets including storage, structuring, clustering, indexing, and query is one of the crucial challenges in the era of big data. This research originates from a specific problem: time series extraction at specific locations takes a long time when a large multidimensional (MD) dataset is stored in the NetCDF classic or the 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. In this research, NetCDF file-based solutions and a MD array database management system applying a chunked storage structure are benchmarked to determine the best solution for storing and querying large MD hydrologic datasets. Expert consultancy was conducted to establish benchmark sets, with the HydroNET-4 system being utilized to provide the benchmark environment. In the final benchmark tests, the effect of data storage configurations, elaborating chunk size, dimension order (spatio-temporal clustering) and compression on the query performance, is explored. Results indicate that for big hydrologic MD data management, the properly chunked NetCDF-4 solution without compression is, in general, more efficient than the SciDB DBMS. However, benefits of a DBMS should not be neglected, for example, the integration with other data types, smart caching strategies, transaction support, scalability, and out-of-The-box support for parallelization.

Original language	English
Pages (from-to)	1058-1070
Number of pages	13
Journal	Journal of Hydroinformatics
Volume	20
Issue number	5
DOIs	https://doi.org/10.2166/hydro.2018.136
Publication status	Published - 2018

Keywords

Chunked storage structure
Hydrologic dataset
NetCDF
SciDB

Access to Document

10.2166/hydro.2018.136

Cite this

@article{de98bc8fec524784aba17e77ee807f15,

title = "Managing large multidimensional hydrologic datasets: A case study comparing NetCDF and SciDB",

abstract = "Management of large hydrologic datasets including storage, structuring, clustering, indexing, and query is one of the crucial challenges in the era of big data. This research originates from a specific problem: time series extraction at specific locations takes a long time when a large multidimensional (MD) dataset is stored in the NetCDF classic or the 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. In this research, NetCDF file-based solutions and a MD array database management system applying a chunked storage structure are benchmarked to determine the best solution for storing and querying large MD hydrologic datasets. Expert consultancy was conducted to establish benchmark sets, with the HydroNET-4 system being utilized to provide the benchmark environment. In the final benchmark tests, the effect of data storage configurations, elaborating chunk size, dimension order (spatio-temporal clustering) and compression on the query performance, is explored. Results indicate that for big hydrologic MD data management, the properly chunked NetCDF-4 solution without compression is, in general, more efficient than the SciDB DBMS. However, benefits of a DBMS should not be neglected, for example, the integration with other data types, smart caching strategies, transaction support, scalability, and out-of-The-box support for parallelization.",

keywords = "Chunked storage structure, Hydrologic dataset, NetCDF, SciDB",

author = "Haicheng Liu and {Van Oosterom}, Peter and Theo Tijssen and Tom Commandeur and Wen Wang",

year = "2018",

doi = "10.2166/hydro.2018.136",

language = "English",

volume = "20",

pages = "1058--1070",

journal = "Journal of Hydroinformatics",

issn = "1464-7141",

publisher = "International Water Association (IWA)",

number = "5",

}

TY - JOUR

T1 - Managing large multidimensional hydrologic datasets

T2 - A case study comparing NetCDF and SciDB

AU - Liu, Haicheng

AU - Van Oosterom, Peter

AU - Tijssen, Theo

AU - Commandeur, Tom

AU - Wang, Wen

PY - 2018

Y1 - 2018

N2 - Management of large hydrologic datasets including storage, structuring, clustering, indexing, and query is one of the crucial challenges in the era of big data. This research originates from a specific problem: time series extraction at specific locations takes a long time when a large multidimensional (MD) dataset is stored in the NetCDF classic or the 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. In this research, NetCDF file-based solutions and a MD array database management system applying a chunked storage structure are benchmarked to determine the best solution for storing and querying large MD hydrologic datasets. Expert consultancy was conducted to establish benchmark sets, with the HydroNET-4 system being utilized to provide the benchmark environment. In the final benchmark tests, the effect of data storage configurations, elaborating chunk size, dimension order (spatio-temporal clustering) and compression on the query performance, is explored. Results indicate that for big hydrologic MD data management, the properly chunked NetCDF-4 solution without compression is, in general, more efficient than the SciDB DBMS. However, benefits of a DBMS should not be neglected, for example, the integration with other data types, smart caching strategies, transaction support, scalability, and out-of-The-box support for parallelization.

AB - Management of large hydrologic datasets including storage, structuring, clustering, indexing, and query is one of the crucial challenges in the era of big data. This research originates from a specific problem: time series extraction at specific locations takes a long time when a large multidimensional (MD) dataset is stored in the NetCDF classic or the 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. In this research, NetCDF file-based solutions and a MD array database management system applying a chunked storage structure are benchmarked to determine the best solution for storing and querying large MD hydrologic datasets. Expert consultancy was conducted to establish benchmark sets, with the HydroNET-4 system being utilized to provide the benchmark environment. In the final benchmark tests, the effect of data storage configurations, elaborating chunk size, dimension order (spatio-temporal clustering) and compression on the query performance, is explored. Results indicate that for big hydrologic MD data management, the properly chunked NetCDF-4 solution without compression is, in general, more efficient than the SciDB DBMS. However, benefits of a DBMS should not be neglected, for example, the integration with other data types, smart caching strategies, transaction support, scalability, and out-of-The-box support for parallelization.

KW - Chunked storage structure

KW - Hydrologic dataset

KW - NetCDF

KW - SciDB

UR - http://www.scopus.com/inward/record.url?scp=85054614809&partnerID=8YFLogxK

U2 - 10.2166/hydro.2018.136

DO - 10.2166/hydro.2018.136

M3 - Article

SN - 1464-7141

VL - 20

SP - 1058

EP - 1070

JO - Journal of Hydroinformatics

JF - Journal of Hydroinformatics

IS - 5

ER -

Managing large multidimensional hydrologic datasets: A case study comparing NetCDF and SciDB

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this