TY - JOUR
T1 - Managing large multidimensional hydrologic datasets
T2 - A case study comparing NetCDF and SciDB
AU - Liu, Haicheng
AU - Van Oosterom, Peter
AU - Tijssen, Theo
AU - Commandeur, Tom
AU - Wang, Wen
PY - 2018
Y1 - 2018
N2 - Management of large hydrologic datasets including storage, structuring, clustering, indexing, and query is one of the crucial challenges in the era of big data. This research originates from a specific problem: time series extraction at specific locations takes a long time when a large multidimensional (MD) dataset is stored in the NetCDF classic or the 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. In this research, NetCDF file-based solutions and a MD array database management system applying a chunked storage structure are benchmarked to determine the best solution for storing and querying large MD hydrologic datasets. Expert consultancy was conducted to establish benchmark sets, with the HydroNET-4 system being utilized to provide the benchmark environment. In the final benchmark tests, the effect of data storage configurations, elaborating chunk size, dimension order (spatio-temporal clustering) and compression on the query performance, is explored. Results indicate that for big hydrologic MD data management, the properly chunked NetCDF-4 solution without compression is, in general, more efficient than the SciDB DBMS. However, benefits of a DBMS should not be neglected, for example, the integration with other data types, smart caching strategies, transaction support, scalability, and out-of-The-box support for parallelization.
AB - Management of large hydrologic datasets including storage, structuring, clustering, indexing, and query is one of the crucial challenges in the era of big data. This research originates from a specific problem: time series extraction at specific locations takes a long time when a large multidimensional (MD) dataset is stored in the NetCDF classic or the 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. In this research, NetCDF file-based solutions and a MD array database management system applying a chunked storage structure are benchmarked to determine the best solution for storing and querying large MD hydrologic datasets. Expert consultancy was conducted to establish benchmark sets, with the HydroNET-4 system being utilized to provide the benchmark environment. In the final benchmark tests, the effect of data storage configurations, elaborating chunk size, dimension order (spatio-temporal clustering) and compression on the query performance, is explored. Results indicate that for big hydrologic MD data management, the properly chunked NetCDF-4 solution without compression is, in general, more efficient than the SciDB DBMS. However, benefits of a DBMS should not be neglected, for example, the integration with other data types, smart caching strategies, transaction support, scalability, and out-of-The-box support for parallelization.
KW - Chunked storage structure
KW - Hydrologic dataset
KW - NetCDF
KW - SciDB
UR - http://www.scopus.com/inward/record.url?scp=85054614809&partnerID=8YFLogxK
U2 - 10.2166/hydro.2018.136
DO - 10.2166/hydro.2018.136
M3 - Article
SN - 1464-7141
VL - 20
SP - 1058
EP - 1070
JO - Journal of Hydroinformatics
JF - Journal of Hydroinformatics
IS - 5
ER -