Managing Large Multidimensional Array Hydrologic Datasets: A Case Study Comparing NetCDF and SciDB

H. Liu, P.J.M. van Oosterom, C. Hu, Wen Wang

    Research output: Contribution to journalConference articleScientificpeer-review

    10 Citations (Scopus)
    50 Downloads (Pure)

    Abstract

    Management of large hydrologic datasets including storage, structuring, indexing and query is one of the crucial challenges in the era of big data. This research originates from a specific data query problem: time series extraction at specific locations takes a long time when a large multidimensional dataset is stored in non-chunked NetCDF classic or 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. In this research, NetCDF file based solutions and a multidimensional (MD) array database management system (DBMS) applying chunked storage structure are benchmarked to determine the best solution for storing and querying large hydrologic datasets. To achieve this, expert consultancy was conducted to establish benchmark sets. To guarantee a fair benchmark test environment, HydroNET-4 system was utilized and adapters for NetCDF files and SciDB were developed to manage and query data. In final benchmark tests, effect of data storage configurations such as chunk size and compression on query performance is also explored. Results indicate that SciDB arrays utilizing small chunk sizes show favorable performance. However with current implementation of SciDB, large numbers of small chunks cause huge overload of main memory which constraints SciDB's scalability. Compression of SciDB can either have negative or no effect on query performance, while it causes significant query degradation to NetCDF-4 solution. The research illustrates that for big hydrologic array data management, the properly chunked NetCDF-4 solution without compression is in general more efficient than the SciDB DBMS. So under current big data environment, traditionally adopted file-based hydroinformatic solutions can still be applicable after proper updating.

    Original languageEnglish
    Pages (from-to)207-214
    JournalProcedia Engineering
    Volume154
    DOIs
    Publication statusPublished - 2016
    Event12th International Conference on Hydroinformatics - Songdo ConvensiA, Incheon, Korea, Democratic People's Republic of
    Duration: 21 Aug 201626 Aug 2016
    Conference number: 12

    Keywords

    • benchmark
    • chunked storage structure
    • hydrologic dataset
    • NetCDF
    • SciDB

    Fingerprint

    Dive into the research topics of 'Managing Large Multidimensional Array Hydrologic Datasets: A Case Study Comparing NetCDF and SciDB'. Together they form a unique fingerprint.

    Cite this