A recipe for fast(er) processing of netCDF files with Python and custom C modules

[摘要] Abstract: netCDF (Network Common Data Form) is a data format used commonly to store scientific array-oriented data. A number of software tools exist that can be used to process and view netCDF data. In some cases though, the existing tools cannot do the processing required and thus, we have to write code for some of the data processing operations. One such case is when we have to merge two netCDF files. The core library for read-write access to netCDF files is written in C but interfaces to the C library are also available in a number of languages including Python, C#, Perl, R and others. Python with the Numpy package is widely used in scientific computing where the Numpy package provides for fast and sophisticated handling of the arrays in Python. netCDF being an array-oriented data format, Python/Numpy combination is a good candidate for netCDF processing. But in terms of performance, Python is not as efficient as C. This becomes a problem for computationally large problems as the compute time becomes too big. So if we are looping over the netCDF data, as is the case when we merge two files, then as the dimension size of the arrays increases, so does the computing time. The problem is further compounded if we have to merge 1000’s of such files. Thus in these cases we have to look for opportunities to speed up the process. This paper describes an approach where the processing time can be shortened by extracting the ‘looping’ code out to a custom C module which can then be called from the Python code. The C ‘looping’ code can then further be parallelised using OpenMP. This gives us the best of both worlds, ease of development in Python and fast execution time in C. Furthermore the problem setup can also reduce processing time if the files are sorted in such a way that the adjoining files show increasing overlap in the dimensions. And if one has access to a cluster of machines, then exploiting the parallelism at a coarser level by running multiple merge processes simultaneously will expedite the process more so than just parallelizing the loop given that the number of machines available is more than the cores in a single machine.

[发布日期] 2017-03-16 [发布机构] CSIRO

[效力级别] [学科分类] 地球科学(综合)

[关键词] [时效性]

浏览次数：5

统一登录查看全文激活码登录查看全文