Parallel load and query processing in a distributed array database
[摘要] Scientists across many research domains collect large amounts of multi-dimensional data in their day to day work. They require high performance, scalable systems to manage and process their data. Oftentimes, the underlying distribution of these types of data is skewed and sparse, rather than dense and uniform. As input data sizes continue to grow at a rapid rate, main memory and storage capacity become bottlenecks on single machines. Thus, we look to distributed array databases as a long term solution for managing and querying this type of data. This thesis presents Multinode-TileDB, a distributed framework that extends TileDB, a new array database management system designed, from the ground up, to handle skewed and sparse arrays. We design the overall distributed architecture and propose and implement parallel algorithms for load, join, subarray, and filter while focusing on load balance and performance. Our experiments show speedup gains as cluster size increases and how different data partitioning schemes benefit the different parallel queries.
[发布日期] [发布机构] Massachusetts Institute of Technology
[效力级别] [学科分类]
[关键词] [时效性]