An Experimental Comparison of Complex Objects Implementations in Big Data Systems
[摘要] Many data management and analytics systems support complex objects. Dataflowplatforms such as Spark and Flink allow programmers to manipulate sets consistingof objects from a host programming language, often Java. Document databases suchas MongoDB make use of hierarchical interchange formats--most popularly JSON--whichembody a data model where individual records can themselves contain sets of records.Systems such as Dremel and AsterixDB allow complex nesting of data structures. Thedesire to support such complex objects forces a system designer to ask: how shouldcomplex objects be implemented in a modern data management system? In thisthesis, over a suite of representative data management tasks, I experimentally evaluatethe performance implications of a wide variety of complex object implementations.The choice of object implementation can have a profound effect on performance. Forexample, the same external sort to perform a duplicate removal can take anywherebetween a half hour to fourteen and a half hours depending upon the complex objectimplementation. A corollary is that a bad object implementation can doom systemperformance. In addition, we reaffirm the value of the classical database way ofstoring complex objects - where there is no distinction between the in-memory andover-the-wire data representation, within a modern big data system.
[发布日期] [发布机构] Rice University
[效力级别] Objects [学科分类]
[关键词] [时效性]