Automated Cache Performance Analysis And Optimization

[摘要] While there is no lack of performance counter tools for coarse-grained measurement of cache activity, there is a critical lack of tools for relating data layout to cache behavior to application performance. Generally, any nontrivial optimizations are either not done at all, or are done ???by hand??? requiring significant time and expertise. To the best of our knowledge no tool available to users measures the latency of memory reference instructions for partic- ular addresses and makes this information available to users in an easy-to-use and intuitive way. In this project, we worked to enable the Open|SpeedShop performance analysis tool to gather memory reference latency information for specific instructions and memory ad- dresses, and to gather and display this information in an easy-to-use and intuitive way to aid performance analysts in identifying problematic data structures in their codes. This tool was primarily designed for use in the supercomputer domain as well as grid, cluster, cloud-based parallel e-commerce, and engineering systems and middleware. Ultimately, we envision a tool to automate optimization of application cache layout and utilization in the Open|SpeedShop performance analysis tool. To commercialize this soft- ware, we worked to develop core capabilities for gathering enhanced memory usage per- formance data from applications and create and apply novel methods for automatic data structure layout optimizations, tailoring the overall approach to support existing supercom- puter and cluster programming models and constraints. In this Phase I project, we focused on infrastructure necessary to gather performance data and present it in an intuitive way to users. With the advent of enhanced Precise Event-Based Sampling (PEBS) counters on recent Intel processor architectures and equivalent technology on AMD processors, we are now in a position to access memory reference information for particular addresses. Prior to the introduction of PEBS counters, cache behavior could only be measured reliably in the ag- gregate across tens or hundreds of thousands of instructions. With the newest iteration of PEBS technology, cache events can be tied to a tuple of instruction pointer, target address (for both loads and stores), memory hierarchy, and observed latency. With this information we can now begin asking questions regarding the efficiency of not only regions of code, but how these regions interact with particular data structures and how these interactions evolve over time. In the short term, this information will be vital for performance analysts understanding and optimizing the behavior of their codes for the memory hierarchy. In the future, we can begin to ask how data layouts might be changed to improve performance and, for a particular application, what the theoretical optimal performance might be. The overall benefit to be produced by this effort was a commercial quality easy-to- use and scalable performance tool that will allow both beginner and experienced parallel programmers to automatically tune their applications for optimal cache usage. Effective use of such a tool can literally save weeks of performance tuning effort. Easy to use. With the proposed innovations, finding and fixing memory performance issues would be more automated and hide most to all of the performance engineer exper- tise ???under the hood??? of the Open|SpeedShop performance tool. One of the biggest public benefits from the proposed innovations is that it makes performance analysis more usable to a larger group of application developers. Intuitive reporting of results. The Open|SpeedShop performance analysis tool has a rich set of intuitive, yet detailed reports for presenting performance results to application developers. Our goal was to leverage this existing technology to present the results from our memory performance addition to Open|SpeedShop. Suitable for experts as well as novices. Application performance is getting more difficult to measure as the hardware platforms they run on become more complicated. This makes life difficult for the application developer, in that they need to know more about the hardware platform, including the memory system hierarchy, in order to understand the performance of their application. Some application developers are comfortable in that sce- nario, while others want to do their scientific research and not have to understand all the nuances in the hardware platform they are running their application on. Our proposed innovations were aimed to support both experts and novice performance analysts. Useful in many markets. The enhancement to Open|SpeedShop would appeal to a broader market space, as it will be useful in scientific, commercial, and cloud computing environments. Our goal was to use technology developed initially at the and Lawrence Livermore Na- tional Laboratory combined with the development and commercial software experience of the Argo Navis Technologies, LLC (ANT) to form a powerful combination to delivery these objectives.

[发布日期] 2013-12-23 [发布机构]

[效力级别] [学科分类] 数学（综合）

[关键词] [时效性]

浏览次数：21

统一登录查看全文激活码登录查看全文