Treffer: SCI Monitoring Hardware and Software: Supporting Performance Evaluation and Debugging.
Weitere Informationen
The development of a parallel program which runs efficiently on a parallel machine is a difficult task and takes much more effort than the development of a sequential one. A programmer has to consider communication and synchronization requirements, the complexity of data accesses, as well as the problem of partitioning work and data, depending on the underlying programming model. Additionally, the potentially nondeterministic behavior of concurrent activities running on the parallel machine aggravates the test and debugging phase in the software development cycle. Even when a program is validated and produces correct results, a considerable amount of work has to be done in order to tune the parallel program to efficiently exploit the resources of the parallel machine. This task becomes even more complicated on architectures supporting fine-grained execution such as PCs clustered with novel high-speed, low-latency networks like the Scalable Coherent Interface (SCI). SCI supports memory-oriented transactions over a ringlet-based network, effectively providing a global virtual bus. Remote read latencies are on the order of 5 μs for I/O bus-based SCI adapters, as is demonstrated for the LRR-TUM adapter in Chapter 4 as well as in Chapter 3 for the Dolphin adapters. Through these properties, SCI implements a hardware-supported distributed shared memory (DSM) system on a network of PCs. On this class of architectures, communication events cannot be observed easily by appropriate tools since they are potentially very frequent, comparably short, and cannot be easily distinguished from local memory reads and writes. The SMiLE system (Shared Memory in a LAN-Like Environment) [4] belongs to this class of architectures and represents a network of Pentium-II PCs clustered with the SCI interconnect. Its NUMA characteristics (non-uniform memory access) is implemented in hardware and is based on a custom PCI/SCI adapter described in Chapter 4 that plugs into the PC's PCI local bus. A hardware monitor as part of an event-driven hybrid monitoring system for the SMiLE PC cluster is able to deliver detailed information about the run-time and communication behavior of parallel programs. This information can be utilized by tools for performance evaluation and tuning as well as debugging [5] [6]. The hardware monitor is being implemented as a second PCI card and attached to the PCI/SCI adapter side-by-side. The controlled deterministic execution approach codex [3] provides a generic method to overcome the problems arising from the nondeterministic behavior of parallel programs during the test and debugging phase. It is based on POEM (Parallel Object Execution Model ) [3], a framework for modeling parallel execution independently from the underlying programming model. This allows the specification of the requirements for a deterministic execution of test cases. While for message passing codex can be implemented in a fairly straightforward way by instrumenting messaging layers, this is not possible for the DSM-oriented execution of the type of architecture mentioned above. This chapter deals with our approach to deliver run-time information to tools for performance analysis, and to integrate controlled deterministic execution into the hardware-supported DSM execution paradigm provided by the SMiLE PC cluster. We will not focus on a particular programming model. However, remote memory transactions are considered to be the base of any execution on this machine. Section 24.2 will therefore present in detail the SCI hardware of the SMiLE cluster and its accompanying hardware monitoring system. Section 24.3 then provides the background necessary to understand codex in general, while Section 24.4 explains in detail how this approach can be mapped onto the SMiLE architecture. A short description of related work follows in Section 24.5, leading to the conclusion in Section 24.6. [ABSTRACT FROM AUTHOR]