MPI Bandwidth and Latency Benchmarks Benchmark-specific Instructions and Constraints *********************************************** Table of contents ================= o Benchmark Requirements o Benchmark Checklist o Optimization Constraints o Timing Issues Benchmark Requirements ====================== The following tests are required to be performed. The output of each test must be returned. A description of each test can be found in the file README. If the reference machine is based on a NUMA architecture, the bandwidth and latency tests must be run both for "out of box" results, as well as inter-SMP results. For example, if a shared-memory system is comprised of 8-processor boxes accessing local memory, these tests must be additionally run for 16 processes, with 8 processes running on each box. Inter-process communication Bandwidth: Four tests must be performed to measure inter-process communication bandwidth. To provide data on process communication performance on a single SMP, the "com" test must be run with an MPI process count equal to the total number of processors on one SMP, with all MPI processes allocated on one SMP. Additionally, the "com" application must be run with the MPI process count equal to the total number of processors on two SMPs. The MPI processes must be block allocated so that the processes with the lower half of rank ids of MPI_COMM_WORLD are allocated to one SMP and the processes consisting of the upper half of rank ids are allocated to a second SMP. Finally, the "com" application must be run with the MPI process count equal to the total number of processors in the system in order to generate a bisectional bandwidth measurement for the system. Consider the system divided into two equal numbers of SMPs separated by the fewest number of interconnect links. The MPI processes must be block allocated so that the processes with the lower half of rank ids of MPI_COMM_WORLD are allocated to SMPs on one side of the system and the processes consisting of the upper half of rank ids are allocated to SMPs on the other side of the system. Two runs of com must be run over the entire system to measure bandwidth first when increasing the number active SMPs and second when increasing the number of active processes per SMP. The figure of merit for each test will be the maximum bandwidth achieved. Inter-process latency: As with the com test, the laten test must be run both with an MPI process count equal to the total number of processors on one SMP, with all MPI processes allocated to one SMP, as well as with the MPI process count equal to the total number of processors on two SMPs, with the processes allocated over two SMPs. The figure of merit for each test will be the latency reported when all processors are active. As with the com test, two laten tests will be performed over the entire system. MPI-2 RMA Performance: As with the com test, the rma test must be run both with an MPI process count equal to the total number of processors on one SMP, with all MPI processes allocated to one SMP, as well as with the MPI process count equal to the total number of processors on two SMPs, with the processes allocated over two SMPs. The figure of merit for each test will be the maximum bandwidth achieved. Four rma tests will be performed over the entire system. Allreduce Performance: The test "allred" must be run with the number of MPI processes equal to the total number of processors on the reference system. The figure of merit for this test will be average operation time. Global-Op Performance: The test "globalop" must be run with the number of MPI processes equal to the total number of processors on the reference system. The figure of merit for this test will be MPI_Allreduce time relative to MPI_Reduce/MPI_Bcast, however, all operation results will be considered. The "crunch" function may be modified, if errors are encountered. However, the errors and modifications must be described. Benchmark Checklist =================== As indicated in the README file, each test provides a means of specifying the number of operations between time samples. To ensure that accurate timings are being made, the test output field "Ticks for minimum sample" must be greater than 1000. Note that this also requires an accurate implementation of the MPI_Wtick function. The following is a list of specific tests which must be run. If a "[y]" appears in the command, this indicates that the argument must produce application results consistent with the MPI_Wtime granularity requirements. The "[x]" in the allred specification indicates that this value should be tuned such that the allred "Time between Allreduce" field is twice as large as the "Op mean" value. The allred test results must include stddev/mean values for both Sample sections that are less than .05. The message size stop value may be reduced for the rma test if excessive paging is encountered, however, the maximum message size returning reasonable results must be identified. The default behavior for the com, laten, and rma tests, which includes a barrier within the measurement, must be used. The rma test may be modified to allow for specific window allocation needs or optimizations. All modifications must be indicated and described. In the event that the default test message sizes cause the com, laten, or rma tests to fail, the message size range can be modified by using the '-e' command line flag. The cause of the failure must be provided. Command Processes/Allocation ---------------------------------------------------------------------------- com -o [y] 1 process/processor on 1 SMP com -o [y] 1 process/processor on 2 SMPs, block com -o [y] -t [procs/SMP] All processors, bisectional BW, block com -o [y] -t [procs/SMP] -p c All processors, bisectional BW, block laten -o [y] 1 process/processor on 1 SMP laten -o [y] 1 process/processor on 2 SMPs, block laten -o [y] -t [procs/SMP] All processors, bisectional BW, block laten -o [y] -t [procs/SMP] -p c All processors, bisectional BW, block rma -o 1 -c [y] 1 process/processor on 1 SMP rma -o 1 -c [y] 1 process/processor on 2 SMPs, block rma -o 128 -c [y] 1 process/processor on 1 SMP rma -o 128 -c [y] 1 process/processor on 2 SMPs, block rma -o 1 -c [y] -t [procs/SMP] All processors, bisectional BW, block rma -o 1 -c [y] -t [procs/SMP] -p c All processors, bisectional BW, block rma -o 128 -c [y] -t [procs/SMP] All processors, bisectional BW, block rma -o 128 -c [y] -t [procs/SMP] -p c All processors, bisectional BW, block allred [x] [y] 10 All processors globalop All processors Optimization Constraints ======================== It is possible that compiler optimizations may obscure the intended result of the benchmarks by optimizing certain parts of the code. In particular, it is necessary to repeat the communication in each loop several times in order to obtain a measurable time. The vendor is required to do whatever is needed in order to ensure that the compiler does not optimize the code so as to remove, entirely, the actual communications to be timed. Other optimizations that increase performance are allowed so long as they do not circumvent in any way the intention of the benchmark. The University will be the sole judge of whether or not this condition has been satisfied. Actual assembly language code generated by the compiler may be required in order to make such a determination. Note that values of "INF" (infinity), zero, or other similarly meaningless numbers for rates for any of the results will be unacceptable. A positive, non-zero value for time and rate must be obtained during execution for each test. Timing Issues ============== The timer used in this MPI code is the standard portable MPI elapsed timer MPI_Wtime(). The vendor should implement this standard MPI timer with an acceptable accuracy and resolution. Last modified on January 31, 2002 by Chris Chambreau For information contact: Chris Chambreau -- chcham@llnl.gov UCRL-CODE-2001-028