Chapter 3


Operating Procedures

This chapter explains how to use NEC MPI, including how to compile, link, and execute MPI programs.


3.1   Compiling and Linking MPI Programs

Firstly, please execute the following command to read a setup script each time you log in to a VH, in order to set up the MPI compilation environment. {version} is the directory name corresponding to the version of NEC MPI you use. The setting is available until you log out.
(For bash)
$ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh

(For csh)
% source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh

It is possible to compile and link MPI programs with the MPI compilation commands corresponding to each programing language as follows:

To compile and link MPI programs written in Fortran, please execute the mpinfort/mpifort command as follows

$ mpinfort [options] {sourcefiles}
To compile and link MPI programs written in C, please execute the mpincc/mpicc command as follows
$ mpincc [options] {sourcefiles}
To compile and link MPI programs written in C++, please execute the mpinc++/mpic++ command as follows
$ mpinc++ [options] {sourcefiles}
In the command lines above, {sourcefiles} means MPI program source files, and [options] means optional compiler options.
In addition to the compiler options provided by the Fortran compiler (nfort), C compiler (ncc), or C++ compiler (nc++), the NEC MPI compiler options in the following table are available.

NEC MPI compile commands, mpincc/mpicc, mpinc++/mpic++ and mpinfort/mpifort, will use the default version of compilers, ncc, nc++ and nfort, respectively. NEC MPI compile command option -compiler or an environment variable can be used to select a compiler version, if another version of compiler must be used. In this case, a compiler version and NEC MPI version must be selected carefully to match each other.

example: If a compiler version 2.x.x is used to compile and link a C program.

$ mpincc -compiler /opt/nec/ve/bin/ncc-2.x.x program.c

Table 3-1 The List of NEC MPI Compiler Commands Options
Option Meaning
-mpimsgq | -msgq Use the MPI message queue facility for the Debugger
-mpiprof Use the MPI communication information facility and use MPI profiling interface (MPI procedure with names beginning with PMPI_). Please refer to this section for the MPI communication information facility.
-mpitrace Use the MPI procedures tracing facility. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the MPI procedures tracing facility.
-mpiverify Use the debug assist feature for MPI collective procedures. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the debug assist feature for MPI collective procedures.
-ftrace Use the FTRACE facility for MPI program. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the FTRACE facility.
-show Display the sequence of compiler execution invoked by the MPI compilation command without actual execution
-ve Compile and link MPI programs to run on VE (default)
-vh
-sh
Compile and link MPI programs to run on VH or SH
-static-mpi Link against MPI libraries statically, but MPI memory management library is linked dynamically (default)
-shared-mpi Link against all MPI libraries dynamically
-compiler <compiler> Specify a compiler invoked by the MPI compilation command following space. If this option is not specified, each compile command starts the following compiler. The following is supported as a compiler that compiles and links MPI programs to run on VH or Scalar Host.
  • GNU Compiler Collection
    • 4.8.5
    • 8.3.0 and 8.3.1
    • 9.1.0 and compatible version
  • Intel C++ Compiler and Intel Fortran Compiler
    • 19.0.4.243 (Intel Parallel Studio XE 2019 Update 4) and compatible version
    • 19.1.2.254 (Intel Parallel Studio XE 2020 Update 2)
See also 2.10 about using mpi_f08 fortran module.
Compilation Command Invoked Compiler
mpincc/mpicc ncc
mpinc++/mpic++ nc++
mpinfort/mpifort nfort
Compilation Command with -vh/-sh Invoked Compiler
mpincc/mpicc gcc
mpinc++/mpic++ g++
mpinfort/mpifort gfortran
Table 3-2 The List of Environment Variables of NEC MPI Compiler Commands
Environment Variable Meaning
NMPI_CC Change a compiler which you use to compile and link a mpi program on VE by mpincc command.
NMPI_CXX Change a compiler which you use to compile and link a mpi program on VE by mpinc++ command.
NMPI_FC Change a compiler which you use to compile and link a mpi program on VE by mpinfort command.
NMPI_CC_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpincc command.
NMPI_CXX_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpinc++ command.
NMPI_FC_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpinfort command.

The above environment variables in Table 3-2 are overridden by -compiler option.

An example of each compiler is shown below.

example1: NEC Compiler

$ source /opt/nec/ve/mpi/2.x.x/bin/necmpivars.sh
$ mpincc a.c
$ mpinc++ a.cpp
$ mpinfort a.f90
example2: GNU compiler
(setup the GNU compiler (e.g. PATH, LD_LIBRARY_PATH)
$ source /opt/nec/ve/mpi/2.x.x/bin/necmpivars.sh
$ mpincc -vh a.c
$ mpinc++ -vh a.cpp
$ mpinfort -vh a.f90
example3: Intel compiler
(setup the Intel compiler (e.g. PATH, LD_LIBRARY_PATH)
$ source /opt/nec/ve/mpi/2.x.x/bin/necmpivars.sh
$ export NMPI_CC_H=icc
$ export NMPI_CXX_H=icpc
$ export NMPI_FC_H=ifort
$ mpincc -vh a.c
$ mpinc++ -vh a.cpp
$ mpinfort -vh a.f90


3.2   Starting MPI Programs

Before use, please setup your compiler referring to 3.1 and execute the following command to read a setup script each time you log in to a VH, in order to setup the MPI execution environment. {version} is the directory name corresponding to the version of NEC MPI you use. This setting is available until you log out.
(For bash)
$ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh

(For csh)
% source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh

By default, the MPI libraries whose version is same as compiling and linking are searched and MPI program is dynamically linked against them as needed. By loading setup script, the MPI libraries corresponding to above {version} will be searched.
Thus, when MPI program is dynamically linked against all MPI libraries with -shared-mpi, You can change MPI libraries to corresponding them to above {version} at runtime.

When -shared-mpi is not specified at compiling and linking time, MPI program is dynamically linked against MPI memory management library and statically linked against the other MPI libraries. The MPI libraries linked statically cannot be changed at runtime. In this case, you use the setup script same as compiling and linking time at runtime. If you use different setup script in this case, unexpected behavior may occur by version inconsistency between MPI memory management library, which is linked dynamically at runtime, and the other MPI libraries, which was linked statically at compiling and linking time.

If you use hybrid execution which consist of vector processes and scalar processes, execute the below command instead of the above. By loading setup script by the below command, in addition to VE, the MPI program executed on VH or a scalar host also is dynamically linked against the MPI libraries to corresponding to below {version}.

(For bash)
$ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh [gnu|intel] [compiler-version]

(For csh)
% source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh [gnu|intel] [compiler-version]
The {version} is the directory name corresponding to the version of NEC MPI which contains MPI libraries the MPI program is dynamically linked against. [gnu|intel] be specified as the first argument. [compiler-version] is specified as the second argument. [compiler-version] is the compiler version used at compiling and linking. You can get the value of each argument from the RUNPATH of MPI program. In the below example, the first argument is the value of the wave line part (gnu) and the second argument is the value of the dashed line part (9.1.0)
$ /usr/bin/readelf -W -d vh.out | grep RUNPATH
0x000000000000001d (RUNPATH) Library runpath: [/opt/nec/ve/mpi/2.3.0/lib64/vh/gnu/9.1.0]

NEC MPI provides the MPI execution commands mpirun and mpiexec to launch MPI programs. Any of the following command lines is available:

$ mpirun [global-options] [local-options] {MPIexec} [args] [ : [local-options] {MPIexec} [args] ]...
$ mpiexec [global-options] [local-options] {MPIexec} [args] [ : [local-options] {MPIexec} [args] ]...


3.2.1   Specification of Program Execution

The following can be specified as MPI-execution specification {MPIexec} in the MPI execution commands:

The explanation above is based on the assumption that the Linux binfmt_misc capability has been configured, which is the default software development environment in the SX-Aurora TSUBASA. The configuration of the binfmt_misc capability requires the system administrator privileges. Please refer to "SX-Aurora TSUBASA Installation Guide", or contact the system administrator for details.

It is possible to execute MPI programs by specifying MPI-execution specification {MPIexec} as follows, even in the case that the binfmt_misc capability has not been configured.


3.2.2   Runtime Options

The term host in runtime options indicates a VH or a VE. Please refer to the clause for how to specify hosts.

The following table shows available global options.

Table 3-3 The List of Global Options
Global Option Meaning
-machinefile | -machine <filename> A file that describes hosts and the number of processes to be launched.
The format is "hostname[:value]" per line. The default value of the number of processes (":value") is 1, if it is omitted.
-configfile <filename> A file containing runtime options.
In the file <filename>, specify one or more option lines.
Runtime options and MPI execution specifications {MPIexec} such as MPI executable file are specified on each line. If the beginning of the line is "#", that line is treated as a comment.
-hosts <host-list> Comma-separated list of hosts on which MPI processes are launched.
When the options -hosts and -hostfile are specified more than once, the hosts specified in each successive option are treated as a continuation of the list of the specified hosts.
This option must not be specified together with the option -host, -nn, or -node.
-hostfile <filename> Name of a file that specifies hosts on which MPI processes are launched.
When the options -hosts and -hostfile are specified more than once, the hosts specified in each successive option are treated as a continuation of the list of the specified hosts.
This option must not be specified together with the option -host, -nn, or -node.
-gvenode Hosts specified in the options indicates VEs.
-perhost | -ppn | -N | -npernode | -nnp <value> MPI processes in groups of the specified number <value> are assigned to respective hosts.
The assignment of MPI processes to hosts is circularly performed until every process is assigned to a host.
When this option is omitted, the default value is (P+H-1)/H, where P is the total number of MPI processes and H is the number of hosts.
-max_np <max_np> Specify the maximum number of MPI processes, including MPI processes dynamically generated at runtime. The default value is the number specified with the -np option. If some -np options are specified, the default value is the sum of the numbers specified with the options.
-multi Specify that MPI program is executed on multiple hosts. Use this option, if all MPI processes are generated in a single host at the start of program execution and then MPI processes are generated on the other hosts by the MPI dynamic process generation function, resulting in multiple host execution.
-genv <varname> <value> Pass the environment variable <varname> with the value <value> to all MPI processes.
-genvall (Default) Pass all environment variables to all MPI processes except for the default environment variables set by NQSV in the NQSV request execution. Please refer to "NEC Network Queuing System V (NQSV) User's Guide" for details.
-genvlist <varname-list> Comma-separated list of environment variables to be passed to all MPI processes.
-genvnone Do not pass any environment variables.
-gpath <dirname> Set PATH environment variables passed to all MPI processes to <dirname>.
-gumask <mode> Execute "umask <mode>" for all MPI processes.
-gwdir <dirname> Set the working directory in which all MPI processes run to <dirname>.
-gdb | -debug Open one debug screen per MPI process, and run MPI programs under the gdb debugger.
-display | -disp <X-server> X display server for debug screens in the format "host:display" or "host:display:screen".
-v | -V | -version Display the version of NEC MPI and runtime information such as environment variables.
-h | -help Display help for the MPI execution commands.

Only one of the local options in the following table can be specified to each MPI executable file. When all of them are omitted, the host specified in runtime options indicates a VH.

Table 3-4 The List of Local Options
Local Option Meaning
-ve <first>[-<last>] The range of VEs on which MPI processes are executed. If this option is specified, the term host in runtime options indicates a VH.
In the interactive execution, specify the range of VE numbers.
In the NQSV request execution, specify the range of logical VE numbers.
<first> indicates the first VE number, and <last> the last VE number. <last> must not be smaller than <first>. When -<last> is omitted, -<first> is assumed to be specified.
The specified VEs are the ones attached to VHs specified immediately before this option in local options or specified in global options.
If this option is omitted and no VEs are specified, VE#0 is assumed to be specified. If this option is omitted and host or the number of hosts are not specified in the NQSV request execution, all VEs assigned by NQSV are assumed to be specified.
-nve <value> The number of VEs on which MPI processes are executed.
Corresponds to: -ve 0-<value-1>
The specified the number of VEs are the ones attached to VHs specified immediately before this option in local options or specified in global options.
-venode The term host in the options indicates a VE.
-vh | -sh Create MPI processes on Vector Hosts or Scalar hosts.
-host <host> One host on which MPI processes are launched.
-node <hostrange> The range of hosts on which MPI processes are launched.
In the interactive execution, the -venode option also needs to be specified.
If the option -hosts, -hostfile, -host, or -nn is specified, this option is ignored.
-nn <value> The number of hosts on which MPI processes are launched.
This option is available only in the NQSV request execution.
This option can be specified only once corresponding to each MPI executable file.
If this option is omitted and host or the number of hosts are not specified in the NQSV request execution, the number of hosts assigned by NQSV is assumed to be specified.
If the option -hosts, -hostfile, or -host is specified, this option is ignored.
-numa <first>[-<last>][,<...>] The range of NUMA nodes on VE on which MPI processes are executed.
<first> indicates the first NUMA node number, and <last> the last NUMA node number. <last> must not be smaller than <first>. When -<last> is omitted, -<first> is assumed to be specified.
-nnuma <value> The number of NUMA nodes on VE on which MPI processes are executed.
Corresponds to: -numa 0-<value-1>
-c | -n | -np <value> The total number of processes launched on the corresponding hosts.
The specified processes correspond to the hosts specified immediately before this option in local options or specified in global options.
When this option is omitted, the default value is 1.
-env <varname> <value> Pass the environment variable <varname> with the value <value> to MPI processes.
-envall (Default) Pass all environment variables to MPI processes except the default environment variables set by NQSV in the NQSV request execution. Please refer to "NEC Network Queuing System V (NQSV) User's Guide" for details about the default environment variables.
-envlist <varname-list> Comma-separated list of environment variables to be passed.
-envnone Do not pass any environment variables.
-path <dirname> Set PATH environment variables passed to MPI process to <dirname>.
-umask <mode> Execute "umask <mode>" for MPI process.
-wdir <dirname> Set the working directory in which MPI processes run to <dirname>.
-ib_vh_memcpy_send <auto | on | off> Use VH memory copy on the sender side of a VE process for InfiniBand communication. This option has higher priority than the environment variable NMPI_IB_VH_MEMCPY_SEND.

auto:
Use sender side VH memory copy for InfiniBand communication through Root Complex.
(default for Intel machines)

on:
Use sender side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)

off:
Don't use sender side VH memory copy for InfiniBand communication.
-ib_vh_memcpy_recv <auto | on | off> Use VH memory copy on the receiver side of a VE process for InfiniBand communication. This option has higher priority than the environment variable NMPI_IB_VH_MEMCPY_RECV.

auto:
Use receiver side VH memory copy for InfiniBand communication through Root Complex.

on:
Use receiver side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)

off:
Don't use receiver side VH memory copy for InfiniBand communication.
(default for Intel machines)
-dma_vh_memcpy <auto | on | off> Use VH memory copy for a communication between VEs in VH. This option has higher priority than the environment variable NMPI_DMA_VH_MEMCPY.

auto:
Use VH memory copy for a communication between VEs in VH through Root Complex.
(default)

on:
Use VH memory copy for a communication between VEs in VH.
(independent on Root Complex).

off:
Don't use VH memory copy for a communication between VEs in VH .
-vh_memcpy <auto | on | off> Use VH memory copy for the InfiniBand communication and the communication between VEs in VH. This option has higher priority than the environment variable NMPI_VH_MEMCPY.


auto:
In the case of InfiniBand communication, sender side VH memcpy is used if the communication goes through Root Complex. In the case of a communication between VEs in VH, VH memory copy is used if the communication goes through Root Complex.
on:
VH memory copy is used.
off:
VH memory copy is not used.

Note:
The option -ib_vh_memcpy_send, -ib_vh_memcpy_recv and -dma_vh_memcpy are higher priority than this option.
-vpin | -vpinning Print info on assigned cpu id's of MPI processes on VH's, scalar hosts or NUMA nodes on VEs.
This option is valid for -pin_mode, -cpu_list, -numa, -nnuma option.
-pin_mode < consec | spread |
consec_rev | spread_rev
scatter | no | none | off >
Specify the method how the affinity of MPI processes on VH or scalar host is controlled with.

consec | spread :
Assign next free cpu ids to MPI processes. Assigning of cpu ids starts with cpu id 0.

consec_rev | spread_rev:
Assign next free (in reverse order) cpu ids to MPI processes. Assigning of cpu ids starts with highest cpu id.

scatter:
Look for a maximal distance to already assigned cpu ids and assign next free cpu ids to MPI processes.

none | off | no :
No pinning of MPI processes to cpu id's. The default pinning mode is 'none'.

Note:
(*) Specifying flag "-pin_mode" disables preceding "-cpu_list".
(*) If the number of free cpu id's is not sufficient to assign cpu_id's, NO cpu id is assigned to the MPI process.
-pin_reserve <num-reserved-ids>[H|h] Specify the number of cpu ids to be reserved per MPI process on VH or scalar host for the pinning method specified with the flag "-pin_mode". If the optional 'h' or 'H' is added to the number, the cpu id's of associated Hyperthreads are also utilized if available.
The number of reserved ids must be greater than 0.
The default number is 1.
-cpu_list | -pin_cpu <first-id>[<-last-id>
[<-increment>[-<num-reserved-ids>
[H|h][,...]]]]
Specify a comma-separated list of cpu id's for the processes to be created. specifies the cpu id which is assigned to the first MPI process on the node. Cpu id <first-id + increment> is assigned to the next MPI process and so on. <last-id> specifies the last cpu id which is assigned. <num-reserved-ids> specifies the number of reserved cpu ids per MPI process for multithreaded application. If the optional 'h' or 'H' is added to the <num-reserved-ids>, the cpu ids of Hyperthreads are also utilized if available.

Default values if not specified:
<last-id> = <first-id>
<increment> = 1
<num-reserved-ids> = 1

Note:
(*) Specifying flag "-cpu_list" disables preceding "-pin_mode".
(*) If the number of free cpu ids is not sufficient to assign <num-reserved-ids> cpu ids, NO cpu id is assigned to the MPI process.


3.2.3   Specification of Hosts

Hosts corresponding to MPI executable files are determined according to the specified runtime options as follows:
  1. MPI executable files for which the -venode option is not specified (Default)

    A host indicates a VH in this case. VHs are specified as shown in the following table.

    Table 3-5 Specification of VHs
    Execution Method Format Description
    Interactive execution VH name
    • The hostname of a VH, which is a host computer.
    NQSV request execution <first>[-<last>]
    • <first> is the first logical VH number and <last> the last.
    • To specify one VH, omit -<last>.
      In particular specify only <first> in the options -hosts, -hostfile, and -host.
    • <last> must not be smaller than <first>.
  2. MPI executable files for which the -venode option is specified

    A host indicates a VE in this case. VEs are specified as shown in the following table.
    Please note that the -ve option cannot be specified for the MPI executable file for which the -venode option is specified.

    Table 3-6 Specification of VEs
    Execution Method Format Description
    Interactive execution <first>[-<last>][@<VH>]
    • <first> is the first VE number and <last> the last.
    • <VH> is a VH name. When omitted, the VH on which the MPI execution command has been executed is selected.
    • To specify one VE, omit -<last>.
      In particular specify only <fisrt> in the options -hosts, -hostfile, and -host.
    • <last> must not be smaller than <first>.
    NQSV request execution <first>[-<last>][@<VH>]
    • <first> is the first logical VE number and <last> the last.
    • <VH> is a logical VH number. When omitted, hosts (VEs) are selected from the ones NQSV allocated.
    • To specify one VE, omit -<last>.
      In particular specify only <first> in the options -hosts, -hostfile, and -host.
    • <last> must not be smaller than <first>.


3.2.4   Environment Variables

The following Table shows the environment variable s the values of which users can set.

Table 3-7   Environment Variables Set by Users
Environment Variable Available Value Meaning
NMPI_COMMINF Control the display of MPI communication information. To use MPI communication information facility, you need to generate MPI program with the option -mpiprof, -mpitrace, -mpiverify or -ftrace. Please refer to this section for MPI communication facility.
NO (Default) Not display the communication information.
YES Display the communication information in the reduced format.
ALL Display the communication information in the extended format.
MPICOMMINF The same as the environment variable NMPI_COMMINF The same as the environment variable NMPI_COMMINF.
If both are specified, the environment variable NMPI_COMMINF takes precedence.
NMPI_COMMINF_VIEW Specify the display format of the aggregated portion of MPI communication information.
VERTICAL (Default) Aggregate vector processes and scalar processes separately and display them vertically.
HORIZONTAL Aggregate vector processes and scalar processes separately and display them horizontally.
MERGED Aggregate and display vector processes and scalar processes.
NMPI_PROGINF Control the display of runtime performance information of MPI program. Please refer to this section for runtime performance information of MPI program.
NO (Default) Not display the performance information.
YES Display the performance information in the reduced format.
ALL Display the performance information in the extended format.
DETAIL Display the detailed performance information in the reduced format.
ALL_DETAIL Display the detailed performance information in the extended format.
MPIPROGINF The same as the environment variable NMPI_PROGINF The same as the environment variable NMPI_PROGINF.
If both are specified, the environment variable NMPI_PROGINF takes precedence.
NMPI_PROGINF_COMPAT 0 (Default) The runtime performance information of MPI program is displayed in the latest format.
1 The runtime performance information of MPI program is displayed in old format.
In this format, performance item "Non Swappable Memory Size Used", VE Card Data section and location information of VE where the MPI process is executed are not displayed.
VE_PROGINF_USE_SIGNAL YES (Default) Signals are used for collecting performance information.
NO Signals are not used for collecting performance information. See this section before using this option.
VE_PERF_MODE Control the HW performance counter set. MPI performance information outputs items corresponding to selected counters.
VECTOR-OP (Default) Select the set of HW performance counters related to vector operation mainly.
VECTOR-MEM Select the set of HW performance counters related to vector and memory access mainly.
NMPI_EXPORT "<string>" Space-separated list of the environment variables to be passed to MPI processes.
MPIEXPORT The same as the environment variable NMPI_EXPORT The same as the environment variable NMPI_EXPORT.
If both are specified, the environment variable NMPI_EXPORT takes precedence.
NMPI_SEPSELECT To enable this environment variable, the shell script mpisep.sh must also be used. Please refer to this section for details.
1 The standard output from each MPI process is saved in a separate file.
2 (Default) The standard error output from each MPI process is saved in a separate file.
3 The standard output and standard error output from each MPI process are saved in respective separate files.
4 The standard output and standard error output from each MPI process are saved in one separate file.
MPISEPSELECT The same as the environment variable NMPI_SEPSELECT The same as the environment variable NMPI_SEPSELECT.
If both are specified, the environment variable NMPI_SEPSELECT takes precedence.
NMPI_VERIFY Control error detection of the debug assist feature for MPI collective procedures. To use the feature for MPI collective procedures, you need to generate MPI program with the option -mpiverify. Please refer to this content for the feature.
0 Errors in invocations of MPI collective procedures are not detected.
3 (Default) Errors other than those in the argument assert of the procedure MPI_WIN_FENCE are detected.
4 Errors in the argument assert of the procedure MPI_WIN_FENCE are detected, in addition to the default errors.
NMPI_BLOCKLEN0 OFF (Default) Blocks with blocklength 0 are not included in the calculation of the values of the lower bound and upper bound of a datatype created by MPI procedures that create derived datatypes and have the argument blocklength.
ON Blocks with blocklength 0 are also included in the calculation of the values of the lower bound and upper bound of a datatype created by MPI procedures that create derived datatypes and have the argument blocklength.
MPIBLOCKLEN0 The same as the environment variable NMPI_BLOCKLEN0 The same as the environment variable NMPI_BLOCKLEN0.
If both are specified, the environment variable NMPI_BLOCKLEN0 takes precedence.
NMPI_COLLORDER OFF (Default)
1. Predefined operations, processes consecutive on nodes:
Canonical order, but bracketing depends on the distribution of processes over nodes, for example, could be (a+b)+(c+d) or ((a+b)+c)+d or a+((b+c)+d). More concretely, inside nodes reduction is performed left-to-right, over the nodes the bracketing depends on the number of nodes.
2. Predefined operations, processes not consecutive on nodes:
Commutativity is exploited, reduction order will not be canonical
3. User-defined operations:
Canonical reduction order, bracketing dependent on the number of processes, and commutativity is not exploited.
ON Canonical order, bracketing independent of process distribution, dependent only on the number of processes.
MPICOLLORDER The same as the environment variable NMPI_COLLORDER The same as the environment variable NMPI_COLLORDER.
If both are specified, the environment variable NMPI_COLLORDER takes precedence.
NMPI_PORT_RANGE <integer>:<integer> The range of port numbers NEC MPI uses to accept TCP/IP connections.
The default value is 25257:25266.
NMPI_INTERVAL_CONNECT <integer> Retry interval in seconds for establishing connections among MPI daemons at the beginning of execution of MPI programs.
The default value is 1.
NMPI_RETRY_CONNECT <integer> The number of retries for establishing connections among MPI daemons at the beginning of execution of MPI programs.
The default value is 2.
NMPI_LAUNCHER_EXEC <string> Full path name of the remote shell that launches MPI daemons.
The default value is /usr/bin/ssh.
NMPI_IB_ADAPTER_NAME <string> Comma-or-Space separated list of InfiniBand adaptor names NEC MPI uses. This environment variable is available only in the interactive execution.
When omitted, NEC MPI automatically selects the optimal ones.
NMPI_IB_DEFAULT_PKEY <integer> Partition key for InfiniBand Communication. The default value is 0.
NMPI_IB_FAST_PATH ON Use InfiniBand RDMA fath path feature to transfer eager messages.
(Default on Intel machines)
Don't set this value if InfiniBand HCA Relaxed Ordering or Adaptive Routing is enabled.
MTU MTU limits the message size of fast path feature to actual OFED mtu size.
Don't set this value if InfiniBand HCA Relaxed Ordering is enabled.
OFF Don't use InfiniBand RDMA fath path feature.
(Default on Non-Intel machines)
NMPI_IB_VBUF_TOTAL_SIZE <integer> Size of each InfiniBand communication buffer in bytes. The default value is 12248.
NMPI_IB_VH_MEMCPY_SEND AUTO Use sender side VH memory copy for InfiniBand communication through Root Complex.
(default for Intel machines)
ON Use sender side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)
OFF Don't use sender side VH memory copy for InfiniBand communication.
NMPI_IB_VH_MEMCPY_RECV AUTO Use receiver side VH memory copy for InfiniBand communication through Root Complex.
ON Use receiver side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)
OFF Don't use receiver side VH memory copy for InfiniBand communication.
(default for Intel machines)
NMPI_DMA_VH_MEMCPY AUTO Use VH memory copy for a communication between VEs in VH through Root Complex.
(Default)
ON Use VH memory copy for a communication between VEs in VH.
OFF Don't use VH memory copy for a communication between VEs in VH.
NMPI_VH_MEMCPY AUTO In the case of InfiniBand communication, sender side VH memcpy is used if the communication goes through Root Complex. In the case of a communication between VEs in VH, VH memory copy is used if the communication goes through Root Complex.
ON VH memory copy is used.
OFF VH memory copy is not used.
Note:
NMPI_IB_VH_MEMCPY_SEND, NMPI_IB_VH_MEMCPY_RECV, NMPI_DMA_VH_MEMCPY are higher priority than this environment variable.
NMPI_DMA_RNDV_OVERLAP
ON In the case of DMA communication, the communication and calculation can overlap when the buffer is contiguous, its transfer length is 200KB or more, and it is non-blocking point-to-point communication.
OFF (Default) In the case of DMA communication, the communication and calculation cannot overlap when the transfer length is 200KB or more and it is non-blocking point-to-point communication.
Note:
Setting NMPI_DMA_RNDV_OVERLAP to ON internally disables the usage of VH memory copy.
the values of environment variables NMPI_DMA_VH_MEMCPY is ignored for non-blocking point-to-point DMA communication.
NMPI_IB_VH_MEMCPY_THRESHOLD <integer> Minimal message size to transfer InfiniBand message to/from VE processes via VH memory. Smaller messages are sent directly without copy to/from VH memory. Message size is given in bytes and must be greater or equal to 0. The default value is 1048576.
NMPI_IB_VH_MEMCPY_BUFFER_SIZE <integer> Maximal size of a buffer located in VH memory to transfer (parts of) an InfiniBand message to/from VE processes. Size of buffer is given in bytes and must be at least 8192 bytes. The default value is 1048576.
NMPI_IB_VH_MEMCPY_SPLIT_THRESHOLD <integer> Minimal message size to split transfer of InfiniBand messages to/from VE processes via VH Memory. The messages are split in nearly equal parts in order to increase the transfer bandwidth. Message size is given in bytes and must be greater or equal to 0. The default value is 1048576.
NMPI_IB_VH_MEMCPY_SPLIT_NUM <integer> Maximal number of parts used to transfer InfiniBand messages to/from VE processes using VH memory. The number must be in range of [1:8]. The default value is 2.
NMPI_EXEC_MODE NECMPI (Default) Work with NECMPI runtime option.
INTELMPI Work with IntelMPI's basic runtime options (see below).
OPENMPI Work with OPENMPI's basic runtime options (see below).
MPICH Work with MPICH's basic runtime options (see below).
MPISX Work with MPISX's runtime options.
NMPI_SHARP_ENABLE ON To use SHARP
OFF Not to use SHARP (default)
NMPI_SHARP_NODES <integer> The minimal number of VE nodes to use SHARP if SHARP usage is enabled. (default: 4)
NMPI_SHARP_ALLREDUCE_MAX <integer> Maximal data size (in bytes) in MPI_Allreduce for which the SHARP API used. (Default: 64)
UNLIMITED SHARP is always used.
NMPI_SHARP_REPORT ON Report on MPI Communicators using SHARP collective support.
OFF No report. (default)
NMPI_DCT_ENABLE Control the usage of Inifniband DCT (Dynamically Connected Transport Service). Using DCT, a memory usage for Inifniband communication is reduced.
(Note: DCT may affect a performance of InfiniBand communication)
AUTOMATIC DCT is used if the number of MPI processes is equal or greater than the number specified by NMPI_DCT_SELECT_NP environment variable. (default)
ON DCT is always used.
OFF DCT is not used.
NMPI_DCT_SELECT_NP <integer> The minimal number of MPI processes that DCT is used if the environment variable NMPI_DCT_ENABLE is set to AUTOMATIC. (default: 2049)
NMPI_DCT_NUM_CONNS <integer> The number of requested DCT connections. (default: 4)
NMPI_COMM_PNODE Control the automatic selection of communication type between logical nodes in the execution under NQSV.
OFF Select the communication type automatically based on the logical node (default).
ON Select the communication type automatically based on the physical node.

Support options for setting NMPI_EXEC_MODE = INTELMPI
-hosts, -f, -hostfile, -machinefile, -machine, -configfile, -perhost, -ppn, -genv, -genvall, -genvnone, -genvlist, -gpath, -gwdir, -gumask, -host, -n , -np, -env, -envall, -envnone, -envlist, -path, -wdir, -umask, and common options for Aurora
Support options for setting NMPI_EXEC_MODE = OPENMPI
-N, -npernode, --npernode, -path, --path, -H, -host, --host, -n, --n, -c, -np, --np, -wdir, --wdir, -wd, --wd, -x, and common options for Aurora
Support options for setting NMPI_EXEC_MODE = MPICH
-hosts, -f, -configfile, -ppn, -genv, -genvall, -genvnone, -genvlist, -wdir, -host, -n, -np, -env, -envall, -envnone, -envlist, and common options for Aurora
Common options for Aurora
-launcher-exec, -max_np, -multi, -debug, -display, -disp, -v, -V, -version, -h, -help, -gvenode, -ve, -venode


3.2.5   Environment Variables for MPI Process Identification

NEC MPI provides the following environment variables, the values of which are automatically set by NEC MPI, for MPI process identification.

Environment Variable Value
MPIUNIVERSE The identification number of the predefined communication universe at the beginning of program execution corresponding to the communicator MPI_COMM_WORLD.
MPIRANK The rank of the executing process in the communicator MPI_COMM_WORLD at the beginning of program execution.
MPISIZE The total number of processes in the communicator MPI_COMM_WORLD at the beginning of program execution.
MPINODEID The logical node number of node where the MPI process is running.
MPIVEID The VE node number of VE where the MPI process is running. If the execution is under NQSV, this shows logical VE node number. If the MPI process is not running on VE, this variable is not set.

These environment variables can be referenced whenever MPI programs are running including before the invocation of the procedure MPI_INIT or MPI_INIT_THREAD.

When an MPI program is initiated, there is a predefined communication universe that includes all MPI processes and corresponds to a communicator MPI_COMM_WORLD. The predefined communication universe is assigned the identification number 0.

In a communication universe, each process is assigned an unique integer value called rank, which is in the range zero to one less than the number of processes.

If the dynamic process creation facility is used and a set of MPI processes is dynamically created, a new communication universe corresponding to a new communicator MPI_COMM_WORLD is created. Processes in each communication universe created at runtime are assigned consecutive integer identification numbers starting at 1. In such a case, two or more MPI_COMM_WORLDs can exist at the same time in a single MPI application.
Therefore, an MPI process can be identified using a pair of values of MPIUNIVERSE and MPIRANK.

If an MPI program is indirectly initiated with a shell script, these environment variables can also be referenced in the shell script and be used, for example, to allow different MPI processes to handle mutually different files. The shell script in the figure makes each MPI process read data from respectively different files and store data to respectively different files, and it is executed as shown in the figure.

#!/bin/sh
INFILE=infile.$MPIUNIVERSE.$MPIRANK
OUTFILE=outfile.$MPIUNIVERSE.$MPIRANK
{MPIexec} < $INFILE > $OUTFILE    # Refer to this clause for {MPIexec}, MPI-execution specification
exit $?
Figure 3-1   A Shell Script "mpi.shell" to Start an MPI Program

$ mpirun -np 8 /execdir/mpi.shell
Figure 3-2   Indirect Initiation of an MPI Program with a Shell Script


3.2.6   Environment Variables for Other Processors

The environment variables supported by other processors such as the Fortran compiler (nfort), C compiler (ncc), or C++ compiler (nc++) are passed to MPI processes because runtime option -genvall is enable by default. In the following example, OMP_NUM_THREADS and VE_LD_LIBRARY_PATH are passed to MPI processes.

#!/bin/sh
#PBS -T necmpi
#PBS -b 2

OMP_NUM_THREADS=8 ; export OMP_NUM_THREADS
VE_LD_LIBRARY_PATH={your shared library path} ; export VE_LD_LIBRARY_PATH

mpirun -node 0-1 -np 2 a.out


3.2.7   Rank Assignment

Ranks are assigned in the ascending order to MPI processes according to the order that NEC MPI assigns them to hosts.


3.2.8   The Working Directory under NQSV

The working directory in the NQSV request execution is determined as follows:
  1. The current working directory on the VHs where the MPI execution commands of NEC MPI are available.
  2. The home directory on the VHs where the MPI execution commands of NEC MPI are not available.


3.2.9   Execution with the singularity container

You can execute MPI programs in the singularity container. As the following example, singularity command is specified as an argument of mpirun command. In this execution, options of singularity command related to the namespace are not available.
As for how to build the singularity image file of NEC MPI, please refer to the following site.

https://github.com/veos-sxarr-NEC/singularity


3.2.10   Execution Examples

The following examples show how to launch MPI programs on the SX-Aurora TSUBASA.


3.3   Standard Output and Standard Error of MPI Programs

To separate output streams from MPI processes, NEC MPI provides the shell script mpisep.sh, which is placed in the path /opt/nec/ve/bin/.

It is possible to redirect output streams from MPI processes into respectively different files in the current working directory by specifying this script before MPI-execution specification {MPIexec} as shown in the following example. (Please refer to this clause for MPI-execution specification {MPIexec}.)

$ mpirun -np 2 /opt/nec/ve/bin/mpisep.sh {MPIexec}

The destinations of output streams can be specified with the environment variable NMPI_SEPSELECT as shown in the following table, in which uuu is the identification number of the predefined communication universe corresponding to the communicator MPI_COMM_WORLD and rrr is the rank of the executing MPI process in the universe.

NMPI_SEPSELECT Action
1 Only the stdout stream from each process is put into the separate file stdout.uuu:rrr.
2 (Default) Only the stderr stream from each process is put into the separate file stderr.uuu:rrr.
3 The stdout and stderr streams from each process are put into the separate files stdout.uuu:rrr and stderr.uuu:rrr, respectively.
4 The stdout and stderr streams from each process are put into one separate file std.uuu:rrr.


3.4   Runtime Performance of MPI Programs

The performance of MPI programs can be obtained with the environment variable NMPI_PROGINF. There are four formats of runtime performance information available in NEC MPI as follows:
Format Description
Reduced Format This format consists of three parts: The first part is the Global Data section in which maximum, minimum and average performance of all MPI processes is displayed. The second part is the Overall Data sectionin which performance of overall MPI processes is displayed. The third part is the VE Card section in which maximum, minimum and average performance of VE card is displayed. The results of the vector processes and scalar processes are output separately.
Extended Format Performance of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the reduced format.
Detailed Reduced Format This format consists of three parts: The first part is the Global Data section in which maximum, minimum, and average detailed performance of all MPI processes is displayed. The second part is the Overall Data section in which performance of overall MPI processes is displayed. The third part is the VE Card section in which maximum, minimum and average performance of VE card is displayed. The results of the vector processes and scalar processes are output separately.
Detailed Extended Format Detailed performance of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the detailed reduced format.
The format of displayed information can be specified by setting the environment variable NMPI_PROGINF at runtime as shown in the following table.

Table 3-8  The Settings of NMPI_PROGINF
NMPI_PROGINF Displayed Information
NO (Default) No Output
YES Reduced Format
ALL Extended Format
DETAIL Detailed Reduced Format
ALL_DETAIL Detailed Extended Format

The following figure is an example of the detailed extended format.

MPI Program Information:
========================
Note: It is measured from MPI_Init till MPI_Finalize.
      [U,R] specifies the Universe and the Process Rank in the Universe.
      Times are given in seconds.


Global Data of 4 Vector processes       :          Min [U,R]          Max [U,R]      Average
=================================

Real Time (sec)                         :       25.203 [0,3]       25.490 [0,2]       25.325
User Time (sec)                         :      199.534 [0,0]      201.477 [0,2]      200.473
Vector Time (sec)                       :       42.028 [0,2]       42.221 [0,1]       42.104
Inst. Count                             :  94658554061 [0,1]  96557454164 [0,2]  95606075636
V. Inst. Count                          :  11589795409 [0,3]  11593360015 [0,0]  11591613166
V. Element Count                        : 920130095790 [0,3] 920199971948 [0,0] 920161556564
V. Load Element Count                   : 306457838070 [0,1] 306470712295 [0,3] 306463228635
FLOP Count                              : 611061870735 [0,3] 611078144683 [0,0] 611070006844
MOPS                                    :     6116.599 [0,2]     6167.214 [0,0]     6142.469
MOPS (Real)                             :    48346.004 [0,2]    48891.767 [0,3]    48624.070
MFLOPS                                  :     3032.988 [0,2]     3062.528 [0,0]     3048.186
MFLOPS (Real)                           :    23972.934 [0,2]    24246.003 [0,3]    24129.581
A. V. Length                            :       79.372 [0,1]       79.391 [0,3]       79.382
V. Op. Ratio (%)                        :       93.105 [0,2]       93.249 [0,1]       93.177
L1 Cache Miss (sec)                     :        3.901 [0,0]        4.044 [0,2]        3.983
CPU Port Conf. (sec)                    :        3.486 [0,1]        3.486 [0,2]        3.486
V. Arith. Exec. (sec)                   :       15.628 [0,3]       15.646 [0,1]       15.637
V. Load Exec. (sec)                     :       23.156 [0,2]       23.294 [0,1]       23.225
VLD LLC Hit Element Ratio (%)           :       90.954 [0,2]       90.965 [0,1]       90.959
Power Throttling (sec)                  :        0.000 [0,0]        0.000 [0,0]        0.000
Thermal Throttling (sec)                :        0.000 [0,0]        0.000 [0,0]        0.000
Max Active Threads                      :            8 [0,0]            8 [0,0]            8
Available CPU Cores                     :            8 [0,0]            8 [0,0]            8
Average CPU Cores Used                  :        7.904 [0,2]        7.930 [0,3]        7.916
Memory Size Used (MB)                   :     1616.000 [0,0]     1616.000 [0,0]     1616.000
Non Swappable Memory Size Used (MB)     :      115.000 [0,1]      179.000 [0,0]      131.000

Global Data of 8 Scalar processes       :          Min [U,R]          Max [U,R]      Average
=================================

Real Time (sec)                         :       25.001 [0,7]       25.010 [0,8]       25.005
User Time (sec)                         :      199.916 [0,7]      199.920 [0,8]      199.918
Memory Size Used (MB)                   :      392.000 [0,7]      392.000 [0,8]      392.000


Overall Data of 4 Vector processes
==================================

Real Time (sec)                         :       25.490
User Time (sec)                         :      801.893
Vector Time (sec)                       :      168.418
GOPS                                    :        5.009
GOPS (Real)                             :      157.578
GFLOPS                                  :        3.048
GFLOPS (Real)                           :       95.890
Memory Size Used (GB)                   :        6.313
Non Swappable Memory Size Used (GB)     :        0.512

Overall Data of 8 Scalar processes
==================================
Real Time (sec)                         :       25.010
User Time (sec)                         :     1599.344
Memory Size Used (GB)                   :        3.063


VE Card Data of 2 VEs
=====================

Memory Size Used (MB) Min               :     3232.000 [node=0,ve=0]
Memory Size Used (MB) Max               :     3232.000 [node=0,ve=0]
Memory Size Used (MB) Avg               :     3232.000
Non Swappable Memory Size Used (MB) Min :      230.000 [node=0,ve=1]
Non Swappable Memory Size Used (MB) Max :      294.000 [node=0,ve=0]
Non Swappable Memory Size Used (MB) Avg :      262.000


Data of Vector Process [0,0] [node=0,ve=0]:
-------------------------------------------

  Real Time (sec)                         :            25.216335
  User Time (sec)                         :           199.533916
  Vector Time (sec)                       :            42.127823
  Inst. Count                             :          94780214417
  V. Inst. Count                          :          11593360015
  V. Element Count                        :         920199971948
  V. Load Element Count                   :         306461345333
  FLOP Count                              :         611078144683
  MOPS                                    :          6167.214211
  MOPS (Real)                             :         48800.446081
  MFLOPS                                  :          3062.527699
  MFLOPS (Real)                           :         24233.424158
  A. V. Length                            :            79.373018
  V. Op. Ratio (%)                        :            93.239965
  L1 Cache Miss (sec)                     :             3.901453
  CPU Port Conf. (sec)                    :             3.485787
  V. Arith. Exec. (sec)                   :            15.642353
  V. Load Exec. (sec)                     :            23.274564
  VLD LLC Hit Element Ratio (%)           :            90.957228
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Max Active Threads                      :                    8
  Available CPU Cores                     :                    8
  Average CPU Cores Used                  :             7.912883
  Memory Size Used (MB)                   :          1616.000000
  Non Swappable Memory Size Used (MB)     :           179.000000
...
Figure 3-3   Performance Information in the Detailed Extended Format
(NMPI_PROGINF=ALL_DETAIL)

The following table shows the meanings of the items in the Global Data section and the Process section. In the case of vector process, in addition to MPI universe number and MPI rank number of MPI_COMM_WORLD, hostname or logical node number and logical VE number are shown as the location information of VE where the MPI process is executed in the header of the Process section. For scalar processes, only items(*1) are output. (*2) items are output only in the detailed reduced format or detailed extended format. (*3) items are output only in the detailed reduced format or detailed extended format in multi-threaded execution.

Table 3-9   The Meanings of the Items in the Global Data Section and Process Section
Item Unit Description
Real Time (sec) second Elapsed time(*1)
User Time (sec) second User CPU time(*1)
Vector Time (sec) second Vector instruction execution time
Inst. Count The number of executed instructions
V.Inst. Count The number of executed vector instructions
V.Element Count The number of elements processed with vector instructions
V.Load Element Count The number of vector-loaded elements
FLOP Count The number of elements processed with floating-point operations
MOPS The number of million operations divided by the user CPU time
MOPS (Real) The number of million operations divided by the real time
FLOPS The number of million floating-point operations divided by the user CPU time
FLOPS (Real) The number of million floating-point operations divided by the real time
A.V.Length Average Vector Length
V.OP.RATIO percent Vector operation ratio
L1 Cache Miss (sec) second L1 cache miss time
CPU Port Conf. second CPU port conflict time (*2)
V. Arith Exec. second Vector operation execution time (*2)
V. Load Exec. second Vector load instruction execution time (*2)
VLD LLC Hit Element Ratio Ratio of the number of elements loaded from LLC to the number of elements loaded with vector load instructions (*2)
Power Throttling second Duration of time the hardware was throttled due to the power consumption (*2)
Thermal Throttling second Duration of time the hardware was throttled due to the temperature (*2)
Max Active Threads The maximum number of threads that were active at the same time (*3)
Available CPU Cores The number of CPU cores a process was allowed to use (*3)
Average CPU Cores Used The average number of CPU cores used (*3)
Memory Size Used (MB) megabyte Peak usage of memory(*1)
Non Swappable Memory Size Used (MB) megabyte Peak usage of memory that cannot be swapped out by Partial Process Swapping function

The following table shows the meanings of the items in the Overall Data section in the Figure above. For scalar processes, only items(*1) are output.

Table 3-10   The Meanings of the Items in the Overall Data Section
Item Unit Description
Real Time (sec) second The maximum elapsed time of all MPI processes(*1)
User Time (sec) second The sum of the user CPU time of all MPI processes(*1)
Vector Time (sec) second The sum of the vector time of all MPI processes
GOPS The total number of giga operations executed on all MPI processes divided by the total user CPU time of all MPI processes
GOPS (Real) The total number of giga operations executed on all MPI processes divided by the maximum real time of all MPI processes
GFLOPS The total number of giga floating-point operations executed on all MPI processes divided by the total user CPU time of all MPI processes
GFLOPS (Real) The total number of giga floating-point operations executed on all MPI processes divided by the maximum real time of all MPI processes
Memory Size Used (GB) gigabyte The sum of peak usage of memory of all MPI processes(*1)
Non Swappable Memory Size Used (GB) gigabyte The sum of peak usage of memory that cannot be swapped out by Partial Process Swapping function of all MPI processes
The following table shows the meanings of the items in the VE Card Data section in the Figure above. In the case of maximum or minimum, hostname or logical node number and logical VE number are shown as the location information of VE where the value is marked.

Table 3-11   The Meanings of the Items in the VE Card Data Section
Item Unit Description
Memory Size Used (MB) Min megabyte The minimum of peak usage of memory aggregated for each VE card
Memory Size Used (MB) Max megabyte The maximum of peak usage of memory aggregated for each VE card
Memory Size Used (MB) Avg megabyte The average of peak usage of memory aggregated for each VE card
Non Swappable Memory Size Used (MB) Min megabyte The minimum of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card
Non Swappable Memory Size Used (MB) Max megabyte The maximum of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card
Non Swappable Memory Size Used (MB) Avg megabyte The average of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card
MPI performance information outputs the program execution analysis information using Aurora HW performance counter. You can control the use of set of the performance counters by the environment variable VE_PERF_MODE and PROGINF can output items corresponding the set. Above output is the case that VE_PERF_MODE is unset or VE_PERF_MODE is set to VECTOR-OP. In this case, PROGINF outputs items related to vector instructions mainly. The below output is the case that VE_PERRF_MODE is set to VECTOR-MEM. In this case, PROGINF outputs items related to vector and memory access mainly.

Global Data of 16 Vector processes      :          Min [U,R]           Max [U,R]       Average
==================================

Real Time (sec)                         :      123.871 [0,12]      123.875 [0,10]      123.873
User Time (sec)                         :      123.695 [0,0]       123.770 [0,4]       123.753
Vector Time (sec)                       :       33.675 [0,8]        40.252 [0,14]       36.871
Inst. Count                             :  94783046343 [0,8]  120981685418 [0,5]  109351879970
V. Inst. Count                          :   2341570533 [0,8]    3423410840 [0,0]    2479317774
V. Element Count                        : 487920413405 [0,15] 762755268183 [0,0]  507278230562
V. Load Element Count                   :  47201569500 [0,8]   69707680610 [0,0]   49406464759
FLOP Count                              : 277294180692 [0,15] 434459800790 [0,0]  287678800758
MOPS                                    :     5558.515 [0,8]      8301.366 [0,0]      5863.352
MOPS (Real)                             :     5546.927 [0,8]      8276.103 [0,0]      5850.278
MFLOPS                                  :     2243.220 [0,15]     3518.072 [0,0]      2327.606
MFLOPS (Real)                           :     2238.588 [0,13]     3507.366 [0,0]      2322.405
A. V. Length                            :      197.901 [0,5]       222.806 [0,0]       204.169
V. Op. Ratio (%)                        :       83.423 [0,5]        90.639 [0,0]        85.109
L1 I-Cache Miss (sec)                   :        4.009 [0,5]         8.310 [0,0]         5.322
L1 O-Cache Miss (sec)                   :       11.951 [0,5]        17.844 [0,9]        14.826
L2 Cache Miss (sec)                     :        7.396 [0,5]        15.794 [0,0]         9.872
FMA Element Count                       : 106583464050 [0,8]  166445323660 [0,0]  110529497704
Required B/F                            :        2.258 [0,0]         3.150 [0,5]         2.948
Required Store B/F                      :        0.914 [0,0]         1.292 [0,5]         1.202
Required Load B/F                       :        1.344 [0,0]         1.866 [0,6]         1.746
Actual V. Load B/F                      :        0.223 [0,0]         0.349 [0,14]        0.322
Power Throttling (sec)                  :        0.000 [0,0]         0.000 [0,0]         0.000
Thermal Throttling (sec)                :        0.000 [0,0]         0.000 [0,0]         0.000
Memory Size Used (MB)                   :      598.000 [0,0]       598.000 [0,0]       598.000
Non Swappable Memory Size Used (MB)     :      115.000 [0,1]       179.000 [0,0]       131.000

When VE_PERF_MODE is set to VECTOR-MEM, MPI performance information outputs the following items instead of L1 Cache Miss, CPU Port Conf., V. Arith Exec., V. Load Exec. and VLD LLC Hit Element Ratio that are output when VE_PERF_MODE is set to VECTOR-OP or VE_PERF_MODE is unset.

(*1) items are output only in the detailed reduced format or detailed extended format.
(*2) items truncate the value over 100.

Item Unit Description
L1 I-Cache Miss (sec) second L1 instruction cache miss time
L1 O-Cache Miss (sec) second L1 operand cache miss time
L2 Cache Miss (sec) second L2 cache miss time
Required B/F B/F calculated from bytes specified by load and store instructions (*1) (*2)
Required Store B/F B/F calculated from bytes specified by store instructions (*1) (*2)
Required Load B/F B/F calculated from bytes specified by load instructions (*1) (*2)
Actual V. Load B/F B/F calculated from bytes of actual memory access by vector load instructions (*1) (*2)


3.5   MPI Communication Information

NEC MPI provides the facility of displaying MPI communication information. To use this facility, you need to generate MPI program with the option -mpiprof, -mpitrace, -mpiverify or -ftrace. There are two formats of MPI communication information available as follows:
Reduced Format

The maximum, minimum, and average values of MPI communication information of all MPI processes are displayed.

Extended Format

MPI communication information of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the reduced format.

You can control the display and format of MPI communication information by setting the environment variable NMPI_COMMINF at runtime as shown in the following table.

Table 3-12   The Settings of NMPI_COMMINF
NMPI_COMMINF Displayed Information
NO (Default) No Output
YES Reduced Format
ALL Extended Format

Also, you can change a view of reduced format by specifying the environment variable NMPI_COMMINF_VIEW.

Table 3-13   The Settings of NMPI_COMMINF_VIEW
NMPI_COMMINF_VIEW Displayed Information
VERTICAL (Default) Summarize for each vector process and scalar process and arrange vertically. Items that correspond only to vector processes are not output to the scalar process part.
HORIZONTAL Summarize for each vector process and scalar process and arrange horizontally. N/A is output to the scalar process part for items that correspond only to vector processes.
MERGED Summarize for vector processes and scalar processes. (V) is output at the end of line to the scalar process part for items that correspond only to vector processes. In the item, vector processes only are aggregated.

The following figure is an example of the extended format.


MPI Communication Information of 4 Vector processes
---------------------------------------------------
                                                   Min [U,R]           Max [U,R]       Average

Real MPI Idle Time (sec)                :        9.732 [0,1]        10.178 [0,3]         9.936
User MPI Idle Time (sec)                :        9.699 [0,1]        10.153 [0,3]         9.904
Total real MPI Time (sec)               :       13.301 [0,0]        13.405 [0,3]        13.374
Send       count                        :         1535 [0,2]          2547 [0,1]          2037
   Memory Transfer                      :          506 [0,3]          2024 [0,0]          1269
   DMA Transfer                         :            0 [0,0]          1012 [0,1]           388
Recv       count                        :         1518 [0,2]          2717 [0,0]          2071
   Memory Transfer                      :          506 [0,2]          2024 [0,1]          1269
   DMA Transfer                         :            0 [0,3]          1012 [0,2]           388
Barrier       count                     :         8361 [0,2]          8653 [0,0]          8507
Bcast         count                     :          818 [0,2]           866 [0,0]           842
Reduce        count                     :          443 [0,0]           443 [0,0]           443
Allreduce     count                     :         1815 [0,2]          1959 [0,0]          1887
Scan          count                     :            0 [0,0]             0 [0,0]             0
Exscan        count                     :            0 [0,0]             0 [0,0]             0
Redscat       count                     :          464 [0,0]           464 [0,0]           464
Redscat_block count                     :            0 [0,0]             0 [0,0]             0
Gather        count                     :          864 [0,0]           864 [0,0]           864
Gatherv       count                     :          506 [0,0]           506 [0,0]           506
Allgather     count                     :          485 [0,0]           485 [0,0]           485
Allgatherv    count                     :          506 [0,0]           506 [0,0]           506
Scatter       count                     :          485 [0,0]           485 [0,0]           485
Scatterv      count                     :          506 [0,0]           506 [0,0]           506
Alltoall      count                     :          506 [0,0]           506 [0,0]           506
Alltoallv     count                     :          506 [0,0]           506 [0,0]           506
Alltoallw     count                     :            0 [0,0]             0 [0,0]             0
Neighbor Allgather  count               :            0 [0,0]             0 [0,0]             0
Neighbor Allgatherv count               :            0 [0,0]             0 [0,0]             0
Neighbor Alltoall   count               :            0 [0,0]             0 [0,0]             0
Neighbor Alltoallv  count               :            0 [0,0]             0 [0,0]             0
Neighbor Alltoallw  count               :            0 [0,0]             0 [0,0]             0
Number of bytes sent                    :    528482333 [0,2]     880803843 [0,1]     704643071
   Memory Transfer                      :    176160755 [0,3]     704643020 [0,0]     440401904
   DMA Transfer                         :            0 [0,0]     352321510 [0,1]     132120600
Number of bytes recvd                   :    528482265 [0,2]     880804523 [0,0]     704643207
   Memory Transfer                      :    176160755 [0,2]     704643020 [0,1]     440401904
   DMA Transfer                         :            0 [0,3]     352321510 [0,2]     132120600
Put        count                        :            0 [0,0]             0 [0,0]             0
Get        count                        :            0 [0,0]             0 [0,0]             0
Accumulate count                        :            0 [0,0]             0 [0,0]             0
Number of bytes put                     :            0 [0,0]             0 [0,0]             0
Number of bytes got                     :            0 [0,0]             0 [0,0]             0
Number of bytes accum                   :            0 [0,0]             0 [0,0]             0

MPI Communication Information of 8 Scalar processes
---------------------------------------------------
                                                   Min [U,R]           Max [U,R]       Average

Real MPI Idle Time (sec)                :        4.837 [0,6]         5.367 [0,11]        5.002
User MPI Idle Time (sec)                :        4.825 [0,6]         5.363 [0,11]        4.992
Total real MPI Time (sec)               :       12.336 [0,11]       12.344 [0,5]        12.340
Send       count                        :         1535 [0,4]          1535 [0,4]          1535
   Memory Transfer                      :          506 [0,11]         1518 [0,5]          1328
Recv       count                        :         1518 [0,4]          1518 [0,4]          1518
   Memory Transfer                      :          506 [0,4]          1518 [0,5]          1328
...
Number of bytes accum                   :            0 [0,0]             0 [0,0]             0


Data of Vector Process [0,0] [node=0,ve=0]:
-------------------------------------------

  Real MPI Idle Time (sec)                :            10.071094
  User MPI Idle Time (sec)                :            10.032894
  Total real MPI Time (sec)               :            13.301340
...
Figure 3-4 MPI Communication Information in the Extended Format
(NMPI_COMMINF=ALL)

The following figure is an reduced format example of the NMPI_COMMINF_VIEW=MERGED.

MPI Communication Information of 4 Vector and 8 Scalar processes
----------------------------------------------------------------
                                                   Min [U,R]           Max [U,R]       Average

Real MPI Idle Time (sec)                :        4.860 [0,10]       10.193 [0,3]         6.651
User MPI Idle Time (sec)                :        4.853 [0,10]       10.167 [0,3]         6.635
Total real MPI Time (sec)               :       12.327 [0,4]        13.396 [0,3]        12.679
Send       count                        :         1535 [0,2]          2547 [0,1]          1702
   Memory Transfer                      :          506 [0,3]          2024 [0,0]          1309
   DMA Transfer                         :            0 [0,0]          1012 [0,1]           388 (V)
Recv       count                        :         1518 [0,2]          2717 [0,0]          1702
   Memory Transfer                      :          506 [0,2]          2024 [0,1]          1309
   DMA Transfer                         :            0 [0,3]          1012 [0,2]           388 (V)
...
Number of bytes accum                   :            0 [0,0]             0 [0,0]             0
  
Figure 3-5 MPI Communication Information in the Reduced Format
(NMPI_COMMINF_VIEW=MERGED)

The following table shows the meanings of the items in the MPI communication information. The item "DMA Transfer" is only supported for a vector process.

Item Unit Description
Real MPI Idle Time second Elapsed time for waiting for messages
User MPI Idle Time second User CPU time for waiting for messages
Total real MPI Time second Elapsed time for executing MPI procedures
Send count The number of invocations of point-to-point send procedures
Memory Transfer The number of invocations of point-to-point send procedures that use memory copy
DMA Transfer The number of invocations of point-to-point send procedures that use DMA transfer
Recv count The number of invocations of point-to-point receive procedures
Memory Transfer The number of invocations of point-to-point receive procedures that use memory copy
DMA Transfer The number of invocations of point-to-point receive procedures that use DMA transfer
Barrier count The number of invocations of the procedures MPI_BARRIER and MPI_IBARRIER
Bcast count The number of invocations of the procedures MPI_BCAST and MPI_IBCAST
Reduce count The number of invocations of the procedures MPI_REDUCE and MPI_IREDUCE
Allreduce count The number of invocations of the procedures MPI_ALLREDUCE and MPI_IALLREDUCE
Scan count The number of invocations of the procedures MPI_SCAN and MPI_ISCAN
Exscan count The number of invocations of the procedures MPI_EXSCAN and MPI_IEXSCAN
Redscat count The number of invocations of the procedures MPI_REDUCE_SCATTER and MPI_IREDUCE_SCATTER
Redscat_block count The number of invocations of the procedures MPI_REDUCE_SCATTER_BLOCK and MPI_IREDUCE_SCATTER_BLOCK
Gather count The number of invocations of the procedures MPI_GATHER and MPI_IGATHER
Gatherv count The number of invocations of the procedures MPI_GATHERV and MPI_IGATHERV
Allgather count The number of invocations of the procedures MPI_ALLGATHER and MPI_IALLGATHER
Allgatherv count The number of invocations of the procedures MPI_ALLGATHERV and MPI_IALLGATHERV
Scatter count The number of invocations of the procedures MPI_SCATTER and MPI_ISCATTER
Scatterv count The number of invocations of the procedures MPI_SCATTERV and MPI_ISCATTERV
Alltoall count The number of invocations of the procedures MPI_ALLTOALL and MPI_IALLTOALL
Alltoallv count The number of invocations of the procedures MPI_ALLTOALLV and MPI_IALLTOALLV
Alltoallw count The number of invocations of the procedures MPI_ALLTOALLW and MPI_IALLTOALLW
Neighbor Allgather count The number of invocations of the procedures MPI_NEIGHBOR_ALLGATHER and MPI_INEIGHBOR_ALLGATHER
Neighbor Allgatherv count The number of invocations of the procedures MPI_NEIGHBOR_ALLGATHERV and MPI_INEIGHBOR_ALLGATHERV
Neighbor Alltoall count The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALL and MPI_INEIGHBOR_ALLTOALL
Neighbor Alltoallv count The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALLV and MPI_INEIGHBOR_ALLTOALLV
Neighbor Alltoallw count The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALLW and MPI_INEIGHBOR_ALLTOALLW
Number of bytes sent byte The number of bytes sent by point-to-point send procedures
Memory Transfer byte The number of bytes sent using memory copy by point-to-point send procedures
DMA Transfer byte The number of bytes sent using DMA transfer by point-to-point send procedures
Number of bytes recvd byte The number of bytes received by point-to-point receive procedures
Memory Transfer byte The number of bytes received using memory copy by point-to-point receive procedures
DMA Transfer byte The number of bytes received using DMA transfer by point-to-point receive procedures
Put count The number of invocations of the procedures MPI_PUT and MPI_RPUT
Memory Transfer The number of invocations of the procedures MPI_PUT and MPI_RPUT that use memory copy
DMA Transfer The number of invocations of the procedures MPI_PUT and MPI_RPUT that use DMA transfer
Get count The number of invocations of the procedures MPI_GET and MPI_RGET
Memory Transfer The number of invocations of the procedures MPI_GET and MPI_RGET that use memory copy
DMA Transfer The number of invocations of the procedures MPI_GET and MPI_RGET that use DMA transfer
Accumulate count The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP
Memory Transfer The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP that use memory copy
DMA Transfer The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP that use DMA transfer
Number of bytes put byte The number of bytes put by the procedures MPI_PUT and MPI_RPUT
Memory Transfer byte The number of bytes put using memory copy by the procedures MPI_PUT and MPI_RPUT
DMA Transfer byte The number of bytes put using DMA transfer by the procedures MPI_PUT and MPI_RPUT
Number of bytes got byte The number of bytes got by the procedures MPI_GET and MPI_RGET
Memory Transfer byte The number of bytes got using memory copy by the procedures MPI_GET and MPI_RGET
DMA Transfer byte The number of bytes got using DMA transfer by the procedures MPI_GET and MPI_RGET
Number of bytes accum byte The number of bytes accumulated by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP
Memory Transfer byte The number of bytes accumulated using memory copy by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP
DMA Transfer byte The number of bytes accumulated using DMA transfer by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP
Table 3-14   The Meanings of the Items in the MPI Communication Information


3.6   FTRACE Facility

The FTRACE facility enables users to obtain detailed performance information of each procedure and specified execution region of a program on each MPI process, including MPI communication information. Please refer to "PROGINF / FTRACE User's Guide" for details. Note: FTRACE is only available in the program executed on VE.

The following table shows the MPI communication information displayed with the FTRACE facility.

Table 3-15 MPI Communication information Displayed with the FTRACE Facility
Item Unit Meaning
ELAPSE second Elapsed time
COMM.TIME second Elapsed time for executing MPI procedures
COMM.TIME / ELAPSE The ratio of the elapsed time for executing MPI procedures to the elapsed time of each process
IDLE TIME second Elapsed time for waiting for messages
IDLE TIME / ELAPSE The ratio of the elapsed time for waiting for messages to the elapsed time of each process
AVER.LEN Byte Average amount of communication per MPI procedure
COUNT Total number of transfers by MPI procedures
TOTAL LEN Byte Total amount of communication by MPI procedures


The steps for using the FTRACE facility are as follows:

  1. Specify the -ftrace option at compile and link time as follows:

    $ mpincc -ftrace mpi.c
    $ mpinfort -ftrace mpifort.f90

  2. Analysis information files are generated in the working directory at runtime. The name of an analysis information file is ftrace.out.uuu.rrr, where uuu and rrr are the values of the environment variables MPIUNIVERSE and MPIRANK, respectively.

  3. Execute the ftrace command to read the analysis information files and display the performance information to the standard output as follows:

    $ ftrace -all -f ftrace.out.0.0 ftrace.out.0.1
    $ ftrace -f ftrace.out.*


The following figure shows an example displayed by the FTRACE facility.


Figure 3-6   Performance Information by the FTRACE Facility
*----------------------*
  FTRACE ANALYSIS LIST
*----------------------*

Execution Date : Sat Feb 17 12:44:49 2018 JST
Total CPU Time : 0:03'24"569 (204.569 sec.)


FREQUENCY  EXCLUSIVE       AVER.TIME     MOPS   MFLOPS  V.OP  AVER.    VECTOR L1CACHE .... PROC.NAME
           TIME[sec](  % )    [msec]                    RATIO V.LEN      TIME    MISS

     1012    49.093( 24.0)    48.511  23317.2  14001.4  96.97  83.2    42.132   5.511      funcA
   160640    37.475( 18.3)     0.233  17874.6   9985.9  95.22  52.2    34.223   1.973      funcB
   160640    30.515( 14.9)     0.190  22141.8  12263.7  95.50  52.8    29.272   0.191      funcC
   160640    23.434( 11.5)     0.146  44919.9  22923.2  97.75  98.5    21.869   0.741      funcD
   160640    22.462( 11.0)     0.140  42924.5  21989.6  97.73  99.4    20.951   1.212      funcE
 53562928    15.371(  7.5)     0.000   1819.0    742.2   0.00   0.0     0.000   1.253      funcG
        8    14.266(  7.0)  1783.201   1077.3     55.7   0.00   0.0     0.000   4.480      funcH
   642560     5.641(  2.8)     0.009    487.7      0.2  46.45  35.1     1.833   1.609      funcF
     2032     2.477(  1.2)     1.219    667.1      0.0  89.97  28.5     2.218   0.041      funcI
        8     1.971(  1.0)   246.398  21586.7   7823.4  96.21  79.6     1.650   0.271      funcJ
------------------------------------------------------------------------------------- .... -----------
 54851346   204.569(100.0)     0.004  22508.5  12210.7  95.64  76.5   154.524  17.740      total


ELAPSED     COMM.TIME  COMM.TIME   IDLE TIME  IDLE TIME  AVER.LEN      COUNT  TOTAL LEN PROC.NAME
   TIME[sec]       [sec]  / ELAPSED       [sec]  / ELAPSED    [byte]                [byte]

      12.444       0.000                  0.000                 0.0           0       0.0  funcA
       9.420       0.000                  0.000                 0.0           0       0.0  funcB
       7.946       0.000                  0.000                 0.0           0       0.0  funcG
       7.688       0.000                  0.000                 0.0           0       0.0  funcC
       7.372       0.000                  0.000                 0.0           0       0.0  funcH
       5.897       0.000                  0.000                 0.0           0       0.0  funcD
       5.653       0.000                  0.000                 0.0           0       0.0  funcE
       1.699       1.475                  0.756                 3.1K     642560       1.9G funcF
       1.073       1.054                  0.987                 1.0M       4064       4.0G funcI
       0.704       0.045                  0.045                80.0           4     320.0  funcK
------------------------------------------------------------------------------------------------------


FREQUENCY  EXCLUSIVE       AVER.TIME     MOPS   MFLOPS  V.OP  AVER.    VECTOR L1CACHE .... PROC.NAME
           TIME[sec](  % )    [msec]                    RATIO V.LEN      TIME    MISS

     1012    49.093( 24.0)    48.511  23317.2  14001.4  96.97  83.2    42.132   5.511      funcA
      253    12.089           47.784  23666.9  14215.9  97.00  83.2    10.431   1.352       0.0
      253    12.442           49.177  23009.2  13811.8  96.93  83.2    10.617   1.406       0.1
      253    12.118           47.899  23607.4  14180.5  97.00  83.2    10.463   1.349       0.2
      253    12.444           49.185  23002.8  13808.2  96.93  83.2    10.622   1.404       0.3
...
------------------------------------------------------------------------------------- .... ----------
 54851346   204.569(100.0)     0.004  22508.5  12210.7  95.64  76.5   154.524  17.740      total

   ELAPSED     COMM.TIME  COMM.TIME   IDLE TIME  IDLE TIME  AVER.LEN      COUNT  TOTAL LEN PROC.NAME
   TIME[sec]       [sec]  / ELAPSED       [sec]  / ELAPSED    [byte]                [byte]

      12.444       0.000                  0.000                 0.0           0       0.0  funcA
      12.090       0.000      0.000       0.000      0.000      0.0           0       0.0   0.0
      12.442       0.000      0.000       0.000      0.000      0.0           0       0.0   0.1
      12.119       0.000      0.000       0.000      0.000      0.0           0       0.0   0.2
      12.444       0.000      0.000       0.000      0.000      0.0           0       0.0   0.3


3.7   MPI Procedures Tracing Facility

NEC MPI provides the facility to trace invocations of and returns from MPI procedures, and the progress of each MPI process is output to the standard output.

The following information is displayed.

The tracing facility makes it easy to see where a program runs and to debug it.

In order to use this facility, please generate MPI program with the -mpitrace option.

Note that amount of the trace output can be huge if a program calls MPI procedures many times.


3.8   Debug Assist Feature for MPI Collective Procedures

The debug assist feature for MPI collective procedures assists users in debugging invocations of MPI collective procedures by detecting incorrect uses across processes and outputting detected errors in detail to the standard error output.
The incorrect uses include the following cases

Please generate MPI program with the -mpiverify option to use this feature as follows:

$ mpinfort -mpiverify f.f90

When an error is detected, a message including the following information is output to the standard error output.

The following example shows the message output when the process with rank 3 invoked the procedure MPI_BCAST with the argument root whose value was 2 and the process with rank 0 invoked the procedure with the argument root whose value was 1.

VERIFY MPI_Bcast(3): root 2 inconsistent with root 1 of 0

The errors to be detected can be specified by setting the environment variable NMPI_VERIFY at runtime as shown in the following table.

Table 3-15   The Settings of NMPI_VERIFY
NMPI_VERIFY Detected Errors
0 No errors are detected.
3 (Default) Errors other than those in the argument assert of the procedure MPI_WIN_FENCE
4 Errors in the argument assert of the procedure MPI_WIN_FENCE, in addition to the errors detected by default

The following table shows the errors that can be detected by the debug assist feature.

Table 3-17 Errors Detected by the Debug Assist Feature
Procedure Target of Checking Condition
All collective procedures Order of invocations Processes in the same communicator, or corresponding to the same window or file handle invoked different MPI collective procedures at the same time.
Procedures with the argument root Argument root The values of the argument root were not the same across processes.
Collective communication procedures Message length (extent of an element * the number of elements transferred) The length of a sent message was not the same as that of the corresponding received message.
Collective communication procedures that perform reduction operations Argument op The values of the argument op (reduction operator) were not the same across processes.
Topology collective procedures Graph information and dimensional information Information of a graph or dimensions specified with arguments was inconsistent across processes.
MPI_COMM_CREATE Argument group The groups specified with the argument group were not the same across processes.
MPI_INTERCOMM_CREATE Arguments local_leader and tag The values of the argument local_leader were not the same across processes in the local communicator, or the values of the argument tag were not the same across the processes corresponding to the argument local_leader or remote_leader.
MPI_INTERCOMM_MERGE Argument high The values of the argument high were not the same across processes.
MPI_FILE_SET_VIEW Arguments etype and datarep The datatypes specified with the argument etype or the data representation specified with the argument datarep were not the same across processes.
MPI_WIN_FENCE Argument assert The values of the argument assert were inconsistent across processes.
Note that this feature involves overhead for checking invocations of MPI collective procedures and can result in lower performance. Therefore, please re-generate MPI program without the -mpiverify option once the correctness of uses of collective procedures is verified.


3.9   Exit Status of an MPI Program

NEC MPI watches exit statuses of MPI processes to determine whether termination of program execution is normal termination or error termination. Normal termination occurs if and only if every MPI process returns 0 as its exit status. Otherwise error termination occurs.
Therefore, termination status of program execution should be specified as follows for NEC MPI to recognize the termination status correctly.


3.10   Miscellaneous

This section describes additional notes in NEC MPI.

  1. In MPI execution, the same version of the MPI library must be linked with all the executable files and shared libraries. It is possible to check the library version of executable files with any one of the following "a" and "b" ways.

    1. You can obtain the directory path (RUNPATH) of MPI library dynamically linked by nreadelf command. The underlined part of this path is the version of MPI library.
      $ /opt/nec/ve/bin/nreadelf -W -d a.out | grep RUNPATH
      0x000000000000001d (RUNPATH) Library runpath: [/opt/nec/ve/mpi/2.2.0/lib64/ve:...]
    2. When executable files is generated without option -shared-mpi, you can obtain the version of MPI library statically linked by strings command and grep command.

      $ /usr/bin/strings a.out | /bin/grep "library version"
      NEC MPI: library Version 2.2.0 (17. April 2019): Copyright (c) NEC Corporation 2018-2019

    When option -static-mpi is not specified at compiling and linking time, MPI program is dynamically linked against MPI memory management library and statically linked against the other MPI libraries. In this case, you need to use the same version of MPI libraries for compiling and linking time and runtime. If you use different version, unexpected behavior may occur by version inconsistency between MPI memory management library, which is linked dynamically at runtime, and the other MPI libraries, which was linked statically at compiling and linking time.

  2. When users use the extended precision features of the Fortran compiler at compile time of MPI programs written in Fortran, both of the compiler options -fdefault-integer=8 and -fdefault-real=8 must be specified, and other extended precision compiler options must not be specified.

  3. NEC MPI cannot be used in a program in which Fortran procedures that have been compiled with the extended precision compile options and C functions are mixed.

  4. NEC MPI handles signals SIGINT, SIGTERM, and SIGXCPU to appropriately control abnormal terminations of programs. For user programs to handle these signals by themselves, they must call previously defined signal handlers. Otherwise, the proper termination of programs is not guaranteed.

  5. Interfaces in C++ format (C++ bindings), which have been removed in MPI-3, cannot be used in NEC MPI. If they are used, please change them into those in C format (C bindings), or define the macro NMPI_ENABLE_CXX prior to inclusion of the file mpi.h in source programs.

  6. Programming languages used for source programs cannot be specified with the -x compiler option in the MPI compilation commands.

  7. When acquiring the MPI execution performance information by specifying the environment variable NMPI_PROGINF, you need to use the option -pthread for the link between MPI program and pthread library. If -lpthread is specified instead of option -pthread, MPI execution performance information may not be displayed correctly.

  8. By default, MPI libraries are linked statically except for MPI memory management library, but when creating a shared library by specifying the -shared compiler option in the MPI compilation commands, all MPI libraries are linked dynamically. When linking a shared library with the dynamically linked all MPI libraries to an executable file, specify the option -shared-mpi and link all MPI libraries dynamically.

  9. MPI programs require shared system libraries and shared library for MPI memory management. If the compiler option -static is specified in the MPI compilation commands, the option is processed as follows.

    If you want to link MPI program against static libraries, you can use linker option -Bstatic and compiler options to link a program against static compiler libraries instaed of compiler option -static. When you use linker option -Bstatic, you surround libraries with -Wl,-Bstatic and -Wl,-Bdynamic. The surrounded libraries are linked statically. The following example is that libww and libxx are linked statically.

    mpincc a.c -lvv -Wl,-Bstatic -lww -lxx -Wl,-Bdynamic -lyy

    About the compiler options to link a program against static compiler libraries, please refer to the compiler's manual.

  10. The execution directory of the MPI program needs write permission. If the authority is insufficient, the following warning message may be output and MPI communication performance may degraded.
    mkstemp: Permission denied
  11. When MPI performance information is used, this function issues signal SIGUSR1 to threads for collecting performance information in the MPI_Init and MPI_Finalize. In the case of executing a MPI program under a debugger, a debugger may capture the SIGUSR1 and stop the MPI execution. Also in the case of using non-blocking MPI-IO procedures, a POSIX AIO worker thread created for non-blocking MPI-IO operations does not respond to the SIGUSR1 and the MPI execution may be stopped. In the above cases, by specifying this environment variable VE_PROGINF_USE_SIGNAL=NO, signal issuance can be suppressed. When signal issuance is suppressed, MPI performance information only terminates the threads of OpenMP and compiler automatic parallelization and collect information from the threads. So the other threads cannot be collected and value are not shown in performance information items however User time, Real Time, Memory Size Used and Non Swappable Memory Size Used are excluded.
  12. MPI uses HugePages to optimize MPI communications. If MPI cannot allocate HugePages on a host, the following warning message outputs and MPI communication may slow down. The configuration of the HugePages requires the system administrator privileges. If the message outputs, please refer to "SX-Aurora TSUBASA Installation Guide", or contact the system administrator for details.

    mpid(0): Allocate_system_v_shared_memory: key = 0x420bf67e, len = 16777216 shmget allocation: Cannot allocate memory
  13. The memlock resource limit needs to be set to "unlimited" for MPI to use Infininband communication and HugePages. Because this setting is applied automatically, you don't change the memlock resource limit from "unlimited" by ulimit command and so on. If the memlock resource limit is not "unlimited", there is a possibility that MPI execution aborts or MPI communication slows down with the following messages.

    libibverbs: Warning: RLIMIT_MEMLOCK is 0 bytes.
    This will severely limit memory registrations.
    [0] MPID_OFED_Open_hca: open device failed ib_dev 0x60100002ead0 name mlx5_0
    [0] Error in Infiniband/OFED initialization. Execution aborts
    mpid(0): Allocate_system_v_shared_memory: key = 0xd34d79c0, len = 16777216
    shmget allocation: Operation not permitted
    Even if the memlock resource limit is set to "unlimited", the following message may be output to system log. This message is not problem and MPI execution works correctly.
    kernel: mpid (20934): Using mlock ulimits for SHM_HUGETLB is deprecated
  14. If the process terminates abnormally during the application execution, information related to the cause of the abnormal termination (error details, termination status, etc.) is output with the universe number and rank number. However, depending on the timing of abnormal termination, many messages such as the following may be output, making it difficult to refer to the information related to the cause of the abnormal termination.

    [3] mpisx_sendx: left (abnormally) (rc=-1), sock = -1 len 0 (12)
    Error in send () called by mpisx_sendx: Bad filedescriptor
    In this case, it may be easier to refer to this information by excluding the above message. An example command is shown below.
    $ grep -v mpisx_sendx <outputfile>


Contents Previous Chapter Next Chapter Glossary Index