3.1   Compiling and Linking MPI Programs
Firstly, please execute the following command to read a setup script each time you log in to a VH, in
order to set up the MPI compilation environment. {version} is the directory name corresponding to the
version of NEC MPI you use.
The setting is available until you log out.
(For bash)
$ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh
(For csh)
% source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh
It is possible to compile and link MPI programs with the MPI compilation commands corresponding to each programing language as follows:
To compile and link MPI programs written in Fortran, please execute the mpinfort/mpifort command as follows
To compile and link MPI programs written in C, please execute the mpincc/mpicc command as follows
$ mpinfort [options] {sourcefiles}
To compile and link MPI programs written in C++, please execute the mpinc++/mpic++ command as follows
$ mpincc [options] {sourcefiles}
In the command lines above, {sourcefiles} means MPI program source files, and [options] means optional compiler options.
$ mpinc++ [options] {sourcefiles}
NEC MPI compile commands, mpincc/mpicc, mpinc++/mpic++ and mpinfort/mpifort, will use the default version of compilers, ncc, nc++ and nfort, respectively. NEC MPI compile command option -compiler or an environment variable can be used to select a compiler version, if another version of compiler must be used. In this case, a compiler version and NEC MPI version must be selected carefully to match each other.
example: If a compiler version 2.x.x is used to compile and link a C program.
$ mpincc -compiler /opt/nec/ve/bin/ncc-2.x.x program.c
Table 3-1 The List of NEC MPI Compiler Commands Options Option Meaning -mpimsgq | -msgq Use the MPI message queue facility for the Debugger -mpiprof Use the MPI communication information facility and use MPI profiling interface (MPI procedure with names beginning with PMPI_). Please refer to this section for the MPI communication information facility. -mpitrace Use the MPI procedures tracing facility. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the MPI procedures tracing facility. -mpiverify Use the debug assist feature for MPI collective procedures. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the debug assist feature for MPI collective procedures. -ftrace Use the FTRACE facility for MPI program. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the FTRACE facility. -show Display the sequence of compiler execution invoked by the MPI compilation command without actual execution -ve Compile and link MPI programs to run on VE (default) -vh
-shCompile and link MPI programs to run on VH or SH -static-mpi Link against MPI libraries statically, but MPI memory management library is linked dynamically (default) -shared-mpi Link against all MPI libraries dynamically -compiler <compiler> Specify a compiler invoked by the MPI compilation command following space. If this option is not specified, each compile command starts the following compiler. The following is supported as a compiler that compiles and links MPI programs to run on VH or Scalar Host. See also 2.10 about using mpi_f08 fortran module.
- GNU Compiler Collection
- 4.8.5
- 8.3.0 and 8.3.1
- 9.1.0 and compatible version
- Intel C++ Compiler and Intel Fortran Compiler
- 19.0.4.243 (Intel Parallel Studio XE 2019 Update 4) and compatible version
- 19.1.2.254 (Intel Parallel Studio XE 2020 Update 2)
Compilation Command Invoked Compiler mpincc/mpicc ncc mpinc++/mpic++ nc++ mpinfort/mpifort nfort Compilation Command with -vh/-sh Invoked Compiler mpincc/mpicc gcc mpinc++/mpic++ g++ mpinfort/mpifort gfortran
Table 3-2 The List of Environment Variables of NEC MPI Compiler Commands Environment Variable Meaning NMPI_CC Change a compiler which you use to compile and link a mpi program on VE by mpincc command. NMPI_CXX Change a compiler which you use to compile and link a mpi program on VE by mpinc++ command. NMPI_FC Change a compiler which you use to compile and link a mpi program on VE by mpinfort command. NMPI_CC_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpincc command. NMPI_CXX_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpinc++ command. NMPI_FC_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpinfort command.
The above environment variables in Table 3-2 are overridden by -compiler option.
An example of each compiler is shown below.
example1: NEC Compiler
example2: GNU compiler
$ source /opt/nec/ve/mpi/2.x.x/bin/necmpivars.sh $ mpincc a.c $ mpinc++ a.cpp $ mpinfort a.f90
example3: Intel compiler
(setup the GNU compiler (e.g. PATH, LD_LIBRARY_PATH) $ source /opt/nec/ve/mpi/2.x.x/bin/necmpivars.sh $ mpincc -vh a.c $ mpinc++ -vh a.cpp $ mpinfort -vh a.f90
(setup the Intel compiler (e.g. PATH, LD_LIBRARY_PATH) $ source /opt/nec/ve/mpi/2.x.x/bin/necmpivars.sh $ export NMPI_CC_H=icc $ export NMPI_CXX_H=icpc $ export NMPI_FC_H=ifort $ mpincc -vh a.c $ mpinc++ -vh a.cpp $ mpinfort -vh a.f90
3.2   Starting MPI Programs
Before use, please setup your compiler referring to 3.1 and
execute the following command to read a setup script each time you log in to a VH,
in order to setup the MPI execution environment.
{version} is the directory name corresponding to the
version of NEC MPI you use.
This setting is available until you log out.
(For bash) $ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh (For csh) % source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh
By default, the MPI libraries whose version is same as compiling and linking are searched and MPI program is dynamically linked against them as needed. By loading setup script, the MPI libraries corresponding to above {version} will be searched. Thus, when MPI program is dynamically linked against all MPI libraries with -shared-mpi, You can change MPI libraries to corresponding them to above {version} at runtime.
When -shared-mpi is not specified at compiling and linking time, MPI program is dynamically linked against MPI memory management library and statically linked against the other MPI libraries. The MPI libraries linked statically cannot be changed at runtime.
If you use hybrid execution which consist of vector processes and scalar processes, execute the below command instead of the above. By loading setup script by the below command, in addition to VE, the MPI program executed on VH or a scalar host also is dynamically linked against the MPI libraries to corresponding to below {version}.
The {version} is the directory name corresponding to the version of NEC MPI which contains MPI libraries the MPI program is dynamically linked against. [gnu|intel] be specified as the first argument. [compiler-version] is specified as the second argument. [compiler-version] is the compiler version used at compiling and linking. You can get the value of each argument from the RUNPATH of MPI program. In the below example, the first argument is the value of the wave line part (gnu) and the second argument is the value of the dashed line part (9.1.0)
(For bash) $ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh [gnu|intel] [compiler-version] (For csh) % source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh [gnu|intel] [compiler-version]
$ /usr/bin/readelf -W -d vh.out | grep RUNPATH 0x000000000000001d (RUNPATH) Library runpath: [/opt/nec/ve/mpi/2.3.0/lib64/vh/gnu/9.1.0]
NEC MPI provides the MPI execution commands mpirun and mpiexec to launch MPI programs. Any of the following command lines is available:
$ mpirun [global-options] [local-options] {MPIexec} [args] [ : [local-options] {MPIexec} [args] ]...
$ mpiexec [global-options] [local-options] {MPIexec} [args] [ : [local-options] {MPIexec} [args] ]...
3.2.1   Specification of Program Execution
The following can be specified as
MPI-execution specification {MPIexec}
in the MPI execution commands:
Specify an MPI executable file {execfile} as follows:
$ mpirun -np 2 {execfile} |
Specify a shell script that executes an MPI executable file {execfile} as follows:
$ cat shell.sh #!/bin/sh {execfile} $ mpirun -np 2 ./shell.sh |
The explanation above is based on the assumption that the Linux binfmt_misc capability has been configured, which is the default software development environment in the SX-Aurora TSUBASA. The configuration of the binfmt_misc capability requires the system administrator privileges. Please refer to "SX-Aurora TSUBASA Installation Guide", or contact the system administrator for details.
It is possible to execute MPI programs by specifying MPI-execution specification {MPIexec} as follows, even in the case that the binfmt_misc capability has not been configured.
- The ve_exec command "/opt/nec/ve/bin/ve_exec" and an MPI executable file {execfile}
Specify the ve_exec command "/opt/nec/ve/bin/ve_exec" and an MPI executable file {execfile} as follows:
$ mpirun -np 2 /opt/nec/ve/bin/ve_exec {execfile} - Shell script that specifies the ve_exec command "/opt/nec/ve/bin/ve_exec" and an MPI executable file {execfile}
Specify a shell script that specifies the ve_exec command "/opt/nec/ve/bin/ve_exec" and an MPI executable file {execfile} as follows:
$ cat shell.sh
#!/bin/sh
/opt/nec/ve/bin/ve_exec {execfile}
$ mpirun -np 2 ./shell.sh
3.2.2   Runtime Options
The term host in runtime options indicates a VH or a VE. Please refer to
the clause for how to specify hosts.
The following table shows available global options.
Table 3-3 The List of Global Options Global Option Meaning -machinefile | -machine <filename> A file that describes hosts and the number of processes to be launched.
The format is "hostname[:value]" per line. The default value of the number of processes (":value") is 1, if it is omitted.-configfile <filename> A file containing runtime options.
In the file <filename>, specify one or more option lines.
Runtime options and MPI execution specifications {MPIexec} such as MPI executable file are specified on each line. If the beginning of the line is "#", that line is treated as a comment.-hosts <host-list> Comma-separated list of hosts on which MPI processes are launched.
When the options -hosts and -hostfile are specified more than once, the hosts specified in each successive option are treated as a continuation of the list of the specified hosts.
This option must not be specified together with the option -host, -nn, or -node.-hostfile <filename> Name of a file that specifies hosts on which MPI processes are launched.
When the options -hosts and -hostfile are specified more than once, the hosts specified in each successive option are treated as a continuation of the list of the specified hosts.
This option must not be specified together with the option -host, -nn, or -node.-gvenode Hosts specified in the options indicates VEs. -perhost | -ppn | -N | -npernode | -nnp <value> MPI processes in groups of the specified number <value> are assigned to respective hosts.
The assignment of MPI processes to hosts is circularly performed until every process is assigned to a host.
When this option is omitted, the default value is (P+H-1)/H, where P is the total number of MPI processes and H is the number of hosts.-max_np <max_np> Specify the maximum number of MPI processes, including MPI processes dynamically generated at runtime. The default value is the number specified with the -np option. If some -np options are specified, the default value is the sum of the numbers specified with the options. -multi Specify that MPI program is executed on multiple hosts. Use this option, if all MPI processes are generated in a single host at the start of program execution and then MPI processes are generated on the other hosts by the MPI dynamic process generation function, resulting in multiple host execution. -genv <varname> <value> Pass the environment variable <varname> with the value <value> to all MPI processes. -genvall (Default) Pass all environment variables to all MPI processes except for the default environment variables set by NQSV in the NQSV request execution. Please refer to "NEC Network Queuing System V (NQSV) User's Guide" for details. -genvlist <varname-list> Comma-separated list of environment variables to be passed to all MPI processes. -genvnone Do not pass any environment variables. -gpath <dirname> Set PATH environment variables passed to all MPI processes to <dirname>. -gumask <mode> Execute "umask <mode>" for all MPI processes. -gwdir <dirname> Set the working directory in which all MPI processes run to <dirname>. -gdb | -debug Open one debug screen per MPI process, and run MPI programs under the gdb debugger. -display | -disp <X-server> X display server for debug screens in the format "host:display" or "host:display:screen". -v | -V | -version Display the version of NEC MPI and runtime information such as environment variables. -h | -help Display help for the MPI execution commands.
Only one of the local options in the following table can be specified to each MPI executable file. When all of them are omitted, the host specified in runtime options indicates a VH.
Table 3-4 The List of Local Options Local Option Meaning -ve <first>[-<last>] The range of VEs on which MPI processes are executed. If this option is specified, the term host in runtime options indicates a VH.
In the interactive execution, specify the range of VE numbers.
In the NQSV request execution, specify the range of logical VE numbers.
<first> indicates the first VE number, and <last> the last VE number. <last> must not be smaller than <first>. When -<last> is omitted, -<first> is assumed to be specified.
The specified VEs are the ones attached to VHs specified immediately before this option in local options or specified in global options.
If this option is omitted and no VEs are specified, VE#0 is assumed to be specified. If this option is omitted and host or the number of hosts are not specified in the NQSV request execution, all VEs assigned by NQSV are assumed to be specified.-nve <value> The number of VEs on which MPI processes are executed.
Corresponds to: -ve 0-<value-1>
The specified the number of VEs are the ones attached to VHs specified immediately before this option in local options or specified in global options.-venode The term host in the options indicates a VE. -vh | -sh Create MPI processes on Vector Hosts or Scalar hosts. -host <host> One host on which MPI processes are launched. -node <hostrange> The range of hosts on which MPI processes are launched.
In the interactive execution, the -venode option also needs to be specified.
If the option -hosts, -hostfile, -host, or -nn is specified, this option is ignored.-nn <value> The number of hosts on which MPI processes are launched.
This option is available only in the NQSV request execution.
This option can be specified only once corresponding to each MPI executable file.
If this option is omitted and host or the number of hosts are not specified in the NQSV request execution, the number of hosts assigned by NQSV is assumed to be specified.
If the option -hosts, -hostfile, or -host is specified, this option is ignored.-numa <first>[-<last>][,<...>] The range of NUMA nodes on VE on which MPI processes are executed.
<first> indicates the first NUMA node number, and <last> the last NUMA node number. <last> must not be smaller than <first>. When -<last> is omitted, -<first> is assumed to be specified.
-nnuma <value> The number of NUMA nodes on VE on which MPI processes are executed.
Corresponds to: -numa 0-<value-1>-c | -n | -np <value> The total number of processes launched on the corresponding hosts.
The specified processes correspond to the hosts specified immediately before this option in local options or specified in global options.
When this option is omitted, the default value is 1.-env <varname> <value> Pass the environment variable <varname> with the value <value> to MPI processes. -envall (Default) Pass all environment variables to MPI processes except the default environment variables set by NQSV in the NQSV request execution. Please refer to "NEC Network Queuing System V (NQSV) User's Guide" for details about the default environment variables. -envlist <varname-list> Comma-separated list of environment variables to be passed. -envnone Do not pass any environment variables. -path <dirname> Set PATH environment variables passed to MPI process to <dirname>. -umask <mode> Execute "umask <mode>" for MPI process. -wdir <dirname> Set the working directory in which MPI processes run to <dirname>. -ib_vh_memcpy_send <auto | on | off> Use VH memory copy on the sender side of a VE process for InfiniBand communication. This option has higher priority than the environment variable NMPI_IB_VH_MEMCPY_SEND.
auto:
Use sender side VH memory copy for InfiniBand communication through Root Complex.
(default for Intel machines)
on:
Use sender side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)
off:
Don't use sender side VH memory copy for InfiniBand communication.-ib_vh_memcpy_recv <auto | on | off> Use VH memory copy on the receiver side of a VE process for InfiniBand communication. This option has higher priority than the environment variable NMPI_IB_VH_MEMCPY_RECV.
auto:
Use receiver side VH memory copy for InfiniBand communication through Root Complex.
on:
Use receiver side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)
off:
Don't use receiver side VH memory copy for InfiniBand communication.
(default for Intel machines)
-dma_vh_memcpy <auto | on | off> Use VH memory copy for a communication between VEs in VH. This option has higher priority than the environment variable NMPI_DMA_VH_MEMCPY.
auto:
Use VH memory copy for a communication between VEs in VH through Root Complex.
(default)
on:
Use VH memory copy for a communication between VEs in VH.
(independent on Root Complex).
off:
Don't use VH memory copy for a communication between VEs in VH .
-vh_memcpy <auto | on | off> Use VH memory copy for the InfiniBand communication and the communication between VEs in VH. This option has higher priority than the environment variable NMPI_VH_MEMCPY.
auto:
In the case of InfiniBand communication, sender side VH memcpy is used if the communication goes through Root Complex. In the case of a communication between VEs in VH, VH memory copy is used if the communication goes through Root Complex.
on:
VH memory copy is used.
off:
VH memory copy is not used.
Note:
The option -ib_vh_memcpy_send, -ib_vh_memcpy_recv and -dma_vh_memcpy are higher priority than this option.
-vpin | -vpinning Print info on assigned cpu id's of MPI processes on VH's, scalar hosts or NUMA nodes on VEs.
This option is valid for -pin_mode, -cpu_list, -numa, -nnuma option.-pin_mode < consec | spread |
consec_rev | spread_rev
scatter | no | none | off >Specify the method how the affinity of MPI processes on VH or scalar host is controlled with.
consec | spread :
Assign next freecpu ids to MPI processes. Assigning of cpu ids starts with cpu id 0.
consec_rev | spread_rev:
Assign next free (in reverse order)cpu ids to MPI processes. Assigning of cpu ids starts with highest cpu id.
scatter:
Look for a maximal distance to already assigned cpu ids and assign next freecpu ids to MPI processes.
none | off | no :
No pinning of MPI processes to cpu id's. The default pinning mode is 'none'.
Note:
(*) Specifying flag "-pin_mode" disables preceding "-cpu_list".
(*) If the number of free cpu id's is not sufficient to assigncpu_id's, NO cpu id is assigned to the MPI process. -pin_reserve <num-reserved-ids>[H|h] Specify the number of cpu ids to be reserved per MPI process on VH or scalar host for the pinning method specified with the flag "-pin_mode". If the optional 'h' or 'H' is added to the number, the cpu id's of associated Hyperthreads are also utilized if available.
The number of reserved ids must be greater than 0.
The default number is 1.-cpu_list | -pin_cpu <first-id>[<-last-id>
[<-increment>[-<num-reserved-ids>
[H|h][,...]]]]Specify a comma-separated list of cpu id's for the processes to be created. specifies the cpu id which is assigned to the first MPI process on the node. Cpu id <first-id + increment> is assigned to the next MPI process and so on. <last-id> specifies the last cpu id which is assigned. <num-reserved-ids> specifies the number of reserved cpu ids per MPI process for multithreaded application. If the optional 'h' or 'H' is added to the <num-reserved-ids>, the cpu ids of Hyperthreads are also utilized if available.
Default values if not specified:
<last-id> = <first-id>
<increment> = 1
<num-reserved-ids> = 1
Note:
(*) Specifying flag "-cpu_list" disables preceding "-pin_mode".
(*) If the number of free cpu ids is not sufficient to assign <num-reserved-ids> cpu ids, NO cpu id is assigned to the MPI process.
- When all of the options -hosts, -hostfile, -host, -node, and -nn are omitted in the NQSV request execution, all the hosts allocated by NQSV are used.
- The precedence of the options -hosts, -hostfile, -host, -node, and -nn is
-hosts, -hostfile, -host > -nn > -node.- The following local options have higher priority than the following global options.
Local options : -evn, -envall, -envlist, -envnone, -path, -umask, -wdir
Global options : -genv, -genvall, -genvlist, -genvnone, -gpath, -gumask, -gwdir
3.2.3   Specification of Hosts
Hosts corresponding to
MPI executable files are determined according to the specified runtime options as follows:
A host indicates a VH in this case. VHs are specified as shown in the following table.
Table 3-5 Specification of VHs Execution Method Format Description Interactive execution VH name
- The hostname of a VH, which is a host computer.
NQSV request execution <first>[-<last>]
- <first> is the first logical VH number and <last> the last.
- To specify one VH, omit -<last>.
In particular specify only <first> in the options -hosts, -hostfile, and -host.- <last> must not be smaller than <first>.
A host indicates a VE in this case. VEs are specified as shown in
the following table.
Please note that the -ve option cannot be specified for the MPI
executable file for which the -venode option is specified.
Table 3-6 Specification of VEs Execution Method Format Description Interactive execution <first>[-<last>][@<VH>]
- <first> is the first VE number and <last> the last.
- <VH> is a VH name. When omitted, the VH on which the MPI execution command has been executed is selected.
- To specify one VE, omit -<last>.
In particular specify only <fisrt> in the options -hosts, -hostfile, and -host.- <last> must not be smaller than <first>.
NQSV request execution <first>[-<last>][@<VH>]
- <first> is the first logical VE number and <last> the last.
- <VH> is a logical VH number. When omitted, hosts (VEs) are selected from the ones NQSV allocated.
- To specify one VE, omit -<last>.
In particular specify only <first> in the options -hosts, -hostfile, and -host.- <last> must not be smaller than <first>.
3.2.4   Environment Variables
The following Table shows the environment variable s the values of which users can set.
Environment Variable | Available Value | Meaning |
---|---|---|
NMPI_COMMINF | Control the display of MPI communication information. To use MPI communication information facility, you need to generate MPI program with the option -mpiprof, -mpitrace, -mpiverify or -ftrace. Please refer to this section for MPI communication facility. | NO | (Default) Not display the communication information. | YES | Display the communication information in the reduced format. | ALL | Display the communication information in the extended format. |
MPICOMMINF | The same as the environment variable NMPI_COMMINF | The same as the environment variable NMPI_COMMINF. If both are specified, the environment variable NMPI_COMMINF takes precedence. |
NMPI_COMMINF_VIEW | Specify the display format of the aggregated portion of MPI communication information. | VERTICAL | (Default) Aggregate vector processes and scalar processes separately and display them vertically. | HORIZONTAL | Aggregate vector processes and scalar processes separately and display them horizontally. | MERGED | Aggregate and display vector processes and scalar processes. |
NMPI_PROGINF | Control the display of runtime performance information of MPI program. Please refer to this section for runtime performance information of MPI program. | NO | (Default) Not display the performance information. | YES | Display the performance information in the reduced format. | ALL | Display the performance information in the extended format. | DETAIL | Display the detailed performance information in the reduced format. | ALL_DETAIL | Display the detailed performance information in the extended format. |
MPIPROGINF | The same as the environment variable NMPI_PROGINF | The same as the environment variable NMPI_PROGINF. If both are specified, the environment variable NMPI_PROGINF takes precedence. |
NMPI_PROGINF_COMPAT | 0 | (Default) The runtime performance information of MPI program is displayed in the latest format. | 1 | The runtime performance information of MPI program is displayed in old format. In this format, performance item "Non Swappable Memory Size Used", VE Card Data section and location information of VE where the MPI process is executed are not displayed. |
VE_PROGINF_USE_SIGNAL | YES | (Default) Signals are used for collecting performance information. | NO | Signals are not used for collecting performance information. See this section before using this option. |
VE_PERF_MODE | Control the HW performance counter set. MPI performance information outputs items corresponding to selected counters. | |
VECTOR-OP | (Default) Select the set of HW performance counters related to vector operation mainly. | |
VECTOR-MEM | Select the set of HW performance counters related to vector and memory access mainly. | |
NMPI_EXPORT | "<string>" | Space-separated list of the environment variables to be passed to MPI processes. |
MPIEXPORT | The same as the environment variable NMPI_EXPORT | The same as the environment variable NMPI_EXPORT. If both are specified, the environment variable NMPI_EXPORT takes precedence. |
NMPI_SEPSELECT | To enable this environment variable, the shell script mpisep.sh must also be used. Please refer to this section for details. | 1 | The standard output from each MPI process is saved in a separate file. | 2 | (Default) The standard error output from each MPI process is saved in a separate file. | 3 | The standard output and standard error output from each MPI process are saved in respective separate files. | 4 | The standard output and standard error output from each MPI process are saved in one separate file. |
MPISEPSELECT | The same as the environment variable NMPI_SEPSELECT | The same as the environment variable NMPI_SEPSELECT. If both are specified, the environment variable NMPI_SEPSELECT takes precedence. |
NMPI_VERIFY | Control error detection of the debug assist feature for MPI collective procedures. To use the feature for MPI collective procedures, you need to generate MPI program with the option -mpiverify. Please refer to this content for the feature. | 0 | Errors in invocations of MPI collective procedures are not detected. | 3 | (Default) Errors other than those in the argument assert of the procedure MPI_WIN_FENCE are detected. | 4 | Errors in the argument assert of the procedure MPI_WIN_FENCE are detected, in addition to the default errors. |
NMPI_BLOCKLEN0 | OFF | (Default) Blocks with blocklength 0 are not included in the calculation of the values of the lower bound and upper bound of a datatype created by MPI procedures that create derived datatypes and have the argument blocklength. | ON | Blocks with blocklength 0 are also included in the calculation of the values of the lower bound and upper bound of a datatype created by MPI procedures that create derived datatypes and have the argument blocklength. |
MPIBLOCKLEN0 | The same as the environment variable NMPI_BLOCKLEN0 | The same as the environment variable NMPI_BLOCKLEN0. If both are specified, the environment variable NMPI_BLOCKLEN0 takes precedence. |
NMPI_COLLORDER | OFF | (Default)
|
ON | Canonical order, bracketing independent of process distribution, dependent only on the number of processes. | |
MPICOLLORDER | The same as the environment variable NMPI_COLLORDER | The same as the environment variable NMPI_COLLORDER. If both are specified, the environment variable NMPI_COLLORDER takes precedence. |
NMPI_PORT_RANGE |
|
The range of port numbers NEC MPI uses to accept TCP/IP
connections. The default value is 25257:25266. |
NMPI_INTERVAL_CONNECT |
|
Retry interval in seconds for establishing connections among
MPI daemons at the beginning of execution of MPI programs. The default value is 1. |
NMPI_RETRY_CONNECT |
|
The number of retries for establishing connections among
MPI daemons at the beginning of execution of MPI programs. The default value is 2. |
NMPI_LAUNCHER_EXEC |
|
Full path name of the remote shell that launches
MPI daemons. The default value is /usr/bin/ssh. |
NMPI_IB_ADAPTER_NAME |
|
Comma-or-Space separated list of InfiniBand adaptor names
NEC MPI uses. This environment variable is available only in the interactive execution. When omitted, NEC MPI automatically selects the optimal ones. |
NMPI_IB_DEFAULT_PKEY |
|
Partition key for InfiniBand Communication. The default value is 0. |
NMPI_IB_FAST_PATH | ON |
Use InfiniBand RDMA fath path feature to transfer eager messages. (Default on Intel machines) Don't set this value if InfiniBand HCA Relaxed Ordering or Adaptive Routing is enabled. |
MTU |
MTU limits the message size of fast path feature to actual OFED mtu size. Don't set this value if InfiniBand HCA Relaxed Ordering is enabled. |
OFF |
Don't use InfiniBand RDMA fath path feature. (Default on Non-Intel machines) |
NMPI_IB_VBUF_TOTAL_SIZE |
|
Size of each InfiniBand communication buffer in bytes. The default value is 12248. |
NMPI_IB_VH_MEMCPY_SEND | AUTO | Use sender side VH memory copy for InfiniBand communication
through Root Complex. (default for Intel machines) |
ON | Use sender side VH memory copy for InfiniBand communication
(independent on Root Complex). (default for non-Intel machines) |
OFF | Don't use sender side VH memory copy for InfiniBand communication. |
NMPI_IB_VH_MEMCPY_RECV | AUTO | Use receiver side VH memory copy for InfiniBand communication
through Root Complex. |
ON | Use receiver side VH memory copy for InfiniBand communication
(independent on Root Complex). (default for non-Intel machines) |
OFF | Don't use receiver side VH memory copy for InfiniBand communication. (default for Intel machines) |
NMPI_DMA_VH_MEMCPY | AUTO | Use VH memory copy for a communication between VEs in VH through Root Complex. (Default) |
ON | Use VH memory copy for a communication between VEs in VH. |
OFF | Don't use VH memory copy for a communication between VEs in VH. |
NMPI_VH_MEMCPY | AUTO | In the case of InfiniBand communication,
sender side VH memcpy is used
if the communication goes through Root Complex.
In the case of a communication between VEs in VH,
VH memory copy is used
if the communication goes through Root Complex. |
ON | VH memory copy is used. | OFF | VH memory copy is not used. |
Note: NMPI_IB_VH_MEMCPY_SEND, NMPI_IB_VH_MEMCPY_RECV, NMPI_DMA_VH_MEMCPY are higher priority than this environment variable. |
NMPI_DMA_RNDV_OVERLAP | ON | In the case of DMA communication, the communication and calculation can overlap when the buffer is contiguous, its transfer length is 200KB or more, and it is non-blocking point-to-point communication. | OFF | (Default) In the case of DMA communication, the communication and calculation cannot overlap when the transfer length is 200KB or more and it is non-blocking point-to-point communication. |
Note: Setting NMPI_DMA_RNDV_OVERLAP to ON internally disables the usage of VH memory copy. the values of environment variables NMPI_DMA_VH_MEMCPY is ignored for non-blocking point-to-point DMA communication. |
NMPI_IB_VH_MEMCPY_THRESHOLD |
|
Minimal message size to transfer InfiniBand message to/from VE processes via VH memory. Smaller messages are sent directly without copy to/from VH memory. Message size is given in bytes and must be greater or equal to 0. The default value is 1048576. |
NMPI_IB_VH_MEMCPY_BUFFER_SIZE |
|
Maximal size of a buffer located in VH memory to transfer (parts of) an InfiniBand message to/from VE processes. Size of buffer is given in bytes and must be at least 8192 bytes. The default value is 1048576. |
NMPI_IB_VH_MEMCPY_SPLIT_THRESHOLD |
|
Minimal message size to split transfer of InfiniBand messages to/from VE processes via VH Memory. The messages are split in nearly equal parts in order to increase the transfer bandwidth. Message size is given in bytes and must be greater or equal to 0. The default value is 1048576. |
NMPI_IB_VH_MEMCPY_SPLIT_NUM |
|
Maximal number of parts used to transfer InfiniBand messages to/from VE processes using VH memory. The number must be in range of [1:8]. The default value is 2. |
NMPI_IP_USAGE | TCP/IP usage if fast InfiniBand interconnect is not available on an InfiniBand system(for example, if InfiniBand ports are down or no HCA was assigned to a job). | |
ON | FALLBACK | Use TCP/IP as fallback for fast InfiniBand interconnect. | |
OFF | (Default) Terminate application if InfiniBand interconnect is not available on a InfiniBand system. | |
NMPI_EXEC_MODE | NECMPI | (Default) Work with NECMPI runtime option. | INTELMPI | Work with IntelMPI's basic runtime options (see below). | OPENMPI | Work with OPENMPI's basic runtime options (see below). | MPICH | Work with MPICH's basic runtime options (see below). | MPISX | Work with MPISX's runtime options. |
NMPI_SHARP_ENABLE | ON | To use SHARP |
OFF | Not to use SHARP (default) | |
NMPI_SHARP_NODES | <integer> | The minimal number of VE nodes to use SHARP if SHARP usage is enabled. (default: 4) |
NMPI_SHARP_ALLREDUCE_MAX | <integer> | Maximal data size (in bytes) in MPI_Allreduce for which the SHARP API used. (Default: 64) |
UNLIMITED | SHARP is always used. | |
NMPI_SHARP_REPORT | ON | Report on MPI Communicators using SHARP collective support. |
OFF | No report. (default) | |
NMPI_DCT_ENABLE |
Control the usage of Inifniband DCT (Dynamically Connected Transport Service).
Using DCT, a memory usage for Inifniband communication is reduced. (Note: DCT may affect a performance of InfiniBand communication) |
|
AUTOMATIC | DCT is used if the number of MPI processes is equal or greater than the number specified by NMPI_DCT_SELECT_NP environment variable. (default) | |
ON | DCT is always used. | |
OFF | DCT is not used. | |
NMPI_DCT_SELECT_NP | <integer> | The minimal number of MPI processes that DCT is used if the environment variable NMPI_DCT_ENABLE is set to AUTOMATIC. (default: 2049) |
NMPI_DCT_NUM_CONNS | <integer> | The number of requested DCT connections. (default: 4) |
NMPI_COMM_PNODE | Control the automatic selection of communication type between logical nodes in the execution under NQSV. | |
OFF | Select the communication type automatically based on the logical node (default). | |
ON | Select the communication type automatically based on the physical node. | |
NMPI_VH_MEMORY_USAGE | VH memory usage required for MPI application execution. | |
ON | (Default) VH Memory is required. If VH memory is requested and not available, the MPI application is aborted. | |
OFF | FALLBACK | If VH Memory is requested and not available, a possibly slower communication path is used. |
3.2.5   Environment Variables for MPI Process Identification
NEC MPI provides the following environment variables, the values of which are automatically set by NEC MPI, for MPI process identification.
Environment Variable | Value |
---|---|
MPIUNIVERSE | The identification number of the predefined communication universe at the beginning of program execution corresponding to the communicator MPI_COMM_WORLD. |
MPIRANK | The rank of the executing process in the communicator MPI_COMM_WORLD at the beginning of program execution. |
MPISIZE | The total number of processes in the communicator MPI_COMM_WORLD at the beginning of program execution. |
MPINODEID | The logical node number of node where the MPI process is running. |
MPIVEID | The VE node number of VE where the MPI process is running. If the execution is under NQSV, this shows logical VE node number. If the MPI process is not running on VE, this variable is not set. |
These environment variables can be referenced whenever MPI programs are running including before the invocation of the procedure MPI_INIT or MPI_INIT_THREAD.
When an MPI program is initiated, there is a predefined communication universe that includes all MPI processes and corresponds to a communicator MPI_COMM_WORLD. The predefined communication universe is assigned the identification number 0.
In a communication universe, each process is assigned an unique integer value called rank, which is in the range zero to one less than the number of processes.
If the dynamic process creation facility is used and a set of MPI processes
is dynamically created, a new communication universe corresponding to a new
communicator MPI_COMM_WORLD is created.
Processes in each communication universe created at runtime are
assigned consecutive integer identification numbers starting at 1.
In such a case, two or more MPI_COMM_WORLDs can exist at the same time
in a single MPI application.
Therefore, an MPI process can be identified
using a pair of values of MPIUNIVERSE and MPIRANK.
If an MPI program is indirectly initiated with a shell script, these environment variables can also be referenced in the shell script and be used, for example, to allow different MPI processes to handle mutually different files. The shell script in the figure makes each MPI process read data from respectively different files and store data to respectively different files, and it is executed as shown in the figure.
#!/bin/sh INFILE=infile.$MPIUNIVERSE.$MPIRANK OUTFILE=outfile.$MPIUNIVERSE.$MPIRANK {MPIexec} < $INFILE > $OUTFILE # Refer to this clause for {MPIexec}, MPI-execution specification exit $? |
$ mpirun -np 8 /execdir/mpi.shell |
3.2.6   Environment Variables for Other Processors
The environment variables supported by other processors such as the Fortran compiler (nfort), C compiler (ncc), or C++ compiler (nc++) are passed to MPI processes because runtime option -genvall is enable by default. In the following example, OMP_NUM_THREADS and VE_LD_LIBRARY_PATH are passed to MPI processes.
#!/bin/sh
#PBS -T necmpi
#PBS -b 2
OMP_NUM_THREADS=8 ; export OMP_NUM_THREADS
VE_LD_LIBRARY_PATH={your shared library path} ; export VE_LD_LIBRARY_PATH
mpirun -node 0-1 -np 2 a.out
3.2.7   Rank Assignment
Ranks are assigned in the ascending order to MPI processes according to the order that NEC MPI assigns them to hosts.
3.2.8   The Working Directory under NQSV
The working directory in the NQSV request execution
is determined as follows:
3.2.9   Execution with the singularity container
You can execute MPI programs in the singularity container.
As the following example, singularity command is specified as an argument of mpirun command.
$ mpirun -ve 0 -np 8 /usr/bin/singularity exec --bind /var/opt/nec/ve/veos ./nmpi.sif ./ve.out |
3.2.10   Execution Examples
The following examples show how to launch MPI programs on the
SX-Aurora TSUBASA.
$ mpirun -ve 3 -np 4 ./ve.out |
$ mpirun -ve 0-7 -np 16 ./ve.out |
$ mpirun -hosts host1,host2 -ve 0-1 -np 32 ./ve.out |
$ mpirun -host host1 -ve 0-1 -np 16 -host host2 -ve 2-3 -np 16 ./ve.out |
$ mpirun -vh -host host1 -np 8 vh.out : -host host1 -ve 0-1 -np 16 ./ve.out |
Assignment of MPI processes to VEs and VHs is automatically performed by NQSV and users can only specify logical numbers of them.
The following examples show the content of job script files in the batch job, but the commands in the scripts are available in the interactive job, too.
#PBS -T necmpi #PBS -b 2 #PBS --venum-lhost=4 # Number of VEs source /opt/nec/ve/mpi/2.3.0/bin/necmpivars.sh mpirun -host 0 -ve 0-3 -np 32 ./ve.out |
#PBS -T necmpi #PBS -b 4 # Number of VHs #PBS --venum-lhost=8 # Number of VEs #PBS --use-hca=1 # Number of HCAs source /opt/nec/ve/mpi/2.3.0/bin/necmpivars.sh mpirun -np 32 ./ve.out |
#PBS -T necmpi #PBS -b 4 # Number of VHs #PBS --venum-lhost=8 # Number of VEs #PBS --use-hca=1 # Number of HCAs source /opt/nec/ve/mpi/2.3.0/bin/necmpivars.sh mpirun -vh -host 0 -np 1 vh.out : -np 32 ./ve.out |
3.3   Standard Output and Standard Error of MPI Programs
To separate output streams from MPI processes, NEC MPI provides the shell script mpisep.sh, which is placed in the path
/opt/nec/ve/bin/.
It is possible to redirect output streams from MPI processes into respectively different files in the current working directory by specifying this script before MPI-execution specification {MPIexec} as shown in the following example. (Please refer to this clause for MPI-execution specification {MPIexec}.)
$ mpirun -np 2 /opt/nec/ve/bin/mpisep.sh {MPIexec} |
The destinations of output streams can be specified with the environment variable NMPI_SEPSELECT as shown in the following table, in which uuu is the identification number of the predefined communication universe corresponding to the communicator MPI_COMM_WORLD and rrr is the rank of the executing MPI process in the universe.
NMPI_SEPSELECT Action 1 Only the stdout stream from each process is put into the separate file stdout.uuu:rrr. 2 (Default) Only the stderr stream from each process is put into the separate file stderr.uuu:rrr. 3 The stdout and stderr streams from each process are put into the separate files stdout.uuu:rrr and stderr.uuu:rrr, respectively. 4 The stdout and stderr streams from each process are put into one separate file std.uuu:rrr.
3.4   Runtime Performance of MPI Programs
The performance of MPI programs can be obtained
with the environment variable NMPI_PROGINF.
There are four formats of runtime performance information available in NEC MPI as
follows:
The format of displayed information can be specified by setting the environment variable NMPI_PROGINF at runtime as shown in the following table.
Format Description Reduced Format This format consists of three parts: The first part is the Global Data section in which maximum, minimum and average performance of all MPI processes is displayed. The second part is the Overall Data sectionin which performance of overall MPI processes is displayed. The third part is the VE Card section in which maximum, minimum and average performance of VE card is displayed. The results of the vector processes and scalar processes are output separately. Extended Format Performance of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the reduced format. Detailed Reduced Format This format consists of three parts: The first part is the Global Data section in which maximum, minimum, and average detailed performance of all MPI processes is displayed. The second part is the Overall Data section in which performance of overall MPI processes is displayed. The third part is the VE Card section in which maximum, minimum and average performance of VE card is displayed. The results of the vector processes and scalar processes are output separately. Detailed Extended Format Detailed performance of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the detailed reduced format.
NMPI_PROGINF | Displayed Information |
---|---|
NO | (Default) No Output |
YES | Reduced Format |
ALL | Extended Format |
DETAIL | Detailed Reduced Format |
ALL_DETAIL | Detailed Extended Format |
MPI Program Information: ======================== Note: It is measured from MPI_Init till MPI_Finalize. [U,R] specifies the Universe and the Process Rank in the Universe. Times are given in seconds. Global Data of 4 Vector processes : Min [U,R] Max [U,R] Average ================================= Real Time (sec) : 25.203 [0,3] 25.490 [0,2] 25.325 User Time (sec) : 199.534 [0,0] 201.477 [0,2] 200.473 Vector Time (sec) : 42.028 [0,2] 42.221 [0,1] 42.104 Inst. Count : 94658554061 [0,1] 96557454164 [0,2] 95606075636 V. Inst. Count : 11589795409 [0,3] 11593360015 [0,0] 11591613166 V. Element Count : 920130095790 [0,3] 920199971948 [0,0] 920161556564 V. Load Element Count : 306457838070 [0,1] 306470712295 [0,3] 306463228635 FLOP Count : 611061870735 [0,3] 611078144683 [0,0] 611070006844 MOPS : 6116.599 [0,2] 6167.214 [0,0] 6142.469 MOPS (Real) : 48346.004 [0,2] 48891.767 [0,3] 48624.070 MFLOPS : 3032.988 [0,2] 3062.528 [0,0] 3048.186 MFLOPS (Real) : 23972.934 [0,2] 24246.003 [0,3] 24129.581 A. V. Length : 79.372 [0,1] 79.391 [0,3] 79.382 V. Op. Ratio (%) : 93.105 [0,2] 93.249 [0,1] 93.177 L1 Cache Miss (sec) : 3.901 [0,0] 4.044 [0,2] 3.983 CPU Port Conf. (sec) : 3.486 [0,1] 3.486 [0,2] 3.486 V. Arith. Exec. (sec) : 15.628 [0,3] 15.646 [0,1] 15.637 V. Load Exec. (sec) : 23.156 [0,2] 23.294 [0,1] 23.225 VLD LLC Hit Element Ratio (%) : 90.954 [0,2] 90.965 [0,1] 90.959 Power Throttling (sec) : 0.000 [0,0] 0.000 [0,0] 0.000 Thermal Throttling (sec) : 0.000 [0,0] 0.000 [0,0] 0.000 Max Active Threads : 8 [0,0] 8 [0,0] 8 Available CPU Cores : 8 [0,0] 8 [0,0] 8 Average CPU Cores Used : 7.904 [0,2] 7.930 [0,3] 7.916 Memory Size Used (MB) : 1616.000 [0,0] 1616.000 [0,0] 1616.000 Non Swappable Memory Size Used (MB) : 115.000 [0,1] 179.000 [0,0] 131.000 Global Data of 8 Scalar processes : Min [U,R] Max [U,R] Average ================================= Real Time (sec) : 25.001 [0,7] 25.010 [0,8] 25.005 User Time (sec) : 199.916 [0,7] 199.920 [0,8] 199.918 Memory Size Used (MB) : 392.000 [0,7] 392.000 [0,8] 392.000 Overall Data of 4 Vector processes ================================== Real Time (sec) : 25.490 User Time (sec) : 801.893 Vector Time (sec) : 168.418 GOPS : 5.009 GOPS (Real) : 157.578 GFLOPS : 3.048 GFLOPS (Real) : 95.890 Memory Size Used (GB) : 6.313 Non Swappable Memory Size Used (GB) : 0.512 Overall Data of 8 Scalar processes ================================== Real Time (sec) : 25.010 User Time (sec) : 1599.344 Memory Size Used (GB) : 3.063 VE Card Data of 2 VEs ===================== Memory Size Used (MB) Min : 3232.000 [node=0,ve=0] Memory Size Used (MB) Max : 3232.000 [node=0,ve=0] Memory Size Used (MB) Avg : 3232.000 Non Swappable Memory Size Used (MB) Min : 230.000 [node=0,ve=1] Non Swappable Memory Size Used (MB) Max : 294.000 [node=0,ve=0] Non Swappable Memory Size Used (MB) Avg : 262.000 Data of Vector Process [0,0] [node=0,ve=0]: ------------------------------------------- Real Time (sec) : 25.216335 User Time (sec) : 199.533916 Vector Time (sec) : 42.127823 Inst. Count : 94780214417 V. Inst. Count : 11593360015 V. Element Count : 920199971948 V. Load Element Count : 306461345333 FLOP Count : 611078144683 MOPS : 6167.214211 MOPS (Real) : 48800.446081 MFLOPS : 3062.527699 MFLOPS (Real) : 24233.424158 A. V. Length : 79.373018 V. Op. Ratio (%) : 93.239965 L1 Cache Miss (sec) : 3.901453 CPU Port Conf. (sec) : 3.485787 V. Arith. Exec. (sec) : 15.642353 V. Load Exec. (sec) : 23.274564 VLD LLC Hit Element Ratio (%) : 90.957228 Power Throttling (sec) : 0.000000 Thermal Throttling (sec) : 0.000000 Max Active Threads : 8 Available CPU Cores : 8 Average CPU Cores Used : 7.912883 Memory Size Used (MB) : 1616.000000 Non Swappable Memory Size Used (MB) : 179.000000 ... |
The following table shows the meanings of the items in the Global Data section and the Process section. In the case of vector process, in addition to MPI universe number and MPI rank number of MPI_COMM_WORLD, hostname or logical node number and logical VE number are shown as the location information of VE where the MPI process is executed in the header of the Process section. For scalar processes, only items(*1) are output. (*2) items are output only in the detailed reduced format or detailed extended format. (*3) items are output only in the detailed reduced format or detailed extended format in multi-threaded execution.
Item | Unit | Description |
---|---|---|
Real Time (sec) | second | Elapsed time(*1) |
User Time (sec) | second | User CPU time(*1) |
Vector Time (sec) | second | Vector instruction execution time |
Inst. Count | The number of executed instructions | |
V.Inst. Count | The number of executed vector instructions | |
V.Element Count | The number of elements processed with vector instructions | |
V.Load Element Count | The number of vector-loaded elements | |
FLOP Count | The number of elements processed with floating-point operations | |
MOPS | The number of million operations divided by the user CPU time | |
MOPS (Real) | The number of million operations divided by the real time | |
FLOPS | The number of million floating-point operations divided by the user CPU time | |
FLOPS (Real) | The number of million floating-point operations divided by the real time | |
A.V.Length | Average Vector Length | |
V.OP.RATIO | percent | Vector operation ratio |
L1 Cache Miss (sec) | second | L1 cache miss time |
CPU Port Conf. | second | CPU port conflict time (*2) |
V. Arith Exec. | second | Vector operation execution time (*2) |
V. Load Exec. | second | Vector load instruction execution time (*2) |
|
Ratio of the number of
elements loaded from LLC to
the number of elements loaded
with vector load instructions (*2) |
|
Power Throttling | second | Duration of time the hardware was
throttled due to the power
consumption (*2) |
Thermal Throttling | second | Duration of time the hardware was
throttled due to the temperature (*2)
|
Max Active Threads | The maximum number of threads
that were active at
the same time (*3)
|
|
Available CPU Cores | The number of CPU cores
a process was allowed to use (*3)
|
|
|
The average number of CPU cores
used (*3)
|
|
Memory Size Used (MB) | megabyte | Peak usage of memory(*1) |
Non Swappable Memory Size Used (MB) | megabyte | Peak usage of memory that cannot be swapped out by Partial Process Swapping function |
The following table shows the meanings of the items in the Overall Data section in the Figure above. For scalar processes, only items(*1) are output.
Item | Unit | Description |
---|---|---|
Real Time (sec) | second | The maximum elapsed time of all MPI processes(*1) |
User Time (sec) | second | The sum of the user CPU time of all MPI processes(*1) |
Vector Time (sec) | second | The sum of the vector time of all MPI processes |
GOPS | The total number of giga operations executed on all MPI processes divided by the total user CPU time of all MPI processes | |
GOPS (Real) | The total number of giga operations executed on all MPI processes divided by the maximum real time of all MPI processes | |
GFLOPS | The total number of giga floating-point operations executed on all MPI processes divided by the total user CPU time of all MPI processes | |
GFLOPS (Real) | The total number of giga floating-point operations executed on all MPI processes divided by the maximum real time of all MPI processes | |
Memory Size Used (GB) | gigabyte | The sum of peak usage of memory of all MPI processes(*1) |
Non Swappable Memory Size Used (GB) | gigabyte | The sum of peak usage of memory that cannot be swapped out by Partial Process Swapping function of all MPI processes |
Item | Unit | Description |
---|---|---|
Memory Size Used (MB) Min | megabyte | The minimum of peak usage of memory aggregated for each VE card |
Memory Size Used (MB) Max | megabyte | The maximum of peak usage of memory aggregated for each VE card |
Memory Size Used (MB) Avg | megabyte | The average of peak usage of memory aggregated for each VE card |
Non Swappable Memory Size Used (MB) Min | megabyte | The minimum of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card |
Non Swappable Memory Size Used (MB) Max | megabyte | The maximum of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card |
Non Swappable Memory Size Used (MB) Avg | megabyte | The average of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card |
Global Data of 16 Vector processes : Min [U,R] Max [U,R] Average ================================== Real Time (sec) : 123.871 [0,12] 123.875 [0,10] 123.873 User Time (sec) : 123.695 [0,0] 123.770 [0,4] 123.753 Vector Time (sec) : 33.675 [0,8] 40.252 [0,14] 36.871 Inst. Count : 94783046343 [0,8] 120981685418 [0,5] 109351879970 V. Inst. Count : 2341570533 [0,8] 3423410840 [0,0] 2479317774 V. Element Count : 487920413405 [0,15] 762755268183 [0,0] 507278230562 V. Load Element Count : 47201569500 [0,8] 69707680610 [0,0] 49406464759 FLOP Count : 277294180692 [0,15] 434459800790 [0,0] 287678800758 MOPS : 5558.515 [0,8] 8301.366 [0,0] 5863.352 MOPS (Real) : 5546.927 [0,8] 8276.103 [0,0] 5850.278 MFLOPS : 2243.220 [0,15] 3518.072 [0,0] 2327.606 MFLOPS (Real) : 2238.588 [0,13] 3507.366 [0,0] 2322.405 A. V. Length : 197.901 [0,5] 222.806 [0,0] 204.169 V. Op. Ratio (%) : 83.423 [0,5] 90.639 [0,0] 85.109 L1 I-Cache Miss (sec) : 4.009 [0,5] 8.310 [0,0] 5.322 L1 O-Cache Miss (sec) : 11.951 [0,5] 17.844 [0,9] 14.826 L2 Cache Miss (sec) : 7.396 [0,5] 15.794 [0,0] 9.872 FMA Element Count : 106583464050 [0,8] 166445323660 [0,0] 110529497704 Required B/F : 2.258 [0,0] 3.150 [0,5] 2.948 Required Store B/F : 0.914 [0,0] 1.292 [0,5] 1.202 Required Load B/F : 1.344 [0,0] 1.866 [0,6] 1.746 Actual V. Load B/F : 0.223 [0,0] 0.349 [0,14] 0.322 Power Throttling (sec) : 0.000 [0,0] 0.000 [0,0] 0.000 Thermal Throttling (sec) : 0.000 [0,0] 0.000 [0,0] 0.000 Memory Size Used (MB) : 598.000 [0,0] 598.000 [0,0] 598.000 Non Swappable Memory Size Used (MB) : 115.000 [0,1] 179.000 [0,0] 131.000 |
When VE_PERF_MODE is set to VECTOR-MEM, MPI performance information outputs the following items instead of L1 Cache Miss, CPU Port Conf., V. Arith Exec., V. Load Exec. and VLD LLC Hit Element Ratio that are output when VE_PERF_MODE is set to VECTOR-OP or VE_PERF_MODE is unset.
(*1) items are output only in the detailed
reduced format or detailed extended format.
(*2) items truncate the value over 100.
Item | Unit | Description |
---|---|---|
L1 I-Cache Miss (sec) | second | L1 instruction cache miss time |
L1 O-Cache Miss (sec) | second | L1 operand cache miss time |
L2 Cache Miss (sec) | second | L2 cache miss time |
Required B/F | B/F calculated from bytes specified by load and store instructions (*1) (*2) | |
Required Store B/F | B/F calculated from bytes specified by store instructions (*1) (*2) | |
Required Load B/F | B/F calculated from bytes specified by load instructions (*1) (*2) | |
Actual V. Load B/F | B/F calculated from bytes of actual memory access by vector load instructions (*1) (*2) |
3.5   MPI Communication Information
NEC MPI provides the facility of displaying MPI communication information.
To use this facility, you need to generate MPI program
with the option -mpiprof, -mpitrace, -mpiverify or -ftrace.
There are two formats of MPI communication information available as follows:
The maximum, minimum, and average values of MPI communication information of all MPI processes are displayed.
MPI communication information of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the reduced format.
NMPI_COMMINF | Displayed Information |
---|---|
NO | (Default) No Output |
YES | Reduced Format |
ALL | Extended Format |
Also, you can change a view of reduced format by specifying the environment variable NMPI_COMMINF_VIEW.
NMPI_COMMINF_VIEW | Displayed Information |
---|---|
VERTICAL | (Default) Summarize for each vector process and scalar process and arrange vertically. Items that correspond only to vector processes are not output to the scalar process part. |
HORIZONTAL | Summarize for each vector process and scalar process and arrange horizontally. N/A is output to the scalar process part for items that correspond only to vector processes. |
MERGED | Summarize for vector processes and scalar processes. (V) is output at the end of line to the scalar process part for items that correspond only to vector processes. In the item, vector processes only are aggregated. |
The following figure is an example of the extended format.
MPI Communication Information of 4 Vector processes --------------------------------------------------- Min [U,R] Max [U,R] Average Real MPI Idle Time (sec) : 9.732 [0,1] 10.178 [0,3] 9.936 User MPI Idle Time (sec) : 9.699 [0,1] 10.153 [0,3] 9.904 Total real MPI Time (sec) : 13.301 [0,0] 13.405 [0,3] 13.374 Send count : 1535 [0,2] 2547 [0,1] 2037 Memory Transfer : 506 [0,3] 2024 [0,0] 1269 DMA Transfer : 0 [0,0] 1012 [0,1] 388 Recv count : 1518 [0,2] 2717 [0,0] 2071 Memory Transfer : 506 [0,2] 2024 [0,1] 1269 DMA Transfer : 0 [0,3] 1012 [0,2] 388 Barrier count : 8361 [0,2] 8653 [0,0] 8507 Bcast count : 818 [0,2] 866 [0,0] 842 Reduce count : 443 [0,0] 443 [0,0] 443 Allreduce count : 1815 [0,2] 1959 [0,0] 1887 Scan count : 0 [0,0] 0 [0,0] 0 Exscan count : 0 [0,0] 0 [0,0] 0 Redscat count : 464 [0,0] 464 [0,0] 464 Redscat_block count : 0 [0,0] 0 [0,0] 0 Gather count : 864 [0,0] 864 [0,0] 864 Gatherv count : 506 [0,0] 506 [0,0] 506 Allgather count : 485 [0,0] 485 [0,0] 485 Allgatherv count : 506 [0,0] 506 [0,0] 506 Scatter count : 485 [0,0] 485 [0,0] 485 Scatterv count : 506 [0,0] 506 [0,0] 506 Alltoall count : 506 [0,0] 506 [0,0] 506 Alltoallv count : 506 [0,0] 506 [0,0] 506 Alltoallw count : 0 [0,0] 0 [0,0] 0 Neighbor Allgather count : 0 [0,0] 0 [0,0] 0 Neighbor Allgatherv count : 0 [0,0] 0 [0,0] 0 Neighbor Alltoall count : 0 [0,0] 0 [0,0] 0 Neighbor Alltoallv count : 0 [0,0] 0 [0,0] 0 Neighbor Alltoallw count : 0 [0,0] 0 [0,0] 0 Number of bytes sent : 528482333 [0,2] 880803843 [0,1] 704643071 Memory Transfer : 176160755 [0,3] 704643020 [0,0] 440401904 DMA Transfer : 0 [0,0] 352321510 [0,1] 132120600 Number of bytes recvd : 528482265 [0,2] 880804523 [0,0] 704643207 Memory Transfer : 176160755 [0,2] 704643020 [0,1] 440401904 DMA Transfer : 0 [0,3] 352321510 [0,2] 132120600 Put count : 0 [0,0] 0 [0,0] 0 Get count : 0 [0,0] 0 [0,0] 0 Accumulate count : 0 [0,0] 0 [0,0] 0 Number of bytes put : 0 [0,0] 0 [0,0] 0 Number of bytes got : 0 [0,0] 0 [0,0] 0 Number of bytes accum : 0 [0,0] 0 [0,0] 0 MPI Communication Information of 8 Scalar processes --------------------------------------------------- Min [U,R] Max [U,R] Average Real MPI Idle Time (sec) : 4.837 [0,6] 5.367 [0,11] 5.002 User MPI Idle Time (sec) : 4.825 [0,6] 5.363 [0,11] 4.992 Total real MPI Time (sec) : 12.336 [0,11] 12.344 [0,5] 12.340 Send count : 1535 [0,4] 1535 [0,4] 1535 Memory Transfer : 506 [0,11] 1518 [0,5] 1328 Recv count : 1518 [0,4] 1518 [0,4] 1518 Memory Transfer : 506 [0,4] 1518 [0,5] 1328 ... Number of bytes accum : 0 [0,0] 0 [0,0] 0 Data of Vector Process [0,0] [node=0,ve=0]: ------------------------------------------- Real MPI Idle Time (sec) : 10.071094 User MPI Idle Time (sec) : 10.032894 Total real MPI Time (sec) : 13.301340 ... |
The following figure is an reduced format example of the NMPI_COMMINF_VIEW=MERGED.
MPI Communication Information of 4 Vector and 8 Scalar processes ---------------------------------------------------------------- Min [U,R] Max [U,R] Average Real MPI Idle Time (sec) : 4.860 [0,10] 10.193 [0,3] 6.651 User MPI Idle Time (sec) : 4.853 [0,10] 10.167 [0,3] 6.635 Total real MPI Time (sec) : 12.327 [0,4] 13.396 [0,3] 12.679 Send count : 1535 [0,2] 2547 [0,1] 1702 Memory Transfer : 506 [0,3] 2024 [0,0] 1309 DMA Transfer : 0 [0,0] 1012 [0,1] 388 (V) Recv count : 1518 [0,2] 2717 [0,0] 1702 Memory Transfer : 506 [0,2] 2024 [0,1] 1309 DMA Transfer : 0 [0,3] 1012 [0,2] 388 (V) ... Number of bytes accum : 0 [0,0] 0 [0,0] 0 |
The following table shows the meanings of the items in the MPI communication information. The item "DMA Transfer" is only supported for a vector process.
Item | Unit | Description |
---|---|---|
Real MPI Idle Time | second | Elapsed time for waiting for messages |
User MPI Idle Time | second | User CPU time for waiting for messages |
Total real MPI Time | second | Elapsed time for executing MPI procedures |
Send count | The number of invocations of point-to-point send procedures | |
Memory Transfer | The number of invocations of point-to-point send procedures that use memory copy | |
DMA Transfer | The number of invocations of point-to-point send procedures that use DMA transfer | |
Recv count | The number of invocations of point-to-point receive procedures | |
Memory Transfer | The number of invocations of point-to-point receive procedures that use memory copy | |
DMA Transfer | The number of invocations of point-to-point receive procedures that use DMA transfer | |
Barrier count | The number of invocations of the procedures MPI_BARRIER and MPI_IBARRIER | |
Bcast count | The number of invocations of the procedures MPI_BCAST and MPI_IBCAST | |
Reduce count | The number of invocations of the procedures MPI_REDUCE and MPI_IREDUCE | |
Allreduce count | The number of invocations of the procedures MPI_ALLREDUCE and MPI_IALLREDUCE | |
Scan count | The number of invocations of the procedures MPI_SCAN and MPI_ISCAN | |
Exscan count | The number of invocations of the procedures MPI_EXSCAN and MPI_IEXSCAN | |
Redscat count | The number of invocations of the procedures MPI_REDUCE_SCATTER and MPI_IREDUCE_SCATTER | |
Redscat_block count | The number of invocations of the procedures MPI_REDUCE_SCATTER_BLOCK and MPI_IREDUCE_SCATTER_BLOCK | |
Gather count | The number of invocations of the procedures MPI_GATHER and MPI_IGATHER | |
Gatherv count | The number of invocations of the procedures MPI_GATHERV and MPI_IGATHERV | |
Allgather count | The number of invocations of the procedures MPI_ALLGATHER and MPI_IALLGATHER | |
Allgatherv count | The number of invocations of the procedures MPI_ALLGATHERV and MPI_IALLGATHERV | |
Scatter count | The number of invocations of the procedures MPI_SCATTER and MPI_ISCATTER | |
Scatterv count | The number of invocations of the procedures MPI_SCATTERV and MPI_ISCATTERV | |
Alltoall count | The number of invocations of the procedures MPI_ALLTOALL and MPI_IALLTOALL | |
Alltoallv count | The number of invocations of the procedures MPI_ALLTOALLV and MPI_IALLTOALLV | |
Alltoallw count | The number of invocations of the procedures MPI_ALLTOALLW and MPI_IALLTOALLW | |
Neighbor Allgather count | The number of invocations of the procedures MPI_NEIGHBOR_ALLGATHER and MPI_INEIGHBOR_ALLGATHER | |
Neighbor Allgatherv count | The number of invocations of the procedures MPI_NEIGHBOR_ALLGATHERV and MPI_INEIGHBOR_ALLGATHERV | |
Neighbor Alltoall count | The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALL and MPI_INEIGHBOR_ALLTOALL | |
Neighbor Alltoallv count | The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALLV and MPI_INEIGHBOR_ALLTOALLV | |
Neighbor Alltoallw count | The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALLW and MPI_INEIGHBOR_ALLTOALLW | |
Number of bytes sent | byte | The number of bytes sent by point-to-point send procedures |
Memory Transfer | byte | The number of bytes sent using memory copy by point-to-point send procedures |
DMA Transfer | byte | The number of bytes sent using DMA transfer by point-to-point send procedures |
Number of bytes recvd | byte | The number of bytes received by point-to-point receive procedures |
Memory Transfer | byte | The number of bytes received using memory copy by point-to-point receive procedures |
DMA Transfer | byte | The number of bytes received using DMA transfer by point-to-point receive procedures |
Put count | The number of invocations of the procedures MPI_PUT and MPI_RPUT | |
Memory Transfer | The number of invocations of the procedures MPI_PUT and MPI_RPUT that use memory copy | |
DMA Transfer | The number of invocations of the procedures MPI_PUT and MPI_RPUT that use DMA transfer | |
Get count | The number of invocations of the procedures MPI_GET and MPI_RGET | |
Memory Transfer | The number of invocations of the procedures MPI_GET and MPI_RGET that use memory copy | |
DMA Transfer | The number of invocations of the procedures MPI_GET and MPI_RGET that use DMA transfer | |
Accumulate count | The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP | |
Memory Transfer | The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP that use memory copy | |
DMA Transfer | The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP that use DMA transfer | |
Number of bytes put | byte | The number of bytes put by the procedures MPI_PUT and MPI_RPUT |
Memory Transfer | byte | The number of bytes put using memory copy by the procedures MPI_PUT and MPI_RPUT |
DMA Transfer | byte | The number of bytes put using DMA transfer by the procedures MPI_PUT and MPI_RPUT |
Number of bytes got | byte | The number of bytes got by the procedures MPI_GET and MPI_RGET |
Memory Transfer | byte | The number of bytes got using memory copy by the procedures MPI_GET and MPI_RGET |
DMA Transfer | byte | The number of bytes got using DMA transfer by the procedures MPI_GET and MPI_RGET |
Number of bytes accum | byte | The number of bytes accumulated by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP |
Memory Transfer | byte | The number of bytes accumulated using memory copy by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP |
DMA Transfer | byte | The number of bytes accumulated using DMA transfer by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP |
3.6   FTRACE Facility
The FTRACE facility enables users to
obtain detailed performance information
of each procedure and specified execution region of a program on each MPI
process, including MPI
communication information.
Please refer to
"PROGINF / FTRACE User's Guide" for details.
Note: FTRACE is only available in the program executed on VE.
The following table shows the MPI communication information displayed with the FTRACE facility.
Table 3-15 MPI Communication information Displayed with the FTRACE Facility Item Unit Meaning ELAPSE second Elapsed time COMM.TIME second Elapsed time for executing MPI procedures COMM.TIME / ELAPSE The ratio of the elapsed time for executing MPI procedures to the elapsed time of each process IDLE TIME second Elapsed time for waiting for messages IDLE TIME / ELAPSE The ratio of the elapsed time for waiting for messages to the elapsed time of each process AVER.LEN Byte Average amount of communication per MPI procedure COUNT Total number of transfers by MPI procedures TOTAL LEN Byte Total amount of communication by MPI procedures
The steps for using the FTRACE facility are as follows:
$ mpincc -ftrace mpi.c $ mpinfort -ftrace mpifort.f90 |
$ ftrace -all -f ftrace.out.0.0 ftrace.out.0.1 $ ftrace -f ftrace.out.* |
The following figure shows an example
displayed by the FTRACE facility.
*----------------------* FTRACE ANALYSIS LIST *----------------------* Execution Date : Sat Feb 17 12:44:49 2018 JST Total CPU Time : 0:03'24"569 (204.569 sec.) FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE .... PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS 1012 49.093( 24.0) 48.511 23317.2 14001.4 96.97 83.2 42.132 5.511 funcA 160640 37.475( 18.3) 0.233 17874.6 9985.9 95.22 52.2 34.223 1.973 funcB 160640 30.515( 14.9) 0.190 22141.8 12263.7 95.50 52.8 29.272 0.191 funcC 160640 23.434( 11.5) 0.146 44919.9 22923.2 97.75 98.5 21.869 0.741 funcD 160640 22.462( 11.0) 0.140 42924.5 21989.6 97.73 99.4 20.951 1.212 funcE 53562928 15.371( 7.5) 0.000 1819.0 742.2 0.00 0.0 0.000 1.253 funcG 8 14.266( 7.0) 1783.201 1077.3 55.7 0.00 0.0 0.000 4.480 funcH 642560 5.641( 2.8) 0.009 487.7 0.2 46.45 35.1 1.833 1.609 funcF 2032 2.477( 1.2) 1.219 667.1 0.0 89.97 28.5 2.218 0.041 funcI 8 1.971( 1.0) 246.398 21586.7 7823.4 96.21 79.6 1.650 0.271 funcJ ------------------------------------------------------------------------------------- .... ----------- 54851346 204.569(100.0) 0.004 22508.5 12210.7 95.64 76.5 154.524 17.740 total ELAPSED COMM.TIME COMM.TIME IDLE TIME IDLE TIME AVER.LEN COUNT TOTAL LEN PROC.NAME TIME[sec] [sec] / ELAPSED [sec] / ELAPSED [byte] [byte] 12.444 0.000 0.000 0.0 0 0.0 funcA 9.420 0.000 0.000 0.0 0 0.0 funcB 7.946 0.000 0.000 0.0 0 0.0 funcG 7.688 0.000 0.000 0.0 0 0.0 funcC 7.372 0.000 0.000 0.0 0 0.0 funcH 5.897 0.000 0.000 0.0 0 0.0 funcD 5.653 0.000 0.000 0.0 0 0.0 funcE 1.699 1.475 0.756 3.1K 642560 1.9G funcF 1.073 1.054 0.987 1.0M 4064 4.0G funcI 0.704 0.045 0.045 80.0 4 320.0 funcK ------------------------------------------------------------------------------------------------------ FREQUENCY EXCLUSIVE AVER.TIME MOPS MFLOPS V.OP AVER. VECTOR L1CACHE .... PROC.NAME TIME[sec]( % ) [msec] RATIO V.LEN TIME MISS 1012 49.093( 24.0) 48.511 23317.2 14001.4 96.97 83.2 42.132 5.511 funcA 253 12.089 47.784 23666.9 14215.9 97.00 83.2 10.431 1.352 0.0 253 12.442 49.177 23009.2 13811.8 96.93 83.2 10.617 1.406 0.1 253 12.118 47.899 23607.4 14180.5 97.00 83.2 10.463 1.349 0.2 253 12.444 49.185 23002.8 13808.2 96.93 83.2 10.622 1.404 0.3 ... ------------------------------------------------------------------------------------- .... ---------- 54851346 204.569(100.0) 0.004 22508.5 12210.7 95.64 76.5 154.524 17.740 total ELAPSED COMM.TIME COMM.TIME IDLE TIME IDLE TIME AVER.LEN COUNT TOTAL LEN PROC.NAME TIME[sec] [sec] / ELAPSED [sec] / ELAPSED [byte] [byte] 12.444 0.000 0.000 0.0 0 0.0 funcA 12.090 0.000 0.000 0.000 0.000 0.0 0 0.0 0.0 12.442 0.000 0.000 0.000 0.000 0.0 0 0.0 0.1 12.119 0.000 0.000 0.000 0.000 0.0 0 0.0 0.2 12.444 0.000 0.000 0.000 0.000 0.0 0 0.0 0.3 |
3.7   MPI Procedures Tracing Facility
NEC MPI provides the facility to trace invocations of and returns from MPI procedures, and the progress of each MPI process is output to the standard output.
The following information is displayed.
The tracing facility makes it easy to see where a program runs and to debug it.
In order to use this facility, please generate MPI program with the -mpitrace option.
Note that amount of the trace output can be huge if a program calls MPI procedures many times.
3.8   Debug Assist Feature for MPI
Collective Procedures
The debug assist feature for MPI
collective procedures assists users in
debugging invocations of MPI collective procedures by detecting incorrect uses
across processes and outputting detected errors in detail to the
standard error output.
The incorrect uses include the following cases
Please generate MPI program with the -mpiverify option to use this feature as follows:
$ mpinfort -mpiverify f.f90
When an error is detected, a message including the following information is output to the standard error output.
VERIFY MPI_Bcast(3): root 2 inconsistent with root 1 of 0
The errors to be detected can be specified by setting the environment variable NMPI_VERIFY at runtime as shown in the following table.
NMPI_VERIFY | Detected Errors |
---|---|
0 | No errors are detected. |
3 | (Default) Errors other than those in the argument assert of the procedure MPI_WIN_FENCE |
4 | Errors in the argument assert of the procedure MPI_WIN_FENCE, in addition to the errors detected by default |
The following table shows the errors that can be detected by the debug assist feature.
Note that this feature involves overhead for checking invocations of MPI collective procedures and can result in lower performance. Therefore, please re-generate MPI program without the -mpiverify option once the correctness of uses of collective procedures is verified.
Table 3-17 Errors Detected by the Debug Assist Feature Procedure Target of Checking Condition All collective procedures Order of invocations Processes in the same communicator, or corresponding to the same window or file handle invoked different MPI collective procedures at the same time. Procedures with the argument root Argument root The values of the argument root were not the same across processes. Collective communication procedures Message length (extent of an element * the number of elements transferred) The length of a sent message was not the same as that of the corresponding received message. Collective communication procedures that perform reduction operations Argument op The values of the argument op (reduction operator) were not the same across processes. Topology collective procedures Graph information and dimensional information Information of a graph or dimensions specified with arguments was inconsistent across processes. MPI_COMM_CREATE Argument group The groups specified with the argument group were not the same across processes. MPI_INTERCOMM_CREATE Arguments local_leader and tag The values of the argument local_leader were not the same across processes in the local communicator, or the values of the argument tag were not the same across the processes corresponding to the argument local_leader or remote_leader. MPI_INTERCOMM_MERGE Argument high The values of the argument high were not the same across processes. MPI_FILE_SET_VIEW Arguments etype and datarep The datatypes specified with the argument etype or the data representation specified with the argument datarep were not the same across processes. MPI_WIN_FENCE Argument assert The values of the argument assert were inconsistent across processes.
3.9   Exit Status of an MPI
Program
NEC MPI watches exit statuses of MPI processes to
determine whether termination of program execution is
normal termination or
error termination.
Normal termination occurs if and only if every MPI process
returns 0 as
its exit status. Otherwise error termination occurs.
Therefore, termination status of program execution should be
specified as follows
for NEC MPI to recognize the termination status correctly.
#!/bin/sh | {MPIexec} | # MPI-execution specification (Launch of MPI processes: Refer to this clause) | RC=$? | # holds the exit status | command | # non-MPI program/command | exit $RC | # specify the exit code |
3.10   Miscellaneous
This section describes additional notes in NEC MPI.
$ /opt/nec/ve/bin/nreadelf -W -d a.out | grep RUNPATH
0x000000000000001d (RUNPATH) Library runpath: [/opt/nec/ve/mpi/2.2.0/lib64/ve:...]
$ /usr/bin/strings a.out | /bin/grep "library version"
NEC MPI: library Version 2.2.0 (17. April 2019): Copyright (c) NEC Corporation 2018-2019
If you want to link MPI program against static libraries, you can use linker option -Bstatic and compiler options to link a program against static compiler libraries instaed of compiler option -static. When you use linker option -Bstatic, you surround libraries with -Wl,-Bstatic and -Wl,-Bdynamic. The surrounded libraries are linked statically. The following example is that libww and libxx are linked statically.
mpincc a.c -lvv -Wl,-Bstatic -lww -lxx -Wl,-Bdynamic -lyy
About the compiler options to link a program against static compiler libraries, please refer to the compiler's manual.
mkstemp: Permission denied
MPI uses HugePages to optimize MPI communications. If MPI cannot allocate HugePages on a host, the following warning message outputs and MPI communication may slow down. The configuration of the HugePages requires the system administrator privileges. If the message outputs, please refer to "SX-Aurora TSUBASA Installation Guide", or contact the system administrator for details.
mpid(0): Allocate_system_v_shared_memory: key = 0x420bf67e, len = 16777216 shmget allocation: Cannot allocate memory
The memlock resource limit needs to be set to "unlimited" for MPI to use Infininband communication and HugePages. Because this setting is applied automatically, you don't change the memlock resource limit from "unlimited" by ulimit command and so on. If the memlock resource limit is not "unlimited", there is a possibility that MPI execution aborts or MPI communication slows down with the following messages.
libibverbs: Warning: RLIMIT_MEMLOCK is 0 bytes.
This will severely limit memory registrations.
[0] MPID_OFED_Open_hca: open device failed ib_dev 0x60100002ead0 name mlx5_0
[0] Error in Infiniband/OFED initialization. Execution aborts
Even if the memlock resource limit is set to "unlimited", the following message may be output to system log. This message is not problem and MPI execution works correctly.
mpid(0): Allocate_system_v_shared_memory: key = 0xd34d79c0, len = 16777216
shmget allocation: Operation not permitted
kernel: mpid (20934): Using mlock ulimits for SHM_HUGETLB is deprecated
If the process terminates abnormally during the application execution, information related to the cause of the abnormal termination (error details, termination status, etc.) is output with the universe number and rank number. However, depending on the timing of abnormal termination, many messages such as the following may be output, making it difficult to refer to the information related to the cause of the abnormal termination.
In this case, it may be easier to refer to this information by excluding the above message. An example command is shown below.
[3] mpisx_sendx: left (abnormally) (rc=-1), sock = -1 len 0 (12)
Error in send () called by mpisx_sendx: Bad filedescriptor
$ grep -v mpisx_sendx <outputfile>
When MPI program is executed on Model A412-8 or B401-8 using NQSV request that request multiple logical nodes, the NQSV option --use-hca needs to be set as the number of available HCAs for NEC MPI to select appropriate HCAs. Otherwise, the following error may occur at the end of MPI execution.
mpid(1): error in proxy: Resource temporarily unavailable
Contents | Previous Chapter | Next Chapter | Glossary | Index |