Chapter 3


Operating Procedures

This chapter explains how to use NEC MPI, including how to compile, link, and execute MPI programs.


3.1   Compiling and Linking MPI Programs

Firstly, please execute the following command to read a setup script each time you log in to a VH, in order to set up the MPI compilation environment. {version} is the directory name corresponding to the version of NEC MPI you use. The setting is available until you log out.
(For bash)
$ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh
(For VE30: $ source /opt/nec/ve3/mpi/{version}/bin/necmpivars.sh)

(For csh)
% source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh
(For VE30: % source /opt/nec/ve3/mpi/{version}/bin/necmpivars.csh)

It is possible to compile and link MPI programs with the MPI compilation commands corresponding to each programing language as follows:

To compile and link MPI programs written in Fortran, please execute the mpinfort/mpifort command as follows

$ mpinfort [options] {sourcefiles}
To compile and link MPI programs written in C, please execute the mpincc/mpicc command as follows
$ mpincc [options] {sourcefiles}
To compile and link MPI programs written in C++, please execute the mpinc++/mpic++ command as follows
$ mpinc++ [options] {sourcefiles}
In the command lines above, {sourcefiles} means MPI program source files, and [options] means optional compiler options.
In addition to the compiler options provided by the Fortran compiler (nfort), C compiler (ncc), or C++ compiler (nc++), the NEC MPI compiler options in the following table are available.

NEC MPI compile commands, mpincc/mpicc, mpinc++/mpic++ and mpinfort/mpifort, will use the default version of compilers, ncc, nc++ and nfort, respectively. NEC MPI compile command option -compiler or an environment variable can be used to select a compiler version, if another version of compiler must be used. In this case, a compiler version and NEC MPI version must be selected carefully to match each other.

example: If a compiler version 2.x.x is used to compile and link a C program.

$ mpincc -compiler /opt/nec/ve/bin/ncc-2.x.x program.c

Table 3-1 The List of NEC MPI Compiler Commands Options
Option Meaning
-mpimsgq | -msgq Use the MPI message queue facility for the Debugger
-mpiprof Use the MPI communication information facility and use MPI profiling interface (MPI procedure with names beginning with PMPI_). Please refer to this section for the MPI communication information facility.
-mpitrace Use the MPI procedures tracing facility. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the MPI procedures tracing facility.
-mpiverify Use the debug assist feature for MPI collective procedures. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the debug assist feature for MPI collective procedures.
-ftrace Use the FTRACE facility for MPI program. The MPI communication information facility and MPI profiling interface are also available. Please refer to this section for the FTRACE facility.
-show Display the sequence of compiler execution invoked by the MPI compilation command without actual execution
-ve Compile and link MPI programs to run on VE (default)
-vh
-sh
Compile and link MPI programs to run on VH or SH
-static-mpi Link against MPI libraries statically, but MPI memory management library is linked dynamically (default)
-shared-mpi Link against all MPI libraries dynamically
-compiler <compiler> Specify a compiler invoked by the MPI compilation command following space. If this option is not specified, each compile command starts the following compiler. The following is supported as a compiler that compiles and links MPI programs to run on VH or Scalar Host.
  • GNU Compiler Collection
    • 4.8.5
    • 8.3.0 and 8.3.1
    • 8.4.0 and 8.4.1
    • 8.5.0
    • 9.1.0 and compatible version
  • Intel C++ Compiler and Intel Fortran Compiler
    • 19.0.4.243 (Intel Parallel Studio XE 2019 Update 4) and compatible version
    • 19.1.2.254 (Intel Parallel Studio XE 2020 Update 2)
  • NVIDIA Cuda compiler
    • 11.1
    • 11.8
  • NVIDIA HPC SDK compiler
    • 22.7
See also 2.10 about using mpi_f08 fortran module.
Compilation Command Invoked Compiler
mpincc/mpicc ncc
mpinc++/mpic++ nc++
mpinfort/mpifort nfort
Compilation Command with -vh/-sh Invoked Compiler
mpincc/mpicc gcc
mpinc++/mpic++ g++
mpinfort/mpifort gfortran
-compiler_host <compiler> For a VH or scalar node, if the specified compiler with -compiler option is not GNU nor Intel compiler, nvcc for CUDA for example, this option must specify a GNU or Intel compiler compatible with the compiler specified by -compiler option. Note that if the GNU or Intel compiler is identical to the default one for NEC MPI, see -compiler option above, this option can be omitted.
-mpifp16 <binary16|bfloat16> Assumes that MPI primitive data types NEC_MPI_FLOAT16 and MPI_REAL2 are the specified format with this option, regardless of the floating-point binary format option -mfp16-format.   Default is binary16, if the -mfp16-format option is omitted.
Table 3-2 The List of Environment Variables of NEC MPI Compiler Commands
Environment Variable Meaning
NMPI_CC Change a compiler which you use to compile and link a mpi program on VE by mpincc command.
NMPI_CXX Change a compiler which you use to compile and link a mpi program on VE by mpinc++ command.
NMPI_FC Change a compiler which you use to compile and link a mpi program on VE by mpinfort command.
NMPI_CC_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpincc command.
NMPI_CXX_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpinc++ command.
NMPI_FC_H Change a compiler which you use to compile and link a mpi program on VH or Scalar Host by mpinfort command.

The above environment variables in Table 3-2 are overridden by -compiler option.

An example of each compiler is shown below.

example1: NEC Compiler

$ source /opt/nec/ve/mpi/3.x.x/bin/necmpivars.sh
(For VE30: $ source /opt/nec/ve3/mpi/3.x.x/bin/necmpivars.sh)
$ mpincc a.c
$ mpinc++ a.cpp
$ mpinfort a.f90
example2: GNU compiler
(setup the GNU compiler (e.g. PATH, LD_LIBRARY_PATH)
$ source /opt/nec/ve/mpi/3.x.x/bin/necmpivars.sh
(For VE30: $ source /opt/nec/ve3/mpi/3.x.x/bin/necmpivars.sh)
$ mpincc -vh a.c
$ mpinc++ -vh a.cpp
$ mpinfort -vh a.f90
example3: Intel compiler
(setup the Intel compiler (e.g. PATH, LD_LIBRARY_PATH)
$ source /opt/nec/ve/mpi/3.x.x/bin/necmpivars.sh
(For VE30: $ source /opt/nec/ve3/mpi/3.x.x/bin/necmpivars.sh)
$ export NMPI_CC_H=icc
$ export NMPI_CXX_H=icpc
$ export NMPI_FC_H=ifort
$ mpincc -vh a.c
$ mpinc++ -vh a.cpp
$ mpinfort -vh a.f90
example4: NVIDIA HPC SDK compiler
(setup the NVIDIA HPC SDK compiler (e.g. PATH, LD_LIBRARY_PATH)
$ source /opt/nec/ve/mpi/3.x.x/bin/necmpivars.sh
(For VE30: $ source /opt/nec/ve3/mpi/3.x.x/bin/necmpivars.sh)
$ export NMPI_CC_H=nvc
$ export NMPI_CXX_H=nvc++
$ export NMPI_FC_H=nvfortran
$ mpincc -vh a.c
$ mpinc++ -vh a.cpp
$ mpinfort -vh a.f90

If MPI process running on a VH or a scalar node uses VEO or CUDA features, programs can be compiled and linked as follows.


3.2   Starting MPI Programs

Before use, please setup your compiler referring to 3.1 and execute the following command to read a setup script each time you log in to a VH, in order to setup the MPI execution environment. {version} is the directory name corresponding to the version of NEC MPI you use. This setting is available until you log out.
(For bash)
$ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh
(For VE30: $ source /opt/nec/ve3/mpi/{version}/bin/necmpivars.sh)

(For csh)
% source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh
(For VE30: % source /opt/nec/ve3/mpi/{version}/bin/necmpivars.csh)

By default, the MPI libraries whose version is same as compiling and linking are searched and MPI program is dynamically linked against them as needed. By loading setup script, the MPI libraries corresponding to above {version} will be searched.
Thus, when MPI program is dynamically linked against all MPI libraries with -shared-mpi, You can change MPI libraries to corresponding them to above {version} at runtime.

When -shared-mpi is not specified at compiling and linking time, MPI program is dynamically linked against MPI memory management library and statically linked against the other MPI libraries. The MPI libraries linked statically cannot be changed at runtime.

If you use hybrid execution which consist of vector processes and scalar processes, execute the below command instead of the above. By loading setup script by the below command, in addition to VE, the MPI program executed on VH or a scalar host also is dynamically linked against the MPI libraries to corresponding to below {version}.

(For bash)
$ source /opt/nec/ve/mpi/{version}/bin/necmpivars.sh [gnu|intel] [compiler-version]
(For VE30: $ source /opt/nec/ve3/mpi/{version}/bin/necmpivars.sh [gnu|intel] [compiler-version])

(For csh)
% source /opt/nec/ve/mpi/{version}/bin/necmpivars.csh [gnu|intel] [compiler-version]
(For VE30: % source /opt/nec/ve3/mpi/{version}/bin/necmpivars.csh [gnu|intel] [compiler-version])
The {version} is the directory name corresponding to the version of NEC MPI which contains MPI libraries the MPI program is dynamically linked against. [gnu|intel] be specified as the first argument. [compiler-version] is specified as the second argument. [compiler-version] is the compiler version used at compiling and linking. You can get the value of each argument from the RUNPATH of MPI program. In the below example, the first argument is the value of the wave line part (gnu) and the second argument is the value of the dashed line part (9.1.0)
$ /usr/bin/readelf -W -d vh.out | grep RUNPATH
0x000000000000001d (RUNPATH) Library runpath: [/opt/nec/ve/mpi/2.3.0/lib64/vh/gnu/9.1.0]

NEC MPI provides the MPI execution commands mpirun and mpiexec to launch MPI programs. Any of the following command lines is available:

$ mpirun [global-options] [local-options] {MPIexec} [args] [ : [local-options] {MPIexec} [args] ]...
$ mpiexec [global-options] [local-options] {MPIexec} [args] [ : [local-options] {MPIexec} [args] ]...

The MPI execution commands support executing MPI programs linked with MPI libraries that are the same or older than the command version.

If you use the MPI execution commands located in the system standard path /opt/nec/ve/bin, load necmpivars.sh or necmpivars.csh before executing the MPI program.

If you use a specific version of the MPI execution commands that are not located in the system standard path, load necmpivars-runtime.sh or necmpivars-runtime.csh located in the /opt/nec/ve/mpi/{version}/bin directory instead of necmpivars.sh or necmpivars.csh. {version} is the directory name corresponding to the version of NEC MPI that contains the MPI execution commands to use. necmpivars-runtime.sh and necmpivars-runtime.csh can be used in the same way as necmpivars.sh and necmpivars.csh, and they configure that the specified version of the MPI execution commands are used in addition to the settings configured by necmpivars.sh and necmpivars.csh.

Note that the specific version of the MPI execution commands that are not located in the system standard path cannot be used in NQSV Request submitted to a batch queue that MPD is selected as NEC MPI Process Manager. If you load necmpivars-runtime.sh or necmpivars-runtime.csh in the request, the following warning message is shown and the setting for the MPI execution is not configured.

necmpivars-runtime.sh: Warning: This script cannot be used in NQSV Request submitted to a batch queue that MPD is selected as NEC MPI Process Manager.

3.2.1   Specification of Program Execution

The following can be specified as MPI-execution specification {MPIexec} in the MPI execution commands:

The explanation above is based on the assumption that the Linux binfmt_misc capability has been configured, which is the default software development environment in the SX-Aurora TSUBASA. The configuration of the binfmt_misc capability requires the system administrator privileges. Please refer to "SX-Aurora TSUBASA Installation Guide", or contact the system administrator for details.

It is possible to execute MPI programs by specifying MPI-execution specification {MPIexec} as follows, even in the case that the binfmt_misc capability has not been configured.


3.2.2   Runtime Options

The term host in runtime options indicates a VH or a VE. Please refer to the clause for how to specify hosts.

The following table shows available global options.

Table 3-3 The List of Global Options
Global Option Meaning
-machinefile | -machine <filename> A file that describes hosts and the number of processes to be launched.
The format is "hostname[:value]" per line. The default value of the number of processes (":value") is 1, if it is omitted.
-configfile <filename> A file containing runtime options.
In the file <filename>, specify one or more option lines.
Runtime options and MPI execution specifications {MPIexec} such as MPI executable file are specified on each line. If the beginning of the line is "#", that line is treated as a comment.
-hosts <host-list> Comma-separated list of hosts on which MPI processes are launched.
When the options -hosts and -hostfile are specified more than once, the hosts specified in each successive option are treated as a continuation of the list of the specified hosts.
This option must not be specified together with the option -host, -nn, or -node.
-hostfile | -f <filename> Name of a file that specifies hosts on which MPI processes are launched.
When the options -hosts, -f and -hostfile are specified more than once, the hosts specified in each successive option are treated as a continuation of the list of the specified hosts.
This option must not be specified together with the option -host, -nn, or -node.
-gvenode Hosts specified in the options indicates VEs.
-perhost | -ppn | -N | -npernode | -nnp <value> MPI processes in groups of the specified number <value> are assigned to respective hosts.
The assignment of MPI processes to hosts is circularly performed until every process is assigned to a host.
When this option is omitted, the default value is (P+H-1)/H, where P is the total number of MPI processes and H is the number of hosts.
-launcher-exec <fullpath> Full path name of the remote shell that launches MPI daemons.
The default value is /usr/bin/ssh. This option is only available only in the interactive execution.
-max_np | -universe_size <max_np> Specify the maximum number of MPI processes, including MPI processes dynamically generated at runtime. The default value is the number specified with the -np option. If some -np options are specified, the default value is the sum of the numbers specified with the options.
-multi Specify that MPI program is executed on multiple hosts. Use this option, if all MPI processes are generated in a single host at the start of program execution and then MPI processes are generated on the other hosts by the MPI dynamic process generation function, resulting in multiple host execution.
-genv <varname> <value> Pass the environment variable <varname> with the value <value> to all MPI processes.
-genvall (Default) Pass all environment variables to all MPI processes except for the default environment variables set by NQSV in the NQSV request execution or set by PBS in the PBS request execution.
-genvlist <varname-list> Comma-separated list of environment variables to be passed to all MPI processes.
-genvnone Do not pass any environment variables.
-gpath <dirname> Set PATH environment variables passed to all MPI processes to <dirname>.
-gumask <mode> Execute "umask <mode>" for all MPI processes.
-gwdir <dirname> Set the working directory in which all MPI processes run to <dirname>.
-gdb | -debug Open one debug screen per MPI process, and run MPI programs under the gdb debugger.
-display | -disp <X-server> X display server for debug screens in the format "host:display" or "host:display:screen".
-gvh | -gsh Specify that executables should run by default on Vector Hosts or Scalar Hosts
Note: When running some executables on VE, it is necessary to use an option such as -ve to indicate that the executables should run on VE.
-vpin | -vpinning | -vnuma Print info on assigned cpu id's of MPI processes on VH's, scalar hosts or NUMA nodes on VEs.
This option is valid for -pin_mode, -cpu_list, -numa, -nnuma option.
-v | -V | -version Display the version of NEC MPI and runtime information such as environment variables.
-h | -help Display help for the MPI execution commands.

Only one of the local options in the following table can be specified to each MPI executable file. When all of them are omitted, the host specified in runtime options indicates a VH.

Table 3-4 The List of Local Options
Local Option Meaning
-ve <first>[-<last>] The range of VEs on which MPI processes are executed. If this option is specified, the term host in runtime options indicates a VH.
In the interactive execution, specify the range of VE numbers.
In the NQSV request execution, specify the range of logical VE numbers.
<first> indicates the first VE number, and <last> the last VE number. <last> must not be smaller than <first>. When -<last> is omitted, -<first> is assumed to be specified.
The specified VEs are the ones attached to VHs specified immediately before this option in local options or specified in global options.
If this option is omitted and no VEs are specified, VE#0 is assumed to be specified. If this option is omitted and host or the number of hosts are not specified in the NQSV request execution, all VEs assigned by NQSV are assumed to be specified.
-nve <value> The number of VEs on which MPI processes are executed.
Corresponds to: -ve 0-<value-1>
The specified the number of VEs are the ones attached to VHs specified immediately before this option in local options or specified in global options.
-venode The term host in the options indicates a VE.
-vh | -sh Create MPI processes on Vector Hosts or Scalar hosts.
-host <host> One host on which MPI processes are launched.
-node <hostrange> The range of hosts on which MPI processes are launched. Please refer to this section for the format of <hostrange>.
In the interactive execution, the -venode option also needs to be specified.
If the option -hosts, -hostfile, -f, -host, or -nn is specified, this option is ignored.
-nn <value> The number of hosts on which MPI processes are launched.
This option is available only in the NQSV request execution.
This option can be specified only once corresponding to each MPI executable file.
If this option is omitted and host or the number of hosts are not specified in the NQSV request execution, the number of hosts assigned by NQSV is assumed to be specified.
If the option -hosts, -hostfile, -f or -host is specified, this option is ignored.
-numa <first>[-<last>][,<...>] The range of NUMA nodes on VE on which MPI processes are executed.
<first> indicates the first NUMA node number, and <last> the last NUMA node number. <last> must not be smaller than <first>. When -<last> is omitted, -<first> is assumed to be specified.
-nnuma <value> The number of NUMA nodes on VE on which MPI processes are executed.
Corresponds to: -numa 0-<value-1>
-c | -n | -np <value> The total number of processes launched on the corresponding hosts.
The specified processes correspond to the hosts specified immediately before this option in local options or specified in global options.
When this option is omitted, the default value is 1.
-ve_nnp | -nnp_ve | -vennp <value> The number of processes launched per VE.
This option is ignored where other options that specify the number of MPI processes to be launched, such as the -np option, -nnp option and so on, are specified. This option cannot be used where the -gvenode option or -venode option is specified.
When this option is omitted, the default value is 1.
-env <varname> <value> Pass the environment variable <varname> with the value <value> to MPI processes.
-envall (Default) Pass all environment variables to MPI processes except the default environment variables set by NQSV in the NQSV request execution or set by PBS in the PBS request execution.
-envlist <varname-list> Comma-separated list of environment variables to be passed.
-envnone Do not pass any environment variables.
-path <dirname> Set PATH environment variables passed to MPI process to <dirname>.
-umask <mode> Execute "umask <mode>" for MPI process.
-wdir <dirname> Set the working directory in which MPI processes run to <dirname>.
-ib_vh_memcpy_send <auto | on | off> Use VH memory copy on the sender side of a VE process for InfiniBand communication. This option has higher priority than the environment variable NMPI_IB_VH_MEMCPY_SEND.

auto:
Use sender side VH memory copy for InfiniBand communication through Root Complex.
(default for Intel machines)

on:
Use sender side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)

off:
Don't use sender side VH memory copy for InfiniBand communication.
-ib_vh_memcpy_recv <auto | on | off> Use VH memory copy on the receiver side of a VE process for InfiniBand communication. This option has higher priority than the environment variable NMPI_IB_VH_MEMCPY_RECV.

auto:
Use receiver side VH memory copy for InfiniBand communication through Root Complex.

on:
Use receiver side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)

off:
Don't use receiver side VH memory copy for InfiniBand communication.
(default for Intel machines)
-dma_vh_memcpy <auto | on | off> Use VH memory copy for a communication between VEs in VH. This option has higher priority than the environment variable NMPI_DMA_VH_MEMCPY.

auto:
Use VH memory copy for a communication between VEs in VH through Root Complex.
(default)

on:
Use VH memory copy for a communication between VEs in VH.
(independent on Root Complex).

off:
Don't use VH memory copy for a communication between VEs in VH .
-vh_memcpy <auto | on | off> Use VH memory copy for the InfiniBand communication and the communication between VEs in VH. This option has higher priority than the environment variable NMPI_VH_MEMCPY.


auto:
In the case of InfiniBand communication, sender side VH memcpy is used if the communication goes through Root Complex. In the case of a communication between VEs in VH, VH memory copy is used if the communication goes through Root Complex.
on:
VH memory copy is used.
off:
VH memory copy is not used.

Note:
The option -ib_vh_memcpy_send, -ib_vh_memcpy_recv and -dma_vh_memcpy are higher priority than this option.
-vh_thread_yield <0 | 1 | 2> Control the waiting method for a VH process.


0:
Do the busy wait.
(default)
1:
Do the sched_yield().
2:
Do the sleep. It is implemented by pselect().

-vh_spin_count <spin count value> Control the spin count value for a VH process. The value must be greater than 0.
-vh_thread_sleep <sleep timeout value> Control the sleep microseconds timeout for a VH process.
-pin_mode < consec | spread |
consec_rev | spread_rev
scatter | no | none | off >
Specify the method how the affinity of MPI processes on VH or scalar host is controlled with.

consec | spread :
Assign next free cpu ids to MPI processes. Assigning of cpu ids starts with cpu id 0.

consec_rev | spread_rev:
Assign next free (in reverse order) cpu ids to MPI processes. Assigning of cpu ids starts with highest cpu id.

scatter:
Look for a maximal distance to already assigned cpu ids and assign next free cpu ids to MPI processes.

none | off | no :
No pinning of MPI processes to cpu id's. The default pinning mode is 'none'.

Note:
(*) Specifying flag "-pin_mode" disables preceding "-cpu_list".
(*) If the number of free cpu id's is not sufficient to assign cpu_id's, NO cpu id is assigned to the MPI process.
-pin_reserve <num-reserved-ids>[H|h] Specify the number of cpu ids to be reserved per MPI process on VH or scalar host for the pinning method specified with the flag "-pin_mode". If the optional 'h' or 'H' is added to the number, the cpu id's of associated Hyperthreads are also utilized if available.
The number of reserved ids must be greater than 0.
The default number is 1.
-cpu_list | -pin_cpu <first-id>[<-last-id>
[<-increment>[-<num-reserved-ids>
[H|h][,...]]]]
Specify a comma-separated list of cpu id's for the processes to be created. specifies the cpu id which is assigned to the first MPI process on the node. Cpu id <first-id + increment> is assigned to the next MPI process and so on. <last-id> specifies the last cpu id which is assigned. <num-reserved-ids> specifies the number of reserved cpu ids per MPI process for multithreaded application. If the optional 'h' or 'H' is added to the <num-reserved-ids>, the cpu ids of Hyperthreads are also utilized if available.

Default values if not specified:
<last-id> = <first-id>
<increment> = 1
<num-reserved-ids> = 1

Note:
(*) Specifying flag "-cpu_list" disables preceding "-pin_mode".
(*) If the number of free cpu ids is not sufficient to assign <num-reserved-ids> cpu ids, NO cpu id is assigned to the MPI process.
-veo To specify for MPI processes to use VEO features
-cuda To specify for MPI processes to use CUDA features


3.2.3   Specification of Hosts

Hosts corresponding to MPI executable files are determined according to the specified runtime options as follows:
  1. MPI executable files for which the -venode option is not specified (Default)

    A host indicates a VH in this case. VHs are specified as shown in the following table.

    Table 3-5 Specification of VHs
    Execution Method Format Description
    Interactive execution VH name
    • The hostname of a VH, which is a host computer.
    NQSV request execution <first>[-<last>]
    • <first> is the first logical VH number and <last> the last.
    • To specify one VH, omit -<last>.
      In particular specify only <first> in the options -hosts, -hostfile, -f and -host.
    • <last> must not be smaller than <first>.
  2. MPI executable files for which the -venode option is specified

    A host indicates a VE in this case. VEs are specified as shown in the following table.
    Please note that the -ve option cannot be specified for the MPI executable file for which the -venode option is specified.

    Table 3-6 Specification of VEs
    Execution Method Format Description
    Interactive execution <first>[-<last>][@<VH>]
    • <first> is the first VE number and <last> the last.
    • <VH> is a VH name. When omitted, the VH on which the MPI execution command has been executed is selected.
    • To specify one VE, omit -<last>.
      In particular specify only <fisrt> in the options -hosts, -hostfile, -f and -host.
    • <last> must not be smaller than <first>.
    NQSV request execution <first>[-<last>][@<VH>]
    • <first> is the first logical VE number and <last> the last.
    • <VH> is a logical VH number. When omitted, hosts (VEs) are selected from the ones NQSV allocated.
    • To specify one VE, omit -<last>.
      In particular specify only <first> in the options -hosts, -hostfile, -f and -host.
    • <last> must not be smaller than <first>.


3.2.4   Environment Variables

The following Table shows the environment variable s the values of which users can set. The name of an environment variable in NEC MPI starts with NMPI_, and some of them provide names that start with MPI. Additionally, the behavior and output of MPI runtime performance information may vary depending on the NEC MPI-unrelated environment variables that start with VE_ described in the table, such as VE_PROGINF_USE_SIGNAL and VE_PERF_MODE. Environment variables that start with NMPI_ can be referred to in the help of the mpirun and mpiexec command.

Table 3-7   Environment Variables Set by Users
Environment Variable Available Value Meaning
NMPI_COMMINF Control the display of MPI communication information. To use MPI communication information facility, you need to generate MPI program with the option -mpiprof, -mpitrace, -mpiverify or -ftrace. Please refer to this section for MPI communication facility.
NO (Default) Not display the communication information.
YES Display the communication information in the reduced format.
ALL Display the communication information in the extended format.
MPICOMMINF The same as the environment variable NMPI_COMMINF The same as the environment variable NMPI_COMMINF.
If both are specified, the environment variable NMPI_COMMINF takes precedence.
NMPI_COMMINF_VIEW Specify the display format of the aggregated portion of MPI communication information.
VERTICAL (Default) Aggregate vector processes and scalar processes separately and display them vertically.
HORIZONTAL Aggregate vector processes and scalar processes separately and display them horizontally.
MERGED Aggregate and display vector processes and scalar processes.
NMPI_PROGINF Control the display of runtime performance information of MPI program. Please refer to this section for runtime performance information of MPI program.
NO (Default) Not display the performance information.
YES Display the performance information in the reduced format.
ALL Display the performance information in the extended format.
DETAIL Display the detailed performance information in the reduced format.
ALL_DETAIL Display the detailed performance information in the extended format.
MPIPROGINF The same as the environment variable NMPI_PROGINF The same as the environment variable NMPI_PROGINF.
If both are specified, the environment variable NMPI_PROGINF takes precedence.
NMPI_PROGINF_VIEW Specify the display format of the aggregated portion about VE of runtime performance information of MPI program.
VE_SPLIT Aggregate processes executed on VE30 and processes executed on VE10/VE10E/VE20 separately and display them.
VE_MERGED (Default) Aggregate all processes executed on VE togather as vector processes and display it.
NMPI_PROGINF_COMPAT 0 (Default) The runtime performance information of MPI program is displayed in the latest format.
1 The runtime performance information of MPI program is displayed in old format.
In this format, performance item "Non Swappable Memory Size Used", VE Card Data section and location information of VE where the MPI process is executed are not displayed.
VE_PROGINF_USE_SIGNAL YES (Default) Signals are used for collecting performance information.
NO Signals are not used for collecting performance information. See this section before using this option.
VE_PERF_MODE Control the HW performance counter set. MPI performance information outputs items corresponding to selected counters.
VECTOR-OP (Default) Select the set of HW performance counters related to vector operation mainly.
VECTOR-MEM Select the set of HW performance counters related to vector and memory access mainly.
NMPI_EXPORT "<string>" Space-separated list of the environment variables to be passed to MPI processes.
MPIEXPORT The same as the environment variable NMPI_EXPORT The same as the environment variable NMPI_EXPORT.
If both are specified, the environment variable NMPI_EXPORT takes precedence.
NMPI_SEPSELECT To enable this environment variable, the shell script mpisep.sh must also be used. Please refer to this section for details.
1 The standard output from each MPI process is saved in a separate file.
2 (Default) The standard error output from each MPI process is saved in a separate file.
3 The standard output and standard error output from each MPI process are saved in respective separate files.
4 The standard output and standard error output from each MPI process are saved in one separate file.
MPISEPSELECT The same as the environment variable NMPI_SEPSELECT The same as the environment variable NMPI_SEPSELECT.
If both are specified, the environment variable NMPI_SEPSELECT takes precedence.
NMPI_VERIFY Control error detection of the debug assist feature for MPI collective procedures. To use the feature for MPI collective procedures, you need to generate MPI program with the option -mpiverify. Please refer to this content for the feature.
0 Errors in invocations of MPI collective procedures are not detected.
3 (Default) Errors other than those in the argument assert of the procedure MPI_WIN_FENCE are detected.
4 Errors in the argument assert of the procedure MPI_WIN_FENCE are detected, in addition to the default errors.
NMPI_VE_TRACEBACK Controls format of traceback output by the VE MPI.
ON Output traceback in the same format as NEC compiler when the environment variable VE_TRACEBACK is set to VERBOSE.
OFF Output traceback in the same format as backtrace_symbols. (default)
NMPI_TRACEBACK_DEPTH <integer> Controls the maximum depth of traceback output by MPI. (default:50)
0 has special meaning:
The maximum depth is unlimited in the case of VE MPI.
The maximum depth is at least 50 in the case of VH MPI.
NMPI_OUTPUT_COLLECT Controls the output of MPI programs when the NEC MPI process manager in the queue settings is hydra when executing NQSV batch jobs.
ON The output of the MPI program is set as the standard output and standard error output of the MPI execution command. This setting takes precedence over qsub -f.
OFF The output of the MPI program is output for each logical node as in the case of mpd.(default)
NMPI_BLOCKLEN0 OFF (Default) Blocks with blocklength 0 are not included in the calculation of the values of the lower bound and upper bound of a datatype created by MPI procedures that create derived datatypes and have the argument blocklength.
ON Blocks with blocklength 0 are also included in the calculation of the values of the lower bound and upper bound of a datatype created by MPI procedures that create derived datatypes and have the argument blocklength.
MPIBLOCKLEN0 The same as the environment variable NMPI_BLOCKLEN0 The same as the environment variable NMPI_BLOCKLEN0.
If both are specified, the environment variable NMPI_BLOCKLEN0 takes precedence.
NMPI_COLLORDER OFF (Default)
1. Predefined operations, processes consecutive on nodes:
Canonical order, but bracketing depends on the distribution of processes over nodes, for example, could be (a+b)+(c+d) or ((a+b)+c)+d or a+((b+c)+d). More concretely, inside nodes reduction is performed left-to-right, over the nodes the bracketing depends on the number of nodes.
2. Predefined operations, processes not consecutive on nodes:
Commutativity is exploited, reduction order will not be canonical
3. User-defined operations:
Canonical reduction order, bracketing dependent on the number of processes, and commutativity is not exploited.
ON Canonical order, bracketing independent of process distribution, dependent only on the number of processes.
MPICOLLORDER The same as the environment variable NMPI_COLLORDER The same as the environment variable NMPI_COLLORDER.
If both are specified, the environment variable NMPI_COLLORDER takes precedence.
NMPI_PORT_RANGE <integer>:<integer> The range of port numbers NEC MPI uses to accept TCP/IP connections.
The default value is 25257:25266.
NMPI_INTERVAL_CONNECT <integer> Retry interval in seconds for establishing connections among MPI daemons at the beginning of execution of MPI programs.
The default value is 1.
NMPI_RETRY_CONNECT <integer> The number of retries for establishing connections among MPI daemons at the beginning of execution of MPI programs.
The default value is 2.
NMPI_LAUNCHER_EXEC <string> Full path name of the remote shell that launches MPI daemons.
The default value is /usr/bin/ssh. This environment variable is only available only in the interactive execution.
NMPI_IB_ADAPTER_NAME <string> Comma-or-Space separated list of InfiniBand adaptor names NEC MPI uses. This environment variable is available only in the interactive execution.
When omitted, NEC MPI automatically selects the optimal ones.
NMPI_IB_DEFAULT_PKEY <integer> Partition key for InfiniBand Communication. The default value is 0.
NMPI_IB_FAST_PATH ON Use InfiniBand RDMA fath path feature to transfer eager messages.
(Default on Intel machines)
Don't set this value if InfiniBand HCA Relaxed Ordering or Adaptive Routing is enabled.
MTU MTU limits the message size of fast path feature to actual OFED mtu size.
Don't set this value if InfiniBand HCA Relaxed Ordering is enabled.
OFF Don't use InfiniBand RDMA fath path feature.
(Default on Non-Intel machines)
NMPI_IB_VBUF_TOTAL_SIZE <integer> Size of each InfiniBand communication buffer in bytes. The default value is 12248.
NMPI_IB_VH_MEMCPY_SEND AUTO Use sender side VH memory copy for InfiniBand communication through Root Complex.
(default for Intel machines)
ON Use sender side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)
OFF Don't use sender side VH memory copy for InfiniBand communication.
NMPI_IB_VH_MEMCPY_RECV AUTO Use receiver side VH memory copy for InfiniBand communication through Root Complex.
ON Use receiver side VH memory copy for InfiniBand communication (independent on Root Complex).
(default for non-Intel machines)
OFF Don't use receiver side VH memory copy for InfiniBand communication.
(default for Intel machines)
NMPI_DMA_VH_MEMCPY AUTO Use VH memory copy for a communication between VEs in VH through Root Complex.
(Default)
ON Use VH memory copy for a communication between VEs in VH.
OFF Don't use VH memory copy for a communication between VEs in VH.
NMPI_VH_MEMCPY AUTO In the case of InfiniBand communication, sender side VH memcpy is used if the communication goes through Root Complex. In the case of a communication between VEs in VH, VH memory copy is used if the communication goes through Root Complex.
ON VH memory copy is used.
OFF VH memory copy is not used.
Note:
NMPI_IB_VH_MEMCPY_SEND, NMPI_IB_VH_MEMCPY_RECV, NMPI_DMA_VH_MEMCPY are higher priority than this environment variable.
NMPI_DMA_RNDV_OVERLAP
ON In the case of DMA communication, the communication and calculation can overlap when the buffer is contiguous, its transfer length is 200KB or more, and it is non-blocking point-to-point communication.
OFF (Default) In the case of DMA communication, the communication and calculation cannot overlap when the transfer length is 200KB or more and it is non-blocking point-to-point communication.
Note:
Setting NMPI_DMA_RNDV_OVERLAP to ON internally disables the usage of VH memory copy.
the values of environment variables NMPI_DMA_VH_MEMCPY is ignored for non-blocking point-to-point DMA communication.
NMPI_IB_VH_MEMCPY_THRESHOLD <integer> Minimal message size to transfer InfiniBand message to/from VE processes via VH memory. Smaller messages are sent directly without copy to/from VH memory. Message size is given in bytes and must be greater or equal to 0. The default value is 1048576.

This value corresponds the following item output by specifying runtime option "-v":
"Threshold" of "IB Parameters for message transfer via VH memory"
NMPI_IB_VH_MEMCPY_BUFFER_SIZE <integer> Maximal size of a buffer located in VH memory to transfer (parts of) an InfiniBand message to/from VE processes. Size of buffer is given in bytes and must be at least 8192 bytes. The default value is 1048576.

This value corresponds the following item output by specifying runtime option "-v":
"Buffer size" of "IB Parameters for message transfer via VH memory"
NMPI_IB_VH_MEMCPY_SPLIT_THRESHOLD <integer> Minimal message size to split transfer of InfiniBand messages to/from VE processes via VH Memory. The messages are split in nearly equal parts in order to increase the transfer bandwidth. Message size is given in bytes and must be greater or equal to 0. The default value is 1048576.

This value corresponds the following item output by specifying runtime option "-v":
"Split threshold" of "IB Parameters for message transfer via VH memory"
NMPI_IB_VH_MEMCPY_SPLIT_NUM <integer> Maximal number of parts used to transfer InfiniBand messages to/from VE processes using VH memory. The number must be in range of [1:8]. The default value is 2.

This value corresponds the following item output by specifying runtime option "-v":
"Split number" of "IB Parameters for message transfer via VH memory"
NMPI_IP_USAGE TCP/IP usage if fast InfiniBand interconnect is not available on an InfiniBand system(for example, if InfiniBand ports are down or no HCA was assigned to a job).
ON | FALLBACK Use TCP/IP as fallback for fast InfiniBand interconnect.
OFF (Default) Terminate application if InfiniBand interconnect is not available on a InfiniBand system.
NMPI_EXEC_MODE NECMPI (Default) Work with NECMPI runtime option.
INTELMPI Work with IntelMPI's basic runtime options (see below).
OPENMPI Work with OPENMPI's basic runtime options (see below).
MPICH Work with MPICH's basic runtime options (see below).
MPISX Work with MPISX's runtime options.
NMPI_SHARP_ENABLE ON To use SHARP
OFF Not to use SHARP (default)
NMPI_SHARP_NODES <integer> The minimal number of VE nodes to use SHARP if SHARP usage is enabled. (default: 4)
NMPI_SHARP_ALLREDUCE_MAX <integer> Maximal data size (in bytes) in MPI_Allreduce for which the SHARP API used. (Default: 64)
UNLIMITED SHARP is always used.
NMPI_SHARP_REPORT ON Report on MPI Communicators using SHARP collective support.
OFF No report. (default)
NMPI_DCT_ENABLE Control the usage of Inifniband DCT (Dynamically Connected Transport Service). Using DCT, a memory usage for Inifniband communication is reduced.
(Note: DCT may affect a performance of InfiniBand communication)
AUTOMATIC DCT is used if the number of MPI processes is equal or greater than the number specified by NMPI_DCT_SELECT_NP environment variable. (default)
ON DCT is always used.
OFF DCT is not used.
NMPI_DCT_SELECT_NP <integer> The minimal number of MPI processes that DCT is used if the environment variable NMPI_DCT_ENABLE is set to AUTOMATIC. The default value is automatically decided by the number of cores in one VE and the number of VEs in one VH. (up to 2049)
NMPI_DCT_NUM_CONNS <integer> The number of requested DCT connections. (default: 4)
NMPI_COMM_PNODE Control the automatic selection of communication type between logical nodes in the execution under NQSV.
OFF Select the communication type automatically based on the logical node (default).
ON Select the communication type automatically based on the physical node.
NMPI_EXEC_LNODE Control the logical node execution in the interactive execution. In the logical node execution, the communication is selected automatically based on the logical node. The format of the specified logical node is "hostname/string".
The following example shows how to execute a program on 3 logical nodes in using 1 physical node.
$ mpirun -host HOST1 -ve 0 -host HOST1/A -ve 1 -host HOST1/B -ve 2 ve.out
OFF The logical node execution in interactive execution is not used (default).
ON The logical node execution in interactive execution is used.
NMPI_LNODEON
MPILNODEON
The same as the environment variable NMPI_EXEC_LNODE. If both are specified, the environment variable NMPI_EXEC_LNODE takes precedence.
1 The logical node execution in interactive execution is used.
NMPI_VH_MEMORY_USAGE VH memory usage required for MPI application execution.
ON (Default) VH Memory is required. If VH memory is requested and not available, the MPI application is aborted.
OFF | FALLBACK If VH Memory is requested and not available, a possibly slower communication path is used.
NMPI_CUDA_ENABLE Control the usage of CUDA memory transfer.
AUTO CUDA memory transfer is used if it is available(default).
OFF CUDA memory transfer is not used.
ON CUDA memory transfer is used. The MPI application is aborted when CUDA memory transfer is not available.
NMPI_GDRCOPY_ENABLE Control whether GDRCopy is used for data transfer between GPU and VH in a node.
AUTOMATIC Transfer data using GDRCopy if it is available(default).
ON Transfer data using GDRCopy. If GDRCopy is not available, MPI application will be aborted.
OFF Do not transfer data using GDRCopy.
NMPI_GDRCOPY_LIB <path> Path to the GDRCopy dynamic library.
NMPI_GDRCOPY_FROM_DEVICE_LIMIT <integer> Maximal transfer size for usage of GDRCopy from GPU memory to Host memory. The default value is 8192.
NMPI_GDRCOPY_TO_DEVICE_LIMIT <integer> Maximal transfer size for usage of GDRCopy from Host memory to GPU memory. The default value is 8192.
NMPI_VE_AIO_METHOD Controls asynchronous I/O method used by non-blocking MPI-IO procedures of VE MPI programs.
VEAIO Use VE AIO (default)
POSIX Use POAIX AIO
NMPI_SWAP_ON_HOLD When the Partial Process Swapping function is used in order to suspend a regular request in NQSV job, it controls the release of a Non Swappable Memory used by a MPI process. The default value depends on the system settings. You can check it in the ⌈ Swappable on hold ⌋ item which is displayed when you specify the runtime option -v.
ON A part of the Non Swappable Memory used by the MPI process is released.
OFF A Non Swappable Memory used by the MPI process is not released.
NMPI_AVEO_UDMA_ENABLE Control the AVEO UserDMA feature.
ON Enable the AVEO UserDMA feature (default)
OFF Disable the AVEO UserDMA feature
NMPI_USE_COMMAND_SEARCH_PATH Controls whether the PATH environment variable is used in order to search for the executable file specified in the MPI execution command.
(*) If you specify a file path that includes path separators instead of the file name, it will not be affected by this environment variable.
ON Use the PATH environment variable.
The file is searched for in order from the beginning of the directory specified in the PATH environment variable.
OFF Do not use the PATH environment variable.
The file is searched for only in the current working directory. (default)
NMPI_OUTPUT_RUNTIMEINFO Control the frequency of output for runtime information displayed by -v option of MPI execution command when executing the NQSV batch job in the queue that mpd is selected as NEC MPI Process Manager.
ON Output runtime information every runtime of MPI execution command with -v option in the job script.(default)
OFF Output runtime information only at the runtime of the first MPI execution command even if there are many MPI execution commands with -v option in the job script.
NMPI_IB_CONNECT_IN_INIT Control timing of establishing InfiniBand connection. The default can be changed to AUTO by specifying "ib_connect_in_init auto" in /etc/opt/nec/ve/mpi/necmpi.conf .
ON All of connections are established in MPI_Init(). The performance of first collective communication may be improved.
OFF Each of connections are established when first communication is issued. (default)
AUTO This feature is enabled when the number of processes is 4096 or more.
NMPI_VH_THREAD_YIELD Control the waiting method for a VH process.
0 Do the busy wait.(default)
1 Do the sched_yield().
2 Do the sleep. It is implemented by pselect().
NMPI_VH_SPIN_COUNT <integer> Control the spin count value for a VH process. The value must be greater than 0. (Default: 100)
NMPI_VH_THREAD_SLEEP <integer> Control the sleep microseconds timeout for a VH process. (Default: 100)
NMPI_IB_MEDIUM_BUFFERING Use buffering for reducing Non Swappable Memory when the transfer is issued over InfiniBand, and the transfer size is equal to or larger than NMPI_IB_VBUF_TOTAL_SIZE and less than NMPI_IB_VH_MEMCPY_THRESHOLD.
AUTO Buffering is used when NMPI_SWAP_ON_HOLD=ON.(default)
ON Buffering is used.
OFF Buffering is not used.
NMPI_ALLOC_MEM_LOCAL Controls whether local memory is allocated in the MPI procedures MPI_Alloc_mem,MPI_Win_allocate, and MPI_Win_allocate_shared (only with a single process).
Note: Local memory is not available for some high performance communications such as direct transfers of RMA. Global memory is Non Swappable Memory during Switch Over.
ON Local memory is allocated.
OFF Global memory is allocated. (default)
NMPI_IB_GPUDIRECT_ENABLE Controls the GPUDirect RDMA feature.
AUTO Enable the GPUDirect RDMA feature if GPU and InfiniBand HCA are connected under the same PCIe Root Port. (default)
ON Enable the GPUDirect RDMA feature regardless of the PCIe topology.
OFF Disable the GPUDirect RDMA feature.
NMPI_GDRCOPY_GPUDIRECT_THRESHOLD <integer> The threshold transfer size to change GDRCopy to GPUDirect RDMA. (Default: 128)
NMPI_VE_USE_256MB_MEM Controls the usage of the memory managed in units of 256 MB.
Note: This affects only MPI processes executed on VE
ON Use the memory managed in unit of 256MB
OFF Don't use the memory managed in unit of 256MB
AUTO (Default) The respective processes have the different value as follows:
  • The processes executed on VE30 have ON
  • The processes executed on VE10/VE10E/VE20 have OFF
NMPI_VE_ALLOC_MEM_BACKEND Specify the function used by the memory management of MPI_Alloc_mem when it allocates memory.
Note: This affects only MPI processes executed on VE
MALLOC Use malloc and variants
MMAP Use mmap and variants
AUTO (Default) The respective processes have the different value as follows:
  • The processes have MMAP when the processes have NMPI_VE_USE_256MB_MEM=ON
  • The processes have MALLOC when the processes have NMPI_VE_USE_256MB_MEM=OFF
NMPI_IB_RNDV_PROTOCOL Specifies the type of InfiniBand Transfer that NEC MPI mainly uses for MPI communication.
PUT or RPUT RDMA-WRITE is mainly used.
GET or RGET RDMA-READ is mainly used.
AUTO (Default) Either RDMA-WRITE or RDMA-READ is automatically selected according to the system configuration, distribution and layout of MPI processes in the program execution.
NMPI_IB_RMA_PUT_PROTOCOL Specify the transfer type of InfiniBand communication with MPI_Put and MPI_Rput procedures.
RDMA (Default) RDMA-WRITE is used if possible.
PT2PT Point-to-Point communication is used.
PT2PT_DCT or PT2PT4DCT Point-to-Point communication is used if a DCT connection between both processes is active.
REMOTE_GET RDMA-READ is used if possible.
NMPI_IB_RMA_PUT_THRESHOLD Specify minimal transfer size at which the transfer type selected by environment variable NMPI_IB_RMA_PUT_PROTOCOL is used in MPI_Put and MPI_Rput procedures. When the transfer size is smaller than the value of NMPI_IB_RMA_PUT_THRESHOLD , RDMA is used. (Default: 0)

Support options for setting NMPI_EXEC_MODE = INTELMPI
-hosts, -f, -hostfile, -machinefile, -machine, -configfile, -perhost, -ppn, -genv, -genvall, -genvnone, -genvlist, -gpath, -gwdir, -gumask, -host, -n , -np, -env, -envall, -envnone, -envlist, -path, -wdir, -umask, and common options for Aurora
Support options for setting NMPI_EXEC_MODE = OPENMPI
-N, -npernode, --npernode, -path, --path, -H, -host, --host, -n, --n, -c, -np, --np, -wdir, --wdir, -wd, --wd, -x, and common options for Aurora
Support options for setting NMPI_EXEC_MODE = MPICH
-hosts, -f, -configfile, -ppn, -genv, -genvall, -genvnone, -genvlist, -wdir, -host, -n, -np, -env, -envall, -envnone, -envlist, and common options for Aurora
Common options for Aurora
-launcher-exec, -max_np, -universe_size, -multi, -debug, -display, -disp, -v, -V, -version, -h, -help, -gvenode, -ve, -venode, -ve_nnp, -nnp_ve, -vennp, -gsh, -gvh, -vpin, -vpinning, -vnum


3.2.5   Environment Variables for MPI Process Identification

NEC MPI provides the following environment variables, the values of which are automatically set by NEC MPI, for MPI process identification.

Environment Variable Value
MPIUNIVERSE The identification number of the predefined communication universe at the beginning of program execution corresponding to the communicator MPI_COMM_WORLD.
MPIRANK The rank of the executing process in the communicator MPI_COMM_WORLD at the beginning of program execution.
MPISIZE The total number of processes in the communicator MPI_COMM_WORLD at the beginning of program execution.
MPINODEID The logical node number of node where the MPI process is running.
MPIVEID The VE node number of VE where the MPI process is running. If the execution is under NQSV, this shows logical VE node number. If the MPI process is not running on VE, this variable is not set.
NMPI_LOCAL_RANK The relative rank of MPI process in MPI_COMM_WORLD on this node.
NMPI_LOCAL_RANK_VHVE The relative rank of MPI process in MPI_COMM_WORLD on host CPUs or VE cards of this node.
NMPI_LOCAL_RANK_DEVICE The relative rank of MPI process in MPI_COMM_WORLD on host CPUs or a VE card of this node.

These environment variables can be referenced whenever MPI programs are running including before the invocation of the procedure MPI_INIT or MPI_INIT_THREAD.

When an MPI program is initiated, there is a predefined communication universe that includes all MPI processes and corresponds to a communicator MPI_COMM_WORLD. The predefined communication universe is assigned the identification number 0.

In a communication universe, each process is assigned an unique integer value called rank, which is in the range zero to one less than the number of processes.

If the dynamic process creation facility is used and a set of MPI processes is dynamically created, a new communication universe corresponding to a new communicator MPI_COMM_WORLD is created. Processes in each communication universe created at runtime are assigned consecutive integer identification numbers starting at 1. In such a case, two or more MPI_COMM_WORLDs can exist at the same time in a single MPI application.
Therefore, an MPI process can be identified using a pair of values of MPIUNIVERSE and MPIRANK.

In the case of Aurora system, MPI processes run on host CPUs or VE cards that are components of a node. By the environment variable MPINODEID, MPIVEID, NMPI_LOCAL_RANK, NMPI_LOCAL_RANK_VHVE and NMPI_LOCAL_RANK_DEBVICE, you can get the location where the MPI process run, and the unique number of the MPI process to the MPI process group within a node, the CPU side, the VE side or a VE. The environment variable MPIRANK indicates the unique number to the MPI process group of MPI_COMM_WORLD, but the environment variable NMPI_LOCAL_RANK, NMPI_LOCAL_RANK_VHVE and NMPI_LOCAL_RANK_DEVICE indicate the unique number to the smaller group which MPI_COMM_WORLD is split into as follows.

mpirun \
-host hostA -vh     -np 2 ./vh.out : \
-host hostA -ve 0-1 -np 4 ./ve.out : \
-host hostB -ve 0-1 -np 6 ./ve.out : \
-host hostB -vh     -np 2 ./vh.out


MPIRANK                 0  1  2  3  4  5  6  7  8  9 10 11 12 13
NMPI_LOCAL_RANK         0  1  2  3  4  5  0  1  2  3  4  5  6  7
NMPI_LOCAL_RANK_VHVE    0  1  0  1  2  3  0  1  2  3  4  5  0  1
NMPI_LOCAL_RANK_DEVICE  0  1  0  1  0  1  0  1  2  0  1  2  0  1
MPINODEID               0  0  0  0  0  0  1  1  1  1  1  1  1  1
MPIVEID                 -  -  0  0  1  1  0  0  0  1  1  1  -  -
Example of the environemnt variables of each MPI process at VH-VE hybrid execution. Indirect Initiation of an MPI Program with a Shell Script

If an MPI program is indirectly initiated with a shell script, these environment variables can also be referenced in the shell script and be used, for example, to allow different MPI processes to handle mutually different files. The shell script in the figure makes each MPI process read data from respectively different files and store data to respectively different files, and it is executed as shown in the figure.

#!/bin/sh
INFILE=infile.$MPIUNIVERSE.$MPIRANK
OUTFILE=outfile.$MPIUNIVERSE.$MPIRANK
{MPIexec} < $INFILE > $OUTFILE    # Refer to this clause for {MPIexec}, MPI-execution specification
exit $?
Figure 3-1   A Shell Script "mpi.shell" to Start an MPI Program

$ mpirun -np 8 /execdir/mpi.shell
Figure 3-2   Indirect Initiation of an MPI Program with a Shell Script


3.2.6   Environment Variables for Other Processors

The environment variables supported by other processors such as the Fortran compiler (nfort), C compiler (ncc), or C++ compiler (nc++) are passed to MPI processes because runtime option -genvall is enable by default. In the following example, OMP_NUM_THREADS and VE_LD_LIBRARY_PATH are passed to MPI processes.

#!/bin/sh
#PBS -T necmpi
#PBS -b 2

OMP_NUM_THREADS=8 ; export OMP_NUM_THREADS
VE_LD_LIBRARY_PATH={your shared library path} ; export VE_LD_LIBRARY_PATH

mpirun -node 0-1 -np 2 a.out


3.2.7   Rank Assignment

Ranks are assigned in the ascending order to MPI processes according to the order that NEC MPI assigns them to hosts.


3.2.8   The Working Directory

The working directory is determined as follows:
  1. The current working directory where the MPI execution command of NEC MPI is executed.
  2. The home directory if the above diretory is not available.


3.2.9   Execution with the apptainer container

You can execute MPI programs in the apptainer, formerly known as singularity, container. As the following example, apptainer command is specified as an argument of mpirun command. In this execution, options of apptainer command related to the namespace are not available.
As for how to build the apptainer image file of NEC MPI, please refer to the following site.

https://github.com/veos-sxarr-NEC/singularity


3.2.10   Execution Examples

The following examples show how to launch MPI programs on the SX-Aurora TSUBASA.


3.3   Standard Output and Standard Error of MPI Programs

To separate output streams from MPI processes, NEC MPI provides the shell script mpisep.sh, which is placed in the path /opt/nec/ve/bin/.

It is possible to redirect output streams from MPI processes into respectively different files in the current working directory by specifying this script before MPI-execution specification {MPIexec} as shown in the following example. (Please refer to this clause for MPI-execution specification {MPIexec}.)

$ mpirun -np 2 /opt/nec/ve/bin/mpisep.sh {MPIexec}

The destinations of output streams can be specified with the environment variable NMPI_SEPSELECT as shown in the following table, in which uuu is the identification number of the predefined communication universe corresponding to the communicator MPI_COMM_WORLD and rrr is the rank of the executing MPI process in the universe.

NMPI_SEPSELECT Action
1 Only the stdout stream from each process is put into the separate file stdout.uuu:rrr.
2 (Default) Only the stderr stream from each process is put into the separate file stderr.uuu:rrr.
3 The stdout and stderr streams from each process are put into the separate files stdout.uuu:rrr and stderr.uuu:rrr, respectively.
4 The stdout and stderr streams from each process are put into one separate file std.uuu:rrr.


3.4   Runtime Performance of MPI Programs

The performance of MPI programs can be obtained with the environment variable NMPI_PROGINF. There are four formats of runtime performance information available in NEC MPI as follows:
Format Description
Reduced Format This format consists of three parts: The first part is the Global Data section in which maximum, minimum and average performance of all MPI processes is displayed. The second part is the Overall Data section which performance of overall MPI processes is displayed. The third part is the VE Card section in which maximum, minimum and average performance of VE card is displayed. The results of the vector processes and scalar processes are output separately.
Extended Format Performance of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the reduced format.
Detailed Reduced Format This format consists of three parts: The first part is the Global Data section in which maximum, minimum, and average detailed performance of all MPI processes is displayed. The second part is the Overall Data section in which performance of overall MPI processes is displayed. The third part is the VE Card section in which maximum, minimum and average performance of VE card is displayed. The results of the vector processes and scalar processes are output separately.
Detailed Extended Format Detailed performance of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the detailed reduced format.
The format of displayed information can be specified by setting the environment variable NMPI_PROGINF at runtime as shown in the following table.

Table 3-8  The Settings of NMPI_PROGINF
NMPI_PROGINF Displayed Information
NO (Default) No Output
YES Reduced Format
ALL Extended Format
DETAIL Detailed Reduced Format
ALL_DETAIL Detailed Extended Format

Also, you can change a view of reduced format about VE by specifying the environment variable NMPI_PROGINF_VIEW.

NMPI_PROGINF_VIEW Displayed Information
VE_SPLIT Aggregate processes executed on VE30 and processes executed on VE10/VE10E/VE20 separately and display them.
VE_MERGED (Default) Aggregate all processes on VE together as vector processes and display it.

The following figure is an example of the detailed extended format.

MPI Program Information:
========================
Note: It is measured from MPI_Init till MPI_Finalize.
      [U,R] specifies the Universe and the Process Rank in the Universe.
      Times are given in seconds.


Global Data of 4 Vector processes       :          Min [U,R]          Max [U,R]      Average
=================================

Real Time (sec)                         :       25.203 [0,3]       25.490 [0,2]       25.325
User Time (sec)                         :      199.534 [0,0]      201.477 [0,2]      200.473
Vector Time (sec)                       :       42.028 [0,2]       42.221 [0,1]       42.104
Inst. Count                             :  94658554061 [0,1]  96557454164 [0,2]  95606075636
V. Inst. Count                          :  11589795409 [0,3]  11593360015 [0,0]  11591613166
V. Element Count                        : 920130095790 [0,3] 920199971948 [0,0] 920161556564
V. Load Element Count                   : 306457838070 [0,1] 306470712295 [0,3] 306463228635
FLOP Count                              : 611061870735 [0,3] 611078144683 [0,0] 611070006844
MOPS                                    :     6116.599 [0,2]     6167.214 [0,0]     6142.469
MOPS (Real)                             :    48346.004 [0,2]    48891.767 [0,3]    48624.070
MFLOPS                                  :     3032.988 [0,2]     3062.528 [0,0]     3048.186
MFLOPS (Real)                           :    23972.934 [0,2]    24246.003 [0,3]    24129.581
A. V. Length                            :       79.372 [0,1]       79.391 [0,3]       79.382
V. Op. Ratio (%)                        :       93.105 [0,2]       93.249 [0,1]       93.177
L1 Cache Miss (sec)                     :        3.901 [0,0]        4.044 [0,2]        3.983
CPU Port Conf. (sec)                    :        3.486 [0,1]        3.486 [0,2]        3.486
V. Arith. Exec. (sec)                   :       15.628 [0,3]       15.646 [0,1]       15.637
V. Load Exec. (sec)                     :       23.156 [0,2]       23.294 [0,1]       23.225
VLD LLC Hit Element Ratio (%)           :       90.954 [0,2]       90.965 [0,1]       90.959
FMA Element Count                       :       100000 [0,0]       100000 [0,0]       100000
Power Throttling (sec)                  :        0.000 [0,0]        0.000 [0,0]        0.000
Thermal Throttling (sec)                :        0.000 [0,0]        0.000 [0,0]        0.000
Max Active Threads                      :            8 [0,0]            8 [0,0]            8
Available CPU Cores                     :            8 [0,0]            8 [0,0]            8
Average CPU Cores Used                  :        7.904 [0,2]        7.930 [0,3]        7.916
Memory Size Used (MB)                   :     1616.000 [0,0]     1616.000 [0,0]     1616.000
Non Swappable Memory Size Used (MB)     :      115.000 [0,1]      179.000 [0,0]      131.000

Global Data of 8 Scalar processes       :          Min [U,R]          Max [U,R]      Average
=================================

Real Time (sec)                         :       25.001 [0,7]       25.010 [0,8]       25.005
User Time (sec)                         :      199.916 [0,7]      199.920 [0,8]      199.918
Memory Size Used (MB)                   :      392.000 [0,7]      392.000 [0,8]      392.000


Overall Data of 4 Vector processes
==================================

Real Time (sec)                         :       25.490
User Time (sec)                         :      801.893
Vector Time (sec)                       :      168.418
GOPS                                    :        5.009
GOPS (Real)                             :      157.578
GFLOPS                                  :        3.048
GFLOPS (Real)                           :       95.890
Memory Size Used (GB)                   :        6.313
Non Swappable Memory Size Used (GB)     :        0.512

Overall Data of 8 Scalar processes
==================================
Real Time (sec)                         :       25.010
User Time (sec)                         :     1599.344
Memory Size Used (GB)                   :        3.063


VE Card Data of 2 VEs
=====================

Memory Size Used (MB) Min               :     3232.000 [node=0,ve=0]
Memory Size Used (MB) Max               :     3232.000 [node=0,ve=0]
Memory Size Used (MB) Avg               :     3232.000
Non Swappable Memory Size Used (MB) Min :      230.000 [node=0,ve=1]
Non Swappable Memory Size Used (MB) Max :      294.000 [node=0,ve=0]
Non Swappable Memory Size Used (MB) Avg :      262.000


Data of Vector Process [0,0] [node=0,ve=0]:
-------------------------------------------

  Real Time (sec)                         :            25.216335
  User Time (sec)                         :           199.533916
  Vector Time (sec)                       :            42.127823
  Inst. Count                             :          94780214417
  V. Inst. Count                          :          11593360015
  V. Element Count                        :         920199971948
  V. Load Element Count                   :         306461345333
  FLOP Count                              :         611078144683
  MOPS                                    :          6167.214211
  MOPS (Real)                             :         48800.446081
  MFLOPS                                  :          3062.527699
  MFLOPS (Real)                           :         24233.424158
  A. V. Length                            :            79.373018
  V. Op. Ratio (%)                        :            93.239965
  L1 Cache Miss (sec)                     :             3.901453
  CPU Port Conf. (sec)                    :             3.485787
  V. Arith. Exec. (sec)                   :            15.642353
  V. Load Exec. (sec)                     :            23.274564
  VLD LLC Hit Element Ratio (%)           :            90.957228
  FMA Element Count                       :               100000
  Power Throttling (sec)                  :             0.000000
  Thermal Throttling (sec)                :             0.000000
  Max Active Threads                      :                    8
  Available CPU Cores                     :                    8
  Average CPU Cores Used                  :             7.912883
  Memory Size Used (MB)                   :          1616.000000
  Non Swappable Memory Size Used (MB)     :           179.000000
...
Figure 3-3   Performance Information in the Detailed Extended Format
(NMPI_PROGINF=ALL_DETAIL)

When the environment variable NMPI_PROGINF_VIEW to VE_SPLIT, the reduced sections are changed as follows:

The following table shows the meanings of the items in the Global Data section and the Process section. In the case of vector process, in addition to MPI universe number and MPI rank number of MPI_COMM_WORLD, hostname or logical node number and logical VE number are shown as the location information of VE where the MPI process is executed in the header of the Process section.
(*1) scalar processes outputs them only.
(*2) items are output only in the detailed reduced format or detailed extended format.
(*3) items are output only in the detailed reduced format or detailed extended format in multi-threaded execution.
(*4) item is output only for the process executed on VE30. When all processes in the aggregate range are executed on the corresponding VEs, it is output on the Global Data section.
(*5) the smaller of Max Active Threads and Available CPU Cores will be the upper limit.

Table 3-9   The Meanings of the Items in the Global Data Section and Process Section
Item Unit Description
Real Time (sec) second Elapsed time(*1)
User Time (sec) second User CPU time(*1)
Vector Time (sec) second Vector instruction execution time
Inst. Count The number of executed instructions
V.Inst. Count The number of executed vector instructions
V.Element Count The number of elements processed with vector instructions
V.Load Element Count The number of vector-loaded elements
FLOP Count The number of elements processed with floating-point operations
MOPS The number of million operations divided by the user CPU time
MOPS (Real) The number of million operations divided by the real time
FLOPS The number of million floating-point operations divided by the user CPU time
FLOPS (Real) The number of million floating-point operations divided by the real time
A.V.Length Average Vector Length
V.OP.RATIO percent Vector operation ratio
L1 Cache Miss (sec) second L1 cache miss time
CPU Port Conf. second CPU port conflict time (*2)
V. Arith Exec. second Vector operation execution time (*2)
V. Load Exec. second Vector load instruction execution time (*2)
LD L3 Hit Element Ratio Ratio of the number of elements loaded from L3 cache to the number of elements loaded with load instructions (*4)
VLD LLC Hit Element Ratio Ratio of the number of elements loaded from LLC to the number of elements loaded with vector load instructions
FMA Element Count Number of FMA execution elements (*2)
Power Throttling second Duration of time the hardware was throttled due to the power consumption (*2)
Thermal Throttling second Duration of time the hardware was throttled due to the temperature (*2)
Max Active Threads The maximum number of threads that were active at the same time (*3)
Available CPU Cores The number of CPU cores a process was allowed to use (*3)
Average CPU Cores Used The average number of CPU cores used (*3) (*5)
Memory Size Used (MB) megabyte (using base 1024) Peak usage of memory(*1)
Non Swappable Memory Size Used (MB) megabyte (using base 1024) Peak usage of memory that cannot be swapped out by Partial Process Swapping function

The following table shows the meanings of the items in the Overall Data section in the Figure above. For scalar processes, only items(*1) are output.

Table 3-10   The Meanings of the Items in the Overall Data Section
Item Unit Description
Real Time (sec) second The maximum elapsed time of all MPI processes(*1)
User Time (sec) second The sum of the user CPU time of all MPI processes(*1)
Vector Time (sec) second The sum of the vector time of all MPI processes
GOPS The total number of giga operations executed on all MPI processes divided by the total user CPU time of all MPI processes
GOPS (Real) The total number of giga operations executed on all MPI processes divided by the maximum real time of all MPI processes
GFLOPS The total number of giga floating-point operations executed on all MPI processes divided by the total user CPU time of all MPI processes
GFLOPS (Real) The total number of giga floating-point operations executed on all MPI processes divided by the maximum real time of all MPI processes
Memory Size Used (GB) gigabyte (using base 1024) The sum of peak usage of memory of all MPI processes(*1)
Non Swappable Memory Size Used (GB) gigabyte (using base 1024) The sum of peak usage of memory that cannot be swapped out by Partial Process Swapping function of all MPI processes
The following table shows the meanings of the items in the VE Card Data section in the Figure above. In the case of maximum or minimum, hostname or logical node number and logical VE number are shown as the location information of VE where the value is marked.

Table 3-11   The Meanings of the Items in the VE Card Data Section
Item Unit Description
Memory Size Used (MB) Min megabyte (using base 1024) The minimum of peak usage of memory aggregated for each VE card
Memory Size Used (MB) Max megabyte (using base 1024) The maximum of peak usage of memory aggregated for each VE card
Memory Size Used (MB) Avg megabyte (using base 1024) The average of peak usage of memory aggregated for each VE card
Non Swappable Memory Size Used (MB) Min megabyte (using base 1024) The minimum of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card
Non Swappable Memory Size Used (MB) Max megabyte (using base 1024) The maximum of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card
Non Swappable Memory Size Used (MB) Avg megabyte (using base 1024) The average of peak usage of memory that cannot be swapped out by Partial Process Swapping function aggregated for each VE card
MPI performance information outputs the program execution analysis information using Aurora HW performance counter. You can control the use of set of the performance counters by the environment variable VE_PERF_MODE and PROGINF can output items corresponding the set. Above output is the case that VE_PERF_MODE is unset or VE_PERF_MODE is set to VECTOR-OP. In this case, PROGINF outputs items related to vector instructions mainly. The below output is the case that VE_PERRF_MODE is set to VECTOR-MEM. In this case, PROGINF outputs items related to vector and memory access mainly.

Global Data of 16 Vector processes      :          Min [U,R]           Max [U,R]       Average
==================================

Real Time (sec)                         :      123.871 [0,12]      123.875 [0,10]      123.873
User Time (sec)                         :      123.695 [0,0]       123.770 [0,4]       123.753
Vector Time (sec)                       :       33.675 [0,8]        40.252 [0,14]       36.871
Inst. Count                             :  94783046343 [0,8]  120981685418 [0,5]  109351879970
V. Inst. Count                          :   2341570533 [0,8]    3423410840 [0,0]    2479317774
V. Element Count                        : 487920413405 [0,15] 762755268183 [0,0]  507278230562
V. Load Element Count                   :  47201569500 [0,8]   69707680610 [0,0]   49406464759
FLOP Count                              : 277294180692 [0,15] 434459800790 [0,0]  287678800758
MOPS                                    :     5558.515 [0,8]      8301.366 [0,0]      5863.352
MOPS (Real)                             :     5546.927 [0,8]      8276.103 [0,0]      5850.278
MFLOPS                                  :     2243.220 [0,15]     3518.072 [0,0]      2327.606
MFLOPS (Real)                           :     2238.588 [0,13]     3507.366 [0,0]      2322.405
A. V. Length                            :      197.901 [0,5]       222.806 [0,0]       204.169
V. Op. Ratio (%)                        :       83.423 [0,5]        90.639 [0,0]        85.109
L1 I-Cache Miss (sec)                   :        4.009 [0,5]         8.310 [0,0]         5.322
L1 O-Cache Miss (sec)                   :       11.951 [0,5]        17.844 [0,9]        14.826
L2 Cache Miss (sec)                     :        7.396 [0,5]        15.794 [0,0]         9.872
FMA Element Count                       : 106583464050 [0,8]  166445323660 [0,0]  110529497704
Required B/F                            :        2.258 [0,0]         3.150 [0,5]         2.948
Required Store B/F                      :        0.914 [0,0]         1.292 [0,5]         1.202
Required Load B/F                       :        1.344 [0,0]         1.866 [0,6]         1.746
Actual V. Load B/F                      :        0.223 [0,0]         0.349 [0,14]        0.322
Power Throttling (sec)                  :        0.000 [0,0]         0.000 [0,0]         0.000
Thermal Throttling (sec)                :        0.000 [0,0]         0.000 [0,0]         0.000
Memory Size Used (MB)                   :      598.000 [0,0]       598.000 [0,0]       598.000
Non Swappable Memory Size Used (MB)     :      115.000 [0,1]       179.000 [0,0]       131.000

When VE_PERF_MODE is set to VECTOR-MEM, MPI performance information outputs the following items instead of L1 Cache Miss, CPU Port Conf., V. Arith Exec., V. Load Exec. and VLD LLC Hit Element Ratio that are output when VE_PERF_MODE is set to VECTOR-OP or VE_PERF_MODE is unset.

(*1) items are output only in the detailed reduced format or detailed extended format.
(*2) items truncate the value over 100.
(*3) item is output only for the process on VE30. When all processes in the aggregate range are executed on the corresponding VEs, it is output on the Global Data section.
(*4) item is output only for the process on VE10/VE10E/VE20. When all processes in the aggregate range are executed on the corresponding VEs, it is output on the Global Data section.

Item Unit Description
L1 I-Cache Miss (sec) second L1 instruction cache miss time
L1 O-Cache Miss (sec) second L1 operand cache miss time
L2 Cache Miss (sec) second L2 cache miss time
LD L3 Hit Element Ratio Ratio of the number of elements loaded from L3 cache to the number of elements loaded with load instructions (*3)
VLD LLC Hit Element Ratio Ratio of the number of elements loaded from LLC to the number of elements loaded with vector load instructions (*3)
Required B/F B/F calculated from bytes specified by load and store instructions (*1) (*2)
Required Store B/F B/F calculated from bytes specified by store instructions (*1) (*2)
Required Load B/F B/F calculated from bytes specified by load instructions (*1) (*2)
Actual Load B/F B/F calculated from bytes of actual memory access by load instructions (*1) (*2) (*3)
Actual V. Load B/F B/F calculated from bytes of actual memory access by vector load instructions (*1) (*2) (*4)


3.5   MPI Communication Information

NEC MPI provides the facility of displaying MPI communication information. To use this facility, you need to generate MPI program with the option -mpiprof, -mpitrace, -mpiverify or -ftrace. There are two formats of MPI communication information available as follows:
Reduced Format

The maximum, minimum, and average values of MPI communication information of all MPI processes are displayed.

Extended Format

MPI communication information of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the reduced format.

You can control the display and format of MPI communication information by setting the environment variable NMPI_COMMINF at runtime as shown in the following table.

Table 3-12   The Settings of NMPI_COMMINF
NMPI_COMMINF Displayed Information
NO (Default) No Output
YES Reduced Format
ALL Extended Format

Also, you can change a view of reduced format by specifying the environment variable NMPI_COMMINF_VIEW.

Table 3-13   The Settings of NMPI_COMMINF_VIEW
NMPI_COMMINF_VIEW Displayed Information
VERTICAL (Default) Summarize for each vector process and scalar process and arrange vertically. Items that correspond only to vector processes are not output to the scalar process part.
HORIZONTAL Summarize for each vector process and scalar process and arrange horizontally. N/A is output to the scalar process part for items that correspond only to vector processes.
MERGED Summarize for vector processes and scalar processes. (V) is output at the end of line to the scalar process part for items that correspond only to vector processes. In the item, vector processes only are aggregated.

The following figure is an example of the extended format.


MPI Communication Information of 4 Vector processes
---------------------------------------------------
                                                   Min [U,R]           Max [U,R]       Average

Real MPI Idle Time (sec)                :        9.732 [0,1]        10.178 [0,3]         9.936
User MPI Idle Time (sec)                :        9.699 [0,1]        10.153 [0,3]         9.904
Total real MPI Time (sec)               :       13.301 [0,0]        13.405 [0,3]        13.374
Send       count                        :         1535 [0,2]          2547 [0,1]          2037
   Memory Transfer                      :          506 [0,3]          2024 [0,0]          1269
   DMA Transfer                         :            0 [0,0]          1012 [0,1]           388
Recv       count                        :         1518 [0,2]          2717 [0,0]          2071
   Memory Transfer                      :          506 [0,2]          2024 [0,1]          1269
   DMA Transfer                         :            0 [0,3]          1012 [0,2]           388
Barrier       count                     :         8361 [0,2]          8653 [0,0]          8507
Bcast         count                     :          818 [0,2]           866 [0,0]           842
Reduce        count                     :          443 [0,0]           443 [0,0]           443
Allreduce     count                     :         1815 [0,2]          1959 [0,0]          1887
Scan          count                     :            0 [0,0]             0 [0,0]             0
Exscan        count                     :            0 [0,0]             0 [0,0]             0
Redscat       count                     :          464 [0,0]           464 [0,0]           464
Redscat_block count                     :            0 [0,0]             0 [0,0]             0
Gather        count                     :          864 [0,0]           864 [0,0]           864
Gatherv       count                     :          506 [0,0]           506 [0,0]           506
Allgather     count                     :          485 [0,0]           485 [0,0]           485
Allgatherv    count                     :          506 [0,0]           506 [0,0]           506
Scatter       count                     :          485 [0,0]           485 [0,0]           485
Scatterv      count                     :          506 [0,0]           506 [0,0]           506
Alltoall      count                     :          506 [0,0]           506 [0,0]           506
Alltoallv     count                     :          506 [0,0]           506 [0,0]           506
Alltoallw     count                     :            0 [0,0]             0 [0,0]             0
Neighbor Allgather  count               :            0 [0,0]             0 [0,0]             0
Neighbor Allgatherv count               :            0 [0,0]             0 [0,0]             0
Neighbor Alltoall   count               :            0 [0,0]             0 [0,0]             0
Neighbor Alltoallv  count               :            0 [0,0]             0 [0,0]             0
Neighbor Alltoallw  count               :            0 [0,0]             0 [0,0]             0
Number of bytes sent                    :    528482333 [0,2]     880803843 [0,1]     704643071
   Memory Transfer                      :    176160755 [0,3]     704643020 [0,0]     440401904
   DMA Transfer                         :            0 [0,0]     352321510 [0,1]     132120600
Number of bytes recvd                   :    528482265 [0,2]     880804523 [0,0]     704643207
   Memory Transfer                      :    176160755 [0,2]     704643020 [0,1]     440401904
   DMA Transfer                         :            0 [0,3]     352321510 [0,2]     132120600
Put        count                        :            0 [0,0]             0 [0,0]             0
Get        count                        :            0 [0,0]             0 [0,0]             0
Accumulate count                        :            0 [0,0]             0 [0,0]             0
Number of bytes put                     :            0 [0,0]             0 [0,0]             0
Number of bytes got                     :            0 [0,0]             0 [0,0]             0
Number of bytes accum                   :            0 [0,0]             0 [0,0]             0

MPI Communication Information of 8 Scalar processes
---------------------------------------------------
                                                   Min [U,R]           Max [U,R]       Average

Real MPI Idle Time (sec)                :        4.837 [0,6]         5.367 [0,11]        5.002
User MPI Idle Time (sec)                :        4.825 [0,6]         5.363 [0,11]        4.992
Total real MPI Time (sec)               :       12.336 [0,11]       12.344 [0,5]        12.340
Send       count                        :         1535 [0,4]          1535 [0,4]          1535
   Memory Transfer                      :          506 [0,11]         1518 [0,5]          1328
Recv       count                        :         1518 [0,4]          1518 [0,4]          1518
   Memory Transfer                      :          506 [0,4]          1518 [0,5]          1328
...
Number of bytes accum                   :            0 [0,0]             0 [0,0]             0


Data of Vector Process [0,0] [node=0,ve=0]:
-------------------------------------------

  Real MPI Idle Time (sec)                :            10.071094
  User MPI Idle Time (sec)                :            10.032894
  Total real MPI Time (sec)               :            13.301340
...
Figure 3-4 MPI Communication Information in the Extended Format
(NMPI_COMMINF=ALL)

The following figure is an reduced format example of the NMPI_COMMINF_VIEW=MERGED.

MPI Communication Information of 4 Vector and 8 Scalar processes
----------------------------------------------------------------
                                                   Min [U,R]           Max [U,R]       Average

Real MPI Idle Time (sec)                :        4.860 [0,10]       10.193 [0,3]         6.651
User MPI Idle Time (sec)                :        4.853 [0,10]       10.167 [0,3]         6.635
Total real MPI Time (sec)               :       12.327 [0,4]        13.396 [0,3]        12.679
Send       count                        :         1535 [0,2]          2547 [0,1]          1702
   Memory Transfer                      :          506 [0,3]          2024 [0,0]          1309
   DMA Transfer                         :            0 [0,0]          1012 [0,1]           388 (V)
Recv       count                        :         1518 [0,2]          2717 [0,0]          1702
   Memory Transfer                      :          506 [0,2]          2024 [0,1]          1309
   DMA Transfer                         :            0 [0,3]          1012 [0,2]           388 (V)
...
Number of bytes accum                   :            0 [0,0]             0 [0,0]             0
  
Figure 3-5 MPI Communication Information in the Reduced Format
(NMPI_COMMINF_VIEW=MERGED)

The following table shows the meanings of the items in the MPI communication information. The item "DMA Transfer" is only supported for a vector process.

Item Unit Description
Real MPI Idle Time second Elapsed time for waiting for messages
User MPI Idle Time second User CPU time for waiting for messages
Total real MPI Time second Elapsed time for executing MPI procedures
Send count The number of invocations of point-to-point send procedures
Memory Transfer The number of invocations of point-to-point send procedures that use memory copy
DMA Transfer The number of invocations of point-to-point send procedures that use DMA transfer
Recv count The number of invocations of point-to-point receive procedures
Memory Transfer The number of invocations of point-to-point receive procedures that use memory copy
DMA Transfer The number of invocations of point-to-point receive procedures that use DMA transfer
Barrier count The number of invocations of the procedures MPI_BARRIER and MPI_IBARRIER
Bcast count The number of invocations of the procedures MPI_BCAST and MPI_IBCAST
Reduce count The number of invocations of the procedures MPI_REDUCE and MPI_IREDUCE
Allreduce count The number of invocations of the procedures MPI_ALLREDUCE and MPI_IALLREDUCE
Scan count The number of invocations of the procedures MPI_SCAN and MPI_ISCAN
Exscan count The number of invocations of the procedures MPI_EXSCAN and MPI_IEXSCAN
Redscat count The number of invocations of the procedures MPI_REDUCE_SCATTER and MPI_IREDUCE_SCATTER
Redscat_block count The number of invocations of the procedures MPI_REDUCE_SCATTER_BLOCK and MPI_IREDUCE_SCATTER_BLOCK
Gather count The number of invocations of the procedures MPI_GATHER and MPI_IGATHER
Gatherv count The number of invocations of the procedures MPI_GATHERV and MPI_IGATHERV
Allgather count The number of invocations of the procedures MPI_ALLGATHER and MPI_IALLGATHER
Allgatherv count The number of invocations of the procedures MPI_ALLGATHERV and MPI_IALLGATHERV
Scatter count The number of invocations of the procedures MPI_SCATTER and MPI_ISCATTER
Scatterv count The number of invocations of the procedures MPI_SCATTERV and MPI_ISCATTERV
Alltoall count The number of invocations of the procedures MPI_ALLTOALL and MPI_IALLTOALL
Alltoallv count The number of invocations of the procedures MPI_ALLTOALLV and MPI_IALLTOALLV
Alltoallw count The number of invocations of the procedures MPI_ALLTOALLW and MPI_IALLTOALLW
Neighbor Allgather count The number of invocations of the procedures MPI_NEIGHBOR_ALLGATHER and MPI_INEIGHBOR_ALLGATHER
Neighbor Allgatherv count The number of invocations of the procedures MPI_NEIGHBOR_ALLGATHERV and MPI_INEIGHBOR_ALLGATHERV
Neighbor Alltoall count The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALL and MPI_INEIGHBOR_ALLTOALL
Neighbor Alltoallv count The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALLV and MPI_INEIGHBOR_ALLTOALLV
Neighbor Alltoallw count The number of invocations of the procedures MPI_NEIGHBOR_ALLTOALLW and MPI_INEIGHBOR_ALLTOALLW
Number of bytes sent byte The number of bytes sent by point-to-point send procedures
Memory Transfer byte The number of bytes sent using memory copy by point-to-point send procedures
DMA Transfer byte The number of bytes sent using DMA transfer by point-to-point send procedures
Number of bytes recvd byte The number of bytes received by point-to-point receive procedures
Memory Transfer byte The number of bytes received using memory copy by point-to-point receive procedures
DMA Transfer byte The number of bytes received using DMA transfer by point-to-point receive procedures
Put count The number of invocations of the procedures MPI_PUT and MPI_RPUT
Memory Transfer The number of invocations of the procedures MPI_PUT and MPI_RPUT that use memory copy
DMA Transfer The number of invocations of the procedures MPI_PUT and MPI_RPUT that use DMA transfer
Get count The number of invocations of the procedures MPI_GET and MPI_RGET
Memory Transfer The number of invocations of the procedures MPI_GET and MPI_RGET that use memory copy
DMA Transfer The number of invocations of the procedures MPI_GET and MPI_RGET that use DMA transfer
Accumulate count The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP
Memory Transfer The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP that use memory copy
DMA Transfer The number of invocations of the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP that use DMA transfer
Number of bytes put byte The number of bytes put by the procedures MPI_PUT and MPI_RPUT
Memory Transfer byte The number of bytes put using memory copy by the procedures MPI_PUT and MPI_RPUT
DMA Transfer byte The number of bytes put using DMA transfer by the procedures MPI_PUT and MPI_RPUT
Number of bytes got byte The number of bytes got by the procedures MPI_GET and MPI_RGET
Memory Transfer byte The number of bytes got using memory copy by the procedures MPI_GET and MPI_RGET
DMA Transfer byte The number of bytes got using DMA transfer by the procedures MPI_GET and MPI_RGET
Number of bytes accum byte The number of bytes accumulated by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP
Memory Transfer byte The number of bytes accumulated using memory copy by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP
DMA Transfer byte The number of bytes accumulated using DMA transfer by the procedures MPI_ACCUMULATE, MPI_RACCUMULATE, MPI_GET_ACCUMULATE, MPI_RGET_ACCUMULATE, MPI_FETCH_AND_OP and MPI_COMPARE_AND_SWAP
Table 3-14   The Meanings of the Items in the MPI Communication Information


3.6   FTRACE Facility

The FTRACE facility enables users to obtain detailed performance information of each procedure and specified execution region of a program on each MPI process, including MPI communication information. Please refer to "PROGINF / FTRACE User's Guide" for details. Note: FTRACE is only available in the program executed on VE.

The following table shows the MPI communication information displayed with the FTRACE facility.

Table 3-15 MPI Communication information Displayed with the FTRACE Facility
Item Unit Meaning
ELAPSE second Elapsed time
COMM.TIME second Elapsed time for executing MPI procedures
COMM.TIME / ELAPSE The ratio of the elapsed time for executing MPI procedures to the elapsed time of each process
IDLE TIME second Elapsed time for waiting for messages
IDLE TIME / ELAPSE The ratio of the elapsed time for waiting for messages to the elapsed time of each process
AVER.LEN Byte Average amount of communication per MPI procedure (The unit is using base 1024)
COUNT Total number of transfers by MPI procedures
TOTAL LEN Byte Total amount of communication by MPI procedures (The unit is using base 1024)


The steps for using the FTRACE facility are as follows:

  1. Specify the -ftrace option at compile and link time as follows:

    $ mpincc -ftrace mpi.c
    $ mpinfort -ftrace mpifort.f90

  2. Analysis information files are generated in the working directory at runtime. The name of an analysis information file is ftrace.out.uuu.rrr, where uuu and rrr are the values of the environment variables MPIUNIVERSE and MPIRANK, respectively.

  3. Execute the ftrace command to read the analysis information files and display the performance information to the standard output as follows:

    $ ftrace -all -f ftrace.out.0.0 ftrace.out.0.1
    $ ftrace -f ftrace.out.*


The following figure shows an example displayed by the FTRACE facility.


Figure 3-6   Performance Information by the FTRACE Facility
*----------------------*
  FTRACE ANALYSIS LIST
*----------------------*

Execution Date : Sat Feb 17 12:44:49 2018 JST
Total CPU Time : 0:03'24"569 (204.569 sec.)


FREQUENCY  EXCLUSIVE       AVER.TIME     MOPS   MFLOPS  V.OP  AVER.    VECTOR L1CACHE .... PROC.NAME
           TIME[sec](  % )    [msec]                    RATIO V.LEN      TIME    MISS

     1012    49.093( 24.0)    48.511  23317.2  14001.4  96.97  83.2    42.132   5.511      funcA
   160640    37.475( 18.3)     0.233  17874.6   9985.9  95.22  52.2    34.223   1.973      funcB
   160640    30.515( 14.9)     0.190  22141.8  12263.7  95.50  52.8    29.272   0.191      funcC
   160640    23.434( 11.5)     0.146  44919.9  22923.2  97.75  98.5    21.869   0.741      funcD
   160640    22.462( 11.0)     0.140  42924.5  21989.6  97.73  99.4    20.951   1.212      funcE
 53562928    15.371(  7.5)     0.000   1819.0    742.2   0.00   0.0     0.000   1.253      funcG
        8    14.266(  7.0)  1783.201   1077.3     55.7   0.00   0.0     0.000   4.480      funcH
   642560     5.641(  2.8)     0.009    487.7      0.2  46.45  35.1     1.833   1.609      funcF
     2032     2.477(  1.2)     1.219    667.1      0.0  89.97  28.5     2.218   0.041      funcI
        8     1.971(  1.0)   246.398  21586.7   7823.4  96.21  79.6     1.650   0.271      funcJ
------------------------------------------------------------------------------------- .... -----------
 54851346   204.569(100.0)     0.004  22508.5  12210.7  95.64  76.5   154.524  17.740      total


ELAPSED     COMM.TIME  COMM.TIME   IDLE TIME  IDLE TIME  AVER.LEN      COUNT  TOTAL LEN PROC.NAME
   TIME[sec]       [sec]  / ELAPSED       [sec]  / ELAPSED    [byte]                [byte]

      12.444       0.000                  0.000                 0.0           0       0.0  funcA
       9.420       0.000                  0.000                 0.0           0       0.0  funcB
       7.946       0.000                  0.000                 0.0           0       0.0  funcG
       7.688       0.000                  0.000                 0.0           0       0.0  funcC
       7.372       0.000                  0.000                 0.0           0       0.0  funcH
       5.897       0.000                  0.000                 0.0           0       0.0  funcD
       5.653       0.000                  0.000                 0.0           0       0.0  funcE
       1.699       1.475                  0.756                 3.1K     642560       1.9G funcF
       1.073       1.054                  0.987                 1.0M       4064       4.0G funcI
       0.704       0.045                  0.045                80.0           4     320.0  funcK
------------------------------------------------------------------------------------------------------


FREQUENCY  EXCLUSIVE       AVER.TIME     MOPS   MFLOPS  V.OP  AVER.    VECTOR L1CACHE .... PROC.NAME
           TIME[sec](  % )    [msec]                    RATIO V.LEN      TIME    MISS

     1012    49.093( 24.0)    48.511  23317.2  14001.4  96.97  83.2    42.132   5.511      funcA
      253    12.089           47.784  23666.9  14215.9  97.00  83.2    10.431   1.352       0.0
      253    12.442           49.177  23009.2  13811.8  96.93  83.2    10.617   1.406       0.1
      253    12.118           47.899  23607.4  14180.5  97.00  83.2    10.463   1.349       0.2
      253    12.444           49.185  23002.8  13808.2  96.93  83.2    10.622   1.404       0.3
...
------------------------------------------------------------------------------------- .... ----------
 54851346   204.569(100.0)     0.004  22508.5  12210.7  95.64  76.5   154.524  17.740      total

   ELAPSED     COMM.TIME  COMM.TIME   IDLE TIME  IDLE TIME  AVER.LEN      COUNT  TOTAL LEN PROC.NAME
   TIME[sec]       [sec]  / ELAPSED       [sec]  / ELAPSED    [byte]                [byte]

      12.444       0.000                  0.000                 0.0           0       0.0  funcA
      12.090       0.000      0.000       0.000      0.000      0.0           0       0.0   0.0
      12.442       0.000      0.000       0.000      0.000      0.0           0       0.0   0.1
      12.119       0.000      0.000       0.000      0.000      0.0           0       0.0   0.2
      12.444       0.000      0.000       0.000      0.000      0.0           0       0.0   0.3


3.7   MPI Procedures Tracing Facility

NEC MPI provides the facility to trace invocations of and returns from MPI procedures, and the progress of each MPI process is output to the standard output.

The following information is displayed.

The tracing facility makes it easy to see where a program runs and to debug it.

In order to use this facility, please generate MPI program with the -mpitrace option.

Note that amount of the trace output can be huge if a program calls MPI procedures many times.


3.8    Traceback

By calling MPI_Abort the traceback information can be obtained. The following is an example

[0,0] MPI Abort by user Aborting program !
[0,0] Obtained 5 stack frames.
[0,0] aborttest() [0x60000003eb18]
[0,0] aborttest() [0x600000006ad0]
[0,0] aborttest() [0x600000005b48]
[0,0] aborttest() [0x600000005cf8]
[0,0] /opt/nec/ve/lib/libc.so.6(__libc_start_main+0x340) [0x600c01c407b0]
[0,0] aborttest() [0x600000005a08]
  [0,0] Aborting program!


                                                                                                                                                                                           
In the example the traceback information is output in the form of libc back-trace symbol as default. When the program specifying -traceback=verbose at the compiling and linking stage is executed, the filename and the line number are output by setting "ON" for the environmental variable NMPI_VE_TRACEBACK. Changing the output format with this environment variable is valid only for VE MPI programs.The following is an example.
[0,0] MPI Abort by user Aborting program !
[0,0] [ 0] 0x600000001718 abort_test       abort.c:33
[0,0] [ 1] 0x600000001600 out              out.c:9
[0,0] [ 2] 0x600000001460 hey              hey.c:9
[0,0] [ 3] 0x600000001530 main             main.c:13
[0,0] [ 4] 0x600c01c407a8 ?                ?:?
[0,0] [ 5] 0x600000000b00 ?                ?:?
[0,0] Aborting program!



                                                                                                                                                                                           
The maximum line number of the traceback information can be set by using NMPI_TRACEBACK_DEPTH. In the case that no maximum line number is set, 50 is set as the default value.


3.9   Debug Assist Feature for MPI Collective Procedures

The debug assist feature for MPI collective procedures assists users in debugging invocations of MPI collective procedures by detecting incorrect uses across processes and outputting detected errors in detail to the standard error output.
The incorrect uses include the following cases

Please generate MPI program with the -mpiverify option to use this feature as follows:

$ mpinfort -mpiverify f.f90

When an error is detected, a message including the following information is output to the standard error output.

The following example shows the message output when the process with rank 3 invoked the procedure MPI_BCAST with the argument root whose value was 2 and the process with rank 0 invoked the procedure with the argument root whose value was 1.

VERIFY MPI_Bcast(3): root 2 inconsistent with root 1 of 0

The errors to be detected can be specified by setting the environment variable NMPI_VERIFY at runtime as shown in the following table.

Table 3-15   The Settings of NMPI_VERIFY
NMPI_VERIFY Detected Errors
0 No errors are detected.
3 (Default) Errors other than those in the argument assert of the procedure MPI_WIN_FENCE
4 Errors in the argument assert of the procedure MPI_WIN_FENCE, in addition to the errors detected by default

The following table shows the errors that can be detected by the debug assist feature.

Table 3-17 Errors Detected by the Debug Assist Feature
Procedure Target of Checking Condition
All collective procedures Order of invocations Processes in the same communicator, or corresponding to the same window or file handle invoked different MPI collective procedures at the same time.
Procedures with the argument root Argument root The values of the argument root were not the same across processes.
Collective communication procedures Message length (extent of an element * the number of elements transferred) The length of a sent message was not the same as that of the corresponding received message.
Collective communication procedures that perform reduction operations Argument op The values of the argument op (reduction operator) were not the same across processes.
Topology collective procedures Graph information and dimensional information Information of a graph or dimensions specified with arguments was inconsistent across processes.
MPI_COMM_CREATE Argument group The groups specified with the argument group were not the same across processes.
MPI_INTERCOMM_CREATE Arguments local_leader and tag The values of the argument local_leader were not the same across processes in the local communicator, or the values of the argument tag were not the same across the processes corresponding to the argument local_leader or remote_leader.
MPI_INTERCOMM_MERGE Argument high The values of the argument high were not the same across processes.
MPI_FILE_SET_VIEW Arguments etype and datarep The datatypes specified with the argument etype or the data representation specified with the argument datarep were not the same across processes.
MPI_WIN_FENCE Argument assert The values of the argument assert were inconsistent across processes.
Note that this feature involves overhead for checking invocations of MPI collective procedures and can result in lower performance. Therefore, please re-generate MPI program without the -mpiverify option once the correctness of uses of collective procedures is verified.


3.10   Exit Status of an MPI Program

NEC MPI watches exit statuses of MPI processes to determine whether termination of program execution is normal termination or error termination. Normal termination occurs if and only if every MPI process returns 0 as its exit status. Otherwise error termination occurs.
Therefore, termination status of program execution should be specified as follows for NEC MPI to recognize the termination status correctly.


3.11   Miscellaneous

This section describes additional notes in NEC MPI.

  1. In MPI execution, the same version of the MPI library must be linked with all the executable files and shared libraries except for the cases mentioned below. It is possible to check the library version of executable files with any one of the following "a" and "b" ways.

    1. You can obtain the default directory path (RUNPATH) of MPI library dynamically linked by nreadelf command. The underlined part of this path is the version of MPI library. Note that when executable file is generated with option -shared-mpi, the version of MPI library that corresponds to the setup script loaded before execution is dynamically linked in preference.
      $ /opt/nec/ve/bin/nreadelf -W -d a.out | grep RUNPATH
      0x000000000000001d (RUNPATH) Library runpath: [/opt/nec/ve/mpi/2.2.0/lib64/ve:...]
    2. When executable files is generated without option -shared-mpi, you can obtain the version of MPI library statically linked by strings command and grep command.

      $ /usr/bin/strings a.out | /bin/grep "library version"
      NEC MPI: library Version 2.2.0 (17. April 2019): Copyright (c) NEC Corporation 2018-2019

    MPI memory management library is always dynamically linked, even if other MPI libraries are statically linked. In this case, dynamic linking of a newer version of MPI memory management library at runtime is fine, as long as the major version matches the other statically linked MPI libraries.

  2. When users use the extended precision features of the Fortran compiler at compile time of MPI programs written in Fortran, both of the compiler options -fdefault-integer=8 and -fdefault-real=8 must be specified, and other extended precision compiler options must not be specified.

  3. NEC MPI cannot be used in a program in which Fortran procedures that have been compiled with the extended precision compile options and C functions are mixed.

  4. NEC MPI handles signals SIGINT, SIGTERM, and SIGXCPU to appropriately control abnormal terminations of programs. For user programs to handle these signals by themselves, they must call previously defined signal handlers. Otherwise, the proper termination of programs is not guaranteed.

  5. Interfaces in C++ format (C++ bindings), which have been removed in MPI-3.0, cannot be used in NEC MPI. If they are used, please change them into those in C format (C bindings), or specify the option -mpicxx. Note that the option -mpicxx cannot be used when the compiler option -stdlib=libc++ is enabled in NEC C++ compiler.

  6. Programming languages used for source programs cannot be specified with the -x compiler option in the MPI compilation commands.

  7. When acquiring the MPI execution performance information by specifying the environment variable NMPI_PROGINF, you need to use the option -pthread for the link between MPI program executed on VE10/VE10E/VE20 and pthread library. If -lpthread is specified instead of option -pthread, MPI execution performance information may not be displayed correctly.

  8. By default, MPI libraries are linked statically except for MPI memory management library, but when creating a shared library by specifying the -shared compiler option in the MPI compilation commands, all MPI libraries are linked dynamically. When linking a shared library with the dynamically linked all MPI libraries to an executable file, specify the option -shared-mpi and link all MPI libraries dynamically.

  9. MPI compile commands dynamically link MPI programs even if the compiler option -static is specified. Using the compiler option -static is not recommended with MPI compile commands. MPI programs require shared system libraries and shared library for MPI memory management to execute, so MPI compile commands append -Wl,-Bdynamic to the end of the command line to force dynamic linking. The mix of -Wl,-Bdynamic appended by MPI compile commands and -static may lead unexpected behavior.

    If you want to link MPI program against static libraries, you can use linker option -Bstatic and compiler options to link a program against static compiler libraries instaed of compiler option -static. When you use linker option -Bstatic, you surround libraries with -Wl,-Bstatic and -Wl,-Bdynamic. The surrounded libraries are linked statically. The following example is that libww and libxx are linked statically.

    mpincc a.c -lvv -Wl,-Bstatic -lww -lxx -Wl,-Bdynamic -lyy

    About the compiler options to link a program against static compiler libraries, please refer to the compiler's manual.

  10. The execution directory of the MPI program needs write permission. If the authority is insufficient, the following warning message may be output and MPI communication performance may degraded.
    mkstemp: Permission denied
  11. When MPI performance information is used, this function issues signal SIGUSR1 to VE threads for collecting performance information in the MPI_Init and MPI_Finalize in MPI processes running on VE10/VE10E/VE20. In the case of executing a MPI program under a debugger, a debugger may capture the SIGUSR1 and stop the MPI execution. Also when VE MPI programs use non-blocking MPI-IO procedures and POSIX AIO is selected as asynchronous I/O method used by the procedures, a POSIX AIO worker thread created for asynchronous I/O does not respond to the SIGUSR1 and the MPI execution may be stopped. In the above cases, by specifying this environment variable VE_PROGINF_USE_SIGNAL=NO, signal issuance can be suppressed. When signal issuance is suppressed, MPI performance information only terminates the threads of OpenMP and compiler automatic parallelization and collect information from the threads. So the other threads cannot be collected and value are not shown in performance information items however User time, Real Time, Memory Size Used and Non Swappable Memory Size Used are excluded.
  12. MPI uses HugePages to optimize MPI communications. If MPI cannot allocate HugePages on a host, the following warning message outputs and MPI program may be abnormally terminated. The configuration of the HugePages requires the system administrator privileges. If the message outputs, please refer to "SX-Aurora TSUBASA Installation Guide", or contact the system administrator for details.

    mpid(0): Allocate_system_v_shared_memory: key = 0x420bf67e, len = 16777216
    shmget allocation: Cannot allocate memory
  13. The memlock resource limit needs to be set to "unlimited" for MPI to use Infininband communication and HugePages. Because this setting is applied automatically, you don't change the memlock resource limit from "unlimited" by ulimit command and so on. If the memlock resource limit is not "unlimited", there is a possibility that MPI execution aborts or MPI communication slows down with the following messages.

    libibverbs: Warning: RLIMIT_MEMLOCK is 0 bytes.
    This will severely limit memory registrations.
    [0] MPID_OFED_Open_hca: open device failed ib_dev 0x60100002ead0 name mlx5_0
    [0] Error in Infiniband/OFED initialization. Execution aborts
    mpid(0): Allocate_system_v_shared_memory: key = 0xd34d79c0, len = 16777216
    shmget allocation: Operation not permitted
    Even if the memlock resource limit is set to "unlimited", the following message may be output to system log. This message is not problem and MPI execution works correctly.
    kernel: mpid (20934): Using mlock ulimits for SHM_HUGETLB is deprecated
  14. If the process terminates abnormally during the application execution, information related to the cause of the abnormal termination (error details, termination status, etc.) is output with the universe number and rank number. However, depending on the timing of abnormal termination, many messages such as the following may be output, making it difficult to refer to the information related to the cause of the abnormal termination.

    [3] mpisx_sendx: left (abnormally) (rc=-1), sock = -1 len 0 (12)
    Error in send () called by mpisx_sendx: Bad filedescriptor
    In this case, it may be easier to refer to this information by excluding the above message. An example command is shown below.
    $ grep -v mpisx_sendx <outputfile>
  15. When MPI program is executed on Model A412-8, B401-8 or C401-8 using NQSV request that request multiple logical nodes, the NQSV option --use-hca needs to be set as the number of available HCAs for NEC MPI to select appropriate HCAs. Otherwise, the following error may occur at the end of MPI execution.

    mpid(0): accept_process_answer: Application 0: No valid IB device found which is requested by environment variable NMPI_IP_USAGE=OFF. Specify NMPI_IP_USAGE=FALLBACK if TCP/IP should be used in this case !
  16. When using VEO, VE memory directly passed to MPI procedure must be allocated with veo_alloc_hmem.

  17. MPI process cannot execute the following system calls and library functions.

    processes on VE: fork, popen, posix_spawn
    processes on VH or scalar host: fork, system, popen, posix_spawn

    Additionally, if a process on VE uses non-blocking MPI-IO and VE AIO (default value) is selected as asynchronous I/O method, the process cannot execute system() until completion of MPI-IO.

    If any of these system calls or library functions is executed, the MPI program may result in problems, such as program stall or abnormal termination.

  18. malloc_info function cannot be used in MPI programs. If malloc_info is executed in MPI programs, malloc_info may return incorrect value. MPI programs ignores M_PERFTURB, M_ARENA_MAX and M_ARENA_TEST arguments of mallopt function and MALLOC_PERFTURB_, MALLOC_ARENA_MAX and MALLOC_ARENA_TEST environment variables. (Note: In the case of VE program, VE_ is prefixed to those environment variables)

  19. If you source the setup script "necmpivars.sh", "necmpivars.csh", "necmpivars-runtime.sh" or "necmpivars-runtime.csh" without explicit paramaters in a shell script, parameters specified to the shell script may be passed to the setup script. If invalid parameters are passed to the setup script, the following message is output and LD_LIBRARY_PATH is not updated.

    necmpivars.sh: Warning: invalid argument. LD_LIBRARY_PATH is not updated.
  20. When the AVEO UserDMA feature is enabled, available VE memory may not increase even if veo_free_hmem is called to free VE memory or veo_proc_destroy is called to terminate a VEO process.

  21. When the AVEO UserDMA feature is enabled, users cannot call veo_proc_create or similar functions to create new VEO process after calling veo_proc_destroy. In this case, abnormal termination or illegal result may be occurred.

  22. When an MPI program is executed through the NQSV, if the following conditions are all fulfilled, NEC MPI uses SIGSTOP, SIGCONT and SIGUSR2. Therefore, user programs cannot handle (trap, hold or ignore) those signals, and processes cannot be controlled by a debugger such as gdb. Otherwise, the MPI program may be abnormally stopped or terminated.

  23. When using CUDA, GPU memory directly passed to MPI procedure must be allocated with cudaMalloc, cudaMallocPitch or cudaMalloc3D.

  24. When invoking MPI processes on VE30, and also invoking MPI processes on VE10/VE10E/VE20, you cannot source the MPI setup script (necmpivars.sh and so on). In this case, the mpirun command needs to be specified as /opt/nec/ve/bin/mpirun or /opt/nec/ve3/{version}/bin/runtime/mpirun ({version} is the directory name corresponding to the version of NEC MPI you use).

  25. When using CUDA with NVIDIA CUDA Toolkit 11.2 or before, the following message may be displayed. This means the lack of the API to enable the GPUDirect RDMA feature. Please ignore this message if you do not use the GPUDirect RDMA feature. You can suppress this message by specifying environment variable NMPI_IB_GPUDIRECT_ENABLE=OFF for disabling the GPUDirect RDMA feature.

    MPID_CUDA_Init_GPUDirect: Cannot dynamically load CUDA symbol cuFlushGPUDirectRDMAWrites MPID_CUDA_Init_GPUDirect: Error message /lib64/libcuda.so: undefined symbol: cuFlushGPUDirectRDMAWrites
  26. The MPI compile command dynamically links MPI memory management library so that calls to functions such as malloc and free in a program execute functions provided by MPI memory management library. For this reason, programs linked with MPI compile commands should not call functions such as malloc and free provided by other libraries. Doing so can cause memory corruption. If you want to implement wrappers for functions such as malloc and free, you can get the functions provided by MPI memory management library in dlsym(RTLD_NEXT), so use them.

  27. If you reduce Non Swappable Memory during Switch Over by specifying the environment variable NMPI_SWAP_ON_HOLD=ON, Non Swappable Memory may not be reduced as expected.

    The target memory when using direct transfer of InfiniBand communication or the global memory returned when using the MPI procedures such as MPI_Alloc_mem become Non Swappable Memory, but since the performance may change, NMPI_SWAP_ON_HOLD=ON does not perform reducing these Non Swappable Memory.

    If you want to prioritize reducing non swappable memory, please specify additional environment variables depending on the result of mpirun -v.


Contents Previous Chapter Next Chapter Glossary Index