The Tutorial and API Reference of VEDA  2.11.1
Getting Started with VEDA

VEDA (VE Driver API) and VERA (VE Runtime API) are a CUDA Driver and Runtime API-like APIs for hybrid programming. It is based on AVEO. Most of the functionality is identical to the CUDA Driver API and CUDA Runtime API.

Introduction

VEDA is a parallel computing platform and hybrid programming model. It enables Vector engine (VE) for general purpose computing in a simple and elegant manner. VEDA API's are inspired by the widely used CUDA Driver API. It builds upon AVEO and enables easy porting of existing CUDA (and other hybrid) applications to VE. VEDA uses CUDA's design principles and maps these onto the execution model of AVEO.

Using VEDA, a programmer can execute code on VE and can control the execution from VH main program.

Installation

Prerequisite

VEDA API are build upon AVEO, Hence as a prerequisite first please install AVEO.
For veda program execution, veoffload-aveorun package is required.
For veda program development, veoffload-aveo and veoffload-aveo-devel packages are required.
For installation of aveo packages refer link https://www.hpc.nec/documents/veos/en/aveo/md_GettingStarted.html

Installing runtime package

To run programs, please install veoffload-veda and the runtime packages of the compiler (2.2.2 or later).

To install the packages to run programs by yum, execute the following command as root:

# yum install veoffload-veda

Installing development package

To develop programs, veoffload-veda-devel and the development packages of the compiler (2.2.2 or later) are also required.

To install the packages to develop programs by yum, execute the following command as root:

# yum install veoffload-veda-devel

Hello World

First, let's try a "Hello, World!" program on VE.

The Required Number of HugePages for VEDA

VEDA requires HugePages for data transfer. The required number of HugePages 32 per VEDA thread context.

VE Code

Code to run on VE is shown below. Standard C functions are available, hence, you can use printf(3).

#include <stdio.h>
void hello_world(void) {
printf("Hello World!\n");
}

Save the above code as libvehello.vc.

A function on VE called via VEDA needs to return a 64-bit unsigned integer. A function on VE called via VEDA can have arguments as mentioned later.

Compile VE Code

VEDA supports a function in a shared library.

To execute a function on VE using VEDA, compile and link a source file into a binary for VE.

To build a shared library with the functions for dynamic loading, execute as follows:

/opt/nec/ve/bin/ncc -x c -fpic -I/opt/nec/ve/share/veda/include -o libvehello.o -c libvehello.vc
/opt/nec/ve/bin/ncc -shared -o libvehello.vso libvehello.o

VH Main Program

Main routine on VH side to run VE program is shown here.

A program using VEDA needs to include "veda.h" and "vera.h" for VERA. In the header, the prototypes of functions and constants for VEDA and VERA API are defined.

The example VH program to call a VE function in a dynamic library with VEDA:

#include <stdio.h>
#include <stdlib.h>
#include <veda.h>
int main(int argc, char** argv) {
printf("Hello World from Host!\n");
VEDAcontext ctx;
VEDAmodule mod;
vedaModuleLoad(&mod, "libvehello.vso");
VEDAfunction func;
vedaModuleGetFunction(&func, mod, "hello_world");
VEDAargs args;
vedaLaunchKernelEx(func, 0, args, 1);
return 0;
}

Save the above code as hello.c

  1. API vedaInit() is called to initialized VE devices, the initialized VE devices may be termed as VEDA device.
  2. VE process is created on the VEDA device and the handle to the VE process is returned to the VH process. The returned handle to the VE process is termed as the VEDA context in the VEDA hybrid program.
  3. vedaDevicePrimaryCtxRetain() retain the primary Context on the VEDA device or create a new context incase of no context and vedaCtxPushCurrent() pushes a context on the current CPU thread.
  4. Internally VEDA context instantiate the VE threads on the VE device to execute the VE functions. These VE threads are termed as VEDA Streaming Multiprocessor(SM) throughout this mechanism.
  5. VE compiled device code(libvehello.vc) is loaded into the VE memory by calling vedaModuleLoad(), which loads a VE shared library.
  6. VE device address of the VE device function “hello_world” retrieved by calling vedaModuleGetFunction() function.
  7. VE device function is termed as VEDA device function in VEDA hybrid program. Although VEDA device function and VEDA kernel function can be used interchangeably and logically both the term refers to the same entity VE function.
  8. But VEDA kernel functions are different in terms as they are the preloaded VE device functions at the time of context creation. On the other hand, VEDA device function are loaded by the VH program.
  9. By calling vedaLaunchKernelEx(), VEDA hybrid program submits the request for the execution of the VEDA device function to VEDA streaming Multiprocessor(SM). It may be termed as launching of the VEDA device function. VEDA arguement would be destroyed after VEDA device function is called as fourth argument is set to 1.
  10. Execution of the VH programs is blocked by calling vedaCtxSynchronize() since the execution of the “hello_world” program is finished, this is termed as VEDA streaming multiprocessor(SM) synchronization.
  11. API vedaExit() releases the VEDA driver API library gracefully and perform the proper cleanup of the VEDA driver library.

Compile VH Main Program

Compile source code on VH side as shown below.

$ g++ -o hello hello.c -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \
-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

The headers for VEDA and VERA are installed in /opt/nec/ve/veos/include. libveda and libvera, the shared library of VEDA and VERA, is in /opt/nec/ve/veos/lib64.

Run a program with VEDA

Execute the compiled VEDA program.

$ ./hello
Hello, world!

Various Arguments for a VE function

You can pass one or more arguments to a function on VE. To specify arguments, VEDA arguments object is used. A VEDA argument object is created by vedaArgsCreate(). When a VEDA argument object is created, the VEDA argument object is empty, without any arguments passed. Even if a VE function has no arguments, a VEDA arguments object is still necessary.

VEDA provides functions to set an argument in various types.

Basic Types

To pass an integer value, the following functions are used.

VEDAresult vedaArgsSetI64 (VEDAargs args, const int idx, const int64_t value);
VEDAresult vedaArgsSetU16 (VEDAargs args, const int idx, const uint16_t value);
VEDAresult vedaArgsSetI32 (VEDAargs args, const int idx, const int32_t value);
VEDAresult vedaArgsSetU32 (VEDAargs args, const int idx, const uint32_t value);
VEDAresult vedaArgsSetI16 (VEDAargs args, const int idx, const int16_t value);
VEDAresult vedaArgsSetU16 (VEDAargs args, const int idx, const int16_t value);
VEDAresult vedaArgsSetU8 (VEDAargs args, const int idx, const uint8_t value);
VEDAresult vedaArgsSetI8 (VEDAargs args, const int idx, const int8_t value);

You can pass also a floating point number argument.

VEDAresult vedaArgsSetF32 (VEDAargs args, const int idx, const float value);
VEDAresult vedaArgsSetF64 (VEDAargs args, const int idx, const double value);

For instance: suppose that veda device is initialized and func(int, double) is defined in a VE library whose handle is func1.

VEDAargs args;
vedaArgsSetI32(args, 0, 1);
vedaArgsSetF64(args, 1, 2.0);
VEDAmodule mod;
vedaModuleLoad(&mod, "libvehello.vso"));
VEDAfunction func1;
vedaModuleGetFunction(&func1, mod, "func");
vedaLaunchKernel(func1, 0, args);

In this case, func(1, 2.0) is called on VE.

Stack Arguments

Non basic typed arguments and arguments by reference are put on a stack. VEDA supports an argument on a stack.

To set a stack argument to a VEDA arguments object, call vedaArgsSetStack().

VEDAresult vedaArgsSetStack (VEDAargs args, const int idx, void* ptr,
VEDAargs_intent intent, const size_t size);

The fourth argument specifies the argument is for input and/or output.

  • VEDA_ARGS_INTENT_IN: the argument is for input; data is copied into a VE stack on call.
  • VEDA_ARGS_INTENT_OUT: the argument is for output; a VE stack area is allocated without copy-in and data is copied out to VH memory on completion.
  • VEDA_ARGS_INTENT_INOUT: the argument is for both input and output; data is copied into and out from a VE stack area.

How to call the function written by Fortran

VE Code (Fortran)

Code written by Fortran to run on VE is shown below.

SUBROUTINE SUB1(x, ret)
implicit none
INTEGER, INTENT(IN) :: x
INTEGER, INTENT(OUT) :: ret
ret = x + 1
END SUBROUTINE SUB1

Save the above code as libvefortran.vf90.

Compile VE Code (Fortran)

To build a shared library with the functions for dynamic loading, execute as follows:

$/opt/nec/ve/bin/nfort -x f95 -I/opt/nec/ve/share/veda/include -c -o libvefortran.o libvefortran.vf90
$/opt/nec/ve/bin/nfort -shared -fpic -o libvefortran.vso libvefortran.o

VH Main Program (Fortran)

Main routine on VH side to run VE program written by Fortran is shown here.

The example VH program to call a VE Fortran function in a dynamically linked executable:

#include <stdio.h>
#include <stdlib.h>
#include <veda.h>
#define VEDA(err) check(err, __FILE__, __LINE__)
void check(VEDAresult err, const char* file, const int line) {
if(err != VEDA_SUCCESS) {
const char *name, *str;
vedaGetErrorName (err, &name);
vedaGetErrorString (err, &str);
printf("%s: %s @ %s:%i\n", name, str, file, line);
exit(1);
}
}
int main(int argc, char** argv) {
printf("Hello World from Host!\n");
VEDA(vedaInit(0));
VEDAcontext ctx;
VEDA(vedaCtxPushCurrent(ctx));
VEDAmodule mod;
VEDA(vedaModuleLoad(&mod, "libvefortran.vso"));
VEDAfunction func;
VEDA(vedaModuleGetFunction(&func, mod, "sub1_"));
VEDAargs args;
VEDA(vedaArgsCreate(&args));
size_t x = 42;
size_t y = 1;
VEDA(vedaArgsSetStack(args, 0, &x, VEDA_ARGS_INTENT_IN, sizeof(x)));
VEDA(vedaArgsSetStack(args, 1, &y, VEDA_ARGS_INTENT_OUT, sizeof(y)));
VEDA(vedaLaunchKernel(func, 0, args));
VEDA(vedaArgsDestroy(args));
printf("SUB1 return %lu\n",y);
VEDA(vedaExit());
return 0;
}

Save the above code as fortran.c.

  1. API vedaInit() is called to initialized VE devices, the initialized VE devices may be termed as VEDA device.
  2. VE process is created on the VEDA device and the handle to the VE process is returned to the VH process. The returned handle to the VE process is termed as the VEDA context in the VEDA hybrid program.
  3. vedaDevicePrimaryCtxRetain() retain the primary Context on the VEDA device or create a new context incase of no context and vedaCtxPushCurrent() pushes a context on the current CPU thread.
  4. Internally VEDA context instantiate the VE threads on the VE device to execute the VE functions. These VE threads are termed as VEDA Streaming Multiprocessor(SM) throughout this mechanism.
  5. VE compiled device code(libvefortran.vf90) is loaded into the VE memory by calling vedaModuleLoad(), which loads a VE shared library.
  6. VE device address of the VE device function “sub1_” retrieved by calling vedaModuleGetFunction() function.
  7. VE device function is termed as VEDA device function in VEDA hybrid program. Although VEDA device function and VEDA kernel function can be used interchangeably and logically both the term refers to the same entity VE function.
  8. But VEDA kernel functions are different in terms as they are the preloaded VE device functions at the time of context creation. On the other hand, VEDA device function are loaded by the VH program.
  9. API vedaArgsCreate() instantiate the VEDA function argument handler.
  10. API vedaArgsSetStack() initialize the VEDA function argument to point the buffer on stack. Where x is treated as the Input buffer to the VEDA device function.
  11. API vedaArgsSetStack() initialize the VEDA function argument to point the buffer on stack. Where y is treated as the Output buffer, some output is expected.
  12. By calling vedaLaunchKernel(), VEDA hybrid program submits the request for the execution of the VEDA device function to VEDA streaming Multiprocessor(SM). It may be termed as launching of the VEDA device function.
  13. Execution of the VH programs is blocked by calling vedaCtxSynchronize() since the execution of the “sub1_” program is finished, this is termed as VEDA streaming multiprocessor(SM) synchronization.
  14. API vedaExit() releases the VEDA driver API library gracefully and perform the proper clean-up of the VEDA driver library.

For passing arguments to VE Fortran function, please use vedaArgsSetStack() to pass arguments as stack arguments. However, for passing arguments to arguments with VALUE attribute in Fortran function, please pass arguments by value in the same way as VE C function.

When you want to load VE Fortran function by vedaModuleGetFunction() with the name of a Fortran function, please change the name of the Fortran function to lowercase, and add "_" at the end of the function name.

Taking libvefortran.vf90 and fortran.c as an example, pass "sub1_" as an argument to vedaModuleGetFunction() in fortran.c when calling the Fortran function named "SUB1" in libvefortran.f90.

The method of compiling and running VH main program are same as C program.

Compile VH Main Program (Fortran)

Compile source code on VH side as shown below. This is the same as the compilation method described above.

$ gcc -o fortran fortran.c -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \
-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

Run a program with VEDA

Execute the compiled VEDA program. This is also the same as the execution method described above.

$ ./fortran
SUB1 return 43

How to parallelize code using OpenMP

VE code using OpenMP in C

The following is an example of VE code using OpenMP written in C.

#include <stdio.h>
int omp_hello(void)
{
int tid, nthreads = 0;
#pragma omp parallel private(nthreads, tid)
{
tid = omp_get_thread_num();
printf("Hello, World! from thread = %d\n", tid);
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master thread and disband */
fflush(stdout);
return 0;
}

Save the above code in libomphello.vc.

How to build VE code

To use OpenMP parallelization, specify -fopenmp at compilation and linking.

Here is an example of building VE code written in C.

To build a shared library, execute as follows:

$/opt/nec/ve/bin/ncc -x c -fpic -I/opt/nec/ve/share/veda/include -o libomphello.o -c libomphello.vc -fopenmp
$/opt/nec/ve/bin/ncc -shared -o libomphello.vso libomphello.o

Compile VH Main Program (Omp)

Example omphello.c which calls the above omp VE code,

Compile source code on VH side as shown below.

$ gcc -o omphello omphello.c -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \
-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

Execute the compiled omp VEDA program.

$ ./omphello
Hello World from Host!
Hello, World! from thread = 4
Hello, World! from thread = 2
Hello, World! from thread = 0
Hello, World! from thread = 1
Hello, World! from thread = 5
Hello, World! from thread = 3
Hello, World! from thread = 7
Hello, World! from thread = 6
Number of threads = 8

VE code using OpenMP in fortran

The following shows the example written in Fortran.

INTEGER FUNCTION OMP_HELLO()
INTEGER :: TID = 0
INTEGER :: NTHREADS = 0
!$OMP PARALLEL PRIVATE(TID, NTHREADS)
TID = omp_get_thread_num()
WRITE(*,*) "Hello, World! from thread = ", TID
IF ( TID == 0 ) THEN
NTHREADS = omp_get_num_threads()
OMP_HELLO = NTHREADS
WRITE(*,*) "Number of threads = ", NTHREADS
END IF
!$OMP END PARALLEL
END FUNCTION OMP_HELLO

Save the above code in libompfortran.vf90.

How to build VE code

To use OpenMP parallelization, specify -fopenmp at compilation and linking.

Here is an example of building VE code written in Fortran.

To build a shared library, execute as follows:

$/opt/nec/ve/bin/nfort -x f95 -fpic -I/opt/nec/ve/share/veda/include -o libompfortran.o -c libompfortran.vf90 -fopenmp
$/opt/nec/ve/bin/nfort -shared -o libompfortran.vso libompfortran.o

To build code written in Fortran, change the compiler to nfort.

Compile VH Main Program (Omp)

Example omphellofortran.c which calls the above omp VE code,

Compile source code on VH side as shown below.

$ gcc -o omphellofortran omphellofortran.c -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \
-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

Execute the compiled omp VEDA program.

$ ./omphellofortran
Hello World from Host!
Hello, World! from thread = 4
Hello, World! from thread = 2
Hello, World! from thread = 0
Hello, World! from thread = 1
Hello, World! from thread = 5
Hello, World! from thread = 3
Hello, World! from thread = 7
Hello, World! from thread = 6
Number of threads = 8

VEDA for different context mode

Example program of stream in OMP context mode.

#include <veda.h>
#include <cstdio>
#include <cstdlib>
#include <cassert>
#define CHECK(err) check(err, __FILE__, __LINE__)
void check(VEDAresult err, const char* file, const int line) {
if(err != VEDA_SUCCESS) {
const char* name = 0;
vedaGetErrorName(err, &name);
printf("Error: %i %s @ %s (%i)\n", err, name, file, line);
assert(false);
exit(1);
}
}
int main(int argc, char** argv) {
CHECK(vedaInit(0));
int devcnt;
CHECK(vedaDeviceGetCount(&devcnt));
for(int dev = 0; dev < devcnt; dev++) {
VEDAcontext ctx;
int cnt;
CHECK(vedaCtxCreate(&ctx, VEDA_CONTEXT_MODE_OMP, dev));
CHECK(vedaCtxStreamCnt(&cnt));
if(cnt == 1)
{
printf("Passed\n");
}
printf("Stream count in omp mode is %d for device %d\n",cnt,dev);
}
CHECK(vedaExit());
return 0;
}

Save the above code in Omp_stream.cpp.

Compile source code on VH side as shown below.

$ g++ -o Omp_stream Omp_stream.cpp -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \
-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

Execute the compiled VEDA program.

$ ./Omp_stream
Passed
Stream count in omp mode is 1 for device 0
Passed
Stream count in omp mode is 1 for device 1

Example program of streams in SCALAR context mode.

#include <veda.h>
#include <cstdio>
#include <cstdlib>
#include <cassert>
#define CHECK(err) check(err, __FILE__, __LINE__)
void check(VEDAresult err, const char* file, const int line) {
if(err != VEDA_SUCCESS) {
const char* name = 0;
vedaGetErrorName(err, &name);
printf("Error: %i %s @ %s (%i)\n", err, name, file, line);
assert(false);
exit(1);
}
}
int main(int argc, char** argv) {
CHECK(vedaInit(0));
int devcnt;
CHECK(vedaDeviceGetCount(&devcnt));
for(int dev = 0; dev < devcnt; dev++) {
VEDAcontext ctx;
int cnt, cores;
CHECK(vedaCtxCreate(&ctx, VEDA_CONTEXT_MODE_SCALAR, dev));
CHECK(vedaDeviceGetAttribute(&cores,VEDA_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT,dev));
CHECK(vedaCtxStreamCnt(&cnt));
if(cnt == cores)
{
printf("Passed\n");
}
printf("For device %d: Stream count is %d and avaliable device core is %d\n",dev, cnt, cores);
}
CHECK(vedaExit());
//In above code, we can change the default stream value by updating "VE_OMP_NUM_THREADS" env variable.
setenv("VE_OMP_NUM_THREADS","1", 1);
CHECK(vedaInit(0));
CHECK(vedaDeviceGetCount(&devcnt));
for(int dev = 0; dev < devcnt; dev++) {
VEDAcontext ctx;
int cnt;
CHECK(vedaCtxCreate(&ctx, VEDA_CONTEXT_MODE_SCALAR, dev));
CHECK(vedaCtxStreamCnt(&cnt));
if(cnt == 1)
{
printf("Passed\n");
}
printf("For device %d: Stream count is %d which should be same as VE_OMP_NUM_THREADS env variable i.e. 1\n",dev,cnt);
}
CHECK(vedaExit());
return 0;
}

Save the above code in Scalar_stream.cpp.

Compile source code on VH side as shown below.

$ g++ -o Scalar_stream Scalar_stream.cpp -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \
-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

Execute the compiled VEDA program.

$ ./Scalar_stream
Passed
For device 0: Stream count is 4 and avaliable device core is 4
Passed
For device 1: Stream count is 4 and avaliable device core is 4
Passed
For device 0: Stream count is 1 which should be same as VE_OMP_NUM_THREADS env variable i.e. 1
Passed
For device 1: Stream count is 1 which should be same as VE_OMP_NUM_THREADS env variable i.e. 1

Example program of streams in SCALAR context mode for different devices.

#include <veda.h>
#include <cstdio>
#include <cstdlib>
#include <cassert>
#define CHECK(err) check(err, __FILE__, __LINE__)
void check(VEDAresult err, const char* file, const int line) {
if(err != VEDA_SUCCESS) {
const char* name = 0;
vedaGetErrorName(err, &name);
printf("Error: %i %s @ %s (%i)\n", err, name, file, line);
assert(false);
exit(1);
}
}
int main(int argc, char** argv) {
CHECK(vedaInit(0));
int devcnt = 0;
CHECK(vedaDeviceGetCount(&devcnt));
for(int dev = 0; dev < devcnt; dev++) {
VEDAcontext ctx;
CHECK(vedaCtxCreate(&ctx, VEDA_CONTEXT_MODE_SCALAR, dev));
VEDAmodule mod;
const char* modName = "libvehello.vso";
CHECK(vedaModuleLoad(&mod, modName));
VEDAfunction func;
const char* funcName = "hello_world";
CHECK(vedaModuleGetFunction(&func, mod, funcName));
int num = 0;
CHECK(vedaCtxStreamCnt(&num));
for(int stream =0; stream<num; stream++){
printf("stream =%d \n",stream);
CHECK(vedaLaunchKernel(func, stream, 0));
}
for(int stream = 0; stream < num; stream++)
CHECK(vedaStreamSynchronize(stream));
CHECK(vedaCtxDestroy(ctx));
}
CHECK(vedaExit());
return 0;
}

Save the above code in Device_stream.cpp.

Compile source code on VH side as shown below.

$ g++ -o Device_stream Device_stream.cpp -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \
-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

To build libvehello.vc for shared library with the functions for dynamic loading, execute as follows:

/opt/nec/ve/bin/ncc -x c -fpic -I/opt/nec/ve/share/veda/include -o libvehello.o -c libvehello.vc
/opt/nec/ve/bin/ncc -shared -o libvehello.vso libvehello.o

Execute the compiled VEDA program.

$ ./Device_stream
stream =0
stream =1
stream =2
stream =3
Hello World!
Hello World!
Hello World!
Hello World!
stream =0
stream =1
stream =2
stream =3
Hello World!
Hello World!
Hello World!
Hello World!

Alternate way of building VEDA

VEDA Hybrid Offloading:

(1) VEDA projects can be compiled including VEDA host and device code within a single CMake build system.
(2) We need to write separate files for host and device code.
(3) VEDA cmake project can be configured by setting CMAKE_MODULE_PATH to the VEDA cmake directory path.
(4) The languages VEDA_C, VEDA_CXX or VEDA_Fortran need to be enabled in cmake configuration file, by which it finds the scripts to compile the appropriate source code extension files of veda device.
(5) VEDA device source code file extensions need to be prefixed with v as filename.vc, filename.vcpp, filename.vf etc. afterwards, the entire compilation and linking process is automatically handled by CMake.

SET(CMAKE_MODULE_PATH /opt/nec/ve/share/veda/cmake)
FIND_PACKAGE(VE)
ENABLE_LANGUAGE(VEDA_C)
INCLUDE_DIRECTORIES(${VEDA_INCLUDES})
ADD_EXECUTABLE (hello_world host.c)
TARGET_LINK_LIBRARIES (hello_world ${VEDA_LIBRARY})
ADD_LIBRARY (hello_world_device SHARED device.vc)

Save the above code in CMakeLists.txt for C project

Sample C project directory:
CMakeLists.txt device.vc host.c

Sample CXX project directory:
CMakeLists.txt device.vcpp host.cpp

Sample Fortran project directory:
CMakeLists.txt device.vf host.c

How to build your project:

To build a project, execute as follows:

mkdir build
cd build
cmake ../
make

Execute the build VEDA project binary:

$ ./hello_world
Hello World from Host!
Hello World from Device!

NUMA Support

VEDA supports VE NUMA nodes since v0.10. To enable NUMA on your system you need to execute (set -N ? to specific device index):

VCMD="sudo /opt/nec/ve/bin/vecmd -N ?"
$VCMD vconfig set partitioning_mode on
$VCMD state set off
$VCMD state set mnt
$VCMD reset card

VEDA then recognizes each NUMA node as a separate device, i.e. with 2 physical devices in NUMA mode, VEDA would show 4 devices. You can use VEDAresult vedaDeviceDistance(float* distance, VEDAdevice devA, VEDAdevice devB) to determine the relationship of two VEDAdevices.

distance == 0.0; // same device
distance == 0.5; // same physical device, different NUMA node
distance == 1.0; // differeny physical device

Example program of NUMA node

#include <veda.h>
#include <cstdio>
#include <cstdlib>
#include <cassert>
#define CHECK(err) check(err, __FILE__, __LINE__)
void check(VEDAresult err, const char* file, const int line) {
if(err != VEDA_SUCCESS) {
const char* name = 0;
vedaGetErrorName(err, &name);
printf("Error: %i %s @ %s (%i)\n", err, name, file, line);
assert(false);
exit(1);
}
}
int main(int argc, char** argv) {
CHECK(vedaInit(0));
int devcnt;
CHECK(vedaDeviceGetCount(&devcnt));
for(int devA = 0; devA < devcnt; devA++) {
for(int devB = 0; devB < devcnt; devB++) {
float distance = 0;
CHECK(vedaDeviceDistance(&distance, devA, devB));
printf("Distance %i >> %i = %f (%s)\n", devA, devB, distance, distance == 0 ? "same device" : distance == 0.5f ? "different numa node" : "different device");
}
}
CHECK(vedaExit());
return 0;
}

Save the above code in numa_node.cpp.

Compile source code on VH side as shown below.

$ g++ -o numa_node numa_node.cpp -I/opt/nec/ve/share/veda/include -L/opt/nec/ve/veos/lib64 \
-rdynamic -Wl,-rpath,/opt/nec/ve/veos/lib64 -lveda

Execute the compiled code.

$ ./numa_node
Distance 0 >> 0 = 0.000000 (same device)
Distance 0 >> 1 = 0.500000 (different numa node)
Distance 1 >> 0 = 0.500000 (different numa node)
Distance 1 >> 1 = 0.000000 (same device)

VEDA-smi

The executable veda-smi displays available VEDA devices in your system. It uses the VEDA_VISIBLE_DEVICES env var and therefore only shows the devices that your VEDA application would be able to use. Use VEDA_VISIBLE_DEVICES= veda-smi to ensure that you see all installed devices.

╔ veda-smi ═════════════════════════════════════════════════════════════════════╗
║ VEDA Version: 2.10.0 AVEO Version: 2.8.2 ║
╚═══════════════════════════════════════════════════════════════════════════════╝
┌── #0 NEC SX-Aurora Tsubasa VE10B ────────────────────────────────────────────┐
┌ Physical: 1.0
├ AVEO: 0.0
├ Clock: current: 1400 MHz, base: 800 MHz, memory: 1600 MHz
├ Firmware: 5399
├ Memory: 49152 MiB
├ Cache: LLC: 8192kB, L2: 256kB, L1d: 32kB, L1i: 32kB
├ Temp: 56.4°C 56.4°C 57.0°C 56.1°C
└ Power: 18.0W (11.9V, 1.5A)
└───────────────────────────────────────────────────────────────────────────────┘
┌── #1 NEC SX-Aurora Tsubasa VE10B ────────────────────────────────────────────┐
┌ Physical: 1.1
├ AVEO: 0.1
├ Clock: current: 1400 MHz, base: 800 MHz, memory: 1600 MHz
├ Firmware: 5399
├ Memory: 49152 MiB
├ Cache: LLC: 8192kB, L2: 256kB, L1d: 32kB, L1i: 32kB
├ Temp: 56.1°C 56.4°C 55.9°C 56.0°C
└ Power: 18.0W (11.9V, 1.5A)
└───────────────────────────────────────────────────────────────────────────────┘
┌── #2 NEC SX-Aurora Tsubasa VE10B ────────────────────────────────────────────┐
┌ Physical: 0.0
├ AVEO: 1.0
├ Clock: current: 1400 MHz, base: 800 MHz, memory: 1600 MHz
├ Firmware: 5399
├ Memory: 49152 MiB
├ Cache: LLC: 16384kB, L2: 256kB, L1d: 32kB, L1i: 32kB
├ Temp: 53.8°C 53.5°C 54.1°C 53.8°C 53.8°C 54.1°C 53.2°C 53.5°C
└ Power: 36.3W (11.9V, 3.1A)
└───────────────────────────────────────────────────────────────────────────────┘

Environment variables for VEDA

Environment variables Brief Default value
VE_LD_LIBRARY_PATH Default library path of VE to check dynamic and shared libraries required for current VEDA program. .(Current directory)
LD_LIBRARY_PATH Default library path for VE to check for available dynamic and shared libraries of VEDA. None
VEDA_VISIBLE_DEVICES To restrict VEDA to only use those GPUs that have peer-to-peer support. 0

Environment variables to optimize data transfer

AVEO(2.7.5 or later) supports the environment variables to optimize the performance of data transfer.

Please refer below link to set AVEO related environment variables, https://www.hpc.nec/documents/veos/en/aveo/md_GettingStarted.html