The Tutorial and API Reference of VEDA
2.11.1
|
VEDA provides API inspired by the widely used CUDA Driver API. It builds upon AVEO and enables easy porting existing CUDA (and other hybrid) applications to VE. VEDA uses CUDA's design principles and maps these onto the execution model of AVEO. It supports multiple devices, NUMA nodes, asynchronous execution queues, and many more closely mirroring CUDA's best practices that have been tried and tested for over a decade.
Similar to CUDA, VEDA enumerates the physical devices and NUMA nodes starting from zero, whereby NUMA nodes have always adjacent indices. The environmental variable VEDA_VISIBLE_DEVICES determines which devices should be visible within the application. In contrast to CUDA, VEDA only supports a single device context at a time, which maintains all loaded libraries/modules and allocations.
VEDA have most of the similar APIs like CUDA however as the programming model of the SX-Aurora differs from NVIDIA GPUs, there are some differences.
1. In VEDA, all function calls start with veda* instead of cu* and vera* instead of cuda* for VERA runtime APIs.
2. Objects start with VEDA* instead of CU* and vera* instead of cuda*
3. Similar to CUDA Runtime API, calls from VEDA and VERA can be mixed.
4. VEDA uses the environment variable VEDA_VISIBLE_DEVICES in contrast to CUDA_VISIBLE_DEVICES.
5. We required to call vedaExit() at the end of the application, to ensure that no dead device processes stay alive along with the call vedaInit(0) in the beginning like CUDA.
6. VEDA/VERA supports asynchronous malloc and free, VEDA supports asynchronous vedaMemAllocAsync and vedaMemFreeAsync. They can be used like the synchronous calls, and don't need to synchronize the execution between device and host.
7. VEDA streams differ from CUDA streams. As we have two modes to create Context in VEDA, See chapter "Two modes of VEDA Context" for more details.
8. Due to programming model difference, launching kernels looks different:
9. VEDAdeviceptr needs to be dereferenced first on device side:
1. Delayed Memory Allocation: In contrast to CUDA, the VE's memory allocator is operated by the device itself which gives VEDA the Opportunity of enabling asynchronous memory allocations from the host. This can reduce unnecessary synchronizations between the host and the device.
If VEDA pointers are allocated with a size of 0, only a virtual pointer will be created that can be allocated later using vedaMemAlloc(...) inside the device code. This enables to model execution graphs, without the need to know the required size before calling the device code. Virtual pointers behave identical to normal pointers, i.e. they can be dereferenced using offsets A = B + offset or A = &B[offset].
VEDA does not need to allocate memory from the host, but can do that directly from the device. For this, the host only needs to create an empty VEDAdeviceptr.
2. We can allow to fetch the power consumption (in W) and temperature (in C) from vedaDeviceGetPower(float* power, VEDAdevice dev) and vedaDeviceGetTemp(float* tempC, const int coreIdx, VEDAdevice dev) functions.
3. Kernel api vedaLaunchKernelEx(func, stream, argsObject, 1, checkResult=0) can automatically destroy VEDAargs argObject i.e VEDAargs argObject can't be reused for other calls.
4. Also vedaLaunchKernelEx api have fifth parameter as checkResult which is a optional parameter(default value is 0) and can be set as 1 when we want to have return value of device function, which we can get by calling vedaCtxSynchronize() or vedaStreamSynchronize() after it.
The context can be generated in two different modes: VEDA_CONTEXT_MODE_OMP (default) and VEDA_CONTEXT_MODE_SCALAR. The first mode creates one stream (execution queue) that controls all threads through OpenMP. The second mode creates one stream per core, allowing to directly address each core separately from within the host.
In CUDA streams can be used to create different execution queues, to overlap compute with memcopy while VEDA supports two stream modes which differ from the CUDA behavior. These modes can be defined by vedaCtxCreate(&ctx, MODE, device) API in VEDA.
We have VEDAdeviceptr for pointer variable type in VEDA however we can also use VEDAptr<typename> while using C++ that gives us more directly control over the VEDAdeviceptr, i.e. we can use vptr.size(), vptr.device(), ... . The typename is used to automatically determine the correct offsets when executing vptr += offset;.
NA: Not Available