CUDA Notes

We dive into some of cuRobo’s CUDA kernels and compute bottlenecks in GTC 2024

Developing CUDA kernels

  1. Global variables can get mangled. Move global variables to templates or input argument to functions.

  2. pytorch creates a new context: https://github.com/pytorch/pytorch/issues/75025 , as such any other cuda libraries need to be able to use this context instead of creating a new one. So far, we found NVIDIA warp and NVIDIA omniverse to work well with pytorch as long as we create a tensor with pytorch before calling these libraries as these libraries try to use an existing cuda context when available instead of blindly creating a new one.

  3. Compiling cuda kernels for every compute capability can significantly increase the install time of this library, it’s reccommended to set the environment variable to your device with export TORCH_CUDA_ARCH_LIST=8.0+PTX, where 8.0 should be replaced by your GPU’s compute capability.

Debugging CUDA Errors

  1. Recompile kernels with additional nvcc flags: [”–prec-sqrt=false”, “-g”,”-G”,”–generate-line-info”]

  2. Set export CUDA_LAUNCH_BLOCKING=1 to make sure that the kernels are launched synchronously.

  3. Set the following environment variables:
    • export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 to enable core dumps on CUDA exceptions.

    • export CUDA_ENABLE_LIGHTWEIGHT_COREDUMP=1 to enable lightweight core dumps on CUDA exceptions.

    • export CUDA_COREDUMP_SHOW_PROGRESS=1 to show progress of core dump generation.

  4. On a crash, a cuda core dump will be generated. This can be loaded following https://docs.nvidia.com/cuda/cuda-gdb/index.html#gpu-core-dump-support

  5. Instead of steps (3) and (4), compute-sanitizer --tool memcheck python script.py can also be run.

Tensor IO Buffers

We implement some indexing buffers in our kernels to reduce memory bottlenecks and also to enable goals/worlds to be different per batch item in a batched call. Some rough notes on this indexing is given below.

Batched Goal Costs: Assuming that all pose costs will be written at a kernel level, we avoid repeating goal tensors and instead store a batch index tensor that maps from a batch to location in memory.

Batched Environment Collision Checking

To optimize across different environments, we keep track of the mapping between an index in the query tensor and the environment. Every query to distance should now have two mapping tensors, batch_env_idx (n_batch), and enable_obs_env (n_env, n_obs).

batch_env_idx: This stores the environment index per query sphere

enable_obs_env: This contains a 1 if the obstacle is enabled in a given environment

If you want to have an obstacle enabled for some problems and disabled for others, duplicate the environment.

To pass buffers to kernels, we use a custom dataclass CollisionBuffer which has functions to create new buffers given the query_sphere buffer. We use this dataclass to allow for chaining different collision checkers as each collision checker type will look for a specific buffer (e.g., primitive collision checker will look for out_buffer_prim, while mesh collision checker looks for out_buffer_mesh).