Thrust, CUB, TBB, AVX2, AVX-512, CUDA, OpenCL, OpenMP, Metal, and Rust - all it takes to sum a lot of numbers fast!

avx512 cuda gpgpu gpu gpu-acceleration gpu-computing hpc metal neon nvidia opencl openmp parallel simd stl sve tbb thread-pool thrust
1 Open Issue Need Help Last updated: Jul 22, 2025

Open Issues Need Help

View All on GitHub

AI Summary: The task is to add Python benchmarks using the new CCCL v3 for efficient parallel reductions, creating a new benchmark file (`reduce_bench.py`) that showcases the performance of JIT-compiled CUDA kernels for parallel reduction with varying hyperparameters. This will involve writing Python code leveraging CCCL v3 and comparing its performance against existing C++ and Rust benchmarks.

Complexity: 4/5
help wanted good first issue

Thrust, CUB, TBB, AVX2, AVX-512, CUDA, OpenCL, OpenMP, Metal, and Rust - all it takes to sum a lot of numbers fast!

C++
#avx512#cuda#gpgpu#gpu#gpu-acceleration#gpu-computing#hpc#metal#neon#nvidia#opencl#openmp#parallel#simd#stl#sve#tbb#thread-pool#thrust