Open Issues Need Help
View All on GitHub Add Python benchmarks for the new CUDA DSL/JIT about 1 month ago
AI Summary: The task is to add Python benchmarks using the new CCCL v3 for efficient parallel reductions, creating a new benchmark file (`reduce_bench.py`) that showcases the performance of JIT-compiled CUDA kernels for parallel reduction with varying hyperparameters. This will involve writing Python code leveraging CCCL v3 and comparing its performance against existing C++ and Rust benchmarks.
Complexity:
4/5
help wanted good first issue
Thrust, CUB, TBB, AVX2, AVX-512, CUDA, OpenCL, OpenMP, Metal, and Rust - all it takes to sum a lot of numbers fast!
C++
#avx512#cuda#gpgpu#gpu#gpu-acceleration#gpu-computing#hpc#metal#neon#nvidia#opencl#openmp#parallel#simd#stl#sve#tbb#thread-pool#thrust