ashvardanian/ParallelReductionsBenchmark

Thrust, CUB, TBB, AVX2, AVX-512, CUDA, OpenCL, OpenMP, Metal, and Rust - all it takes to sum a lot of numbers fast!

avx512 cuda gpgpu gpu gpu-acceleration gpu-computing hpc metal neon nvidia opencl openmp parallel simd stl sve tbb thread-pool thrust

View on GitHub

1 Open Issue Need Help Last updated: Jul 22, 2025

Open Issues Need Help

View All on GitHub

Add Python benchmarks for the new CUDA DSL/JIT 11 months ago

AI Summary: The task is to add Python benchmarks using the new CCCL v3 for efficient parallel reductions, creating a new benchmark file (`reduce_bench.py`) that showcases the performance of JIT-compiled CUDA kernels for parallel reduction with varying hyperparameters. This will involve writing Python code leveraging CCCL v3 and comparing its performance against existing C++ and Rust benchmarks.

Complexity: 4/5

help wanted good first issue

ashvardanian/ParallelReductionsBenchmark

102

Thrust, CUB, TBB, AVX2, AVX-512, CUDA, OpenCL, OpenMP, Metal, and Rust - all it takes to sum a lot of numbers fast!

C++

#avx512#cuda#gpgpu#gpu#gpu-acceleration#gpu-computing#hpc#metal#neon#nvidia#opencl#openmp#parallel#simd#stl#sve#tbb#thread-pool#thrust