|
| 1 | +# SYCL Hashing Algorithms |
| 2 | + |
| 3 | +This repository contains hashing algorithms implemented using [SYCL](https://www.khronos.org/sycl/) which is a heterogeneous programming model based on standard C++. |
| 4 | + |
| 5 | +The following hashing methods are currently available: |
| 6 | + |
| 7 | +- sha256 |
| 8 | +- sha1 (unsecure) |
| 9 | +- md2 (unsecure) |
| 10 | +- md5 (unsecure) |
| 11 | +- keccak (128 224 256 288 384 512) |
| 12 | +- sha3 (224 256 384 512) |
| 13 | +- blake2b |
| 14 | + |
| 15 | +## Benchmarks |
| 16 | + |
| 17 | +Some functions were ported from a CUDA implementation. The SYCL code was tested unchanged across the different implementations and hardware. Here's how they perform (the values are in GB/s): |
| 18 | + |
| 19 | +| Function | Native CUDA | SYCL on DPC++ CUDA (optimised) | SYCL on ComputeCPP CPU (spir64/spirv64) | SYCL on DPC++ CPU (spir64_x86_64) | SYCL on hipSYCL (omp/cuda) | |
| 20 | +| -------- | ----------- | ------------------------------------------- | --------------------------------------- | --------------------------------- | -------------------------- | |
| 21 | +| keccak | 15.7 | 23.0 | 4.14 / 3.89 | 4.98 | 4.32 / 23.2 | |
| 22 | +| md5 | 14.6 | 20.3 | 6.26 / 5.89 | 9.93 | 9.27 / 20.2 | |
| 23 | +| blake2b | 14.7 | 21.6 | 9.46 / 10.0 | 12.4 | 7.71 / 17.9 | |
| 24 | +| sha1 | 14.7 | 19.34 | 3.61 / 3.35 | 3.30 | 4.39 / 19.2 | |
| 25 | +| sha256 | 13.5 | 19.15 | 2.23 / 2.00 | 2.91 | 2.93 / 19.1 | |
| 26 | +| md2 | 4.18 | 4.23 | 0.22 / 0.25 | 0.176 | 0.25 / 2.33 | |
| 27 | + |
| 28 | +### Note |
| 29 | + |
| 30 | +Something broke the spir64 backend of DPC++ and it produces now very slow code |
| 31 | + |
| 32 | +Benchmark configuration: |
| 33 | + |
| 34 | +- block_size: 512 kiB |
| 35 | +- n_blocks: 4\*1536 |
| 36 | +- n_outbit: 128 |
| 37 | +- GPU: GTX 1660 Ti |
| 38 | +- OS: rhel8.4 |
| 39 | +- CPU: 2x E5-2670 v2 |
| 40 | + |
| 41 | +### Remark |
| 42 | + |
| 43 | +These are not the "best" settings as the optimum changes with the algorithm. The benchmarks measure the time to run 40 iterations, without copying the memory between the device and the host. In a real application, you |
| 44 | +could be memory bound. |
| 45 | + |
| 46 | +## How to build |
| 47 | + |
| 48 | +```bash |
| 49 | +git clone https://github.com/Michoumichmich/SYCL-Hashing-Algorithms.git ; cd SYCL-Hashing-Algorithms; |
| 50 | +mkdir build; cd build |
| 51 | +CXX=<sycl_compiler> cmake .. -DCMAKE_BUILD_TYPE=Release |
| 52 | +make |
| 53 | +``` |
| 54 | + |
| 55 | +This will build the library, and a demo executable. Running it will perform a benchmark on your CPU and CUDA device (if available). |
| 56 | + |
| 57 | +You do not necessarily need to pass the `<sycl_compiler>` to cmake, it depends on the implementation you're using and its toolchain. |
| 58 | + |
| 59 | +## How to use |
| 60 | + |
| 61 | +Let's assume you used this [script](https://github.com/Michoumichmich/oneAPI-setup-script) to setup the toolchain with CUDA support. |
| 62 | + |
| 63 | +Here's a minimal example: |
| 64 | + |
| 65 | +```C++ |
| 66 | +#include <sycl/sycl.hpp> // SYCL headers |
| 67 | +#include "sycl_hash.hpp" // The headers |
| 68 | +#include "tools/sycl_queue_helpers.hpp" // To make sycl queue |
| 69 | +using namespace hash; |
| 70 | + |
| 71 | +int main(){ |
| 72 | + auto cuda_q = try_get_queue(cuda_selector{}); // create a queue on a cuda device and attach an exception handler |
| 73 | + |
| 74 | + constexpr int hash_size = get_block_size<method::sha256>(); |
| 75 | + constexpr int n_blocks = 20; // amount of hash to do in parallel |
| 76 | + constexpr int item_size = 1024; |
| 77 | + |
| 78 | + byte input[n_blocks * item_size]; // get an array of 20 same-sized data items to hash; |
| 79 | + byte output[n_blocks * hash_size]; // reserve space for the output |
| 80 | + |
| 81 | + compute<method::sha256>(cuda_q, input, item_size, output, n_blocks); // do the computing |
| 82 | + compute_sha256(cuda_q, input, item_size, output, n_blocks); // identical |
| 83 | + |
| 84 | + /** |
| 85 | + * For SHA3 one could write: |
| 86 | + * compute_sha3<512>(cuda_q, input, item_size, output, n_blocks); |
| 87 | + */ |
| 88 | + |
| 89 | + return 0; |
| 90 | +} |
| 91 | +``` |
| 92 | + |
| 93 | +And, for clang build with |
| 94 | + |
| 95 | +``` |
| 96 | +-fsycl -fsycl-targets=spir64_x86_64,nvptx64-nvidia-cuda--sm_50 -I<include_dir> <build_dir>/libsycl_hash.a |
| 97 | +``` |
| 98 | + |
| 99 | +And your hash will run on the GPU. |
| 100 | + |
| 101 | +# Sources |
| 102 | + |
| 103 | +You may find [here](https://github.com/Michoumichmich/cuda-hashing-algos-with-benchmark) the fork of the original CUDA implementations with the benchmarks added. |
| 104 | + |
| 105 | +# Tested implementations |
| 106 | + |
| 107 | +- [Intel's clang](https://github.com/intel/llvm) with OpenCL on CPU (using Intel's driver) and [Codeplay's CUDA backend](https://www.codeplay.com/solutions/oneapi/for-cuda/) |
| 108 | +- [hipSYCL](https://github.com/illuhad/hipSYCL) on macOS with the OpenMP backend (set `hipSYCL_DIR` then `cmake .. -DHIPSYCL_TARGETS="..."`) |
| 109 | +- [ComputeCPP](https://developer.codeplay.com/products/computecpp/ce/home) you can build with `cmake .. -DComputeCpp_DIR=/path_to_computecpp -DCOMPUTECPP_BITCODE=spir64 -DCMAKE_BUILD_TYPE=Release`, Tested on the host |
| 110 | + device, `spir64` and `spirv64`. See [ComputeCpp SDK](https://github.com/codeplaysoftware/computecpp-sdk) |
| 111 | + |
| 112 | +# Acknowledgements |
| 113 | + |
| 114 | +This repository contains code written by Matt Zweil & The Mochimo Core Contributor Team. Please see the [files](https://github.com/mochimodev/cuda-hashing-algos) for their respective licences. |
0 commit comments