Skip to content

Commit 6664a79

Browse files
committed
Clarified how to add new instruction set capabilities
1 parent 4c15cf4 commit 6664a79

File tree

1 file changed

+9
-7
lines changed

1 file changed

+9
-7
lines changed

README.rst

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ Mocks
146146
enabled. Same numerical issues as ``LINK_IN_DEC``
147147

148148
3. ``FAST_DIVIDE`` -- Divisions are slow but required
149-
:math:`DD(r_p,\pi)`. This Makefile option (in mocks.options) replaces
149+
:math:`DD(r_p,\pi)`. This ``Makefile`` option (in ``mocks.options``) replaces
150150
the divisions to a reciprocal followed by a Newton-Raphson. The code
151151
will run ~20% faster at the expense of some numerical precision.
152152
Please check that the loss of precision is not important for your
@@ -243,25 +243,27 @@ Common Code options for both Mocks and Cosmological Boxes
243243
2. ``USE_AVX`` -- uses the AVX instruction set found in Intel/AMD CPUs
244244
>= 2011 (Intel: Sandy Bridge or later; AMD: Bulldozer or later).
245245
Enabled by default - code will run much slower if the CPU does not
246-
support AVX instructions. On Linux, check for "avx" in /proc/cpuinfo
247-
under flags. If you do not have AVX, but have a SSE4 system instead,
248-
then check the ``develop`` branch for the SSE4 code.
246+
support AVX instructions. The ``Makefile`` will automatically check
247+
for "AVX" support and disable this option for unsupported CPUs.
249248

250249
3. ``USE_OMP`` -- uses OpenMP parallelization. Scaling is great for DD
251250
(perfect scaling up to 12 threads in my tests) and okay (runtime
252251
becomes constant ~6-8 threads in my tests) for ``DDrppi`` and ``wp``.
252+
Enabled by default. The ``Makefile`` will compare the `CC` variable with
253+
known OpenMP enabled compilers and set compile options accordingly.
253254

254255
*Optimization for your architecture*
255256

256257
1. The values of ``bin_refine_factor`` and/or ``zbin_refine_factor`` in
257-
the countpairs\_\*.c files control the cache-misses, and
258+
the ``countpairs\_\*.c`` files control the cache-misses, and
258259
consequently, the runtime. In my trial-and-error methods, I have seen
259260
any values larger than 3 are always slower. But some different
260261
combination of 1/2 for ``(z)bin_refine_factor`` might be faster on
261262
your platform.
262263

263-
2. If you have AVX2/AVX-512/KNC, you will need to rewrite the entire AVX
264-
section.
264+
2. If you have AVX2/AVX-512/KNC, you will need to add a new kernel within
265+
the ``*_kernels.c`` and edit the runtime dispatch code to call this new
266+
kernel.
265267

266268
Author
267269
======

0 commit comments

Comments
 (0)