Skip to content

Conversation

@bosilca
Copy link
Member

@bosilca bosilca commented Feb 18, 2020

Add support for AVX instructions for all MPI_Op.

@hppritcha
Copy link
Member

hppritcha commented Feb 18, 2020

The cray compile warnings/errors look legit. Not sure why lots of warns and notes led to GCC returning failure however.

@bosilca
Copy link
Member Author

bosilca commented Feb 18, 2020

Because my .m4 screwed up the flags. Fix on the way.

@bosilca bosilca force-pushed the topic/avx512 branch 2 times, most recently from 93f23d9 to 84e8a49 Compare February 21, 2020 00:23
@bosilca bosilca changed the title Add optimizations for AVX512 Add support for AVX Feb 21, 2020
@bosilca bosilca force-pushed the topic/avx512 branch 2 times, most recently from 0aa9492 to 1fbeead Compare February 21, 2020 07:47
@bosilca bosilca force-pushed the topic/avx512 branch 4 times, most recently from 9595ad0 to 075549f Compare March 3, 2020 17:02
@bosilca bosilca force-pushed the topic/avx512 branch 2 times, most recently from 3438d2d to 136b6da Compare March 23, 2020 20:21
@ggouaillardet
Copy link
Contributor

@bosilca I just pushed two commits to fix some typos

FWIW, I am also writing the op/neon component for ARM NEON (128 bits vectors).

For SVE, would you rather have it in its own module (e.g. op/sve) or have a single module for ARM (e.g. both NEON and SVE) ?

@bosilca
Copy link
Member Author

bosilca commented Mar 25, 2020

Thanks @ggouaillardet . Regarding the test I have a way more extensive version in the pipeline, should be able to push in the next few days.

For the ARM extensions having all ARM-V* versions in a single place can be beneficial, as we will be able to reuse parts of the code. It might however make the compiling stage more complex as we will need to provide specialized flags for some of the .c file (to enable specific architecture options). What's you take on this ?

@ggouaillardet
Copy link
Contributor

@bosilca I pushed a new fix in your PR

I added support for NEON and SVE in my own branch (based on top of this one) at https://github.com/ggouaillardet/ompi/tree/topic/op_arm
(I still need finalizing SVE flags and implement SVE runtime detection)

For now, I did two distinct components, and though there is some overlap, there is not that much imho.

If I understand correctly the op/avx component, and let's say we run on the latest xeon and have to MPI_SUM n+63 int8_t, that would be

  • 3*n AVX512 instructions (n iterations), this is the main part
  • 3 AVX256 instructions (1 iteration)
  • 3 SSE (AVX128?) instructions (1 iteration)
  • 15*3 scalar instructions (2 iterations, 8+7 unrolling)

before each step, the flags are tested, and draining the main loop (e.g. the last 63 int8_t) will invoke two loops with 0 or 1 iterations, which is quite some overhead that could be avoided. Also, and unless this has changed, I remember Intel discouraged mixing AVX and SSE instructions in the same program.
my gut feeling is this could be faster if

  • we use one subroutine per avx type
  • flags are only tested once, and functions pointers are set accordingly
  • draining the loop is done in a single scalar loop

we could also do loop peeling to be faster when data is not optimally aligned.

In theory, we could have several components (e.g. one per avx type) but in order to avoid code duplication, it can be left in a single module (and since subroutine selection is performed at selection time, there is no runtime overhead)

@jjhursey
Copy link
Member

bot:ibm:xl:retest

@open-mpi open-mpi deleted a comment from ibm-ompi Mar 26, 2020
@bosilca
Copy link
Member Author

bosilca commented Mar 26, 2020

@ggouaillardet I did not hear about anything but performance issues while mixing AVX and SSE, and these are not a concern for us because in the worst case we do such a transition once. We could follow Intel's suggestions and pass the code through vtune to make sure.

The code can certainly be improved, but I don't know by how much. I assumed that most of the MPI_Op executes on data large enough to completely hide the few cycles lost in the branch mismanagement, so that reordering and loop peeling might not be the such a deal breaker. But we might want to investigate how to deal with the case where the user data alignment matches the AVX needs, as this should allow us to use AVX instructions for aligned memory (load instead of loadu) accesses but I don't how much we can gain. Some hits can be found here on McCalpin answer.

For the SVE code, @dong0321 and myself were working on a branch in my repo, that you can access via a PR. It is mostly based on the AVX code, and has no support for Neon. We should merge our efforts on this, and reach out to Fujitsu (@Shinji-Sumimoto) and ARM (@shamisp) for final validation.

@bosilca
Copy link
Member Author

bosilca commented Mar 26, 2020

I don't understand the AZR issue here. It seems that the instruction is not supported, but it did find it while looking for it during configure. So I wonder of somehow the additional flags detected during configure are not passed to the compiler for the avx_functions.c file. But for this I would need to see the compile arguments used. Can someone form Mellanox send me this info please.

@jsquyres
Copy link
Member

jsquyres commented Jul 4, 2020

bot:aws:retest

Same real-but-unrelated-to-this-PR timeout we've been seeing for a while (see #7847).

@jsquyres
Copy link
Member

jsquyres commented Jul 4, 2020

bot:aws:retest

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running into problems when testing this PR locally. More information coming shortly.

@jsquyres
Copy link
Member

jsquyres commented Jul 4, 2020

Intel icc 19.04

When I compile with icc 19.04, I get the following (which is probably unsurprising):

# After fixing the `configure.m4` errors noted in the PR review
3274 --- MCA component op:avx (m4 configuration macro)
3275 checking for MCA component op:avx compile mode... dso
3276 checking for AVX512 support (no additional flags)... yes
3277 checking for AVX2 support (no additional flags)... yes
3278 checking for AVX support (no additional flags)... yes
3279 checking for SSE4.1 support... yes
3280 checking for SSE3 support... yes
3281 checking if MCA component op:avx can compile... yes

But then I get this when compiling nearly every file in Open MPI:

  CC       opal_datatype_resize.lo
In file included from ../../../opal/include/opal/sys/atomic.h(64),
                 from ../../../opal/mca/threads/thread_usage.h(32),
                 from ../../../opal/class/opal_object.h(126),
                 from ../../../opal/datatype/opal_datatype.h(44),
                 from ../../../opal/datatype/opal_convertor.h(35),
                 from ../../../opal/datatype/opal_convertor_internal.h(21),
                 from ../../../opal/datatype/opal_datatype_pack.c(28):
../../../opal/include/opal/sys/atomic_stdc.h(119): warning #2330: argument of type "opal_a\
tomic_int32_t={_Atomic(int32_t={int})} *" is incompatible with parameter of type "volatile\
 void *" (dropping qualifiers)
  OPAL_ATOMIC_STDC_DEFINE_FETCH_OP(add, 32, int32_t, +)
  ^

These are just warnings, but it makes the output from make be 60+ MB of this kind of output.

GNU Gcc 7.3.0, 8.2.0, and 9.3.0

With all 3 of these versions of GCC, I get the following:

# After fixing configure.m4 as noted in the review:
3287 --- MCA component op:avx (m4 configuration macro)
3288 checking for MCA component op:avx compile mode... dso
3289 checking for AVX512 support (no additional flags)... no
3290 checking for AVX512 support (with -march=skylake-avx512)... yes
3291 checking for AVX2 support (no additional flags)... no
3292 checking for AVX2 support (with -mavx2)... yes
3293 checking for AVX support (no additional flags)... yes
3294 checking for SSE4.1 support... no
3295 checking for SSE3 support... no
3296 checking for AVX support (with -mavx)... yes
3297 checking for SSE4.1 support... yes
3298 checking for SSE3 support... yes
3299 checking if MCA component op:avx can compile... yes

And then:

$ cd ompi/op/mca/avx
$ make
  CC       op_avx_component.lo
  CC       liblocal_ops_avx_la-op_avx_functions.lo
  CCLD     liblocal_ops_avx.la
  CC       liblocal_ops_avx2_la-op_avx_functions.lo
/tmp/ccN7HHy9.s: Assembler messages:
/tmp/ccN7HHy9.s:1583: Error: no such instruction: `vinserti128 $0x1,16(%rdi,%rax),%ymm3,%ymm1'
/tmp/ccN7HHy9.s:1584: Error: no such instruction: `vinserti128 $0x1,16(%rsi,%rax),%ymm2,%ymm0'
/tmp/ccN7HHy9.s:1585: Error: suffix or operands invalid for `vpaddd'
/tmp/ccN7HHy9.s:1587: Error: no such instruction: `vextracti128 $0x1,%ymm0,16(%rsi,%rax)'
/tmp/ccN7HHy9.s:1730: Error: no such instruction: `vinserti128 $0x1,16(%rdi,%rax),%ymm3,%ymm1'
/tmp/ccN7HHy9.s:1731: Error: no such instruction: `vinserti128 $0x1,16(%rsi,%rax),%ymm2,%ymm0'
/tmp/ccN7HHy9.s:1732: Error: suffix or operands invalid for `vpaddd'
/tmp/ccN7HHy9.s:1734: Error: no such instruction: `vextracti128 $0x1,%ymm0,16(%rsi,%rax)'
/tmp/ccN7HHy9.s:1877: Error: no such instruction: `vinserti128 $0x1,16(%rdi,%rax),%ymm3,%ymm0'
/tmp/ccN7HHy9.s:1878: Error: no such instruction: `vinserti128 $0x1,16(%rsi,%rax),%ymm2,%ymm1'
...etc.

This is on an older Sandy Bridge architecture machine, so it could be before AVX2 was available...?

My full configure output from gcc 9.3.0 and corresponding config.log is attached.

configure-output.txt
config.log

@bosilca
Copy link
Member Author

bosilca commented Jul 4, 2020

The issue reported by the Intel compiler has nothing to do with this patch. It's about the compiler being vocal about a volatile being passed as an Atomic.

@bosilca
Copy link
Member Author

bosilca commented Jul 4, 2020

for the second case can you please check the flags used to compile the AVX2 library. All these instructions are part of the AVX2 support, it looks as if the -mavx2 flag was not passed to the compiler.

@jsquyres
Copy link
Member

jsquyres commented Jul 5, 2020

Looks like -mavx2 is there (left it wrapped in the display below for readability) -- this is the output from make V=1:

Here's the full output: avx-make-V=1.txt

13 libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../../../ompi/mca/op/avx -I../../../\
   ../opal/include -I../../../../ompi/include -I../../../../oshmem/include -I../../../../o\
   pal/mca/hwloc/hwloc2/hwloc/include/private/autogen -I../../../../opal/mca/hwloc/hwloc2/\
   hwloc/include/hwloc/autogen -I../../../../ompi/mpiext/cuda/c -iquote../../../../.. -iqu\
   ote../../../.. -iquote../../../../../opal/include -iquote../../../../../ompi/include -i\
   quote../../../../../oshmem/include -I/home/jsquyres/git/ompi/build-gcc930/opal/mca/pmix\
   /pmix4x/openpmix/include -I/home/jsquyres/git/ompi/opal/mca/pmix/pmix4x/openpmix/includ\
   e -I/home/jsquyres/git/ompi/build-gcc930/opal/mca/event/libevent2022/libevent/include -\
   I/home/jsquyres/git/ompi/opal/mca/event/libevent2022/libevent -I/home/jsquyres/git/ompi\
   /opal/mca/event/libevent2022/libevent/include -I/home/jsquyres/git/ompi/build-gcc930/op\
   al/mca/hwloc/hwloc2/hwloc/include -I/home/jsquyres/git/ompi/opal/mca/hwloc/hwloc2/hwloc\
   /include -I/usr/local/include -I/usr/local/include -mavx2 -DGENERATE_SSE3_CODE -DGENERA\
   TE_SSE41_CODE -DGENERATE_AVX_CODE -DGENERATE_AVX2_CODE -O3 -DNDEBUG -finline-functions \
   -fno-strict-aliasing -mcx16 -MT liblocal_ops_avx2_la-op_avx_functions.lo -MD -MP -MF .d\
   eps/liblocal_ops_avx2_la-op_avx_functions.Tpo -c ../../../../../ompi/mca/op/avx/op_avx_\
   functions.c  -fPIC -DPIC -o .libs/liblocal_ops_avx2_la-op_avx_functions.o
14 /tmp/ccmZoSSt.s: Assembler messages:
15 /tmp/ccmZoSSt.s:1593: Error: no such instruction: `vinserti128 $0x1,-16(%rdi),%ymm0,%ymm1'
16 /tmp/ccmZoSSt.s:1595: Error: no such instruction: `vinserti128 $0x1,-16(%rax),%ymm0,%ymm0'
...etc.

@jsquyres
Copy link
Member

jsquyres commented Jul 5, 2020

Filed #7909 about the intel compiler atomic warnings.

@jsquyres
Copy link
Member

jsquyres commented Jul 6, 2020

@bosilca With a bunch of trial and error, this program fails to compile for me on my systems:

$ cat avx.c
#include <immintrin.h>

int main()
{
    void *in1;
    __m256i vA = _mm256_loadu_si256((__m256i*) in1);

    return 0;
}

$ gcc --version
gcc (GCC) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ gcc -mavx2 avx.c -o avx
/tmp/ccLxzMsm.s: Assembler messages:
/tmp/ccLxzMsm.s:18: Error: no such instruction: `vinserti128 $0x1,16(%rax),%ymm0,%ymm0'

Calling functions like _mm256_add_epi32() seem to be fine -- it's the _mm256_loadu_si256() that appears to be the problem.

@bosilca
Copy link
Member Author

bosilca commented Jul 6, 2020

We figured it out, Jeff's assembler (v2.20.51.0.2) was from an older gcc version. With a newer version everything worked as expected. I will add a test to make sure I catch such corner cases, and disable AVX2 support to allow the build process to reach completion.

@jsquyres
Copy link
Member

jsquyres commented Jul 6, 2020

Just to clarify:

  • I was testing on RHEL 6
  • I had manually installed (i.e., compiled from source) several versions of GCC:
    • GCC 7.3.0
    • GCC 8.2.0
    • GCC 9.3.0
  • The assembler (as) is not part of GCC; it is part of the binutils package. So all of these compilers were using the as that shipped with RHEL 6:
$ as --version
GNU assembler version 2.20.51.0.2-5.36.el6 20100205
Copyright 2009 Free Software Foundation, Inc.
...

Case in point:

  • If I build/install the newest binutils (2.34), and
  • Re-install GCC 9.3.0 from source -- forcing it to use the as v2.34
  • Then everything works fine.
  • Indeed, if you look closely at the error message, it's the assembler that is complaining about a non-existent instruction. That's because the RHEL 6-default as is ancient (from 2009), and simply does not recognize the AVX2 instruction. Upgrading to a newer as solves the problem.

The fix is to strengthen OMPI's avx configure.m4 test to make sure that the problematic instruction can, indeed, compile and link.

@jsquyres
Copy link
Member

jsquyres commented Jul 7, 2020

@bosilca Can you apply the same kind of fix for the 512-sized vectors that you did for the 256-sized vectors? I'm now getting errors with _mm512_loadu_si512() similar to what we saw yesterday (with the older as on RHEL 6):

$ cat avx.c
#include <immintrin.h>

int main()
{
    int in1[9] = {0, 1, 2, 3, 4, 5, 6, 7, 8};
    __m512i vA = _mm512_loadu_si512((__m256i*) &(in1[1]));

    return 0;
}

$ gcc -march=skylake-avx512 avx.c -o avx
/tmp/ccYemRMl.s: Assembler messages:
/tmp/ccYemRMl.s:28: Error: no such instruction: `vmovdqu64 (%rax),%zmm0'
/tmp/ccYemRMl.s:29: Error: no such instruction: `vmovdqa64 %zmm0,-56(%rsp)'

@jsquyres
Copy link
Member

jsquyres commented Jul 7, 2020

More bad news, I'm afraid. ☹️

I ran the reduce_local test with gcc 9+newest as (i.e., all the AVX support did compile on my platform) on my Sandy Bridge platform, and got near-complete failures when AVX was actually used. Here's one result from me running manually:

$ mpirun -np 1 reduce_local 1048583 i 16 min
MPI_SUM MPI_UINT8_T 8  count  1  time 0.000005 seconds
MPI_SUM MPI_UINT8_T 8  count  2  time 0.000000 seconds
MPI_SUM MPI_UINT8_T 8  count  4  time 0.000000 seconds
MPI_SUM MPI_UINT8_T 8  count  8  time 0.000000 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  16  time 0.000000 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  32  time 0.000000 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  64  time 0.000000 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  128  time 0.000000 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  256  time 0.000000 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  512  time 0.000000 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  1024  time 0.000000 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  2048  time 0.000000 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  4096  time 0.000000 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  8192  time 0.000001 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  16384  time 0.000001 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  32768  time 0.000003 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  65536  time 0.000005 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  131072  time 0.000012 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  262144  time 0.000025 seconds
First error at position 0 (5 sum 253 != 255)
MPI_SUM MPI_UINT8_T [fail] count  524288  time 0.000066 seconds

Here's the stdout from configure in this environment:

3291 --- MCA component op:avx (m4 configuration macro)
3292 checking for MCA component op:avx compile mode... dso
3293 checking for AVX512 support (no additional flags)... no
3294 checking for AVX512 support (with -march=skylake-avx512)... yes
3295 checking for AVX2 support (no additional flags)... no
3296 checking for AVX2 support (with -mavx2)... yes
3297 checking if _mm256_loadu_si256 generates code that can be compiled... yes
3298 checking for AVX support (no additional flags)... yes
3299 checking for SSE4.1 support... no
3300 checking for SSE3 support... no
3301 checking for AVX support (with -mavx)... yes
3302 checking for SSE4.1 support... yes
3303 checking for SSE3 support... yes
3304 checking if MCA component op:avx can compile... yes

And you can see the propagation of those flags in a rendered Makefile:

628 MCA_BUILD_OP_AVX2_FLAGS = -mavx2
629 MCA_BUILD_OP_AVX512_FLAGS = -march=skylake-avx512
630 MCA_BUILD_OP_AVX_FLAGS = -mavx

Is building with -march=skylake-avx512 advisable on an older platform (i.e., Sandy Bridge)?

@rhc54
Copy link
Contributor

rhc54 commented Jul 7, 2020

Why don't you just detect the platform and turn it "off" for anything older than skylake?? It isn't worth all the trouble for these corner cases.

@bosilca
Copy link
Member Author

bosilca commented Jul 7, 2020

I will modify the 512 case as well, not because I think it's the right approach, but because I want to get this done. While on the 256 case I understand that -mavx2 might not be enough to cover all the instructions in the generated code, for the 512 case I am using -march=, and this should generate exactly the same code everywhere. So, we can keep adding bandaid to OMPI as we want, but real issue is still there.

@bosilca
Copy link
Member Author

bosilca commented Jul 7, 2020

@jsquyres the test is correct. As I mentionned the AVX instructions use a different overflow policy, the resulting value is set to the min/max of the representative value for the type. Thus, unsigned 8 bits 5 + 253 will be 2, except with AVX where the result will be 255. Good thing that MPI does not mandate how these operations behave with regard to overflow. I will change the test to avoid printing this information.

@jsquyres
Copy link
Member

jsquyres commented Jul 10, 2020

Sorry for the delay; I finally got to testing this PR again this morning. It looks good!

I pushed a commit to this PR that fixes up the check_ops.sh script -- feel free to squash if you like it.

I found that on my Sandy Bridge machines:

  • gcc 9 + new as: compiles fine
  • gcc 8 + old as: compiles fine
  • icc 19: compiles fine

Yay!

But after I fixed up the check_ops script, I noticed that all three cases are failing MIN and MAX with 64-bit values:

$ ./check_op.sh |& tee out.txt
...lots of output...
$ grep fail out.txt
MPI_MAX    MPI_UINT64_T [fail] count  1048576     time 0.003099 seconds
MPI_MAX    MPI_UINT64_T [fail] count  1048577     time 0.002463 seconds
MPI_MAX    MPI_UINT64_T [fail] count  1048583     time 0.002330 seconds
MPI_MAX    MPI_UINT64_T [fail] count  1048591     time 0.002797 seconds
MPI_MAX    MPI_UINT64_T [fail] count  1048607     time 0.003085 seconds
MPI_MAX    MPI_UINT64_T [fail] count  1048639     time 0.002938 seconds
MPI_MAX    MPI_UINT64_T [fail] count  1048703     time 0.002454 seconds
MPI_MAX    MPI_UINT64_T [fail] count  1048706     time 0.002121 seconds
MPI_MIN    MPI_UINT64_T [fail] count  1048576     time 0.001572 seconds
MPI_MIN    MPI_UINT64_T [fail] count  1048577     time 0.001552 seconds
MPI_MIN    MPI_UINT64_T [fail] count  1048583     time 0.001639 seconds
MPI_MIN    MPI_UINT64_T [fail] count  1048591     time 0.002461 seconds
MPI_MIN    MPI_UINT64_T [fail] count  1048607     time 0.001633 seconds
MPI_MIN    MPI_UINT64_T [fail] count  1048639     time 0.001597 seconds
MPI_MIN    MPI_UINT64_T [fail] count  1048703     time 0.001584 seconds
MPI_MIN    MPI_UINT64_T [fail] count  1048706     time 0.001635 seconds

Add logic to handle different architectural capabilities
Detect the compiler flags necessary to build specialized
versions of the MPI_OP. Once the different flavors (AVX512,
AVX2, AVX) are built, detect at runtime which is the best
match with the current processor capabilities.

Add validation checks for loadu 256 and 512 bits.
Add validation tests for MPI_Op.

Signed-off-by: Jeff Squyres <[email protected]>
Signed-off-by: Gilles Gouaillardet <[email protected]>
Signed-off-by: dongzhong <[email protected]>
Signed-off-by: George Bosilca <[email protected]>
@bosilca
Copy link
Member Author

bosilca commented Jul 11, 2020

A typo in the tester (max and min were not correctly computed). I merged everything into a single commit, this is ready to go.

@jsquyres
Copy link
Member

bot:aws:retest

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm -- good to go!

@bosilca bosilca merged commit 1f237f5 into open-mpi:master Jul 13, 2020
@jsquyres
Copy link
Member

🎉

@bosilca bosilca deleted the topic/avx512 branch September 26, 2020 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.