- 
                Notifications
    You must be signed in to change notification settings 
- Fork 929
Add support for AVX #7419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for AVX #7419
Conversation
0199b6c    to
    26680e1      
    Compare
  
    | The cray compile warnings/errors look legit. Not sure why lots of warns and notes led to GCC returning failure however. | 
| Because my .m4 screwed up the flags. Fix on the way. | 
93f23d9    to
    84e8a49      
    Compare
  
    0aa9492    to
    1fbeead      
    Compare
  
    1fbeead    to
    6f2c04b      
    Compare
  
    9595ad0    to
    075549f      
    Compare
  
    3438d2d    to
    136b6da      
    Compare
  
    | @bosilca I just pushed two commits to fix some typos FWIW, I am also writing the  For SVE, would you rather have it in its own module (e.g.  | 
| Thanks @ggouaillardet . Regarding the test I have a way more extensive version in the pipeline, should be able to push in the next few days. For the ARM extensions having all ARM-V* versions in a single place can be beneficial, as we will be able to reuse parts of the code. It might however make the compiling stage more complex as we will need to provide specialized flags for some of the .c file (to enable specific architecture options). What's you take on this ? | 
| @bosilca I pushed a new fix in your PR I added support for NEON and SVE in my own branch (based on top of this one) at https://github.com/ggouaillardet/ompi/tree/topic/op_arm For now, I did two distinct components, and though there is some overlap, there is not that much imho. If I understand correctly the  
 before each step, the flags are tested, and draining the main loop (e.g. the last 63  
 we could also do loop peeling to be faster when data is not optimally aligned. In theory, we could have several components (e.g. one per avx type) but in order to avoid code duplication, it can be left in a single module (and since subroutine selection is performed at selection time, there is no runtime overhead) | 
| bot:ibm:xl:retest | 
| @ggouaillardet I did not hear about anything but performance issues while mixing AVX and SSE, and these are not a concern for us because in the worst case we do such a transition once. We could follow Intel's suggestions and pass the code through vtune to make sure. The code can certainly be improved, but I don't know by how much. I assumed that most of the MPI_Op executes on data large enough to completely hide the few cycles lost in the branch mismanagement, so that reordering and loop peeling might not be the such a deal breaker. But we might want to investigate how to deal with the case where the user data alignment matches the AVX needs, as this should allow us to use AVX instructions for aligned memory (load instead of loadu) accesses but I don't how much we can gain. Some hits can be found here on McCalpin answer. For the SVE code, @dong0321 and myself were working on a branch in my repo, that you can access via a PR. It is mostly based on the AVX code, and has no support for Neon. We should merge our efforts on this, and reach out to Fujitsu (@Shinji-Sumimoto) and ARM (@shamisp) for final validation. | 
| I don't understand the AZR issue here. It seems that the instruction is not supported, but it did find it while looking for it during configure. So I wonder of somehow the additional flags detected during configure are not passed to the compiler for the avx_functions.c file. But for this I would need to see the compile arguments used. Can someone form Mellanox send me this info please. | 
| bot:aws:retest Same real-but-unrelated-to-this-PR timeout we've been seeing for a while (see #7847). | 
| bot:aws:retest | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm running into problems when testing this PR locally. More information coming shortly.
| Intel icc 19.04When I compile with icc 19.04, I get the following (which is probably unsurprising): But then I get this when compiling nearly every file in Open MPI: These are just warnings, but it makes the output from  GNU Gcc 7.3.0, 8.2.0, and 9.3.0With all 3 of these versions of GCC, I get the following: And then: This is on an older Sandy Bridge architecture machine, so it could be before AVX2 was available...? My full configure output from gcc 9.3.0 and corresponding config.log is attached. | 
| The issue reported by the Intel compiler has nothing to do with this patch. It's about the compiler being vocal about a volatile being passed as an Atomic. | 
| for the second case can you please check the flags used to compile the AVX2 library. All these instructions are part of the AVX2 support, it looks as if the -mavx2 flag was not passed to the compiler. | 
| Looks like  Here's the full output: avx-make-V=1.txt  | 
| Filed #7909 about the intel compiler atomic warnings. | 
| @bosilca With a bunch of trial and error, this program fails to compile for me on my systems: Calling functions like  | 
| We figured it out, Jeff's assembler (v2.20.51.0.2) was from an older gcc version. With a newer version everything worked as expected. I will add a test to make sure I catch such corner cases, and disable AVX2 support to allow the build process to reach completion. | 
| Just to clarify: 
 Case in point: 
 The fix is to strengthen OMPI's avx  | 
| @bosilca Can you apply the same kind of fix for the 512-sized vectors that you did for the 256-sized vectors?  I'm now getting errors with   | 
| More bad news, I'm afraid.   I ran the reduce_local test with gcc 9+newest  Here's the stdout from configure in this environment: And you can see the propagation of those flags in a rendered Makefile: Is building with  | 
| Why don't you just detect the platform and turn it "off" for anything older than skylake?? It isn't worth all the trouble for these corner cases. | 
| I will modify the 512 case as well, not because I think it's the right approach, but because I want to get this done. While on the 256 case I understand that  | 
| @jsquyres the test is correct. As I mentionned the AVX instructions use a different overflow policy, the resulting value is set to the min/max of the representative value for the type. Thus, unsigned 8 bits 5 + 253 will be 2, except with AVX where the result will be 255. Good thing that MPI does not mandate how these operations behave with regard to overflow. I will change the test to avoid printing this information. | 
| Sorry for the delay; I finally got to testing this PR again this morning. It looks good! I pushed a commit to this PR that fixes up the check_ops.sh script -- feel free to squash if you like it. I found that on my Sandy Bridge machines: 
 Yay! But after I fixed up the check_ops script, I noticed that all three cases are failing MIN and MAX with 64-bit values:  | 
Add logic to handle different architectural capabilities Detect the compiler flags necessary to build specialized versions of the MPI_OP. Once the different flavors (AVX512, AVX2, AVX) are built, detect at runtime which is the best match with the current processor capabilities. Add validation checks for loadu 256 and 512 bits. Add validation tests for MPI_Op. Signed-off-by: Jeff Squyres <[email protected]> Signed-off-by: Gilles Gouaillardet <[email protected]> Signed-off-by: dongzhong <[email protected]> Signed-off-by: George Bosilca <[email protected]>
| A typo in the tester (max and min were not correctly computed). I merged everything into a single commit, this is ready to go. | 
| bot:aws:retest | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm -- good to go!
| 🎉 | 
Add support for AVX instructions for all MPI_Op.