From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-help-return-54877-listarch-gcc-help=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 21142 invoked by alias); 21 Feb 2014 14:11:22 -0000
Mailing-List: contact gcc-help-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-help.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-help/>
List-Post: <mailto:gcc-help@gcc.gnu.org>
List-Help: <mailto:gcc-help-help@gcc.gnu.org>
Sender: gcc-help-owner@gcc.gnu.org
Received: (qmail 21111 invoked by uid 89); 21 Feb 2014 14:11:21 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL,BAYES_00,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2
X-HELO: mail-pb0-f46.google.com
Received: from mail-pb0-f46.google.com (HELO mail-pb0-f46.google.com) (209.85.160.46) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Fri, 21 Feb 2014 14:11:20 +0000
Received: by mail-pb0-f46.google.com with SMTP id um1so3512162pbc.33        for <gcc-help@gcc.gnu.org>; Fri, 21 Feb 2014 06:11:18 -0800 (PST)
MIME-Version: 1.0
X-Received: by 10.66.254.3 with SMTP id ae3mr9310788pad.107.1392991876735; Fri, 21 Feb 2014 06:11:16 -0800 (PST)
Received: by 10.70.18.193 with HTTP; Fri, 21 Feb 2014 06:11:16 -0800 (PST)
In-Reply-To: <5307271E.5000508@westcontrol.com>
References: <CA+1=iYaWg6OyzNjM9K2Qb1fn40ei0Ls+3AhVyXcg-h2Pm3xQaw@mail.gmail.com>	<5305D0D4.6080105@westcontrol.com>	<CA+1=iYbuXJpMAZh3Nxoe2S+at2mwMr=ueLP7-Pa2GV0WrtOUtw@mail.gmail.com>	<5307271E.5000508@westcontrol.com>
Date: Fri, 21 Feb 2014 14:11:00 -0000
Message-ID: <CA+1=iYa+GhnQeWt1m5PRg_GHXuY5J-NcwF01bDboSk0zWb8rEQ@mail.gmail.com>
Subject: Re: Compiler optimizing variables in inline assembly
From: Cody Rigney <codyrigney92@gmail.com>
To: David Brown <david@westcontrol.com>
Cc: gcc-help@gcc.gnu.org
Content-Type: text/plain; charset=ISO-8859-1
X-SW-Source: 2014-02/txt/msg00137.txt.bz2

Thanks for the example code snippet.

As far as the intrinsics go, I looked a little deeper.  Apparently,
gcc has some bugs with producing optimal NEON assembly per the link
below.  Some of these bugs have been resolved, others not.  I guess
for now I'll stick with assembly until I know these have been
resolved.  Although I'm not sure how the developers of OpenCV would
feel about inline assembly in order to merge a pull request.  If it
won't work for them, I'll just switch to intrinsics.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562


On Fri, Feb 21, 2014 at 5:14 AM, David Brown <david@westcontrol.com> wrote:
> On 20/02/14 20:39, Cody Rigney wrote:
>> Thanks for the advice.  I didn't realize before that volatile was
>> actually hiding the problem.
>>
>> Do you mind providing an example of what you mean by using a "static
>> inline" function?  That sounds like a better way of managing the
>> assembly.  I know what you mean, but I would like to see an example of
>> the details (like passing parameters, etc).
>
> Suppose you have an assembly instruction "foo <dest> <src>".  You would
> write something like this:
>
> static inline uint32_t foo(uint32_t x) {
>         uint32_t y;
>         asm (" foo %[dest], %[src] " : [dest] "=r" (y) : [src] "r" (x));
>         return y;
> }
>
> Then you would use it in code as "y = foo(x);" and the compiler would
> put in the single assembly line (plus any code needed to put x into a
> register - let the compiler handle that sort of thing).
>
> This lets you keep the messing inline assembly stuff separate from the
> algorithm code that actually uses it.
>
>
>>
>> Initially, I began writing the NEON acceleration in intrinsics.
>> Then, I read more and more about NEON intrinsics being much slower
>> when compiled with gcc, due to some stack pops and pushes that fill it
>> up.  Apparently, the Microsoft ARM compiler and Apple's ARM compiler
>> do well with NEON intrinsics, but GCC does not. So I switched to
>> inline assembly.  I haven't actually tested this myself, but since
>> OpenCV is cross-platform, I wanted to make the acceleration work
>> cross-platform in the fastest way.
>
> Don't believe random stuff you read on the internet about compiler
> speeds - test it yourself.  One key reason is that the internet never
> forgets, and information is seldom dated - perhaps the intrinsics /were/
> slow when first introduced in gcc 4.3 (or whatever), but they could be
> much faster with 4.8.  The other issue is that you have to have
> optimisation enabled (sometimes even -O3 or extra specific optimisation
> flags, and often -ffast-math) to get the kind of scheduling, loop
> unrolling, and other optimisations needed to get the best out of NEON.
> The internet is full of people compiling without optimisation and then
> complaining about the slow code.  It is even conceivable that the latest
> gcc advances in auto-vectorisation can generate good enough neon code
> without using intrinsics or inline assembly (I don't know if the
> auto-vectorisation stuff supports ARM/NEON yet).
>
> So make /small/ test cases that let you see exactly what is happening.
> Use the intrinsics, pull things out into small and clear functions (if
> they are "static" then the compiler will be able to inline them) so that
> you can separate your logic from the low-level mechanics, and examine
> the generated assembly for the critical parts.  Only go for inline
> assembly if it will really make a difference.
>
> mvh.,
>
> David
>
>
>>
>> Thanks,
>>
>> Cody
>>
>> On Thu, Feb 20, 2014 at 4:54 AM, David Brown <david@westcontrol.com> wrote:
>>> Hi,
>>>
>>> I haven't read through the code at all, but I will give you a little
>>> general advice.
>>>
>>> Try to cut the code to the absolute minimum that shows the problem.  It
>>> makes it easier for you to work with and check, and it makes it easier
>>> for other people to examine.  Also make sure that the code has no other
>>> dependencies such as extra headers - ideally people should be able to
>>> compile the code themselves and test it (I realise this is difficult for
>>> those who don't have an ARM handy).
>>>
>>> Code that works without optimisation but fails with optimisation, or
>>> that works when you make a variable volatile, is always a bug.
>>> Occasionally, it is a bug in the compiler - but most often it is a bug
>>> in the code.  Either way, it is important to figure out the root cause,
>>> and not try to hide it by making things volatile (though that might be a
>>> good temporary fix for a compiler bug).
>>>
>>> I am not familiar with Neon (and not as good as I should be at ARM
>>> assembly in general), but it looks to me that you have used specific
>>> registers in your inline assembly, and assumed specific registers for
>>> compiler use (such as variables).  Don't do that.  When you have turned
>>> off all optimisation, the compiler is consistent about which registers
>>> it uses for different purposes - when optimising, it changes register
>>> usage in a very unpredictable way.  You must be explicit - all data
>>> going into your assembly must be declared, as must all data coming out
>>> of the assembly.  And if you use specific registers, you need to tell
>>> the compiler about them (as "clobbers") - and be aware that the compiler
>>> might be using those registers for the input or output values.
>>>
>>> Getting inline assembly right is not easy, and it is often best to work
>>> with several small assembly statements rather than large ones - I
>>> usually make a "static inline" function around a line or two of inline
>>> assembly and then use that function in the code as needed.  It can make
>>> the result a lot clearer, and makes it easier to mix the C and assembly
>>> - the end result is often better than I would make in pure assembly.
>>>
>>> Finally, is there a good reason why you need inline assembly rather than
>>> the neon intrinsics provided by gcc?
>>>
>>> <http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html>
>>>
>>>
>>> mvh.,
>>>
>>> David
>>>
>>>
>>>
>>>
>>> On 19/02/14 20:04, Cody Rigney wrote:
>>>> Hi,
>>>>
>>>> I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
>>>> link below.
>>>> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp
>>>>
>>>> The gcc version could vary since this is an open source project, but
>>>> the one I'm currently using is 4.8.1. The target architecture is ARMv7
>>>> w/ NEON. The processor I'm testing on is an ARM
>>>> Cortex-A15(big.LITTLE).
>>>>
>>>> The problem is, in release mode (where optimizations are set) it does
>>>> not work properly. However, in debug mode, it works fine. I tracked
>>>> down a specific variable(FLT_SCALE) that was being optimized out and
>>>> made it volatile and that part worked fine after that. However, I'm
>>>> still having incorrect behavior from some other optimization.  I'm new
>>>> to inline assembly, so I thought maybe I'm doing something wrong
>>>> that's not telling the compiler that I'm using a certain variable.
>>>>
>>>> Below is the code at its current state. Ignore all the comments and
>>>> volatiles(for testing this problem) everywhere. It's WIP. I removed
>>>> unnecessary functions and code so it would be easier to see. I think
>>>> the problem is in the bottom-most asm block because if I do if(false)
>>>> to skip it, I don't run into the problem. Thanks.
>>>>
>>>
>>> <snip>
>>>
>>>
>