Compiler optimizing variables in inline assembly

public inbox for gcc-help@gcc.gnu.org
 help / color / mirror / Atom feed

* Compiler optimizing variables in inline assembly
@ 2014-02-19 19:05 Cody Rigney
  2014-02-20  9:14 ` Andrew Haley
  2014-02-20  9:54 ` David Brown
  0 siblings, 2 replies; 16+ messages in thread
From: Cody Rigney @ 2014-02-19 19:05 UTC (permalink / raw)
  To: gcc-help

Hi,

I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
link below.
https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp

The gcc version could vary since this is an open source project, but
the one I'm currently using is 4.8.1. The target architecture is ARMv7
w/ NEON. The processor I'm testing on is an ARM
Cortex-A15(big.LITTLE).

The problem is, in release mode (where optimizations are set) it does
not work properly. However, in debug mode, it works fine. I tracked
down a specific variable(FLT_SCALE) that was being optimized out and
made it volatile and that part worked fine after that. However, I'm
still having incorrect behavior from some other optimization.  I'm new
to inline assembly, so I thought maybe I'm doing something wrong
that's not telling the compiler that I'm using a certain variable.

Below is the code at its current state. Ignore all the comments and
volatiles(for testing this problem) everywhere. It's WIP. I removed
unnecessary functions and code so it would be easier to see. I think
the problem is in the bottom-most asm block because if I do if(false)
to skip it, I don't run into the problem. Thanks.

#include "precomp.hpp"
#include <float.h>
#include <stdio.h>
#include "lkpyramid.hpp"

namespace
{
static void calcSharrDeriv(const cv::Mat& src, cv::Mat& dst)
{
    using namespace cv;
    using cv::detail::deriv_type;
    int rows = src.rows, cols = src.cols, cn = src.channels(), colsn =
cols*cn, depth = src.depth();
    CV_Assert(depth == CV_8U);
    dst.create(rows, cols, CV_MAKETYPE(DataType<deriv_type>::depth, cn*2));

#ifdef HAVE_TEGRA_OPTIMIZATION
    if (tegra::calcSharrDeriv(src, dst))
        return;
#endif

    int x, y, delta = (int)alignSize((cols + 2)*cn, 16);
    AutoBuffer<deriv_type> _tempBuf(delta*2 + 64);
    deriv_type *trow0 = alignPtr(_tempBuf + cn, 16), *trow1 =
alignPtr(trow0 + delta, 16);


    int three = 3, ten = 10;


    for( y = 0; y < rows; y++ )
    {
        const uchar* srow0 = src.ptr<uchar>(y > 0 ? y-1 : rows > 1 ? 1 : 0);
        const uchar* srow1 = src.ptr<uchar>(y);
        const uchar* srow2 = src.ptr<uchar>(y < rows-1 ? y+1 : rows >
1 ? rows-2 : 0);
        deriv_type* drow = dst.ptr<deriv_type>(y);

        // do vertical convolution
        x = 0;

#ifdef CV_NEON
        //assumes deriv_type is 16 bits
        if(sizeof(deriv_type) == 2 && colsn >= 16)
        {

            __asm__ volatile ( "vdup.16 q8, %0\n\t"
                               "vdup.8 d18, %1\n\t"
                               :
                               : "r" (three), "r" (ten)
                               : );

            for( ; x <= colsn - 8; x += 8)
            {

                __asm__ volatile ( "vld1.8 {d0}, [%0]\n\t"
                                   "vld1.8 {d1}, [%1]\n\t"
                                   "vld1.8 {d2}, [%2]\n\t"
                                   "vaddl.u8 q4, d0, d2\n\t"
                                   "vsubl.u8 q11, d2, d0\n\t"
                                   "vmul.u16 q5, q4, q8\n\t"
                                   "vmull.u8 q6, d1, d18\n\t"
                                   "vadd.u16 q10, q6, q5\n\t"
                                   "vst1.16 {q10}, [%3]\n\t"
                                   "vst1.16 {q11}, [%4]\n\t"
                                   :
                                   : "r" (srow0 + x),
                                     "r" (srow1 + x),
                                     "r" (srow2 + x),
                                     "r" (trow0 + x),
                                     "r" (trow1 + x)
                                   :
                                   );

            }
        }
#endif
        for( ; x < colsn; x++ )
        {
            int t0 = (srow0[x] + srow2[x])*3 + srow1[x]*10;
            int t1 = srow2[x] - srow0[x];
            trow0[x] = (deriv_type)t0;
            trow1[x] = (deriv_type)t1;
        }

        // make border
        int x0 = (cols > 1 ? 1 : 0)*cn, x1 = (cols > 1 ? cols-2 : 0)*cn;
        for( int k = 0; k < cn; k++ )
        {
            trow0[-cn + k] = trow0[x0 + k]; trow0[colsn + k] = trow0[x1 + k];
            trow1[-cn + k] = trow1[x0 + k]; trow1[colsn + k] = trow1[x1 + k];
        }

#ifdef CV_NEON
    __asm__ volatile ( "vdup.16 q8, %0\n\t"
                       "vdup.16 q9, %1\n\t"
                       :
                       : "r" (three), "r" (ten)
                       : );
#endif

        // do horizontal convolution, interleave the results and store
them to dst
        x = 0;

#ifdef CV_NEON
        //assumes size of deriv_type is 16 bits
        if(sizeof(deriv_type) == 2 && colsn >= 16)
        {
            for( ; x <= colsn - 8; x += 8 )
            {
                __asm__ volatile (
                                  "vld1.16 {q0}, [%0]\n\t" //trow0[x + cn]
                                  "vld1.16 {q1}, [%1]\n\t" //trow0[x - cn]
                                  "vsub.i16 q5, q0, q1\n\t" //this is t0
                                  "vld1.16 {q2}, [%2]\n\t" //trow1[x + cn]
                                  "vld1.16 {q3}, [%3]\n\t" //trow1[x - cn]
                                  "vadd.i16 q6, q2, q3\n\t" //this
needs mult by 3
                                  "vld1.16 {q4}, [%4]\n\t" //trow1[x]
                                  "vmul.i16 q7, q6, q8\n\t" //this
needs to add to trow1[x]*10
                                  "vmul.i16 q10, q4, q9\n\t" //this is
trow1[x]*10
                                  "vadd.i16 q11, q7, q10\n\t" //this is t1
                                  "vswp d22, d11\n\t"
                                  "vst2.16 {q5}, [%5]\n\t" //interleave
                                  "vst2.16 {q11}, [%6]\n\t" //interleave
                                  :
                                  : "r" (trow0 + x + cn),  //0
                                    "r" (trow0 + x - cn),  //1
                                    "r" (trow1 + x + cn),  //2
                                    "r" (trow1 + x - cn),  //3
                                    "r" (trow1 + x),       //4
                                    "r" (drow + (x*2)),     //5
                                    "r" (drow + (x*2)+8)   //6
                                  :
                                  );

            }

        }
#endif

        for( ; x < colsn; x++ )
        {
            deriv_type t0 = (deriv_type)(trow0[x+cn] - trow0[x-cn]);
            deriv_type t1 = (deriv_type)((trow1[x+cn] + trow1[x-cn])*3
+ trow1[x]*10);
            drow[x*2] = t0; drow[x*2+1] = t1;
        }
    }
}

}//namespace

cv::detail::LKTrackerInvoker::LKTrackerInvoker(
                      const Mat& _prevImg, const Mat& _prevDeriv,
const Mat& _nextImg,
                      const Point2f* _prevPts, Point2f* _nextPts,
                      uchar* _status, float* _err,
                      Size _winSize, TermCriteria _criteria,
                      int _level, int _maxLevel, int _flags, float
_minEigThreshold )
{
    prevImg = &_prevImg;
    prevDeriv = &_prevDeriv;
    nextImg = &_nextImg;
    prevPts = _prevPts;
    nextPts = _nextPts;
    status = _status;
    err = _err;
    winSize = _winSize;
    criteria = _criteria;
    level = _level;
    maxLevel = _maxLevel;
    flags = _flags;
    minEigThreshold = _minEigThreshold;
}

void cv::detail::LKTrackerInvoker::operator()(const Range& range) const
{
    Point2f halfWin((winSize.width-1)*0.5f, (winSize.height-1)*0.5f);
    const Mat& I = *prevImg;
    const Mat& J = *nextImg;
    const Mat& derivI = *prevDeriv;

    int j, cn = I.channels(), cn2 = cn*2;
    cv::AutoBuffer<deriv_type> _buf(winSize.area()*(cn + cn2));
    int derivDepth = DataType<deriv_type>::depth;

    Mat IWinBuf(winSize, CV_MAKETYPE(derivDepth, cn), (deriv_type*)_buf);
    Mat derivIWinBuf(winSize, CV_MAKETYPE(derivDepth, cn2),
(deriv_type*)_buf + winSize.area()*cn);

    for( int ptidx = range.start; ptidx < range.end; ptidx++ )
    {
        Point2f prevPt = prevPts[ptidx]*(float)(1./(1 << level));
        Point2f nextPt;
        if( level == maxLevel )
        {
            if( flags & OPTFLOW_USE_INITIAL_FLOW )
                nextPt = nextPts[ptidx]*(float)(1./(1 << level));
            else
                nextPt = prevPt;
        }
        else
            nextPt = nextPts[ptidx]*2.f;
        nextPts[ptidx] = nextPt;

        Point2i iprevPt, inextPt;
        prevPt -= halfWin;
        iprevPt.x = cvFloor(prevPt.x);
        iprevPt.y = cvFloor(prevPt.y);

        if( iprevPt.x < -winSize.width || iprevPt.x >= derivI.cols ||
            iprevPt.y < -winSize.height || iprevPt.y >= derivI.rows )
        {
            if( level == 0 )
            {
                if( status )
                    status[ptidx] = false;
                if( err )
                    err[ptidx] = 0;
            }
            continue;
        }

        volatile float a = prevPt.x - iprevPt.x;
        volatile float b = prevPt.y - iprevPt.y;
        volatile const int W_BITS = 14, W_BITS1 = 14;
        volatile const float FLT_SCALE = 1.f/(1 << 20); //volatile is
needed because compiler will optimize this out for NEON
        volatile int iw00 = cvRound((1.f - a)*(1.f - b)*(1 << W_BITS));
        volatile int iw01 = cvRound(a*(1.f - b)*(1 << W_BITS));
        volatile int iw10 = cvRound((1.f - a)*b*(1 << W_BITS));
        volatile int iw11 = (1 << W_BITS) - iw00 - iw01 - iw10;

        volatile int dstep = (int)(derivI.step/derivI.elemSize1());
        volatile int stepI = (int)(I.step/I.elemSize1());
        volatile int stepJ = (int)(J.step/J.elemSize1());
        volatile float A11 = 0, A12 = 0, A22 = 0;

#ifdef CV_NEON

        volatile int CV_DECL_ALIGNED(16) nA11[] = {0, 0, 0, 0}, nA12[]
= {0, 0, 0, 0}, nA22[] = {0, 0, 0, 0};
        volatile const int shifter1 = -(W_BITS - 5); //negative so it
shifts right
        volatile const int shifter2 = -(W_BITS);

        if(sizeof(deriv_type) == 2)
        {



            __asm__ volatile ( "vdup.16 d26, %0\n\t"
                               "vdup.16 d27, %1\n\t"
                               "vdup.16 d28, %2\n\t"
                               "vdup.16 d29, %3\n\t"
                               "vdup.32 q11, %4\n\t"
                               "vdup.32 q12, %5\n\t"
                               :
                               : "r" ((short)iw00),
                                 "r" ((short)iw01),
                                 "r" ((short)iw10),
                                 "r" ((short)iw11),
                                 "r" (shifter1),
                                 "r" (shifter2)
                               : );

        }

#endif

        // extract the patch from the first image, compute covariation
matrix of derivatives
        volatile int x, y;
        for( y = 0; y < winSize.height; y++ )
        {
            volatile const uchar* src = (const uchar*)I.data + (y +
iprevPt.y)*stepI + iprevPt.x*cn;
            volatile const deriv_type* dsrc = (const
deriv_type*)derivI.data + (y + iprevPt.y)*dstep + iprevPt.x*cn2;

            volatile deriv_type* Iptr = (deriv_type*)(IWinBuf.data +
y*IWinBuf.step);
            volatile deriv_type* dIptr =
(deriv_type*)(derivIWinBuf.data + y*derivIWinBuf.step);

            x = 0;

#ifdef CV_NEON

        if(sizeof(deriv_type) == 2 && winSize.width*cn >= 12)
        {

            for( ; x <= winSize.width*cn - 4; x += 4, dsrc += 4*2,
dIptr += 4*2 )
            {

                __asm__ volatile (
                                   "vld1.8 {d0}, [%0]\n\t" //ignores
last 4 bytes
                                   "vmovl.u8 q0, d0\n\t" //expand to 16-bit
                                   "vld1.8 {d2}, [%1]\n\t"
                                   "vmovl.u8 q1, d2\n\t"

                                   "vmull.s16 q5, d0, d26\n\t"
                                   "vmull.s16 q6, d2, d27\n\t"

                                   "vld1.8 {d4}, [%2]\n\t"
                                   "vmovl.u8 q2, d4\n\t" //expand
                                   "vld1.8 {d6}, [%3]\n\t"
                                   "vmovl.u8 q3, d6\n\t"

                                   "vmull.s16 q7, d4, d28\n\t"
                                   "vmull.s16 q8, d6, d29\n\t"

                                   "vadd.i32 q5, q5, q6\n\t"
                                   "vadd.i32 q7, q7, q8\n\t"
                                   "vadd.i32 q5, q5, q7\n\t"

                                   "vld2.16 {d0, d1}, [%4]\n\t"
//evens in d0 and d2
                                   "vld2.16 {d2, d3}, [%5]\n\t"

                                   "vqrshl.s32 q5, q5, q11\n\t"

                                   "vmull.s16 q4, d0, d26\n\t" //q4 is
mult of even 1
                                   "vmull.s16 q6, d1, d26\n\t" //q6 is
mult of odd 1

                                   "vmovn.s32 d0, q5\n\t"

                                   "vmull.s16 q7, d2, d27\n\t" //q7 is
mult of even 2
                                   "vmull.s16 q8, d3, d27\n\t" //q8 is
mult of odd 2

                                   "vst1.16 {d0}, [%8]\n\t"


                                   "vld2.16 {d4, d5}, [%6]\n\t"
//evens in d4 and d6
                                   "vld2.16 {d6, d7}, [%7]\n\t"

                                   "vadd.i32 q4, q4, q7\n\t" //this
frees up q7 and q8
                                   "vadd.i32 q6, q6, q8\n\t" //q4 is
added even 1 and 2
                                                             //q6 is
added odd 1 and 2

                                   "vmull.s16 q7, d4, d28\n\t" //q7 is
mult of even 3
                                   "vmull.s16 q0, d5, d28\n\t" //q0 is
mult of odd 3
                                   "vmull.s16 q8, d6, d29\n\t" //q8 is
mult of even 4
                                   "vmull.s16 q15, d7, d29\n\t" //q15
is mult of odd 4


                                   "vadd.i32 q7, q7, q8\n\t" //q7 is
added even 3 and 4
                                   "vadd.i32 q0, q0, q15\n\t" //q0 is
added odd 3 and 4

                                   "vadd.i32 q4, q4, q7\n\t" //q4 is
added even 1,2,3,4 -- will be ixval
                                   "vadd.i32 q6, q6, q0\n\t" //q6 is
added odd 1,2,3,4 -- will be iyval

                                   "vld1.32 {q1}, [%11]\n\t"
                                   "vld1.32 {q2}, [%12]\n\t"
                                   "vld1.32 {q0}, [%10]\n\t" //get the
loads prepared

                                   "vqrshl.s32 q4, q4, q12\n\t" //q4
is descaled evens added
                                   "vqrshl.s32 q6, q6, q12\n\t" //q6
is descaled odds added

                                   //now ixval is stored in q4 and
iyval is stored in q6 and ival is in q5

                                   "vmul.s32 q7, q4, q4\n\t"
                                   "vmul.s32 q8, q4, q6\n\t"
                                   "vmul.s32 q15, q6, q6\n\t"

                                   "vadd.i32 q0, q0, q7\n\t"
                                   "vadd.i32 q1, q1, q8\n\t"
                                   "vadd.i32 q2, q2, q15\n\t"

                                   "vst1.32 {q0}, [%10]\n\t"
                                   "vst1.32 {q1}, [%11]\n\t"
                                   "vst1.32 {q2}, [%12]\n\t"

                                   "vmovn.i32 d8, q4\n\t" //bring ixval to short
                                   "vmovn.i32 d12, q6\n\t" //bring
iyval to short
                                   "vswp d9, d12\n\t" //now d8 is
ixval and d9 is iyval
                                   "vst2.16 {d8, d9}, [%9]\n\t"

                                   :
                                   : "r" (src + x), //0
                                     "r" (src + x + cn), //1
                                     "r" (src + x + stepI), //2
                                     "r" (src + x + stepI + cn),  //3
                                     "r" (dsrc),  //4
                                     "r" (dsrc + cn2),  //5
                                     "r" (dsrc + dstep),  //6
                                     "r" (dsrc + dstep + cn2),  //7
                                     "r" (Iptr + x),  //8
                                     "r" (dIptr), //9
                                     "r" (nA11), //10
                                     "r" (nA12), //11
                                     "r" (nA22) //12
                                   : );

            }

        }

#endif

            for( ; x < winSize.width*cn; x++, dsrc += 2, dIptr += 2 )
            {
                int ival = CV_DESCALE(src[x]*iw00 + src[x+cn]*iw01 +
                                      src[x+stepI]*iw10 +
src[x+stepI+cn]*iw11, W_BITS1-5);

                int ixval = CV_DESCALE(dsrc[0]*iw00 + dsrc[cn2]*iw01 +
                                       dsrc[dstep]*iw10 +
dsrc[dstep+cn2]*iw11, W_BITS1);
                int iyval = CV_DESCALE(dsrc[1]*iw00 + dsrc[cn2+1]*iw01
+ dsrc[dstep+1]*iw10 +
                                       dsrc[dstep+cn2+1]*iw11, W_BITS1);

                Iptr[x] = (short)ival;
                dIptr[0] = (short)ixval;
                dIptr[1] = (short)iyval;

                A11 += (float)(ixval*ixval);
                A12 += (float)(ixval*iyval);
                A22 += (float)(iyval*iyval);
            }
        }

#ifdef CV_NEON
        A11 += (float)(nA11[0] + nA11[1] + nA11[2] + nA11[3]);
        A12 += (float)(nA12[0] + nA12[1] + nA12[2] + nA12[3]);
        A22 += (float)(nA22[0] + nA22[1] + nA22[2] + nA22[3]);
#endif

        A11 *= FLT_SCALE;
        A12 *= FLT_SCALE;
        A22 *= FLT_SCALE;


        volatile  float D = A11*A22 - A12*A12;
        float minEig = (A22 + A11 - std::sqrt((A11-A22)*(A11-A22) +
                        4.f*A12*A12))/(2*winSize.width*winSize.height);

        if( err && (flags & CV_LKFLOW_GET_MIN_EIGENVALS) != 0 )
            err[ptidx] = (float)minEig;

        if( minEig < minEigThreshold || D < FLT_EPSILON )
        {
            if( level == 0 && status )
                status[ptidx] = false;
            continue;
        }

        D = 1.f/D;

        nextPt -= halfWin;
        Point2f prevDelta;

        for( j = 0; j < criteria.maxCount; j++ )
        {
            inextPt.x = cvFloor(nextPt.x);
            inextPt.y = cvFloor(nextPt.y);

            if( inextPt.x < -winSize.width || inextPt.x >= J.cols ||
               inextPt.y < -winSize.height || inextPt.y >= J.rows )
            {
                if( level == 0 && status )
                    status[ptidx] = false;
                break;
            }

            a = nextPt.x - inextPt.x;
            b = nextPt.y - inextPt.y;
            iw00 = cvRound((1.f - a)*(1.f - b)*(1 << W_BITS));
            iw01 = cvRound(a*(1.f - b)*(1 << W_BITS));
            iw10 = cvRound((1.f - a)*b*(1 << W_BITS));
            iw11 = (1 << W_BITS) - iw00 - iw01 - iw10;
            float b1 = 0, b2 = 0;

            for( y = 0; y < winSize.height; y++ )
            {
                const uchar* Jptr = (const uchar*)J.data + (y +
inextPt.y)*stepJ + inextPt.x*cn;
                const deriv_type* Iptr = (const
deriv_type*)(IWinBuf.data + y*IWinBuf.step);
                const deriv_type* dIptr = (const
deriv_type*)(derivIWinBuf.data + y*derivIWinBuf.step);

                x = 0;


                for( ; x < winSize.width*cn; x++, dIptr += 2 )
                {
                    int diff = CV_DESCALE(Jptr[x]*iw00 + Jptr[x+cn]*iw01 +
                                          Jptr[x+stepJ]*iw10 +
Jptr[x+stepJ+cn]*iw11,
                                          W_BITS1-5) - Iptr[x];
                    b1 += (float)(diff*dIptr[0]);
                    b2 += (float)(diff*dIptr[1]);
                }
            }


            b1 *= FLT_SCALE;
            b2 *= FLT_SCALE;

            Point2f delta( (float)((A12*b2 - A22*b1) * D),
                          (float)((A12*b1 - A11*b2) * D));
            //delta = -delta;

            nextPt += delta;
            nextPts[ptidx] = nextPt + halfWin;

            if( delta.ddot(delta) <= criteria.epsilon )
                break;

            if( j > 0 && std::abs(delta.x + prevDelta.x) < 0.01 &&
               std::abs(delta.y + prevDelta.y) < 0.01 )
            {
                nextPts[ptidx] -= delta*0.5f;
                break;
            }
            prevDelta = delta;
        }

        if( status[ptidx] && err && level == 0 && (flags &
CV_LKFLOW_GET_MIN_EIGENVALS) == 0 )
        {
            Point2f nextPoint = nextPts[ptidx] - halfWin;
            Point inextPoint;

            inextPoint.x = cvFloor(nextPoint.x);
            inextPoint.y = cvFloor(nextPoint.y);

            if( inextPoint.x < -winSize.width || inextPoint.x >= J.cols ||
                inextPoint.y < -winSize.height || inextPoint.y >= J.rows )
            {
                if( status )
                    status[ptidx] = false;
                continue;
            }

            float aa = nextPoint.x - inextPoint.x;
            float bb = nextPoint.y - inextPoint.y;
            iw00 = cvRound((1.f - aa)*(1.f - bb)*(1 << W_BITS));
            iw01 = cvRound(aa*(1.f - bb)*(1 << W_BITS));
            iw10 = cvRound((1.f - aa)*bb*(1 << W_BITS));
            iw11 = (1 << W_BITS) - iw00 - iw01 - iw10;
            float errval = 0.f;

            for( y = 0; y < winSize.height; y++ )
            {
                const uchar* Jptr = (const uchar*)J.data + (y +
inextPoint.y)*stepJ + inextPoint.x*cn;
                const deriv_type* Iptr = (const
deriv_type*)(IWinBuf.data + y*IWinBuf.step);

                for( x = 0; x < winSize.width*cn; x++ )
                {
                    int diff = CV_DESCALE(Jptr[x]*iw00 + Jptr[x+cn]*iw01 +
                                          Jptr[x+stepJ]*iw10 +
Jptr[x+stepJ+cn]*iw11,
                                          W_BITS1-5) - Iptr[x];
                    errval += std::abs((float)diff);
                }
            }
            err[ptidx] = errval * 1.f/(32*winSize.width*cn*winSize.height);
        }
    }
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-19 19:05 Compiler optimizing variables in inline assembly Cody Rigney
@ 2014-02-20  9:14 ` Andrew Haley
  2014-02-20 19:30   ` Cody Rigney
  2014-02-20  9:54 ` David Brown
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Haley @ 2014-02-20  9:14 UTC (permalink / raw)
  To: Cody Rigney, gcc-help

Hi,

On 02/19/2014 07:04 PM, Cody Rigney wrote:
> I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
> link below.
> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp
> 
> The gcc version could vary since this is an open source project, but
> the one I'm currently using is 4.8.1. The target architecture is ARMv7
> w/ NEON. The processor I'm testing on is an ARM
> Cortex-A15(big.LITTLE).
> 
> The problem is, in release mode (where optimizations are set) it does
> not work properly. However, in debug mode, it works fine. I tracked
> down a specific variable(FLT_SCALE) that was being optimized out and
> made it volatile and that part worked fine after that. However, I'm
> still having incorrect behavior from some other optimization.

Forget about using volatile here.  That's just wrong.

You have to mark your inputs, outputs, and clobbers correctly.

Look at this asm:

               __asm__ volatile (
                                  "vld1.16 {q0}, [%0]\n\t" //trow0[x + cn]
                                  "vld1.16 {q1}, [%1]\n\t" //trow0[x - cn]
                                  "vsub.i16 q5, q0, q1\n\t" //this is t0
                                  "vld1.16 {q2}, [%2]\n\t" //trow1[x + cn]
                                  "vld1.16 {q3}, [%3]\n\t" //trow1[x - cn]
                                  "vadd.i16 q6, q2, q3\n\t" //this
needs mult by 3
                                  "vld1.16 {q4}, [%4]\n\t" //trow1[x]
                                  "vmul.i16 q7, q6, q8\n\t" //this
needs to add to trow1[x]*10
                                  "vmul.i16 q10, q4, q9\n\t" //this is
trow1[x]*10
                                  "vadd.i16 q11, q7, q10\n\t" //this is t1
                                  "vswp d22, d11\n\t"
                                  "vst2.16 {q5}, [%5]\n\t" //interleave
                                  "vst2.16 {q11}, [%6]\n\t" //interleave
                                  :
                                  : "r" (trow0 + x + cn),  //0
                                    "r" (trow0 + x - cn),  //1
                                    "r" (trow1 + x + cn),  //2
                                    "r" (trow1 + x - cn),  //3
                                    "r" (trow1 + x),       //4
                                    "r" (drow + (x*2)),     //5
                                    "r" (drow + (x*2)+8)   //6
                                  :
                                  );

It has no outputs.  How is this possible?  It does a lot of work.  It must
have some outputs.  I think there should be some outputs for this asm.  I
think they are memory outputs.

Go through all, the asm blocks, and mark the inputs, outputs, and clobbers.
Then it should work.

Remember one basic thing: you must tell GCC about everything that an
asm does.  If it affects memory, you must tell GCC.  If it reads memory,
you must tell GCC.  DO not lie to the compiler: it will bite you.

Andrew.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-19 19:05 Compiler optimizing variables in inline assembly Cody Rigney
  2014-02-20  9:14 ` Andrew Haley
@ 2014-02-20  9:54 ` David Brown
  2014-02-20 19:39   ` Cody Rigney
  1 sibling, 1 reply; 16+ messages in thread
From: David Brown @ 2014-02-20  9:54 UTC (permalink / raw)
  To: Cody Rigney, gcc-help

Hi,

I haven't read through the code at all, but I will give you a little
general advice.

Try to cut the code to the absolute minimum that shows the problem.  It
makes it easier for you to work with and check, and it makes it easier
for other people to examine.  Also make sure that the code has no other
dependencies such as extra headers - ideally people should be able to
compile the code themselves and test it (I realise this is difficult for
those who don't have an ARM handy).

Code that works without optimisation but fails with optimisation, or
that works when you make a variable volatile, is always a bug.
Occasionally, it is a bug in the compiler - but most often it is a bug
in the code.  Either way, it is important to figure out the root cause,
and not try to hide it by making things volatile (though that might be a
good temporary fix for a compiler bug).

I am not familiar with Neon (and not as good as I should be at ARM
assembly in general), but it looks to me that you have used specific
registers in your inline assembly, and assumed specific registers for
compiler use (such as variables).  Don't do that.  When you have turned
off all optimisation, the compiler is consistent about which registers
it uses for different purposes - when optimising, it changes register
usage in a very unpredictable way.  You must be explicit - all data
going into your assembly must be declared, as must all data coming out
of the assembly.  And if you use specific registers, you need to tell
the compiler about them (as "clobbers") - and be aware that the compiler
might be using those registers for the input or output values.

Getting inline assembly right is not easy, and it is often best to work
with several small assembly statements rather than large ones - I
usually make a "static inline" function around a line or two of inline
assembly and then use that function in the code as needed.  It can make
the result a lot clearer, and makes it easier to mix the C and assembly
- the end result is often better than I would make in pure assembly.

Finally, is there a good reason why you need inline assembly rather than
the neon intrinsics provided by gcc?

<http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html>

mvh.,

David

On 19/02/14 20:04, Cody Rigney wrote:
> Hi,
> 
> I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
> link below.
> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp
> 
> The gcc version could vary since this is an open source project, but
> the one I'm currently using is 4.8.1. The target architecture is ARMv7
> w/ NEON. The processor I'm testing on is an ARM
> Cortex-A15(big.LITTLE).
> 
> The problem is, in release mode (where optimizations are set) it does
> not work properly. However, in debug mode, it works fine. I tracked
> down a specific variable(FLT_SCALE) that was being optimized out and
> made it volatile and that part worked fine after that. However, I'm
> still having incorrect behavior from some other optimization.  I'm new
> to inline assembly, so I thought maybe I'm doing something wrong
> that's not telling the compiler that I'm using a certain variable.
> 
> Below is the code at its current state. Ignore all the comments and
> volatiles(for testing this problem) everywhere. It's WIP. I removed
> unnecessary functions and code so it would be easier to see. I think
> the problem is in the bottom-most asm block because if I do if(false)
> to skip it, I don't run into the problem. Thanks.
> 

<snip>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-20  9:14 ` Andrew Haley
@ 2014-02-20 19:30   ` Cody Rigney
  2014-02-21  9:53     ` Andrew Haley
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Cody Rigney @ 2014-02-20 19:30 UTC (permalink / raw)
  To: Andrew Haley; +Cc: gcc-help

That makes sense.  In this case, the input parameters are actually
memory addresses.  So how would I do an output or clobber that would
tell the compiler that the memory at those addresses will change?

Thanks for your time,

Cody

On Thu, Feb 20, 2014 at 4:14 AM, Andrew Haley <aph@redhat.com> wrote:
> Hi,
>
> On 02/19/2014 07:04 PM, Cody Rigney wrote:
>> I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
>> link below.
>> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp
>>
>> The gcc version could vary since this is an open source project, but
>> the one I'm currently using is 4.8.1. The target architecture is ARMv7
>> w/ NEON. The processor I'm testing on is an ARM
>> Cortex-A15(big.LITTLE).
>>
>> The problem is, in release mode (where optimizations are set) it does
>> not work properly. However, in debug mode, it works fine. I tracked
>> down a specific variable(FLT_SCALE) that was being optimized out and
>> made it volatile and that part worked fine after that. However, I'm
>> still having incorrect behavior from some other optimization.
>
> Forget about using volatile here.  That's just wrong.
>
> You have to mark your inputs, outputs, and clobbers correctly.
>
> Look at this asm:
>
>                __asm__ volatile (
>                                   "vld1.16 {q0}, [%0]\n\t" //trow0[x + cn]
>                                   "vld1.16 {q1}, [%1]\n\t" //trow0[x - cn]
>                                   "vsub.i16 q5, q0, q1\n\t" //this is t0
>                                   "vld1.16 {q2}, [%2]\n\t" //trow1[x + cn]
>                                   "vld1.16 {q3}, [%3]\n\t" //trow1[x - cn]
>                                   "vadd.i16 q6, q2, q3\n\t" //this
> needs mult by 3
>                                   "vld1.16 {q4}, [%4]\n\t" //trow1[x]
>                                   "vmul.i16 q7, q6, q8\n\t" //this
> needs to add to trow1[x]*10
>                                   "vmul.i16 q10, q4, q9\n\t" //this is
> trow1[x]*10
>                                   "vadd.i16 q11, q7, q10\n\t" //this is t1
>                                   "vswp d22, d11\n\t"
>                                   "vst2.16 {q5}, [%5]\n\t" //interleave
>                                   "vst2.16 {q11}, [%6]\n\t" //interleave
>                                   :
>                                   : "r" (trow0 + x + cn),  //0
>                                     "r" (trow0 + x - cn),  //1
>                                     "r" (trow1 + x + cn),  //2
>                                     "r" (trow1 + x - cn),  //3
>                                     "r" (trow1 + x),       //4
>                                     "r" (drow + (x*2)),     //5
>                                     "r" (drow + (x*2)+8)   //6
>                                   :
>                                   );
>
> It has no outputs.  How is this possible?  It does a lot of work.  It must
> have some outputs.  I think there should be some outputs for this asm.  I
> think they are memory outputs.
>
> Go through all, the asm blocks, and mark the inputs, outputs, and clobbers.
> Then it should work.
>
> Remember one basic thing: you must tell GCC about everything that an
> asm does.  If it affects memory, you must tell GCC.  If it reads memory,
> you must tell GCC.  DO not lie to the compiler: it will bite you.
>
> Andrew.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-20  9:54 ` David Brown
@ 2014-02-20 19:39   ` Cody Rigney
  2014-02-21 10:15     ` David Brown
  0 siblings, 1 reply; 16+ messages in thread
From: Cody Rigney @ 2014-02-20 19:39 UTC (permalink / raw)
  To: David Brown; +Cc: gcc-help

Thanks for the advice.  I didn't realize before that volatile was
actually hiding the problem.

Do you mind providing an example of what you mean by using a "static
inline" function?  That sounds like a better way of managing the
assembly.  I know what you mean, but I would like to see an example of
the details (like passing parameters, etc).

Initially, I began writing the NEON acceleration in intrinsics.
Then, I read more and more about NEON intrinsics being much slower
when compiled with gcc, due to some stack pops and pushes that fill it
up.  Apparently, the Microsoft ARM compiler and Apple's ARM compiler
do well with NEON intrinsics, but GCC does not. So I switched to
inline assembly.  I haven't actually tested this myself, but since
OpenCV is cross-platform, I wanted to make the acceleration work
cross-platform in the fastest way.

Thanks,

Cody

On Thu, Feb 20, 2014 at 4:54 AM, David Brown <david@westcontrol.com> wrote:
> Hi,
>
> I haven't read through the code at all, but I will give you a little
> general advice.
>
> Try to cut the code to the absolute minimum that shows the problem.  It
> makes it easier for you to work with and check, and it makes it easier
> for other people to examine.  Also make sure that the code has no other
> dependencies such as extra headers - ideally people should be able to
> compile the code themselves and test it (I realise this is difficult for
> those who don't have an ARM handy).
>
> Code that works without optimisation but fails with optimisation, or
> that works when you make a variable volatile, is always a bug.
> Occasionally, it is a bug in the compiler - but most often it is a bug
> in the code.  Either way, it is important to figure out the root cause,
> and not try to hide it by making things volatile (though that might be a
> good temporary fix for a compiler bug).
>
> I am not familiar with Neon (and not as good as I should be at ARM
> assembly in general), but it looks to me that you have used specific
> registers in your inline assembly, and assumed specific registers for
> compiler use (such as variables).  Don't do that.  When you have turned
> off all optimisation, the compiler is consistent about which registers
> it uses for different purposes - when optimising, it changes register
> usage in a very unpredictable way.  You must be explicit - all data
> going into your assembly must be declared, as must all data coming out
> of the assembly.  And if you use specific registers, you need to tell
> the compiler about them (as "clobbers") - and be aware that the compiler
> might be using those registers for the input or output values.
>
> Getting inline assembly right is not easy, and it is often best to work
> with several small assembly statements rather than large ones - I
> usually make a "static inline" function around a line or two of inline
> assembly and then use that function in the code as needed.  It can make
> the result a lot clearer, and makes it easier to mix the C and assembly
> - the end result is often better than I would make in pure assembly.
>
> Finally, is there a good reason why you need inline assembly rather than
> the neon intrinsics provided by gcc?
>
> <http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html>
>
>
> mvh.,
>
> David
>
>
>
>
> On 19/02/14 20:04, Cody Rigney wrote:
>> Hi,
>>
>> I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
>> link below.
>> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp
>>
>> The gcc version could vary since this is an open source project, but
>> the one I'm currently using is 4.8.1. The target architecture is ARMv7
>> w/ NEON. The processor I'm testing on is an ARM
>> Cortex-A15(big.LITTLE).
>>
>> The problem is, in release mode (where optimizations are set) it does
>> not work properly. However, in debug mode, it works fine. I tracked
>> down a specific variable(FLT_SCALE) that was being optimized out and
>> made it volatile and that part worked fine after that. However, I'm
>> still having incorrect behavior from some other optimization.  I'm new
>> to inline assembly, so I thought maybe I'm doing something wrong
>> that's not telling the compiler that I'm using a certain variable.
>>
>> Below is the code at its current state. Ignore all the comments and
>> volatiles(for testing this problem) everywhere. It's WIP. I removed
>> unnecessary functions and code so it would be easier to see. I think
>> the problem is in the bottom-most asm block because if I do if(false)
>> to skip it, I don't run into the problem. Thanks.
>>
>
> <snip>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-20 19:30   ` Cody Rigney
@ 2014-02-21  9:53     ` Andrew Haley
  2014-02-21 14:06       ` Cody Rigney
  2014-02-21  9:54     ` David Brown
  2014-02-21  9:55     ` David Brown
  2 siblings, 1 reply; 16+ messages in thread
From: Andrew Haley @ 2014-02-21  9:53 UTC (permalink / raw)
  To: Cody Rigney; +Cc: gcc-help

On 02/20/2014 07:29 PM, Cody Rigney wrote:
> On Thu, Feb 20, 2014 at 4:14 AM, Andrew Haley <aph@redhat.com> wrote:
>> Hi,
>>
>> On 02/19/2014 07:04 PM, Cody Rigney wrote:
>>> I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
>>> link below.
>>> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp
>>>
>>> The gcc version could vary since this is an open source project, but
>>> the one I'm currently using is 4.8.1. The target architecture is ARMv7
>>> w/ NEON. The processor I'm testing on is an ARM
>>> Cortex-A15(big.LITTLE).
>>>
>>> The problem is, in release mode (where optimizations are set) it does
>>> not work properly. However, in debug mode, it works fine. I tracked
>>> down a specific variable(FLT_SCALE) that was being optimized out and
>>> made it volatile and that part worked fine after that. However, I'm
>>> still having incorrect behavior from some other optimization.
>>
>> Forget about using volatile here.  That's just wrong.
>>
>> You have to mark your inputs, outputs, and clobbers correctly.
>>
> That makes sense.  In this case, the input parameters are actually
> memory addresses.  So how would I do an output or clobber that would
> tell the compiler that the memory at those addresses will change?

You can use a memory operand as an output, as in "=m"(*a) or simply
add "memory" to the clobber list.  And you must add all clobbered
registers to the clobber list.  Then it should work.

Andrew.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-20 19:30   ` Cody Rigney
  2014-02-21  9:53     ` Andrew Haley
@ 2014-02-21  9:54     ` David Brown
  2014-02-21  9:55     ` David Brown
  2 siblings, 0 replies; 16+ messages in thread
From: David Brown @ 2014-02-21  9:54 UTC (permalink / raw)
  To: gcc-help

On 20/02/14 20:29, Cody Rigney wrote:
> That makes sense.  In this case, the input parameters are actually
> memory addresses.  So how would I do an output or clobber that would
> tell the compiler that the memory at those addresses will change?
> 

Easy - you specify "memory" in your clobber list!

> Thanks for your time,
> 
> Cody
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-20 19:30   ` Cody Rigney
  2014-02-21  9:53     ` Andrew Haley
  2014-02-21  9:54     ` David Brown
@ 2014-02-21  9:55     ` David Brown
  2 siblings, 0 replies; 16+ messages in thread
From: David Brown @ 2014-02-21  9:55 UTC (permalink / raw)
  To: Cody Rigney, Andrew Haley; +Cc: gcc-help

On 20/02/14 20:29, Cody Rigney wrote:
> That makes sense.  In this case, the input parameters are actually
> memory addresses.  So how would I do an output or clobber that would
> tell the compiler that the memory at those addresses will change?
> 

Easy - you specify "memory" in your clobber list!

> Thanks for your time,
> 
> Cody
> 

(I forgot the "reply all" to the list - sorry for the duplicate.)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-20 19:39   ` Cody Rigney
@ 2014-02-21 10:15     ` David Brown
  2014-02-21 14:11       ` Cody Rigney
  0 siblings, 1 reply; 16+ messages in thread
From: David Brown @ 2014-02-21 10:15 UTC (permalink / raw)
  To: Cody Rigney; +Cc: gcc-help

On 20/02/14 20:39, Cody Rigney wrote:
> Thanks for the advice.  I didn't realize before that volatile was
> actually hiding the problem.
> 
> Do you mind providing an example of what you mean by using a "static
> inline" function?  That sounds like a better way of managing the
> assembly.  I know what you mean, but I would like to see an example of
> the details (like passing parameters, etc).

Suppose you have an assembly instruction "foo <dest> <src>".  You would
write something like this:

static inline uint32_t foo(uint32_t x) {
	uint32_t y;
	asm (" foo %[dest], %[src] " : [dest] "=r" (y) : [src] "r" (x));
	return y;
}

Then you would use it in code as "y = foo(x);" and the compiler would
put in the single assembly line (plus any code needed to put x into a
register - let the compiler handle that sort of thing).

This lets you keep the messing inline assembly stuff separate from the
algorithm code that actually uses it.


> 
> Initially, I began writing the NEON acceleration in intrinsics.
> Then, I read more and more about NEON intrinsics being much slower
> when compiled with gcc, due to some stack pops and pushes that fill it
> up.  Apparently, the Microsoft ARM compiler and Apple's ARM compiler
> do well with NEON intrinsics, but GCC does not. So I switched to
> inline assembly.  I haven't actually tested this myself, but since
> OpenCV is cross-platform, I wanted to make the acceleration work
> cross-platform in the fastest way.

Don't believe random stuff you read on the internet about compiler
speeds - test it yourself.  One key reason is that the internet never
forgets, and information is seldom dated - perhaps the intrinsics /were/
slow when first introduced in gcc 4.3 (or whatever), but they could be
much faster with 4.8.  The other issue is that you have to have
optimisation enabled (sometimes even -O3 or extra specific optimisation
flags, and often -ffast-math) to get the kind of scheduling, loop
unrolling, and other optimisations needed to get the best out of NEON.
The internet is full of people compiling without optimisation and then
complaining about the slow code.  It is even conceivable that the latest
gcc advances in auto-vectorisation can generate good enough neon code
without using intrinsics or inline assembly (I don't know if the
auto-vectorisation stuff supports ARM/NEON yet).

So make /small/ test cases that let you see exactly what is happening.
Use the intrinsics, pull things out into small and clear functions (if
they are "static" then the compiler will be able to inline them) so that
you can separate your logic from the low-level mechanics, and examine
the generated assembly for the critical parts.  Only go for inline
assembly if it will really make a difference.

mvh.,

David


> 
> Thanks,
> 
> Cody
> 
> On Thu, Feb 20, 2014 at 4:54 AM, David Brown <david@westcontrol.com> wrote:
>> Hi,
>>
>> I haven't read through the code at all, but I will give you a little
>> general advice.
>>
>> Try to cut the code to the absolute minimum that shows the problem.  It
>> makes it easier for you to work with and check, and it makes it easier
>> for other people to examine.  Also make sure that the code has no other
>> dependencies such as extra headers - ideally people should be able to
>> compile the code themselves and test it (I realise this is difficult for
>> those who don't have an ARM handy).
>>
>> Code that works without optimisation but fails with optimisation, or
>> that works when you make a variable volatile, is always a bug.
>> Occasionally, it is a bug in the compiler - but most often it is a bug
>> in the code.  Either way, it is important to figure out the root cause,
>> and not try to hide it by making things volatile (though that might be a
>> good temporary fix for a compiler bug).
>>
>> I am not familiar with Neon (and not as good as I should be at ARM
>> assembly in general), but it looks to me that you have used specific
>> registers in your inline assembly, and assumed specific registers for
>> compiler use (such as variables).  Don't do that.  When you have turned
>> off all optimisation, the compiler is consistent about which registers
>> it uses for different purposes - when optimising, it changes register
>> usage in a very unpredictable way.  You must be explicit - all data
>> going into your assembly must be declared, as must all data coming out
>> of the assembly.  And if you use specific registers, you need to tell
>> the compiler about them (as "clobbers") - and be aware that the compiler
>> might be using those registers for the input or output values.
>>
>> Getting inline assembly right is not easy, and it is often best to work
>> with several small assembly statements rather than large ones - I
>> usually make a "static inline" function around a line or two of inline
>> assembly and then use that function in the code as needed.  It can make
>> the result a lot clearer, and makes it easier to mix the C and assembly
>> - the end result is often better than I would make in pure assembly.
>>
>> Finally, is there a good reason why you need inline assembly rather than
>> the neon intrinsics provided by gcc?
>>
>> <http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html>
>>
>>
>> mvh.,
>>
>> David
>>
>>
>>
>>
>> On 19/02/14 20:04, Cody Rigney wrote:
>>> Hi,
>>>
>>> I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
>>> link below.
>>> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp
>>>
>>> The gcc version could vary since this is an open source project, but
>>> the one I'm currently using is 4.8.1. The target architecture is ARMv7
>>> w/ NEON. The processor I'm testing on is an ARM
>>> Cortex-A15(big.LITTLE).
>>>
>>> The problem is, in release mode (where optimizations are set) it does
>>> not work properly. However, in debug mode, it works fine. I tracked
>>> down a specific variable(FLT_SCALE) that was being optimized out and
>>> made it volatile and that part worked fine after that. However, I'm
>>> still having incorrect behavior from some other optimization.  I'm new
>>> to inline assembly, so I thought maybe I'm doing something wrong
>>> that's not telling the compiler that I'm using a certain variable.
>>>
>>> Below is the code at its current state. Ignore all the comments and
>>> volatiles(for testing this problem) everywhere. It's WIP. I removed
>>> unnecessary functions and code so it would be easier to see. I think
>>> the problem is in the bottom-most asm block because if I do if(false)
>>> to skip it, I don't run into the problem. Thanks.
>>>
>>
>> <snip>
>>
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-21  9:53     ` Andrew Haley
@ 2014-02-21 14:06       ` Cody Rigney
  2014-02-21 15:02         ` Andrew Haley
  0 siblings, 1 reply; 16+ messages in thread
From: Cody Rigney @ 2014-02-21 14:06 UTC (permalink / raw)
  To: Andrew Haley; +Cc: gcc-help

I added all the registers and "memory" to clobber and it worked!
Since I think it's more efficient to specify which memory is changing,
I'm going to give the "=m" (*a) a shot.  So even though you
dereference the pointer, gcc will know the length of the memory(say 2
bytes) that changes?  Or do you have to specify each index of the
memory(e.g. char a[2]; ..... asm ... "=m" (*a), "=m" (*(a+1)))?

On Fri, Feb 21, 2014 at 4:53 AM, Andrew Haley <aph@redhat.com> wrote:
> On 02/20/2014 07:29 PM, Cody Rigney wrote:
>> On Thu, Feb 20, 2014 at 4:14 AM, Andrew Haley <aph@redhat.com> wrote:
>>> Hi,
>>>
>>> On 02/19/2014 07:04 PM, Cody Rigney wrote:
>>>> I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
>>>> link below.
>>>> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp
>>>>
>>>> The gcc version could vary since this is an open source project, but
>>>> the one I'm currently using is 4.8.1. The target architecture is ARMv7
>>>> w/ NEON. The processor I'm testing on is an ARM
>>>> Cortex-A15(big.LITTLE).
>>>>
>>>> The problem is, in release mode (where optimizations are set) it does
>>>> not work properly. However, in debug mode, it works fine. I tracked
>>>> down a specific variable(FLT_SCALE) that was being optimized out and
>>>> made it volatile and that part worked fine after that. However, I'm
>>>> still having incorrect behavior from some other optimization.
>>>
>>> Forget about using volatile here.  That's just wrong.
>>>
>>> You have to mark your inputs, outputs, and clobbers correctly.
>>>
>> That makes sense.  In this case, the input parameters are actually
>> memory addresses.  So how would I do an output or clobber that would
>> tell the compiler that the memory at those addresses will change?
>
> You can use a memory operand as an output, as in "=m"(*a) or simply
> add "memory" to the clobber list.  And you must add all clobbered
> registers to the clobber list.  Then it should work.
>
> Andrew.
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-21 10:15     ` David Brown
@ 2014-02-21 14:11       ` Cody Rigney
  0 siblings, 0 replies; 16+ messages in thread
From: Cody Rigney @ 2014-02-21 14:11 UTC (permalink / raw)
  To: David Brown; +Cc: gcc-help

Thanks for the example code snippet.

As far as the intrinsics go, I looked a little deeper.  Apparently,
gcc has some bugs with producing optimal NEON assembly per the link
below.  Some of these bugs have been resolved, others not.  I guess
for now I'll stick with assembly until I know these have been
resolved.  Although I'm not sure how the developers of OpenCV would
feel about inline assembly in order to merge a pull request.  If it
won't work for them, I'll just switch to intrinsics.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47562


On Fri, Feb 21, 2014 at 5:14 AM, David Brown <david@westcontrol.com> wrote:
> On 20/02/14 20:39, Cody Rigney wrote:
>> Thanks for the advice.  I didn't realize before that volatile was
>> actually hiding the problem.
>>
>> Do you mind providing an example of what you mean by using a "static
>> inline" function?  That sounds like a better way of managing the
>> assembly.  I know what you mean, but I would like to see an example of
>> the details (like passing parameters, etc).
>
> Suppose you have an assembly instruction "foo <dest> <src>".  You would
> write something like this:
>
> static inline uint32_t foo(uint32_t x) {
>         uint32_t y;
>         asm (" foo %[dest], %[src] " : [dest] "=r" (y) : [src] "r" (x));
>         return y;
> }
>
> Then you would use it in code as "y = foo(x);" and the compiler would
> put in the single assembly line (plus any code needed to put x into a
> register - let the compiler handle that sort of thing).
>
> This lets you keep the messing inline assembly stuff separate from the
> algorithm code that actually uses it.
>
>
>>
>> Initially, I began writing the NEON acceleration in intrinsics.
>> Then, I read more and more about NEON intrinsics being much slower
>> when compiled with gcc, due to some stack pops and pushes that fill it
>> up.  Apparently, the Microsoft ARM compiler and Apple's ARM compiler
>> do well with NEON intrinsics, but GCC does not. So I switched to
>> inline assembly.  I haven't actually tested this myself, but since
>> OpenCV is cross-platform, I wanted to make the acceleration work
>> cross-platform in the fastest way.
>
> Don't believe random stuff you read on the internet about compiler
> speeds - test it yourself.  One key reason is that the internet never
> forgets, and information is seldom dated - perhaps the intrinsics /were/
> slow when first introduced in gcc 4.3 (or whatever), but they could be
> much faster with 4.8.  The other issue is that you have to have
> optimisation enabled (sometimes even -O3 or extra specific optimisation
> flags, and often -ffast-math) to get the kind of scheduling, loop
> unrolling, and other optimisations needed to get the best out of NEON.
> The internet is full of people compiling without optimisation and then
> complaining about the slow code.  It is even conceivable that the latest
> gcc advances in auto-vectorisation can generate good enough neon code
> without using intrinsics or inline assembly (I don't know if the
> auto-vectorisation stuff supports ARM/NEON yet).
>
> So make /small/ test cases that let you see exactly what is happening.
> Use the intrinsics, pull things out into small and clear functions (if
> they are "static" then the compiler will be able to inline them) so that
> you can separate your logic from the low-level mechanics, and examine
> the generated assembly for the critical parts.  Only go for inline
> assembly if it will really make a difference.
>
> mvh.,
>
> David
>
>
>>
>> Thanks,
>>
>> Cody
>>
>> On Thu, Feb 20, 2014 at 4:54 AM, David Brown <david@westcontrol.com> wrote:
>>> Hi,
>>>
>>> I haven't read through the code at all, but I will give you a little
>>> general advice.
>>>
>>> Try to cut the code to the absolute minimum that shows the problem.  It
>>> makes it easier for you to work with and check, and it makes it easier
>>> for other people to examine.  Also make sure that the code has no other
>>> dependencies such as extra headers - ideally people should be able to
>>> compile the code themselves and test it (I realise this is difficult for
>>> those who don't have an ARM handy).
>>>
>>> Code that works without optimisation but fails with optimisation, or
>>> that works when you make a variable volatile, is always a bug.
>>> Occasionally, it is a bug in the compiler - but most often it is a bug
>>> in the code.  Either way, it is important to figure out the root cause,
>>> and not try to hide it by making things volatile (though that might be a
>>> good temporary fix for a compiler bug).
>>>
>>> I am not familiar with Neon (and not as good as I should be at ARM
>>> assembly in general), but it looks to me that you have used specific
>>> registers in your inline assembly, and assumed specific registers for
>>> compiler use (such as variables).  Don't do that.  When you have turned
>>> off all optimisation, the compiler is consistent about which registers
>>> it uses for different purposes - when optimising, it changes register
>>> usage in a very unpredictable way.  You must be explicit - all data
>>> going into your assembly must be declared, as must all data coming out
>>> of the assembly.  And if you use specific registers, you need to tell
>>> the compiler about them (as "clobbers") - and be aware that the compiler
>>> might be using those registers for the input or output values.
>>>
>>> Getting inline assembly right is not easy, and it is often best to work
>>> with several small assembly statements rather than large ones - I
>>> usually make a "static inline" function around a line or two of inline
>>> assembly and then use that function in the code as needed.  It can make
>>> the result a lot clearer, and makes it easier to mix the C and assembly
>>> - the end result is often better than I would make in pure assembly.
>>>
>>> Finally, is there a good reason why you need inline assembly rather than
>>> the neon intrinsics provided by gcc?
>>>
>>> <http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html>
>>>
>>>
>>> mvh.,
>>>
>>> David
>>>
>>>
>>>
>>>
>>> On 19/02/14 20:04, Cody Rigney wrote:
>>>> Hi,
>>>>
>>>> I'm trying to add NEON optimizations to OpenCV's LK optical flow.  See
>>>> link below.
>>>> https://github.com/Itseez/opencv/blob/2.4/modules/video/src/lkpyramid.cpp
>>>>
>>>> The gcc version could vary since this is an open source project, but
>>>> the one I'm currently using is 4.8.1. The target architecture is ARMv7
>>>> w/ NEON. The processor I'm testing on is an ARM
>>>> Cortex-A15(big.LITTLE).
>>>>
>>>> The problem is, in release mode (where optimizations are set) it does
>>>> not work properly. However, in debug mode, it works fine. I tracked
>>>> down a specific variable(FLT_SCALE) that was being optimized out and
>>>> made it volatile and that part worked fine after that. However, I'm
>>>> still having incorrect behavior from some other optimization.  I'm new
>>>> to inline assembly, so I thought maybe I'm doing something wrong
>>>> that's not telling the compiler that I'm using a certain variable.
>>>>
>>>> Below is the code at its current state. Ignore all the comments and
>>>> volatiles(for testing this problem) everywhere. It's WIP. I removed
>>>> unnecessary functions and code so it would be easier to see. I think
>>>> the problem is in the bottom-most asm block because if I do if(false)
>>>> to skip it, I don't run into the problem. Thanks.
>>>>
>>>
>>> <snip>
>>>
>>>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-21 14:06       ` Cody Rigney
@ 2014-02-21 15:02         ` Andrew Haley
  2014-02-21 15:20           ` Cody Rigney
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Haley @ 2014-02-21 15:02 UTC (permalink / raw)
  To: Cody Rigney; +Cc: gcc-help

On 02/21/2014 02:06 PM, Cody Rigney wrote:

> On Fri, Feb 21, 2014 at 4:53 AM, Andrew Haley <aph@redhat.com> wrote:
>>
>> You can use a memory operand as an output, as in "=m"(*a) or simply
>> add "memory" to the clobber list.  And you must add all clobbered
>> registers to the clobber list.  Then it should work.
>>
> I added all the registers and "memory" to clobber and it worked!
> Since I think it's more efficient to specify which memory is changing,
> I'm going to give the "=m" (*a) a shot.  So even though you
> dereference the pointer, gcc will know the length of the memory(say 2
> bytes) that changes?  Or do you have to specify each index of the
> memory(e.g. char a[2]; ..... asm ... "=m" (*a), "=m" (*(a+1)))?

It's hard for me to give a 100% answer to that one, but GCC has an
idea what memory is reachable from every pointer.  So, this won't
clobber memory that's unreachable or has a different type from that
pointer.  It probably doesn't matter.

Andrew.


P.S.  Please don't top-post on GCC lists.  :-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-21 15:02         ` Andrew Haley
@ 2014-02-21 15:20           ` Cody Rigney
  2014-02-27 13:18             ` Cody Rigney
  0 siblings, 1 reply; 16+ messages in thread
From: Cody Rigney @ 2014-02-21 15:20 UTC (permalink / raw)
  To: Andrew Haley; +Cc: gcc-help

On Fri, Feb 21, 2014 at 10:02 AM, Andrew Haley <aph@redhat.com> wrote:
> It's hard for me to give a 100% answer to that one, but GCC has an
> idea what memory is reachable from every pointer.  So, this won't
> clobber memory that's unreachable or has a different type from that
> pointer.  It probably doesn't matter.
>

I see.  Good to know.


> P.S.  Please don't top-post on GCC lists.  :-)

I'm new to mailing lists.  Thanks for the heads up!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-21 15:20           ` Cody Rigney
@ 2014-02-27 13:18             ` Cody Rigney
  2014-02-27 14:03               ` Andrew Haley
  0 siblings, 1 reply; 16+ messages in thread
From: Cody Rigney @ 2014-02-27 13:18 UTC (permalink / raw)
  To: Andrew Haley; +Cc: gcc-help

> On Fri, Feb 21, 2014 at 10:02 AM, Andrew Haley <aph@redhat.com> wrote:
>> It's hard for me to give a 100% answer to that one, but GCC has an
>> idea what memory is reachable from every pointer.  So, this won't
>> clobber memory that's unreachable or has a different type from that
>> pointer.  It probably doesn't matter.
>>

Doing "=m" (*a) in the output operand worked! And so did "memory" to
the clobber list. Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-27 13:18             ` Cody Rigney
@ 2014-02-27 14:03               ` Andrew Haley
  2014-02-27 18:34                 ` Cody Rigney
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Haley @ 2014-02-27 14:03 UTC (permalink / raw)
  To: Cody Rigney; +Cc: gcc-help

On 02/27/2014 01:18 PM, Cody Rigney wrote:
>> On Fri, Feb 21, 2014 at 10:02 AM, Andrew Haley <aph@redhat.com> wrote:
>>> It's hard for me to give a 100% answer to that one, but GCC has an
>>> idea what memory is reachable from every pointer.  So, this won't
>>> clobber memory that's unreachable or has a different type from that
>>> pointer.  It probably doesn't matter.
>>>
> 
> Doing "=m" (*a) in the output operand worked! And so did "memory" to
> the clobber list. Thanks!

I have been informed that this is wrong, sorry.

The correct way to do it is like so:

---------------------------------------
If you know how large the accessed memory is, you can add it as input or
output but if this is not known, you should add `memory'.  As an example,
if you access ten bytes of a string, you can use a memory input like:

     {"m"( ({ struct { char x[10]; } *p = (void *)ptr ; *p; }) )}.

---------------------------------------

If you use a zero-length array in the struct (i.e. char x[0]) that will
effectively do the job if you don't know how large the array is, but there
seems to be some doubt whether this is guaranteed.  A memory clobber really
is guaranteed, but seems a bit like overkill.

Andrew.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Compiler optimizing variables in inline assembly
  2014-02-27 14:03               ` Andrew Haley
@ 2014-02-27 18:34                 ` Cody Rigney
  0 siblings, 0 replies; 16+ messages in thread
From: Cody Rigney @ 2014-02-27 18:34 UTC (permalink / raw)
  To: Andrew Haley; +Cc: gcc-help

On Thu, Feb 27, 2014 at 9:03 AM, Andrew Haley <aph@redhat.com> wrote:
>
> I have been informed that this is wrong, sorry.
>
> The correct way to do it is like so:
>
> ---------------------------------------
> If you know how large the accessed memory is, you can add it as input or
> output but if this is not known, you should add `memory'.  As an example,
> if you access ten bytes of a string, you can use a memory input like:
>
>      {"m"( ({ struct { char x[10]; } *p = (void *)ptr ; *p; }) )}.
>
> ---------------------------------------
>
> If you use a zero-length array in the struct (i.e. char x[0]) that will
> effectively do the job if you don't know how large the array is, but there
> seems to be some doubt whether this is guaranteed.  A memory clobber really
> is guaranteed, but seems a bit like overkill.
>
> Andrew.
>

I gave that a shot but it didn't work for me.  I tried making small
modifications and it would either not work right(maybe optimizations)
or the instruction wouldn't allow offsets like [r0, #16].

I did this instead:

typedef struct { char x[16]; } MemContainer128;

....
output param is "=m" ( *((MemContainer128*)ptr))


Cody

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-02-27 18:34 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-19 19:05 Compiler optimizing variables in inline assembly Cody Rigney
2014-02-20  9:14 ` Andrew Haley
2014-02-20 19:30   ` Cody Rigney
2014-02-21  9:53     ` Andrew Haley
2014-02-21 14:06       ` Cody Rigney
2014-02-21 15:02         ` Andrew Haley
2014-02-21 15:20           ` Cody Rigney
2014-02-27 13:18             ` Cody Rigney
2014-02-27 14:03               ` Andrew Haley
2014-02-27 18:34                 ` Cody Rigney
2014-02-21  9:54     ` David Brown
2014-02-21  9:55     ` David Brown
2014-02-20  9:54 ` David Brown
2014-02-20 19:39   ` Cody Rigney
2014-02-21 10:15     ` David Brown
2014-02-21 14:11       ` Cody Rigney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).