Hi, I've been messing around with the gcc vector extensions (sse) and the assembly produced seems somewhat suboptimal. I'm not sure what "optimal" is so I'm inquring here first, before filing a bug report. This concerns inlined functions that return vectors using using the struct/union return convention, that is, the address where the result is to be stored is passed as a hidden first argument to the callee. When the function returns a 'raw' vector type (such as "double foo __attribute__((vector_size(16) ))") that fits in a single mmx register then the result of the call to the inlined function is the same as manual inlining. However if a union is returned (such as "union { double a[2]; double v __attribute((vector_size(16))); } ") or if the vector type is too big for a register (such as "double foo __attribute__((vector_size(32)))") then excessive stack shuffling occurs, relative to manual inlining. This is C, btw, so I understand that, in general, stack space has to be reserved for the arguments (as opposed to const&) but I would expect that after inlining, the optimizer could see that the arguments are not modified and not bounce them through the stack, as it does for things like int and double. Lets say there's functions f(a,b) = a+b, g(a,b) = a*b, and h(a,b) = g(f(a,a),f(a,b)). Functions f and g are inlined into h, but the body of h looks like this: reserve stack space for what would have been calls to f and g copy arguments into that space load from that space into mmx registers operate copy from mmx registers into stack space copy from stack space into the space pointed to by the hidden "return here" argument. If h is defined as (a*a)+(a*b) then this stack shuffling does not happen. Is this asking too much? Is there some fundamental reason why the arguments to the inlined function need to be bounced through the stack? This is with gcc 4.1.3. I've attached a test file and resulting assembly. The difference is pretty striking, though I've not benchmarked it. I'm also aware this is not really the best way to use sse (better to put each vector component in a separate array and vectorize the loop) but I think maybe the issue is with inlined functions that return structs/unions in general. Thanks, Scott