<< >>
justin = { main feed , music , code , askjf , social , pubkey };
[ <<< last (older) article | view in index | next (newer) article >>> ]

July 15, 2014
checking assumptions

Very often in computer code integers are divided by powers of two, such as 2, 4, 8, 16, 32, 64, and so on). These divisions are much faster than dividing by other numbers (since computers represent the underlying number in binary). In C/C++, there are two ways this type of division is typically expressed: normal division (x/256), or a shift (x<<8). These two methods produce the same results for non-negative values, and depending on the meaning of the code in question, (x/256) is often more readable, and thus I use it regularly.

If x is signed and is negative, division rounds towards 0, whereas the shift rounds towards negative infinity, but in situations where rounding of negative values is not important, I had generally assumed that modern compilers would generate similarly efficient code (reducing x/256 into a single x86 sar instruction).

It turns out, testing with a two modern compilers (and one slightly out of date compiler), this is not the case.

Here is the C code:

  void b(int r);
  void f(int v)
  {
    int i;
    for (i=0;i<v/256;i++) b(v > 0 ? v/65536 : 0);
  }

  void f2(int v)
  {
    int i;
    for (i=0;i<(v>>8);i++) b(v > 0 ? v >> 16 : 0);
  }

Ideally, a compiler should generate identical code for each of these functions. In the case of the loop counter, if v is less than 0, how it is rounded makes no difference. In the case of the parameter to b(), the code v >> 16 is only evaluated if v is known to be above 0.

Let's look at the output of some compilers (removing decoration and unrelated code). I've marked some code as bold to signify instructions that could be eliminated (with slight changes to the surrounding instructions):

  • Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
    Xcode 5.1.1 on OSX 10.9.4
    Target: x86_64-apple-darwin13.3.0
    Command line flags: -O2 -fomit-frame-pointer

    f():                              # using divides
    	cmpl	$256, %edi
    	jl	LBB0_3            # if v is less than 256, skip loop
    
            # v is known to be 256 or greater. 
    
    	movl	%edi, %ebx
    	sarl	$31, %ebx         # ebx=0xffffffff if negative, 0 if non-negative
    	movl	%ebx, %r14d    
    	shrl	$24, %r14d        # r14d=0xff if negative, 0 if non-negative
    	addl	%edi, %r14d       # r14d = (v+255) if v negative, v if v non-negative
    	sarl	$8, %r14d         # r14d = v/256
    
    	shrl	$16, %ebx         # this will make ebx 65535 if v negative, 0 if v non-negative
    	addl	%edi, %ebx        # ebx = (v+65535) if v negative, v if v non-negative
    	sarl	$16, %ebx         # ebx = v/65536
    
            # interestingly, the optimizer used its knowledge of v being greater than 0 to remove the ternary conditional expression completely.
    
    	xorl	%ebp, %ebp
    LBB0_2:
    	movl	%ebx, %edi
    	callq	_b
    	incl	%ebp
    	cmpl	%r14d, %ebp
    	jl	LBB0_2
    LBB0_3:
    
    
    
    f2():                             # using shifts
    	movl	%edi, %ebp
    	sarl	$8, %ebp          # ebp = v>>8
    	testl	%ebp, %ebp
    	jle	LBB1_3            # if less than or equal to 0, skip
    
    	movl	%edi, %eax
    	sarl	$16, %eax         # eax = v>>16
    	xorl	%ebx, %ebx
    	testl	%edi, %edi
    	cmovgl	%eax, %ebx        # if v is greater than 0, set ebx to eax
    
            # the optimizer could also have removed the xorl/testl/cmovgl sequence as well
    
    LBB1_2:
    	movl	%ebx, %edi
    	callq	_b
    	decl	%ebp
    	jne	LBB1_2
    LBB1_3:
    
    
    In the first function (division), the LLVM optimizer appears to have removed the ternary expression (checking to see if v was greater than 0), likely because it knew that if the loop was running, v was greater than 0. Unfortunately, it didn't apply this knowledge to the integer divisions of v, which would have allowed it to not generate (substantial) rounding code.

    In the second function (shifts), LLVM wasn't required to generate rounding code (as C's >> maps to x86 sar directly), but it also didn't use the knowledge that v would be greater than 0.

  • Microsoft (R) C/C++ Optimizing Compiler Version 18.00.30501 for x64
    Visual Studio Express 2013 for Windows Desktop on Windows 7
    Command line flags: /O2 (/Os produced different results, but nothing related to rounding)
    
    f():                              # using divides
    
            mov     eax, ecx
            mov     ebx, ecx
            cdq                       # set edx to 0xffffffff if v negative, 0 otherwise
            movzx   edx, dl           # set edx to 0xff if v negative, 0 otherwise
            add     eax, edx          # eax = v+255 if v negative, v otherwise
            sar     eax, 8            # eax = v/256
            test    eax, eax
            jle     SHORT $LN1@f      # skip loop if v/256 is less than or equal to 0
            mov     QWORD PTR [rsp+48], rdi
    
            mov     edi, eax          # edi is loop counter
    $LL3@f:
            test    ebx, ebx
            jle     SHORT $LN6@f      # if v is less than or equal to 0, jump to set eax to 0
            mov     eax, ebx
            cdq                       # set edx to 0xffffffff if v negative, 0 otherwise
            movzx   edx, dx           # set edx to 0xffff if v negative, 0 otherwise
            add     eax, edx          # eax = v+65535 if v negative, v otherwise
            sar     eax, 16           # eax = v/65536
            jmp     SHORT $LN7@f
    $LN6@f:
            xor     eax, eax
    $LN7@f:
            mov     ecx, eax
            call    b
            dec     rdi
            jne     SHORT $LL3@f
    
            mov     rdi, QWORD PTR [rsp+48]
    $LN1@f:
    
    f2():                             # using shifts
            mov     eax, ecx
            mov     ebx, ecx
            sar     eax, 8            # eax = v>>8
            test    eax, eax
            jle     SHORT $LN1@f2     # skip loop if v>>8 is less than or equal to 0
    
            mov     QWORD PTR [rsp+48], rdi
            mov     edi, eax
    $LL3@f2:
            test    ebx, ebx
            jle     SHORT $LN6@f2     # if v is less than or equal to 0, jump to set ecx to 0
            mov     ecx, ebx
            sar     ecx, 16           # ecx = v>>16
            jmp     SHORT $LN7@f2
    $LN6@f2:
            xor     ecx, ecx
    $LN7@f2:
            call    b
            dec     rdi
            jne     SHORT $LL3@f2
            mov     rdi, QWORD PTR [rsp+48]
    $LN1@f2:
    
    
    VS 2013 generates different rounding code for the division, using cdq/movzx (or cdq/and if shifting by something other than 8 or 16 bits).

    Also worth noting is that VS 2013 doesn't even bother moving the invariant ternary operator and (v/65536) or (v>>16) out of the loop. Ideally, it could move that calculation out of the loop, or remove the ternary operator completely. Ouch. I have to say, VS 2013 does seem to produce pretty good code overall, but I guess most of ours is heavily in floating point these days.

  • gcc 4.4.5
    Linux x86_64
    Command line flags: -O2 -fomit-frame-pointer

    f():                              # using divides
            testl   %edi, %edi
            movl    %edi, %ebp        # ebp = edi = v
            leal    255(%rbp), %r12d  # r12d = v+255
    
            cmovns  %edi, %r12d       # set r12d to v, if v is non-negative (otherwise r12d was v+255)
            sarl    $8, %r12d         # r12d = v/256
            testl   %r12d, %r12d
            jle     .L14              # if r12d is less than or equal to 0, skip
            movl    %edi, %r14d
            xorl    %ebx, %ebx        # ebx is loop counter
            xorl    %r13d, %r13d
            sarl    $16, %r14d        # r14d = v>>16
    .L13:
            testl   %ebp, %ebp
            movl    %r13d, %edi
            cmovg   %r14d, %edi       # if v is greater than 0, use v>>16 instead of 0
            addl    $1, %ebx
            call    b
            cmpl    %r12d, %ebx
            jl      .L13
    .L14:
    
    f2():                             # using shifts
            movl    %edi, %r12d
            sarl    $8, %r12d
            testl   %r12d, %r12d
            movl    %edi, %ebp
            jle     .L6               # skip loop if (v>>8) is less than or equal to 0
            movl    %edi, %r14d
            xorl    %ebx, %ebx        # ebx is loop counter
            xorl    %r13d, %r13d
            sarl    $16, %r14d        # r14d = (v>>16)
    .L5:
            testl   %ebp, %ebp
            movl    %r13d, %edi
            cmovg   %r14d, %edi       # if v is greater than 0, use v>>16 instead of 0
            addl    $1, %ebx
            call    b
            cmpl    %r12d, %ebx
            jl      .L5
    .L6:
    
    
    
    gcc 4.4 does an interesting job, using lea to generate v+255, and then cmovns to replace it with v if v is non-negative. It doesn't bother generating rounding code for v/65536, but it does still generate rounding code for v/256, even though any non-positive result for v/256 is treated the same way throughout. Also, gcc doesn't eliminate the non-varying ternary expression, nor put the constant v/65536 or v>>16 outside of the loop.

Conclusions?

I'm not sure what to say here -- modern compilers can generate a lot of really good code, especially looking at floating point and SSE, but this makes me feel as though some of the basics have been neglected. If I were a better programmer I'd go dig into LLVM and GCC and submit patches.

I should have also tested ICC, but I've spent enough time on this, and the only ICC version we use is old enough that I would just regret not using the latest.

For comparison, here is what I would like to see LLVM generate for f():


f():                              # using divides
	cmpl	$256, %edi
	jl	LBB0_3            # if v is less than 256, skip loop

	movl	%edi, %ebx
	movl	%edi, %ebp
	shrl	$8, %ebx          # ebx = v/256, since v is non-negative
	shrl	$16, %ebp         # ebp = v/65536, since v is non-negative

LBB0_2:
	movl	%ebp, %edi
	callq	_b
	decl    %ebx
	jnz	LBB0_2
LBB0_3:

Performance-wise, I'm sure they wouldn't differ in any meaningful way, but the decrease in size would be nice.

Finally: write for the compiler you have, not the compiler you wish you had. When performance is important, use shifts instead of divides, or use unsigned types (which really should generate the same code for (x/256) vs (x>>8)). Move as much logic out of the loop as you can -- yes, the compiler might be able to do it for you, but why depend on that? But most important of all: test your assumptions.






5 Comments:
Posted by Eugene on Wed 16 Jul 2014 at 10:30 from 95.31.43.x
I've found out with clang that if you use unsigned integer type, it will generate code close to what you expect.

You can use the site
gcc.gnubolt.org to experiment.


Posted by Eugene on Wed 16 Jul 2014 at 10:31 from 95.31.43.x
sorry, gcc.godbolt.org


Posted by Eugene on Wed 16 Jul 2014 at 10:33 from 95.31.43.x
Here's permalink to the example I'm talking about:

goo.gl/iYu8dw


Posted by Justin on Wed 16 Jul 2014 at 13:08 from 198.228.200.x
Yeah, in my conclusion I mentioned unsigned types as being a solution. It would be nice if it were smarter when operating on signed integers that are known to be non-negative.


Posted by Justin on Wed 16 Jul 2014 at 13:09 from 198.228.200.x
Here's a function for which LLVM generates the code I requested:

void f3(int v)
{
  if (v>=256) {
    int i=v>>8;
    const int p = v>>16;
    do {
      b(p);
    } while (--i);
  }
}


Add comment:
Name:
Human?: (no or yes, patented anti crap stuff here)
Comment:
search : rss : recent comments : Copyright © 2017 Justin Frankel