A small 64-bit conundrum

Traditionally, 64-bit computing is a mixed bag. You gain additional precision and the ability to address more memory, but in many cases those advantages are outweighed by the fact that the 64-bit code requires more memory and cache and the actual individual operations are slower.

X86-64 is somewhat different in this respect, because when AMD took it upon themselves to add 64-bit capabilities to x86, they also made other enhancements including, notably, doubling the number of general purpose registers from a paltry 8 to a more manageable 16. The caveat is that only 64-bit code can have access to these new registers.

The simple prime benchmark I used earlier is a case in point – it uses basically every register available avoiding any tedious pushing and popping of the stack, but doesn’t actually do any major 64-bit arithmetic of its own (i.e. the numbers being tested and the possible divisors are all treated as 32-bit quantities).

Out of curiosity, I converted that program both to pure 32-bit assembly (everything 32-bits, only 8 registers), and to pure 64-bit (all arithmetic and data is 64-bit). I also slightly modified the C version so as to have a version that would use all 64-bit quantities as well. All versions of the program were tested on a 2.53GHZ Intel Core 2 Duo system (Linux, gcc 4.4.3) and a 2.8GHZ Intel Core i7 system (MacOS X, gcc 4.2).

x86 assembly – 64-bit version with 32-bit arithmetic

x86 assembly (32) – 32-bit version with 32-bit arithmetic

x86 assembly (64) – 64 bit version with 64-bit arithmetic

C code (64) – 64-bit version with 64-bit arithmetic (if on 64-bit platform)

C code – 32 bit arithmetic version (on any platform)

Results (and oddities).

First and foremost, the compiler (gcc) is not doing a particularly good job on the 64-bit arithmetic code. My pure 64-bit assembly code is running 30-45% faster than the C code, while the gap between the other C and assembly versions is at most 15%.

Second, there is a big hit in performance from performing 64-bit arithmetic. The pure 64-bit assembly version is much slower than either the assembly or C versions that use 32-bit arithmetic. That said, the gap seems to have narrowed somewhat with the newer Intel i7 chips – the 32-bit assembly code is 18% faster on the i7 compared to the Core 2 while the 64-bit assembly runs 41% faster.

Thirdly, for this benchmark it’s clear that there are benefits to using 64-bit mode, even when the arithmetic being done is best with 32-bit quantities, as on both machines, it the mixed assembly version (64-bit code with 32-bit arithmetic) proved fastest.

Lastly, while the benchmark shows the compiler generally does a good job for 32-bit arithmetic code, it does seem that hand-coded assembly maintains an edge in all cases.

Results from 2.5GHZ Core 2 Duo:

Program version (type) Runtime (seconds) Normalized speed
(pure 64-bit C = 1.00)
C pure 64-bit
(all operations 64-bit)
5.60 1.00
asm pure 64-bit
(all operations 64-bit)
3.90 1.44
C pure 32-bit
(all operations 32-bit)
1.97 2.84
asm pure 32-bit
(all operations 32-bit)
1.92 2.92
C mixed (64-bit)
(32-bit data, 64-bit pointers)
1.98 2.83
asm mixed (64-bit)
(32-bit data, 64-bit pointers)
1.87 2.99

Results from 2.8GHZ Core i7:

Program version (type) Runtime (seconds) Normalized speed (pure 64-bit C = 1.00)
C pure 64-bit
(all operations 64-bit)
3.58 1.00
asm pure 64-bit
(all operations 64-bit)
2.77 1.29
C pure 32-bit
(all operations 32-bit)
1.89 1.89
asm pure 32-bit
(all operations 32-bit)
1.63 2.20
C mixed (64-bit)
(32-bit data, 64-bit pointers)
1.68 2.13
asm mixed (64-bit)
(32-bit data, 64-bit pointers)
1.60 2.24

Comments are closed.