Traditionally, 64-bit computing is a mixed bag. You gain additional precision and the ability to address more memory, but in many cases those advantages are outweighed by the fact that the 64-bit code requires more memory and cache and the actual individual operations are slower.
X86-64 is somewhat different in this respect, because when AMD took it upon themselves to add 64-bit capabilities to x86, they also made other enhancements including, notably, doubling the number of general purpose registers from a paltry 8 to a more manageable 16. The caveat is that only 64-bit code can have access to these new registers.
The simple prime benchmark I used earlier is a case in point – it uses basically every register available avoiding any tedious pushing and popping of the stack, but doesn’t actually do any major 64-bit arithmetic of its own (i.e. the numbers being tested and the possible divisors are all treated as 32-bit quantities).
Out of curiosity, I converted that program both to pure 32-bit assembly (everything 32-bits, only 8 registers), and to pure 64-bit (all arithmetic and data is 64-bit). I also slightly modified the C version so as to have a version that would use all 64-bit quantities as well. All versions of the program were tested on a 2.53GHZ Intel Core 2 Duo system (Linux, gcc 4.4.3) and a 2.8GHZ Intel Core i7 system (MacOS X, gcc 4.2).
x86 assembly – 64-bit version with 32-bit arithmetic
x86 assembly (32) – 32-bit version with 32-bit arithmetic
x86 assembly (64) – 64 bit version with 64-bit arithmetic
C code (64) – 64-bit version with 64-bit arithmetic (if on 64-bit platform)
C code – 32 bit arithmetic version (on any platform)
Results (and oddities).
First and foremost, the compiler (gcc) is not doing a particularly good job on the 64-bit arithmetic code. My pure 64-bit assembly code is running 30-45% faster than the C code, while the gap between the other C and assembly versions is at most 15%.
Second, there is a big hit in performance from performing 64-bit arithmetic. The pure 64-bit assembly version is much slower than either the assembly or C versions that use 32-bit arithmetic. That said, the gap seems to have narrowed somewhat with the newer Intel i7 chips – the 32-bit assembly code is 18% faster on the i7 compared to the Core 2 while the 64-bit assembly runs 41% faster.
Thirdly, for this benchmark it’s clear that there are benefits to using 64-bit mode, even when the arithmetic being done is best with 32-bit quantities, as on both machines, it the mixed assembly version (64-bit code with 32-bit arithmetic) proved fastest.
Lastly, while the benchmark shows the compiler generally does a good job for 32-bit arithmetic code, it does seem that hand-coded assembly maintains an edge in all cases.
Results from 2.5GHZ Core 2 Duo:
Program version (type) | Runtime (seconds) | Normalized speed (pure 64-bit C = 1.00) |
C pure 64-bit (all operations 64-bit) |
5.60 | 1.00 |
asm pure 64-bit (all operations 64-bit) |
3.90 | 1.44 |
C pure 32-bit (all operations 32-bit) |
1.97 | 2.84 |
asm pure 32-bit (all operations 32-bit) |
1.92 | 2.92 |
C mixed (64-bit) (32-bit data, 64-bit pointers) |
1.98 | 2.83 |
asm mixed (64-bit) (32-bit data, 64-bit pointers) |
1.87 | 2.99 |
Results from 2.8GHZ Core i7:
Program version (type) | Runtime (seconds) | Normalized speed (pure 64-bit C = 1.00) |
C pure 64-bit (all operations 64-bit) |
3.58 | 1.00 |
asm pure 64-bit (all operations 64-bit) |
2.77 | 1.29 |
C pure 32-bit (all operations 32-bit) |
1.89 | 1.89 |
asm pure 32-bit (all operations 32-bit) |
1.63 | 2.20 |
C mixed (64-bit) (32-bit data, 64-bit pointers) |
1.68 | 2.13 |
asm mixed (64-bit) (32-bit data, 64-bit pointers) |
1.60 | 2.24 |