PDA

View Full Version : 64bit vs 32bit benchmarks & integer maths & PT8



David (PassMark)
10-24-2011, 06:08 AM
Six years ago we took a look at 64bit benchmarking (http://www.passmark.com/forum/showthread.php?t=261) and provided some examples of why 64bit can give better performance than 32bit.

What we found at the time was that a 64bit CPU, running a 64bit O/S, executing 64bit code could in some cases be twice as quick as 32bit code.

We are now at the point where we are doing research into PerformanceTest V8, and since the initial study many new CPUs have been released and the difference between 32bit/64bit performance has grown. In PerformanceTest V7 some CPUs now get up to 4 - 6 times the performance in 64bit integer maths (http://www.passmark.com/forum/showpost.php?p=11395&postcount=7), compared to 32bit.

Six times the performance is an enormous difference. So we decided to dig a bit deeper to see what was going on.

In PerformanceTest V7 the integer maths test is made up of 8 individual mathematical operations performed in equal numbers. These are,

Addition of two 32bit numbers
Subtraction of two 32bit numbers
Multiplication of two 32bit numbers
Division of two 32bit numbers
Addition of two 64bit numbers
Subtraction of two 64bit numbers
Multiplication of two 64bit numbers
Division of two 64bit numbers


The first four operations are unsurprisingly executed in the same way and at the speed on a 32bit machine and a 64bit machine.

The second four 64bit operations are executed at a much quicker speed on a 64bit machine. See the above referenced post for details.

What we have found in more recent testing however is also interesting.

The first interesting point is that division of 32bit numbers is pretty much always around four times slower than Add, Subtract or Multiply. This isn't news, as it is well known that division is a harder operation to do.

What was more interesting was that 64bit division was way way slower than 32bit division. And doing 64bit division on a 32bit system was extremely expensive, showing a fourteen fold performance drop going to 64bit numbers. This more than anything else accounts for why the PerformanceTest V7 integer maths test does so well on 64bit compared to 32bit.

The second interesting point is that some of the newest CPUs have got significantly faster in 64bit division. For example the AMD A8-3850 & A6-3650. This has really lifted their results in this test.

The lessons in this (for us) are that V7 integer maths test places too much weight on the speed 64bit division can be performed. The same is also true for 32bit division and multiplication to a lessor extent. More weight should be give to the other operations, Add, Subtract, etc.. This would moderate the differences between CPU types and also reduce the large differences between 32bit and 64bit.

So in V8 we plan to reduce the number of division operations performed and also introduce some additional variety into the test in the form of logic operations like bit shifting & increment instructions.

Here are the actual V7 numbers from an Intel X9650 CPU, running both 32bit code and 64bit code. Higher numbers are better.
http://www.passmark.com/images/forumimages/64bit_vs_32bit_benchmark_V7.png

You might be wondering just why doing 64bit division on a 32bit system is so slow. The reason is that A) There is no native machine code instruction for dealing with any 64bit numbers on a 32bit system and B) The calculations to perform a 64bit division are rather complex on a 32bit system. Each division takes dozens of steps to complete for the CPU.

Update: Here is a link to the PT8 development thread (http://www.passmark.com/forum/showthread.php?t=3499).

wonderwrench
11-15-2011, 09:34 PM
How much 64 bit Division is done in the real world? If its close to equaling the amount of 64 bit Addition, Subtraction and Multiplication should 64 bit Division be weighted down?

David (PassMark)
11-16-2011, 01:53 AM
We don't have hard stats, and it would vary from one application to the next. But we are thinking that multiply and add are significantly more common.

Some research on Google turn up the "Gibson mix". Which was based on research done by Jack C. Gibson in 1959 on a IBM 704 system running scientific applications. (Yes from 1959!!!)

Instruction type, Percentage of use.
Load and store, 31.2%
Indexing, 18.0%
Branches, 16.6%
Floating Add and Subtract, 6.9%
Fixed-Point Add and Subtract, 6.1%
Instructions not using registers, 5.3%
Shifting, 4.4%
Compares, 3.8%
Floating Multiply, 3.8%
Logical, And, Or, 1.6%
Floating Divide, 1.5%
Fixed-Point Multiply, 0.6%
Fixed-Point Divide , 0.2%

See also, [Jain91] R. Jain, "The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling", Wiley- Interscience, New York, NY, April 1991.

Besides being very old, the work load mix is only half the story. There are different addressing modes, different caching scenarios, differences in the data being used (e.g. all zeros), different options in the CPU for floating point precision, where the data is in RAM, if it is accessed sequentially, if the data is aligned and many other factors.

More research turned up this work from the 60's and 70's. Know as the "ADP Mix". Produced by TSU. Which was the UK's Treasury Technical Support Unit (TSU). See, http://www.roylongbottom.org.uk/cpumix.htm

Instruction type, Percentage of use.
Fixed Point Add/Subtract 31%
Fixed Point Multiply 1.3%
Fixed Point Divide 0.6%
Branch 35%
Compare 6.2%
Transfer 8 characters 20.5%
Logical 5.4%

I am guessing there is no floating point % mentioned because many years ago, all the floating point work was done in a separate FPU chip and not in the CPU.

One would think there must be something more recent. But I haven't found it yet.

wonderwrench
11-17-2011, 07:24 PM
Would it be possible to write a program to monitor the type of math used over time and save the results?

David (PassMark)
11-17-2011, 08:17 PM
Probably nearly impossible to do in real time. Modern CPUs might execute 12,000,000,000 instructions per second. So no program is going to be able to accumulate this much data in real time. Sampling or static analysis is maybe a better option. But either way it is a lot of work. Would also be highly task dependent. So an algorithm to search for prime numbers might use a lot of division. But sorting strings might use none at all.