Shopping cart    |      
Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: 64bit vs 32bit benchmarks & integer maths & PT8

  1. #1
    Join Date
    Jan 2003
    Location
    Sydney Australia
    Posts
    4,967

    Lightbulb 64bit vs 32bit benchmarks & integer maths & PT8

    Six years ago we took a look at 64bit benchmarking and provided some examples of why 64bit can give better performance than 32bit.

    What we found at the time was that a 64bit CPU, running a 64bit O/S, executing 64bit code could in some cases be twice as quick as 32bit code.

    We are now at the point where we are doing research into PerformanceTest V8, and since the initial study many new CPUs have been released and the difference between 32bit/64bit performance has grown. In PerformanceTest V7 some CPUs now get up to 4 - 6 times the performance in 64bit integer maths, compared to 32bit.

    Six times the performance is an enormous difference. So we decided to dig a bit deeper to see what was going on.

    In PerformanceTest V7 the integer maths test is made up of 8 individual mathematical operations performed in equal numbers. These are,
    1. Addition of two 32bit numbers
    2. Subtraction of two 32bit numbers
    3. Multiplication of two 32bit numbers
    4. Division of two 32bit numbers
    5. Addition of two 64bit numbers
    6. Subtraction of two 64bit numbers
    7. Multiplication of two 64bit numbers
    8. Division of two 64bit numbers


    The first four operations are unsurprisingly executed in the same way and at the speed on a 32bit machine and a 64bit machine.

    The second four 64bit operations are executed at a much quicker speed on a 64bit machine. See the above referenced post for details.

    What we have found in more recent testing however is also interesting.

    The first interesting point is that division of 32bit numbers is pretty much always around four times slower than Add, Subtract or Multiply. This isn't news, as it is well known that division is a harder operation to do.

    What was more interesting was that 64bit division was way way slower than 32bit division. And doing 64bit division on a 32bit system was extremely expensive, showing a fourteen fold performance drop going to 64bit numbers. This more than anything else accounts for why the PerformanceTest V7 integer maths test does so well on 64bit compared to 32bit.

    The second interesting point is that some of the newest CPUs have got significantly faster in 64bit division. For example the AMD A8-3850 & A6-3650. This has really lifted their results in this test.

    The lessons in this (for us) are that V7 integer maths test places too much weight on the speed 64bit division can be performed. The same is also true for 32bit division and multiplication to a lessor extent. More weight should be give to the other operations, Add, Subtract, etc.. This would moderate the differences between CPU types and also reduce the large differences between 32bit and 64bit.

    Because the division operation is so slow compare to the other operations and the weighting of each operation was equal, the V7 integer maths test has become largely a test of how fast the CPU can do division. Making it a rather narrow unrealistic test of a CPU.

    So in V8 we plan to reduce the number of division operations performed and also introduce some additional variety into the test in the form of logic operations like bit shifting & increment instructions.

    Here are the actual V7 numbers from an Intel X9650 CPU, running both 32bit code and 64bit code. Higher numbers are better.


    You might be wondering just why doing 64bit division on a 32bit system is so slow. The reason is that A) There is no native machine code instruction for dealing with any 64bit numbers on a 32bit system and B) The calculations to perform a 64bit division are rather complex on a 32bit system. Each division takes dozens of steps to complete for the CPU.

    Update: Here is a link to the PT8 development thread.

  2. #2
    Join Date
    Oct 2007
    Posts
    237

    Default

    How much 64 bit Division is done in the real world? If its close to equaling the amount of 64 bit Addition, Subtraction and Multiplication should 64 bit Division be weighted down?
    Main Box*i7 4790K*Z97 Extreme 6*2x8 gigs Crucial BS DDR3-1600*Samsung 850 Pro 128 gig*WD Black WD2002FAEX 2 TB*ASUS DRW-24B3LT*LG HL-DT-ST BD-RE*Corsair AX750 PSU*Fractal Design Define R4 case*Windows 7 Pro 64 bit SP1

  3. #3
    Join Date
    Jan 2003
    Location
    Sydney Australia
    Posts
    4,967

    Default

    We don't have hard stats, and it would vary from one application to the next. But we are thinking that multiply and add are significantly more common.

    Some research on Google turn up the "Gibson mix". Which was based on research done by Jack C. Gibson in 1959 on a IBM 704 system running scientific applications. (Yes from 1959!!!)

    Instruction type, Percentage of use.
    Load and store, 31.2%
    Indexing, 18.0%
    Branches, 16.6%
    Floating Add and Subtract, 6.9%
    Fixed-Point Add and Subtract, 6.1%
    Instructions not using registers, 5.3%
    Shifting, 4.4%
    Compares, 3.8%
    Floating Multiply, 3.8%
    Logical, And, Or, 1.6%
    Floating Divide, 1.5%
    Fixed-Point Multiply, 0.6%
    Fixed-Point Divide , 0.2%

    See also, [Jain91] R. Jain, "The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling", Wiley- Interscience, New York, NY, April 1991.

    Besides being very old, the work load mix is only half the story. There are different addressing modes, different caching scenarios, differences in the data being used (e.g. all zeros), different options in the CPU for floating point precision, where the data is in RAM, if it is accessed sequentially, if the data is aligned and many other factors.

    More research turned up this work from the 60's and 70's. Know as the "ADP Mix". Produced by TSU. Which was the UK's Treasury Technical Support Unit (TSU). See, http://www.roylongbottom.org.uk/cpumix.htm

    Instruction type, Percentage of use.
    Fixed Point Add/Subtract 31%
    Fixed Point Multiply 1.3%
    Fixed Point Divide 0.6%
    Branch 35%
    Compare 6.2%
    Transfer 8 characters 20.5%
    Logical 5.4%

    I am guessing there is no floating point % mentioned because many years ago, all the floating point work was done in a separate FPU chip and not in the CPU.

    One would think there must be something more recent. But I haven't found it yet.

  4. #4
    Join Date
    Oct 2007
    Posts
    237

    Default

    Would it be possible to write a program to monitor the type of math used over time and save the results?
    Main Box*i7 4790K*Z97 Extreme 6*2x8 gigs Crucial BS DDR3-1600*Samsung 850 Pro 128 gig*WD Black WD2002FAEX 2 TB*ASUS DRW-24B3LT*LG HL-DT-ST BD-RE*Corsair AX750 PSU*Fractal Design Define R4 case*Windows 7 Pro 64 bit SP1

  5. #5
    Join Date
    Jan 2003
    Location
    Sydney Australia
    Posts
    4,967

    Default

    Probably nearly impossible to do in real time. Modern CPUs might execute 12,000,000,000 instructions per second. So no program is going to be able to accumulate this much data in real time. Sampling or static analysis is maybe a better option. But either way it is a lot of work. Would also be highly task dependent. So an algorithm to search for prime numbers might use a lot of division. But sorting strings might use none at all.

  6. #6
    Join Date
    Jan 2003
    Location
    Sydney Australia
    Posts
    4,967

    Default

    Now that PerformanceTest V8 have been launched we have some stats about how the rebalancing of results has impacted the integer maths score.

    PerformanceTest V7
    ===============
    AMD A10-5800K: 4,050 Millions of integers maths operations per sec (MOps).
    AMD Phenom II x4 965: 680 MOps
    Intel i7-2600: 2,780 MOps

    PerformanceTest V8
    ===============
    AMD A10-5800K: 11,750 MOps.
    AMD Phenom II x4 965: 6,350 MOps
    Intel i7-2600: 16,670 MOps

    % Difference between V7 and V8
    =========================
    AMD A10-5800K: 190% increase
    AMD Phenom II x4 965: 833% increase
    Intel i7-2600: 500% increase

    Conclusion
    ========
    All CPUs did more operations per second in the new PT8 test. This makes sense as the instruction mix is now weighted to faster instructions. Bitwise operations are especially quick compared to division. Some CPUs will benefit at lot from this change. Others only slightly. This will move the relative rankings around in the PT8 charts compared to the PT7 chart. The Phenom should move up a bit, the A10-5800K down a bit.

    The old PT7 prime number test also used the square root function very heavily. Internally we think the square root function did a lot of division as well. In PT8 this function has been replaced the Sieve of Atkin, which is a lot more efficient and uses a broader range of CPU instructions. This has caused further reshuffling of the relative CPU's rankings.


    What's in the new integer maths test
    ============================
    The new test uses a lot less division and a broader set of CPU instructions. The test is a mix of 32bit and 64bit instructions and performs the following operations. Addition, Subtraction, Multiplication, Division, Bitwise Shift, Bitwise boolean AND, Bitwise boolean OR, Bitwise boolean XOR, Bitwise boolean NOT. Weightings have also changed so that division only makes up ~1.5% of the new test.

  7. #7
    Join Date
    Oct 2012
    Posts
    1

    Default

    Hi Dave,

    Isn't the V8 relative ranking quite weird when compared to V7?

    The A10-5800 is not down a bit, it has gone from 45% more MOps versus i7-2600 to 30% less. On the overall CPU score, it has gone from 7500 to 5200, a 30% reduction.

    In comparison, the i5-3550 score has gone from 7470 to 7080, a 5% reduction.

    Bottom line: A10-5800 was equivalent to i5-3550, now it is equivalent to i3-3225.

    Is this supposed to happen? V7 seems to be more related to real world performance than V8.


    Best regards,
    Eduardo

  8. #8
    Join Date
    Jan 2003
    Location
    Sydney Australia
    Posts
    4,967

    Default

    We can compare the V8 results to other 3rd party benchmarks.

    Here is the comparison over at Anandtech between the A10-5800K and the i3-3220 with ~20 different applications.
    http://www.anandtech.com/bench/Product/675?vs=677#
    Broadly speaking these two CPUs are similar in most of the selected applications.

    Here is the comparison between the A10-5800K and the i5-2500K.
    http://www.anandtech.com/bench/Product/675?vs=288#
    The i5 is clearly a better chip in this selection of applications.

    Unfortunately Anandtech has only benchmarked a small selection of the available CPUs. So I couldn't use the exact model numbers you quoted. You could also make a valid argument that selection of applications used at Anandtech is too GPU dependant for a pure CPU test. But I think the point still stands.

    • Our old V7 tests ended up being too weighted toward doing division (which the A10/A8 excelled at doing, for reasons we don't fully understand).
    • Our old V7 tests were all fully threaded. So single threaded performance wasn't factored in at all. This has now changed in V8. The A10-5800K doesn't do all that well in single threaded applications. It gets thrashed by the i5-3550. for example.
    • The AMD chips gave inconsistent performance due to a CPU bug. Further messing up our results.
    • The A10 needed to drop a bit in our V7 charts, and this has happened in the new charts.

  9. #9
    Join Date
    Jan 2013
    Posts
    4

    Default

    Do the 32- and 64-bit versions of the CPU test weigh the results of 32- and 64-bit operations differently? Because theoretically, a 32-bit version of a program would process more 32-bit numbers in places where a 64-bit version of the same program would be using 64-bit numbers (e.g. integers, memory addresses, etc.).

    If not, wouldn't the 64-bit results be somewhat skewed from real-world CPU performance, because the faster processing of those 64-bit numbers won't increase the speed of program execution vs. its 32-bit counterpart?

  10. #10
    Join Date
    Jan 2003
    Location
    Sydney Australia
    Posts
    4,967

    Default

    The weighting is the same.

    Both 32bit and 64bit applications process 32bit and 64bit variables.
    So 64bit applications commonly use 32bit values, and 32bit applications commonly use 64bit values.

    For program variables like integers, characters, floats, etc.. the number of bits used is not determined by the CPU or the operating system, but it is determined by the programmer. So (in C/C++) an integer is always 32bits, even in a 64bit application. By contrast an int64 is always 64bits, even in a 32bit application.

    So if the programmer needs to use a 64bit variable, it will result in a 2 to 5 fold speed penalty if the code is run on a 32bit system. This is what happens in real life and in our benchmark. The only difference will be that some real applications use a lot of 64bit variables and others not so many. So the effect will vary from one application to the next. There is more detailed examination of this 64bit speed difference in this old post.

Page 1 of 2 12 LastLast

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •