[Eng][help needed] Calculator benchmarks

→ **MyCalcs** profile · by **pier4r** » 11 Sep 2013, 16:22

I recently moved the data about "Calculator add loop" benchmark on a wiki page, here: http://www.wiki4hp.com/doku.php?id=benchmarks:addloop .

Since the benchmark, originally, was not updated after 2011, i searched for new results and i found that:
- no one has done it with a Nspire.
- Previous result with TI calculators (like ti89) are limited since the for loop was not used. For example ti89 has a score of 9400 while hp50g has a score of 31000 (using a for loop).

So, is anyone willing to do this benchmark and report the results?

The format is:

Code: Select all: - Calculator used and firmware/software - The count after 60 seconds of execution - The program code used.

For further comparisons there is another benchmark (just designed): http://www.wiki4hp.com/doku.php?id=benc ... ddlesquare . Even for this any result will be appreciated.

Thanks a lot and sorry if the section is not the "right one", i don't know this forum but it appears the only one "alive" about Ti calculators.

Code: Select all

edit: the community has just demanded a simpler benchmark (the middle square one seems not so clear). Do you mind to run also this: http://www.wiki4hp.com/doku.php?id=benchmarks:ultranaiveprimes ?

The code is:

Code: Select all: input: n -- for k:=3 to n do { for j:=2 to k-1 do { if ( k mod j == 0 ) then { j:= k-1 //so we exit from the inner for } } }

The result format is:

Code: Select all: A result is composed by the following list - the device used plus the language used, eventual overclock, eventual custom firmware and so on. - time elapsed for a given n in seconds (see below) - the code used. if the calculator is too slow, or limited, to compute a given n, then report "for n the computation takes too much time". Conversely, if the calculator is too fast to compute a given n, then report "for n the computation takes too little time, i skipped it"

The options are

Code: Select all: n:= 100 n:= 1000 For very fast implementations: n:= 10000 n:= 100000

→ **MyCalcs** profile · by **Lionel Debroux** » 12 Sep 2013, 05:53

Hi Pier

We'll see what we can do, on TI-68k/AMS, in BASIC and C with GCC4TI, and on Nspire in BASIC, Lua and Ndless.

→ **MyCalcs** profile · by **pier4r** » 12 Sep 2013, 07:46

thanks!

Anyway for further comparisons, all the benchmark gathered on the wiki are here:
http://www.wiki4hp.com/doku.php?id=benc ... p&do=index

→ **MyCalcs** profile · by **pier4r** » 12 Sep 2013, 15:34

A little "up", i have added the code for a simpler benchmark (along with the "addloop" one). Casio guys on the casio forum are on rampage, the PRIMZ is really fast.

→ **MyCalcs** profile · by **Lionel Debroux** » 12 Sep 2013, 20:57

For addloop, I think that I'm going to make something along the lines of the HP-50g C benchmark, currently ranked #3 on your page, because that one seems reasonably fair.

I have suggestions for the benchmark as it currently stands. To sum up, it's not your fault, but I find it generally lacking a number basic rules and information. In more detail, I mean:
1) there's already some information about OS versions, toolchains and clock speeds, but for native code programs, no information about the compiler flags, which matter a lot, as you're aware;
2) no telling whether the native code programs are forced to store and re-read the iteration variable from memory (or increment directly in memory, for the processors which can) upon each iteration, instead of keeping it in a register, which will usually make them at least 3x slower. The HP-50g benchmark doesn't store to / re-read memory (no "volatile" qualifier), but the vast, vast majority of implementations of the benchmark do, due to being written in interpreted languages which store to actual language-level variables. In fact, for native code programs, having both benchmark types would make sense;
3) no explicit telling whether all tricks are allowed to paint one's favorite calculator under a more favorable light. For instance, native code programs could skew results by disabling interrupts, which interpreted programs cannot do. Usage of interrupts, which belong to this category, is untold as well: for instance, on the TI-68k series, through interrupts, the incrementation loop could do entirely without checking any stop condition, whether a time-based condition (OS software timers, though nobody would do that because it would skew results by more than an order of magnitude, or changing the rate of the programmable timer + using one's own interrupt handler) or pressing ON (which has an interrupt on the TI-68k series), and it would therefore be faster. Here again, for native code programs, it would probably be desirable to have both standard versions, and "all tricks" versions.

I'm fully aware that centralizing information and discussing on message boards is time-consuming work, and I don't want to sound discouraging, but I felt I'd submit some of my thoughts for improvement

Also, the fx-9860g benchmark, ranked #1, is suspicious. There's no conceivable reason for the fx-9860g to be faster than the HP-50g is. IMO, chances are good that it's made artificially (though probably involuntarily) fast due to compiler optimization. Indeed, when optimization is enabled, any well-behaved recent compiler will not only compute at compile time the loop which increments the "counter" variable, but also, simply erase it from the generated code because its result is used nowhere...

→ **MyCalcs** profile · by **pier4r** » 12 Sep 2013, 21:35

Lionel Debroux wrote:For addloop, I think that I'm going to make something along the lines of the HP-50g C benchmark, currently ranked #3 on your page, because that one seems reasonably fair.

Thanks!

I have suggestions for the benchmark as it currently stands. To sum up, it's not your fault, but I find it generally lacking a number basic rules and information.

I agree! But nevertheless, at least for "non so tricky" submissions they give a general idea of the rough power of the device using a specific language.
For example, pick the addloop bench and the ultranaiveprimes one. Both are simple but the addloop is extremely simple, while the ultranaive use some "complex" operations.
Now, hp50g with saturn ASM score as much as the HP prime in the addloop test, and, surprisingly, they score similarly even in the ultranaive (both order of magnitude and so on). So a "general" idea can be extracted, IMO, from these simple tests.

In more detail, I mean:
1) there's already some information about OS versions, toolchains and clock speeds, but for native code programs, no information about the compiler flags, which matter a lot, as you're aware;
2) no telling whether the native code programs are forced to store and re-read the iteration variable from memory (or increment directly in memory, for the processors which can) upon each iteration, instead of keeping it in a register, which will usually make them at least 3x slower. The HP-50g benchmark doesn't store to / re-read memory (no "volatile" qualifier), but the vast, vast majority of implementations of the benchmark do, due to being written in interpreted languages which store to actual language-level variables. In fact, for native code programs, having both benchmark types would make sense;
3) no explicit telling whether all tricks are allowed to paint one's favorite calculator under a more favorable light. For instance, native code programs could skew results by disabling interrupts, which interpreted programs cannot do. Usage of interrupts, which belong to this category, is untold as well: for instance, on the TI-68k series, through interrupts, the incrementation loop could do entirely without checking any stop condition, whether a time-based condition (OS software timers, though nobody would do that because it would skew results by more than an order of magnitude, or changing the rate of the programmable timer + using one's own interrupt handler) or pressing ON (which has an interrupt on the TI-68k series), and it would therefore be faster. Here again, for native code programs, it would probably be desirable to have both standard versions, and "all tricks" versions.

1. agree
2. agree
3. agree

But users has limited time so the motto here is "it's better than nothing" (because we assume that, in general, these tests are consistent as i said above)

I'm fully aware that centralizing information and discussing on message boards is time-consuming work, and I don't want to sound discouraging, but I felt I'd submit some of my thoughts for improvement

Don't worry, on the contrary it is really important to point out these information.

Also, the fx-9860g benchmark, ranked #1, is suspicious. There's no conceivable reason for the fx-9860g to be faster than the HP-50g is. IMO, chances are good that it's made artificially (though probably involuntarily) fast due to compiler optimization. Indeed, when optimization is enabled, any well-behaved recent compiler will not only compute at compile time the loop which increments the "counter" variable, but also, simply erase it from the generated code because its result is used nowhere...

I know that but... how on the earth the compiler will know the value after 60 seconds? Anyway yes, it looks suspicious, but there is a simple solution: who looks suspicious, for the reader, isn't counted by the reader himself.

Now....unleash your Texas instruments! (i ask it to all the forum) I'm still stunned by the performance of the casio prizm. It looks so "simple" and instead is a beast (it is way faster than a 600 mhz phone, even if the latter used a scripting language) with a really simple C code!

→ **MyCalcs** profile · by **pier4r** » 13 Sep 2013, 07:33

A small update: one kind user on cemetech forum has done the summation test with the Ti89 (only ti-basic).

I expected to see values comparable with hp50g with normal userRPL, instead it is comparable to the old 48gx.

→ **MyCalcs** profile · by **Lionel Debroux** » 13 Sep 2013, 08:03

I know that but... how on the earth the compiler will know the value after 60 seconds?

It doesn't know the value after 60 seconds, but it knows the value at the end of the loop, which is written in the code. For years, optimizing compilers have been able to recognize a number of loop idioms, especially such simple ones as

Code: Select all: do { counter++; } while (counter < 349700000);

Such code is turned into

Code: Select all: counter = 349700000;

by optimizing compilers; then, Dead Store Elimination will erase this assignment and the counter variable, since it's not used later.
Unless the compiler used for the fx-9860g absolutely stinks, or the benchmark is compiled without optimization, the program should print "end" immediately.

Anyway yes, it looks suspicious, but there is a simple solution: who looks suspicious, for the reader, isn't counted by the reader himself.

If the #1 spot in the benchmark is a fluke (which remains to be confirmed), it would reduce the benchmark's credibility.

Now....unleash your Texas instruments!

I wrote I would, so here are a couple TI-68k/ASM C programs, made yesterday evening and this morning

NOTE: building them requires GCC4TI, they won't compile with the older, unmaintained and much harder to install TIGCC:

1) File addloop_register_polling.c:

Code: Select all: // addloop_register_polling.c: optimize counting to the maximum, through keeping the value in a register and writing the main loop in ASM, so as to avoid compiler pessimizations. #define MIN_AMS 101 #define USE_TI89 #define USE_TI92P #define USE_V200 #define USE_TI89T #define NO_CALC_DETECT #define OPTIMIZE_ROM_CALLS #define RETURN_VALUE #include <stdint.h> #include <system.h> #include <args.h> #include <estack.h> #include <peekpoke.h> #include <intr.h> #define TIMER_START_VAL (100000UL) void _main(void) { uint32_t i = 0; // We don't want to short orig_rate = PRG_getRate(); unsigned short orig_start = PRG_getStart(); unsigned char * ON_key_status = (unsigned char *)0x60001A; unsigned long val = 0; // Make the system timer an order of magnitude more precise; // NOTE: this code assumes a HW2+ TI-68k, i.e. anything since 1999. PRG_setRate(1); // Increment counter at a rate of 2^19/2^9 Hz PRG_setStart(0xCE); // Trigger the interrupt every 257 - 0xCE = 51 increments ~ 20.07 Hz. // The PRG_getStart() above effectively waited for the interrupt to trigger, so we don't need another wait. /*OSRegisterTimer(USER_TIMER, 1); while (!OSTimerExpired(USER_TIMER)); OSFreeTimer(USER_TIMER);*/ OSRegisterTimer(USER_TIMER, TIMER_START_VAL); // Main loop :) // The assembly snippet is the equivalent of /* do { i++; } while (*(volatile unsigned char *)ON_key_status & 2); */ // but it lets no compiler pessimization, such as constant-propagating the ON_key_status variable away (sigh), occur. asm volatile("lloop:\n" " addq.l #1, %0\n" " btst.b #1, (%1)\n" " bne.s lloop\n" : "=d"(i) : "a"(ON_key_status)); // Retrieve timer value. val = TIMER_START_VAL - OSTimerCurVal(USER_TIMER); OSFreeTimer(USER_TIMER); // Give some time for the ON key to come back up. OSRegisterTimer(USER_TIMER, 4); while (!OSTimerExpired(USER_TIMER)); OSFreeTimer(USER_TIMER); OSClearBreak(); // Push arguments onto the RPN stack: clean arguments up, then create a list. while (GetArgType (top_estack) != END_TAG) { top_estack = next_expression_index (top_estack); } top_estack--; push_END_TAG(); push_longint(i); push_longint(val); push_LIST_TAG(); // Restore old system state. PRG_setRate(orig_rate); PRG_setStart(orig_start); }

2) File addloop_memory_polling.c:

Code: Select all: // addloop_memory_polling.c: don't optimize counting that much, through "volatile" which triggers three instructions instead of just one for dealing with memory and an address which gets constant-propagated instead of being kept in a register. #define MIN_AMS 101 #define USE_TI89 #define USE_TI92P #define USE_V200 #define USE_TI89T #define NO_CALC_DETECT #define OPTIMIZE_ROM_CALLS #define RETURN_VALUE #include <stdint.h> #include <system.h> #include <args.h> #include <estack.h> #include <peekpoke.h> #include <intr.h> #define TIMER_START_VAL (100000UL) void _main(void) { volatile uint32_t i = 0; short orig_rate = PRG_getRate(); unsigned short orig_start = PRG_getStart(); volatile unsigned char * ON_key_status = (volatile unsigned char *)0x60001A; unsigned long val = 0; // Make the system timer an order of magnitude more precise; // NOTE: this code assumes a HW2+ TI-68k, i.e. anything since 1999. PRG_setRate(1); // Increment counter at a rate of 2^19/2^9 Hz PRG_setStart(0xCE); // Trigger the interrupt every 257 - 0xCE = 51 increments ~ 20.07 Hz. // The PRG_getStart() above effectively waited for the interrupt to trigger, so we don't need another wait. /*OSRegisterTimer(USER_TIMER, 1); while (!OSTimerExpired(USER_TIMER)); OSFreeTimer(USER_TIMER);*/ OSRegisterTimer(USER_TIMER, TIMER_START_VAL); // Main loop :) // Let compiler pessimizations inherent to "volatile", such as: // * reading and writing i in memory instead of incrementing it directly; // * constant-propagating the ON_key_status variable away. // occur. do { i++; } while (*ON_key_status & 2); // Retrieve timer value. val = TIMER_START_VAL - OSTimerCurVal(USER_TIMER); OSFreeTimer(USER_TIMER); // Give some time for the ON key to come back up. OSRegisterTimer(USER_TIMER, 4); while (!OSTimerExpired(USER_TIMER)); OSFreeTimer(USER_TIMER); OSClearBreak(); // Push arguments onto the RPN stack: clean arguments up, then create a list. while (GetArgType (top_estack) != END_TAG) { top_estack = next_expression_index (top_estack); } top_estack--; push_END_TAG(); push_longint(i); push_longint(val); push_LIST_TAG(); // Restore old system state. PRG_setRate(orig_rate); PRG_setStart(orig_start); }

3) Build script - all flags but -O3 reduce size but have no effect on code generation for the main loop:

Code: Select all: tigcc -v -O3 -Wall -W -mpcrel --optimize-code --cut-ranges --reorder-sections --remove-unused --merge-constants -fmerge-all-constants -Wa,--all-relocs -Wa,-l -fverbose-asm -save-temps -o addloop1 addloop_register_polling.c tigcc -v -O3 -Wall -W -mpcrel --optimize-code --cut-ranges --reorder-sections --remove-unused --merge-constants -fmerge-all-constants -Wa,--all-relocs -Wa,-l -fverbose-asm -save-temps -o addloop2 addloop_memory_polling.c

4) Results on 89T HW4 running AMS 3.10 patched with my tiosmod+amspatch, the first element of each list being the number of timer ticks at (2^19/2^9)/53 ~ 20.07 Hz and the second element being the value of the counter when ON is pressed:
* addloop1 (addloop_register_polling): {1203, 24700949} {1237, 25423732} {1211, 24846885} (very coherent with each other)
* addloop2 (addloop_memory_polling): {1206, 9769092} {1214, 9827570} (again, coherent with each other)

Comments:
* the main loop is a tiny code snippet buried into the rest of accuracy-increasing measures and dealing with the consequences of pressing the ON key;
* the main loop in addloop1 is a 1:1 copy of that of the HP-50g benchmark, and shows the 89T is between 6x and 7x slower than the HP-50g, which is easily explained, as I posted on Cemetech;
* the main loop in addloop2 is closer to interpreted languages, since at least, the variable is read from + written to memory, and it shows ~2.5x slowdown.

→ **MyCalcs** profile · by **pier4r** » 13 Sep 2013, 09:28

Just a quick reply, then i'll add your result.

Lionel Debroux wrote:
I know that but... how on the earth the compiler will know the value after 60 seconds?

It doesn't know the value after 60 seconds, but it knows the value at the end of the loop, which is written in the code. For years, optimizing compilers have been able to recognize a number of loop idioms, especially such simple ones as
Code: Select all
do { counter++; } while (counter < 349700000);

Such code is turned into
Code: Select all
counter = 349700000;

by optimizing compilers; then, Dead Store Elimination will erase this assignment and the counter variable, since it's not used later.
Unless the compiler used for the fx-9860g absolutely stinks, or the benchmark is compiled without optimization, the program should print "end" immediately.

That's right! I didn't see the while (counter < 349700000) ! I simply skip it thinking at one "While until getkey something".
Now i'll report your observations as well

added!

→ **MyCalcs** profile · by **Lionel Debroux** » 13 Sep 2013, 10:52

Thanks.

Another odd benchmark result is "4. Casio fx-CG 10 PRIZM, OS version 01.04.3200, C PrizmSDK". The speed of the Prizm C benchmark should be close enough to the speed of the HP-50g C benchmark, significantly faster than the TI-68k C benchmarks. Looking at the code, it's due to the keyboard checking code. Declaring keyupdate(), keydownlast() (keydownhold() is unused) "static inline" should provide a performance boost.

EDIT: Savage benchmark, for TI-68k/AMS/GCC4TI.

1) File savage.c

Code: Select all: // savage.c: Savage benchmark #define MIN_AMS 101 #define USE_TI89 #define USE_TI92P #define USE_V200 #define USE_TI89T #define NO_CALC_DETECT #define OPTIMIZE_ROM_CALLS #define RETURN_VALUE #include <stdint.h> #include <system.h> #include <args.h> #include <estack.h> #include <intr.h> #include <timath.h> #define TIMER_START_VAL (100000UL) /* 5 RADIANS 10 A=1 20 FOR I=1 TO 2499 30 A=TAN(ATN(EXP(LOG(SQR(A*A)))))+1 40 NEXT I 50 PRINT A */ void _main(void) { uint16_t i; short orig_rate = PRG_getRate(); unsigned short orig_start = PRG_getStart(); unsigned long val = 0; double a = 1; // Make the system timer an order of magnitude more precise; // NOTE: this code assumes a HW2+ TI-68k, i.e. anything since 1999. PRG_setRate(1); // Increment counter at a rate of 2^19/2^9 Hz PRG_setStart(0xCE); // Trigger the interrupt every 257 - 0xCE = 51 increments ~ 20.07 Hz. // The PRG_getStart() above effectively waited for the interrupt to trigger, so we don't need another wait. /*OSRegisterTimer(USER_TIMER, 1); while (!OSTimerExpired(USER_TIMER)); OSFreeTimer(USER_TIMER);*/ OSRegisterTimer(USER_TIMER, TIMER_START_VAL); // Main loop :) for (i = 1; i < 2500; i++) { a = tan(atan(exp(log(sqrt(a * a))))) + 1; } // Retrieve timer value. val = TIMER_START_VAL - OSTimerCurVal(USER_TIMER); OSFreeTimer(USER_TIMER); // Push arguments onto the RPN stack: clean arguments up, then create a list. while (GetArgType (top_estack) != END_TAG) { top_estack = next_expression_index (top_estack); } top_estack--; push_END_TAG(); push_Float(a); // Note: rounds to 14 digits. push_longint(val); push_LIST_TAG(); // Restore old system state. PRG_setRate(orig_rate); PRG_setStart(orig_start); }

2) Compiler invocation

Code: Select all: tigcc -v -O3 -Wall -W -mpcrel --optimize-code --cut-ranges --reorder-sections --remove-unused --merge-constants -fmerge-all-constants -Wa,--all-relocs -Wa,-l -fverbose-asm -save-temps -o savage savage.c

3) Results on 89T HW4 AMS 3.10 patched with tiosmod+amspatch: {1952, 2500.0000025271}, {1951, 2500.0000025271}, i.e. ~1'37".
Examining the full 16-digit precision of the BCD floats in the debugger shows 2500.000002527092.