August 12, 2008
NOTE: Benchmarks and timings given below are little outdated because article was written back in 2008. Today’s CPUs are faster and memory bandwidth is also higher.
AmiBroker Formula Language (AFL) thanks to its array processing model is able to run at the same speed as code written in assembler (i.e. machine code). The following article explains how.
AFL runs with native assembly speed when using array operations.
A simple array multiplication like this:
X = Close * H; // array multiplication (each array element is multiplied)
gets compiled by AmiBroker to just 8 assembly instructions:
mov edx,dword ptr [esp+58h]
inc esi ; increase counters
fld dword ptr [edx+esi*4-4] ; get element of close array
fmul dword ptr [eax+ecx-4] ; multiply by element of high array
fstp dword ptr [eax-4] ; store result
jl loop ; continue until all elements are processed
As you can see there are three 4 byte memory accesses per loop iteration (2 reads each 4 bytes long and 1 write 4 byte long)
With such tight loop, single processor core running an AFL formula is able to saturate memory bandwidth in majority of most common operations/functions if total array sizes used in given formula exceedes DATA cache size.
On my (2 year old) 2GHz Athlon x2 64 single iteration of this loop takes 6 nanoseconds (see benchmark code below). 6 nanoseconds to process one bar of data, or 166 million bars per second. So, during 6 nanoseconds we have 8 byte reads and 4 byte store. Thats (8/(6e-9)) bytes per second = 1333 MB per second read and 667 MB per second write simultaneously i.e. 2GB/sec combined !
Now if you look at memory benchmarks you will see that 2GB/s is the limit of system memory speed on Athlon x64 (DDR2 dual channel)
And that’s considering the fact that Athlon has superior-to-intel on-die integrated memory controller (hypertransfer)
// benchmark code
// for accurrate results run it on LARGE arrays -
// intraday database, 1-minute interval, 50K bars or more)
for(k = 0; k < 1000; k++ ) X = C * H;
"Time per single iteration [s]="+1e-3*GetPerformanceCounter()/(1000*BarCount);
Only really complex operations that use *lots* of FPU (floating point) cycles such as trigonometric (sin/cos/tan) functions are slow enough for the memory to keep up.