[Flash] Benchmarking haXe

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[Flash] Benchmarking haXe

Joe Dohn
Hi!

I've been trying to time tiny things like Std.int, or the use of modulo. The problem is that haXe compiler is so good that it simplifies what I put in my for loop and my test becomes irrelevant.

How do you guys do when you want to see how fast is some instruction?

for (i in 0...5000000)
{
    i % 123;
    // or "Std.int(i / 123)"
}

This takes as much time as an empty loop. I doubt such operations as % and / are that fast, so I suppose the compiler is a smartypants and notices that I'm doing nothing real here?

--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: [Flash] Benchmarking haXe

MarcWeber
Those ops are fast (CPU wise). Try assigning it to a var:

var i
for (..){
  i += i % 123;
}

then try this instead

for (..){
  i += i;
}

take times and diff. Maybe its not accurate. Eventually it gives you an
idea..

Or go down to the opcode level ..

I'm sure some gurus on the mailinglist can provide better ideas.

Marc Weber

--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: [Flash] Benchmarking haXe

Justin Donaldson-3
There's also the --no-opt flag if you want to disable compiler optimizations.

you can use haxe.Timer.stamp() cross platform I believe.  Using a wall clock on a big loop is pretty much the only way I know how to compare performance across the platforms.  Make each test run indendently... all the haxe platforms have some form of garbage collection , and you don't want the gc routine to cause a slowdown of an unrelated process.

-Justin

On Thu, May 19, 2011 at 12:05 AM, Marc Weber <[hidden email]> wrote:
Those ops are fast (CPU wise). Try assigning it to a var:

var i
for (..){
 i += i % 123;
}

then try this instead

for (..){
 i += i;
}

take times and diff. Maybe its not accurate. Eventually it gives you an
idea..

Or go down to the opcode level ..

I'm sure some gurus on the mailinglist can provide better ideas.

Marc Weber

--
haXe - an open source web programming language
http://haxe.org


--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: [Flash] Benchmarking haXe

Joe Dohn
In reply to this post by Joe Dohn
Yay, incrementing a variable as Marc suggested works (unless it doesn't). It shows that * is about as fast as +, / is about 16 times slower and % is out of league, 530% slower than /. Std.int could be 150% slower than /, too.

If these results are wrong then I guess the compiler still bias the result and I'm not using the right method for benchmarks.

Otherwise as I expected, it sounds like it's worth getting out of my way and spend time to get rid of %, / and Std.int in CPU intensive tasks.

I use Lib.getTimer to retrieve time but I'll have a look at haxe.Timer.stamp.
And while disabling compiler optimizations would work, I'm not sure how meaningful the results would be ;)

Thanks for the replies!

--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: [Flash] Benchmarking haXe

Matthew Spencer-2
Both division and modulo only have float (double) based instructions in AVM2(Flash). So in addition to being slow instructions by nature, they will also have casting overhead in your example, as 'i' is an integer.

If you have a very performance critical section of code that must heavily rely on divides and modulos, it can in some cases be better to work with floats. Though it's best to just completely avoid if possible.

To go completely overboard, it is possible to implement an integer divide algorithm. Unfortunately it requires handwritten bytecode, and must be inlined. I think macros make this possible now. Still, I doubt it would give much performance gain.

Function calls, and property lookups are by far the most detrimental to performance in Flash, closely followed by casting.

Again, this is all based on Flash. Never tested the other outputs.

On Thu, May 19, 2011 at 4:45 AM, Joe Dohn <[hidden email]> wrote:
Yay, incrementing a variable as Marc suggested works (unless it doesn't). It shows that * is about as fast as +, / is about 16 times slower and % is out of league, 530% slower than /. Std.int could be 150% slower than /, too.

If these results are wrong then I guess the compiler still bias the result and I'm not using the right method for benchmarks.

Otherwise as I expected, it sounds like it's worth getting out of my way and spend time to get rid of %, / and Std.int in CPU intensive tasks.

I use Lib.getTimer to retrieve time but I'll have a look at haxe.Timer.stamp.
And while disabling compiler optimizations would work, I'm not sure how meaningful the results would be ;)

Thanks for the replies!

--
haXe - an open source web programming language
http://haxe.org


--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: [Flash] Benchmarking haXe

Nicolas Cannasse
In reply to this post by Joe Dohn
Le 19/05/2011 10:45, Joe Dohn a écrit :
> Otherwise as I expected, it sounds like it's worth getting out of my way
> and spend time to get rid of %, / and Std.int in CPU i

Well that's something you would expect for any CPU :) divisions are
slow, and converting between float and int is not fast either.

But on modern CPU, you can't really measure performances on a
cycles-per-operation basis : most of the time is spent on memory cache
misses and branch mispredictions, which is another range of problems to
optimize by avoiding both memory fragmentation and random jumps.

Nicolas

--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: [Flash] Benchmarking haXe

Joe Dohn
I noticed too how function calls slow down everything. Gotta thanks haXe inlining for this.
Casting is really slow as well though it's harder to control when dealing with numbers.
Property lookups I don't know how slow they can get. I mean sure, "this.damn.thing.is.nested.deep" is snail-scale slow, but sprite.width doesn't seem to slow me down particularly.
Also to my surprise, I *think* using stuff like "color >> 16 & 0xFF" many times is faster than storing the value in a short-lived variable. I guess bitwise operators and +,- are that fast.

@Nicolas: Memory cache misses and branch mispredictions, avoiding memory fragmentation and random jumps... I only know how to avoid memory fragmentation in the ByteArray. Oh and cache misses is probably referring to empty indexes in Arrays, stuff like that. Branch misprediction random jumps way above my head :D

--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: [Flash] Benchmarking haXe

Matthew Spencer-2

I noticed too how function calls slow down everything. Gotta thanks haXe inlining for this. 
 
Casting is really slow as well though it's harder to control when dealing with numbers.
Are you working in fixed point? If not, this is something to look in to.
 
Property lookups I don't know how slow they can get. I mean sure, "this.damn.thing.is.nested.deep" is snail-scale slow, but sprite.width doesn't seem to slow me down particularly.
sprite.width by appearance still requires a property lookup, but I very much doubt you're using it enough times to matter.
 
Also to my surprise, I *think* using stuff like "color >> 16 & 0xFF" many times is faster than storing the value in a short-lived variable. I guess bitwise operators and +,- are that fast.
Storing the value in a short-lived variable requires a local register set/get vs using the stack, unless HaXe compiles it out.

 
@Nicolas: Memory cache misses and branch mispredictions, avoiding memory fragmentation and random jumps... I only know how to avoid memory fragmentation in the ByteArray. Oh and cache misses is probably referring to empty indexes in Arrays, stuff like that. Branch misprediction random jumps way above my head :D
 A cache miss means that the required data is not presently available to the CPU, and must be fetched from the next lowest cache (L2/L3/..Memory). Data is fetched in cache lines (usually 128bytes) to try to minimize the painfully slow speed of memory fetches. Linear pattern fetches will cause prefetching to occur. Arrays for example benefit heavily from this if read in a sequential pattern. Linked Lists usually do not, since the nodes are most often scattered around memory in random places. There are many good articles on this around the net that explain it in much more depth. However, you don't have much control over it outside of using flash.Memory and maybe arrays(Does flash store arrays of pointers to objects? Or arrays of objects.. I wonder).



--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: [Flash] Benchmarking haXe

Joe Dohn
--- "Are you working in fixed point? If not, this is something to look in to."
I only use integers so far except in the alpha blending algorithm, which uses Float with no fixed point. I guess fixed point would allow me to work with integers, if I get what you're saying. Not sure that my alpha blending thing does enough operations for this to be worth it but that could change when smoothing will kick in.


--- "Arrays for example benefit heavily from this if read in a sequential pattern. Linked Lists usually do not, since the nodes are most often scattered around memory in random places"
Does that mean Linked Lists  in Flash aren't as faster than Array/Vector as advertised in many articles?


--- "Does flash store arrays of pointers to objects? Or arrays of objects.. I wonder"
Arrays of references to objects I believe. (Since there is a difference between pointers and references... I don't think Flash has pointers)

--
haXe - an open source web programming language
http://haxe.org
Reply | Threaded
Open this post in threaded view
|

Re: [Flash] Benchmarking haXe

Matthew Spencer-2

I only use integers so far except in the alpha blending algorithm, which uses Float with no fixed point. I guess fixed point would allow me to work with integers, if I get what you're saying. Not sure that my alpha blending thing does enough operations for this to be worth it but that could change when smoothing will kick in.
If you are doing a pass on a large buffer running:
for each pixel
   if needsAlphaBlend
             //do alpha blend stuff
The branch will be eating up most of your performance by far. 
 
Does that mean Linked Lists  in Flash aren't as faster than Array/Vector as advertised in many articles?
Walking over an array will always have performance gains in comparison to walking over linked lists, always. Keep in mind though that Flash is a VM, and is recompiled at runtime. You as the coder have limited control over how things end up at the processor. A proper Linked List is going to be faster at removals and insertions, while Arrays will be faster at iteration. (flash.Memory will always be the fastest, since you have absolute control over memory layout.)
 

Arrays of references to objects I believe. (Since there is a difference between pointers and references... I don't think Flash has pointers)
Actionscript does not grant control of pointers and references, but AVM2 does compile to code that uses them.
My best guess would be that it stores arrays of pointers for anything that is not a basic type, but it also most likely uses cache prefetching to assist in reading efficiency.




--
haXe - an open source web programming language
http://haxe.org