Performance Improvements

Feb 27, 2011 at 11:27 PM
Edited Feb 28, 2011 at 12:21 AM

The developer of another block engine http://www.youtube.com/user/Slaihne12 suggested some possible improvements to me when i was actively working on this engine, which could possibly improve the time it takes to rebuild a dirty region.

Basically if you use multidimensional arrays in .net then behind the scenes it does all sorts of bounds checking. Apparently for an multidimensional array [x,y,z] you should instead store it as a 1 dimensional array [x*y*z] when accessing the array you calculate the correct index into the 1 dimensional array.

Given how cool his engine is it's probably worth implementing this change into TechCraft.

Feb 28, 2011 at 12:41 AM

I was surprised to see this so I created a project with the following code and I came to the result that multidimensional arrays were 100%~ faster.  Is this true on your machine?  Was this what he ment?

 class Program
    {
        static void Main(string[] args)
        {
            int[, ,] arrayone = new int[256, 256, 256];
            int[] arraytwo = new int[256 * 256 * 256];
            Random rand = new Random();
            for (int x = 0; x < 256; x++)
            {
                for (int y = 0; y< 256; y++)
                {
                    for (int z = 0; z < 256; z++)
                    {
                        int item = rand.Next();
                        arrayone[x, y, z] = item;
                        arraytwo[x * y * z] = item;
                    }
                }
            }
            while (true)
            {
                int i = 0;
                Stopwatch sw = new Stopwatch();
                sw.Start();
                for (int x = 0; x < 256; x++)
                {
                    for (int y = 0; y < 256; y++)
                    {
                        for (int z = 0; z < 256; z++)
                        {

                            i = arrayone[x, y, z];
                        }
                    }
                }
                sw.Stop();
                Console.WriteLine(sw.ElapsedMilliseconds);
                sw = new Stopwatch();
                sw.Start();
                for (int x = 0; x < 256; x++)
                {
                    for (int y = 0; y < 256; y++)
                    {
                        for (int z = 0; z < 256; z++)
                        {

                            i = arraytwo[x * y * z];
                        }
                    }
                }
                sw.Stop();
                Console.WriteLine(sw.ElapsedMilliseconds);
                Console.ReadLine();
            }



        }
    } class Program
    {
        static void Main(string[] args)
        {
            int[, ,] arrayone = new int[256, 256, 256];
            int[] arraytwo = new int[256 * 256 * 256];
            Random rand = new Random();
            for (int x = 0; x < 256; x++)
            {
                for (int y = 0; y< 256; y++)
                {
                    for (int z = 0; z < 256; z++)
                    {
                        int item = rand.Next();
                        arrayone[x, y, z] = item;
                        arraytwo[x * y * z] = item;
                    }
                }
            }
            while (true)
            {
                int i = 0;
                Stopwatch sw = new Stopwatch();
                sw.Start();
                for (int x = 0; x < 256; x++)
                {
                    for (int y = 0; y < 256; y++)
                    {
                        for (int z = 0; z < 256; z++)
                        {

                            i = arrayone[x, y, z];
                        }
                    }
                }
                sw.Stop();
                Console.WriteLine(sw.ElapsedMilliseconds);
                sw = new Stopwatch();
                sw.Start();
                for (int x = 0; x < 256; x++)
                {
                    for (int y = 0; y < 256; y++)
                    {
                        for (int z = 0; z < 256; z++)
                        {

                            i = arraytwo[x * y * z];
                        }
                    }
                }
                sw.Stop();
                Console.WriteLine(sw.ElapsedMilliseconds);
                Console.ReadLine();
            }



        }
    }


Feb 28, 2011 at 2:42 AM

if i run it without checking "optimize code" in project properties I get 179 -   250

but when i check it, i get  151  -  59  !!

=> Lets always use a method getblock(x,y,z) and we will have the ability to change the implementation !

Sparse arrays implementations may be interesting too, those are hashmap based arrays that do not use memory for null/ zero values (so no memory allocated for BlockType.none, or maybe even for blocktype.rock under sea level  ! )

side note : In java 7 beta they found a way to remove the bound checking internally, using this gave me +10 FPS (70 to 80) in my previous java engine, and maybe +3 fps (10 to 13) in  minecraft ( i5 laptop with intel hd graphics ... )  

 

 

Feb 28, 2011 at 2:52 AM

I forgot : you should read this blog and try the attached xna4 project. This is where I took the FPS counter from and there is a whole draw / update statistics system on screen + infinite terrain with a 2d representation of the chunks unloading. Worth reading, and maybe cannibalizing some things ;)

http://blog.eckish.net/2011/01/10/perfecting-a-cube/

( this is really a R&D tech demo not basis for a game : works on xbox but very few block types, the terrain is only a heightmap, there is no digging, no mouse controls  )

Feb 28, 2011 at 3:44 AM

Really great article.  Yeah good catch on the optimize code option and it defiantly looks like single dimensional array is much faster.

Feb 28, 2011 at 3:53 AM

may be worth a try to remove the 60 fps cap and see if changing to single dimensional has an effect on the fps. I ll try tomorrow.

Feb 28, 2011 at 12:35 PM

This may be obvious, but i initially got it wrong so just in case the correct calculation for an index into the array is

x + (y * _my)  + (z * _mz)

where 

_my = (int) _size.X;            

_mz = (int) (_size.X * _size.Y);

and

_size is the region size.

One other optimisation he suggested is to swap the order so its (x*z*y) - that way as you build regions from the ground up you only need to do the index calculation for the first block, then moving up the stack is simply adding 1.

Feb 28, 2011 at 11:04 PM
Edited Mar 1, 2011 at 1:36 AM
The problem with single dimensional arrays using the staggered index is that non-sequential reads start to take a shit on you.
The code below (ignore the mess, half hour of typing while on lunch) gives these results for me very consistently.
I started to do some memory testing but ran out of time during my lunch. By the way, thanks for introducing me to the stopwatch. I'd always used ticks before (I'm a c/c++ guy).

Multidimensional Incremental: 160
Jagged Incremental: 104
Single Incremental: 89

Multidimensional Jumpy: 169
Jagged Jumpy: 147
Single Jumpy: 386

using
System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Text; using System.Windows.Forms; namespace ArrayTest { public partial class Form1 : Form { public Form1() { InitializeComponent(); } int[, ,] multidimensional = new int[256, 256, 256]; int[][][] jagged = new int[256][][]; int[] single = new int[256 * 256 * 256]; System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch(); private void initJagged() { for (int i = 0; i < 256; i++) { jagged[i] = new int[256][]; for (int j = 0; j < 256; j++) { jagged[i][j] = new int[256]; } } } private void Form1_Load(object sender, EventArgs e) { initJagged(); string sResults = ""; long memory = GC.GetTotalMemory(true); long multiincremental = SetArrayIncremental(multidimensional); long jaggedincremental = SetArrayIncremental(jagged); long singleincremental = SetArrayIncremental(single); long multijumpy = SetArrayJumpy(multidimensional); long jaggedjumpy = SetArrayJumpy(jagged); long singlejumpy = SetArrayJumpy(single); sResults += "\n\n"; sResults += "Multidimensional Incremental: " + multiincremental + "\n"; sResults += "Jagged Incremental: " + jaggedincremental + "\n"; sResults += "Single Incremental: " + singleincremental + "\n"; sResults += "\n\n"; sResults += "Multidimensional Jumpy: " + multijumpy + "\n"; sResults += "Jagged Jumpy: " + jaggedjumpy + "\n"; sResults += "Single Jumpy: " + singlejumpy + "\n"; Console.Write(sResults); MessageBox.Show(sResults); Application.ExitThread(); } private long SetArrayIncremental(int[, ,] array) { int counter = 0; sw.Reset();sw.Start(); for (int x = 0; x < 256; x++) { for (int y = 0; y < 256; y++) { for (int z = 0; z < 256; z++) { array[x, y, z] = counter; counter++; } } } return sw.ElapsedMilliseconds; } private long SetArrayIncremental(int[][][] array) { int counter = 0; sw.Reset();sw.Start(); for (int x = 0; x < 256; x++) { for (int y = 0; y < 256; y++) { for (int z = 0; z < 256; z++) { array[x][y][z] = counter; counter++; } } } return sw.ElapsedMilliseconds; } private long SetArrayIncremental(int[] array) { int counter = 0; sw.Reset();sw.Start(); for (int x = 0; x < 256; x++) { for (int y = 0; y < 256; y++) { for (int z = 0; z < 256; z++) { array[x * y * x] = counter; counter++; } } } return sw.ElapsedMilliseconds; } private long SetArrayJumpy(int[, ,] array) { int counter = 0; bool bSwitch = false; sw.Reset();sw.Start(); for(int x = 0; x < 256; x++) { for(int y = 0; y < 256; y++) { for (int z = 0; z < 256; z++) { if (bSwitch) { array[x, y, z] = counter; } else { array[255 - x, 255 - y, 255 - z] = counter; } counter++; bSwitch = !bSwitch; } } } return sw.ElapsedMilliseconds; } private long SetArrayJumpy(int[][][] array) { int counter = 0; bool bSwitch = false; sw.Reset();sw.Start(); for (int x = 0; x < 256; x++) { for (int y = 0; y < 256; y++) { for (int z = 0; z < 256; z++) { if (bSwitch) { array[x][y][z] = counter; } else { array[255 - x][255 - y][255 - z] = counter; } counter++; bSwitch = !bSwitch; } } } return sw.ElapsedMilliseconds; } private long SetArrayJumpy(int[] array) { int counter = 0; bool bSwitch = false; sw.Reset();sw.Start(); long start = DateTime.Now.Ticks; for (int x = 0; x < 256; x++) { for (int y = 0; y < 256; y++) { for (int z = 0; z < 256; z++) { if (bSwitch) { array[x*y*z] = counter; } else { array[(255 - x)*(255 - y)*(255 - z)] = counter; } counter++; bSwitch = !bSwitch; } } } return sw.ElapsedMilliseconds; } } }
Mar 1, 2011 at 1:36 AM

as always, "premature optimization is the root of all evil"

Lets implement the regions/chunks properly, clean up the code / keep it simple and take real use mesurements (with optimized release code, not in debug ! )

I ll now try to integrate the profiling from here http://blog.eckish.net/2011/01/10/perfecting-a-cube/ 

Mar 1, 2011 at 3:50 AM
Edited Mar 1, 2011 at 4:03 AM


100 FPS when looking at 13*13 regions without fog, 92 FPS when trying to spam the region rebuilding with the bazooka :) On an intel i5 laptop with integrated gpu !

( Press F3 to enter statistics mode without SynchronizeWithVerticalRetrace & fixedTimeStep)

 

Image and video hosting by TinyPic

 The source is up to date, profiler is very easy to use just put  

  Profiler.profiler.Start("Something");

---

SomeMethodToProfile();  

 Profiler.profiler.Stop("Something");

I commited this WorldSettings too :

public const int MAPWIDTH = 16*13;
public const int MAPHEIGHT = 128;
public const int MAPLENGTH = 16*13;
public const int FOGNEAR = 90 * 4;
public const int FOGFAR = 140 * 4;
public const int FARPLANE = 140 * 4;

public const int REGIONWIDTH = 16;
public const int REGIONHEIGHT = 128;
public const int REGIONLENGTH = 16;
Mar 3, 2011 at 7:09 AM
Edited Mar 3, 2011 at 7:13 AM
bamyazi wrote:

This may be obvious, but i initially got it wrong so just in case the correct calculation for an index into the array is

x + (y * _my)  + (z * _mz)

where 

_my = (int) _size.X;            

_mz = (int) (_size.X * _size.Y);

and

_size is the region size.

One other optimisation he suggested is to swap the order so its (x*z*y) - that way as you build regions from the ground up you only need to do the index calculation for the first block, then moving up the stack is simply adding 1.


I've seen this swap order in other engines. It is an improvement worth making. Although, treat it as a multidimensional array (x,z,y) and not a single dimensional array (x*z*y) as ycatsce mentions.

Mar 9, 2011 at 11:14 PM

Hi there,

I found this thread after Bamyazi pointed me at this project.

I had a quick read through Ycatsce’s benchmark program and noticed a couple of things…

The line

array[x * y * x] = counter;

in SetArrayIncremental for the 1d array has X in twice, but it really should read

array[x * 65536 + y * 256 + z] = counter;

or else it won’t be a sequential access with the way the loop is organised; Z is the internal loop, so when it increments then the overall reference needs to increment too.

 

Also

The lines in SetArrayJumpy (again for the 1d array) also lead to completely non-sequential access…

They should both be, in order…

array[x * 65536 + y * 256 + z] = counter;

array[(255 - x) * 65536 + (255 - y) * 256 + (255 - z)] = counter;

 

This leads to the same type of access as the 3d and jagged array tests.


In the benchmark you have actually implemented a single sequential access pattern working upwards through the array, at least for the 3d and jagged versions. The 1d array version was a completely random access pattern.

The ‘jumpy’ section is simply two sequential access patterns interleaved, one working from the bottom of the array and the other from the top, again only for the 3d and jagged array. Again, the 1d array test was a completely random access pattern.

The results you were getting were purely because of cache misses on the 1d array since both benchmarks for this were actually random whereas the others were sequential.

 

When compiled and run on my machine looped multiple times I get figures of 718, 682, 279 milliseconds for the sequential accesses for 3d, jagged, and 1d respectively.

For the ‘jumpy’ benchmark I get 766, 600, 450 milliseconds, again for 3d, jagged, and 1d respectively.

As you can see, the 1d array benchmarks have a couple of extra multiplies and additions that the other two don’t but still come out faster, more so in the non-jumpy test.

 

In my engine I actually never use an access pattern that is similar to the benchmark. I tend to calculate the ‘pointer’ outside of the loop and then just add (or subtract) to it.

The following is a simple lighting clear I use…

Public Shared Sub ClearLightChunk(ByVal WorldX As Int32, ByVal Worldz As Int32)

    Dim x, y, z As Integer
    Dim lp As Integer
    Dim ZL As StructLightBlock

    ZL.Sun = 0
    ZL.R = 0
    ZL.G = 0
    ZL.B = 0

    WorldX = WorldX Mod BlockBuffSize
    Worldz = Worldz Mod BlockBuffSize

    For x = WorldX To WorldX + ChunkSize - 1
      For z = Worldz To Worldz + ChunkSize - 1
        lp = x * clsBlocks.BMX + z * clsBlocks.BMZ
        For y = lp To lp + (BlockBuffheight + 2)
          clsBlocks.LIGHTS(y) = ZL
        Next
      Next
    Next

 End Sub

 

As you can see, the inner loop has the barest minimum inside it. I could probably optimise this even further but it’s already fairly limited by cache misses and so wouldn’t get much faster.

Oh, I also saw someone mention using a call to retrieve a block. I used this myself initially but found that it really slowed stuff down.

 

Keep up the good work.

 

rgds

 

Slaihne

Mar 10, 2011 at 4:16 PM

Hey there, thanks for the comments ... short response from me tho' i'm afraid i've been working over in france for the last week just catching up on my digital life since i managed to take the broken laptop with me from work and have had no internet..completely knackered today from driving back - couple of beers and i'm set to crash :)

Apr 13, 2011 at 9:09 AM

I've improved ycatse's codes output based on slaihne12's fixes and added iteration support;

http://techcraft.codeplex.com/SourceControl/network/Forks/raistlinthewiz/voxlrmerge/changeset/changes/4e0aec0d0e51

A sample output: http://pastebin.com/Hmx1rTjc

It seems flatten arrays are always faster and it's even 2x faster then multidimensional pars.