Optimisation: a look at sprite drawing

How fast can you draw a sprite on the screen?

Part 1

Gavin Smyth

Published in EXE, July 1996


The activity of drawing sprites on the screen is fundamental to just about all computer games, and forms a major part of an awful lot of them. The mechanism for performing this activity therefore has to be flexible and as fast as possible. This article and the one next month discuss implementation optimisation for this particular process as well as dependencies on compiler and hardware capabilities.

What's involved in painting a sprite on to the screen? Obviously we need the sprite data - an array of pixel values with associated width and height will suffice. I'll encode the image as a C++ class with a method to draw the sprite on the screen. And next, we need somewhere to put it - with a PC VGA display adapter operating in mode 13h (320 × 200 pixels, with one byte per pixel), it's the screen memory block starting at location A0000h. The reason that I chose this resolution is that mode 13h is one of the easiest to work with on the PC: the pixel at (0, 0) - top left - is the byte at address A0000h, pixel (0, 1) is at A0001h, (1, 0) is at A0140h, ie., A0000h + 320, etc. (Strictly speaking, I should also include colour palette information with the sprite definition, but I'll ignore that topic for this article and use the default palette for mode 13h.)

Figure 1: Type definitions
typedef unsigned short Coord Position on screen or screen dimension.
typedef unsigned int Count Any counter (as long as it never goes negative).
typedef unsigned char Pixel Value put into the screen memory (strictly,
the index into the colour palette).

Figure 1 contains a few type definitions used for brevity and sprites are managed as objects of class Image, declared in Figure 2. Note the double const in the _pixels declaration, stating that both the pointer and the data to which it points are to be treated as constant. As you can see, there are only two functions in this class: the constructor and the drawing function. As a matter of course, class definitions should really include a copy constructor, assignment operator and destructor: however, the defaults supplied by the language are sufficient in this case because the class is so simple.

Figure 2: Sprite class declaration

class Image
{
public:
  Image( const Pixel* data,
         Coord width, Coord height );
  void plot( Coord x, Coord y );
private:
  const Pixel* const _pixels;
  const Coord  _width, _height;
};

Constructor

Figure 3a: Constructor definition, using member initialisation

Image::Image( const Pixel* data, Coord w, Coord h ) :
          _pixels( data ),
          _width( w ),
          _height( h )
{}

The constructor (Figure 3a) simply initialises the internal variables with the values passed in. The member variables are initialised rather than assigned: it is this mechanism that allows me to have them marked as constant within the class declaration for added safety, as noted earlier. If instead I had coded the constructor as shown in Figure 3b, I would have had to omit the const keyword in the member declarations - this would mean that any subsequent accidental assignments to these variables would not be spotted as errors by the compiler. The presence of const does not guarantee correctness, but certainly helps trap a lot of careless mistakes. In general, the const keyword should be used wherever possible to reduce the likelihood of accidentally altering something that ought to be invariant.

Figure 3b: Constructor definition, using member assignment

Image::Image( const Pixel* data, Coord w, Coord h )
{
  _pixels = data;
  _width  = w;
  _height = h;
}

The pixel data vector passed into the constructor is not copied - instead, the Image class simply references it. I did it this way because I intend the data to originate from a C++ vector defined within the program, so there is no need to copy it. Alternatives include reading it from a file, in which case a valid constructor might be one that takes the name of a file, or a handle to a file, and reads the data into a piece of storage allocated within the class - for example, something like Figure 3c. This example assumes the file stores the height and width as single bytes, and then the sprite data as a stream of bytes: not a particularly clever format, but it shows the basic technique for reading the data from a file. In this case too, the member variables cannot be declared constant since they are assigned at run time. For the current implementation, nothing is gained by permitting the sprite data to be read from a file: in fact, it's marginally less efficient - extra time has to be taken to read in the data, and extra space used to store the dynamic memory header block.

Figure 3c: Constructor definition, reading sprite data from a file

Image::Image( const istream& is )
{
  unsigned char temp;
  is.get( temp );  _width = temp;
  is.get( temp );  _height = temp;
  _pixels = new Pixel[ _width * _height ];

  for( Count i = 0; i < _width * _height; i++ )
    is.get( _pixels[ i ] );
}

Back to the real topic of this discussion: the vector defines the pixel values of the sprite, starting at top left, moving along and down the sprite. I could not define the data as a two dimensional array in C ++ (or C for that matter) because different sprites will have different sizes, and C++ does not store information about dimension sizes with the array. I could have defined an array type for the largest sprite I want to use, but that would be wasteful of storage for smaller ones; or I could have used a clever C++ matrix class, but that would add unnecessary inefficiency to what has to be a very quick operation, copying the sprite to the screen. Instead, the sprite pixels are arranged as a simple one dimensional array, with the pixel for row x (counting from 0 at the top) column y (0 at the left) being the value at _pixels[ x × _width + y ] - although there is a multiplication in that, it doesn't really matter because we can very easily optimise that away, as will be shown later.

The sprite I'm going to use for the later speed comparisons is one of the little alien things shown in Figure 4a (OK, so I'm no artist), along with part of its data definition. Thus, constructing an alien sprite is as simple as the single line in Figure 4b.

Figure 4a: Example sprite data
Alien sprite

const int alienWidth  = 27;
const int alienHeight = 31;
static const Pixel alienData[] = {
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ...
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ...
    0x00, 0x00, 0x00, 0x00, 0x20, 0x00, ...
...
    0x00, 0x28, 0x28, 0x28, 0x28, 0x28, ...
    0x28, 0x28, 0x28, 0x28, 0x28, 0x28, ...
    0x28, 0x28, 0x28, 0x28, 0x28, 0x0F, ...
    0x28, 0x28, 0x28, 0x28, 0x28, 0x0F, ...
...
};
Figure 4b: Defining a sprite

const Image alien( alienData, alienWidth, alienHeight );

The drawing function

The plot() function has the job of putting the sprite on the screen, with its top left corner at (x, y) on the screen. I'm going to take this opportunity to show how optimising the code helps - the function definitions that follow have long names to indicate what algorithm is being used, but they are really all just variants of the plot() member function declared in Figure 2. My first attempt at putting the sprite on the screen looked something like Figure 5a, where the Screen_XXX values are defined as:


     const Coord Screen_width = 320;
     const Coord Screen_height = 200;
     Pixel* const Screen_base = (Pixel*)MK_FP( 0xA000, 0 );

The first two are the width and height of the mode 13h VGA display screen, and the final one defines a pointer: because of the standard PC 16-bit operation, location A0000h cannot be referenced directly as such and, instead, it has to be accessed via segment and offset - fortunately, the whole of the mode 13h screen memory fits within one segment so once the segment base is set, it need not be changed. The macro MK_FP( 0xA000, 0 ) creates a far pointer from the specified segment and offset. Incidentally, I built the code with the large memory model, which means all my pointers are far pointers - the code could be made more efficient by using near pointers instead, but the optimisations below remove most of the effects of this inefficiency anyway.

Figure 5a: Drawing function, simple C variant

void Image::plotSlowC( Coord x, Coord y )
{
  for( Coord row = 0; row < _height; row++ )
    for( Coord col = 0; col < _width; col++ )
      Screen_base[ ( y + row ) * Screen_width + x + col ] =
                     _pixels[ row * _width + col ];
}

Now this first effort is hideously slow, and I'm almost ashamed that I made it public. Nevertheless, it does work, and does show the logic of the operation. How can we make it faster? First, notice that it always steps through the sprite data in sequence-when it has completed one row, it immediately starts on the next: for example, the last pixel on the third line is _pixels[ 2 × _width + (_width - 1) ] and the first on the fourth line is _pixels[ 3 × _width + 0 ] - the very next location in the vector. This means that the multiplication and addition on the right side of the assignment can be eliminated: we create a pointer to the start of the sprite data, and just step the pointer along for each pixel, regardless of whether it's in the middle of a line or at the end.

Can we do the same for the left? Almost: we can increment a pointer in exactly the same way as we scan along a line, but at the end of each row, we have to skip a lot of pixels to get to the equivalent position on the screen, Screen_width - _width to be precise. Putting these two together gives the function in Figure 5b, without any multiplications in the loop. This is much faster, even though it introduces a bundle of local variables - this is the usual trade-off of time vs space.

Figure 5b: Drawing function, faster C variant

void Image::plotFasterC( Coord x, Coord y )
{
  const Pixel* p = _pixels;
  const Pixel* s = Screen_base + y * Screen_width + x;
  const Coord offset = Screen_width - _width;
  for( Coord row = 0; row < _height; row++ )
  {
    for( Coord col = 0; col < _width; col++ )
      *s++ = *p++;
    s += offset;  // Back to the right place for the next row
  }
}

There are still a few tweaks to make the C a little faster. Counting down, and comparing with zero is marginally faster than counting up (at least, if the compiler is vaguely intelligent); and placing heavily used variables in registers can speed things up-but remember that the processor has only a limited number of registers, so the register keyword is really only a hint to the compiler that the associated variable is heavily used: some compilers ignore the keyword, particularly when compiler optimisation is disabled. Another minor optimisation has to do with the use of for loops: these check the test condition before processing the loop body (so that, say, for( i = 0; i < 0; i++ ) correctly executes zero times) but we already know that our sprites have non-zero height and width, so we can move the test to the end of the loop with a do...while structure, and perform the loop termination test once less. The best general sprite copying C++ routine I could come up with is given in Figure 5c. As I'll show towards the end of this article, this is roughly twice as fast as the first example. The only thing that can be done to improve this in C++ is to turn on all compiler optimisations and fiddle with near pointers! However, if we delve into assembler, we can make it a lot faster again (by another factor of almost four).

Figure 5c: Drawing function, fastest C variant

void Image::plotFastC( Coord x, Coord y )
{
  const register Pixel* p = _pixels;
  register Pixel* s = Screen_base + y * Screen_width + x;
  const register Coord offset = Screen_width - _width;
  register Coord row = _height;
  do
  {
    register Coord col = _width;
    do
      *s++ = *p++;
    while( --col > 0 );
    s += offset;  // Back to the right place for the next row
  } while( --row > 0 );
}

Assembly language

Assembly language programming is not easy or fun in my opinion, and I try to steer clear of it if I can, only using it where I really have to - here in sprite drawing, speed is very important, and it would be a mistake not to take advantage of the detailed low level control possible within assembly language. Instead of writing assembler from scratch, we can be lazy: what I usually do is get the compiler to produce the assembly output file (-S option on many compilers) and examine the file to get the general structure of the code. If nothing else, this shows where variables are on the stack. Most of my assembly writing is in translating the compiler's rather dumb register and variable assignments into much more efficient ones. In fact, even if you don't intend to write any assembly code, it is still worthwhile to peruse the compiler generated assembler occasionally as it can suggest improvements to the C code, mainly to conpensate for the compiler's lack of intelligence. Figure 6 shows what the Borland 4.53 C++ compiler did with the loops of the last C++ function above with all speed optimisations enabled.

Figure 6: Assembler listing for part of the fast C drawing function

1 ;   const register Pixel* p = _pixels;
      les   bx,dword ptr [bp+6]
      mov   dx,word ptr es:[bx+2]
      mov   ax,word ptr es:[bx]
      mov   word ptr [bp-2],dx
      mov   word ptr [bp-4],ax
2 ;   register Pixel* s = Screen_base + y * Screen_width + x;
      mov   ax,word ptr [bp+12]
      imul  ax,ax,320
      mov   dx,word ptr DGROUP:Screen_base+2
      mov   bx,word ptr DGROUP:Screen_base
      add   bx,ax
      add   bx,word ptr [bp+10]
      mov   word ptr [bp-6],dx
      mov   word ptr [bp-8],bx
3 ;   const register Coord offset = Screen_width - _width;
      mov   bx,word ptr [bp+6]
      mov   ax,320
      sub   ax,word ptr es:[bx+4]
      mov   word ptr [bp-10],ax
4 ;   register Coord row = _height;
      mov   cx,word ptr es:[bx+6]
  @4@226:
  ;   do
  ;   {
5 ;     register Coord col = _width;
      les   bx,dword ptr [bp+6]
      mov   dx,word ptr es:[bx+4]
  @4@254:
  ;     do
6 ;       *s++ = *p++;
      les   bx,dword ptr [bp-4]
      mov   al,byte ptr es:[bx]
      les   bx,dword ptr [bp-8]
      mov   byte ptr es:[bx],al
      inc   word ptr [bp-4]
      inc   word ptr [bp-8]
7a ;    while( --col > 0 );
      dec   dx
      mov   ax,dx
      or    ax,ax
      ja    short @4@254
  ;     s += offset;
      mov   ax,word ptr [bp-10]
      add   word ptr [bp-8],ax
7b ;  } while( --row > 0 );
      dec   cx
      mov   ax,cx
      or    ax,ax
      ja    short @4@226
  ;   }

The source code appears as comments in the assembler, and the odd looking label names are the ones automatically generated by the compiler. The numbers at the left of the listing refer to notes below:

  1. First of all, the compiler is using ES:BX as the C++ this pointer: if you look at the assembler produced for a number of C++ member functions, you'll see that they all use BX for this purpose (unless you use the fastthis optimisation available in this compiler), but remember that this is implementation dependent and other compilers may use different register conventions. According to the declaration of the Image class (Figure 2), the first member variable is the _pixels pointer, as is obvious here in the assignment from the long word at address ES:[BX]. Next, you can see that the compiler has ignored my register hint since it immediately copies the pointer into memory on the stack - the long word at address BP-4 is pointer p.
  2. Once again, register has been ignored, and BP-8 is used as the pointer s.
  3. offset is stored in BP-10.
  4. ES:[BX+6] contains the _height field in the Image object, and so we can see that the CX register is used as the counter for the outer loop.
  5. The DX register is used as the counter for the inner loop.
  6. This rather convoluted bit of code loads BX with the source pointer (from the memory located at [BP-4]), loads AL with the byte addressed by this pointer, reloads BX with the destination pointer (from [BP-8]), shoves AL into the destination location, and then increments both of the pointers in memory.
  7. That move into AX and OR statement are totally unnecessary operations since the decrement instruction directly above them sets the zero flag anyway!
Some of the faults are obvious: Figure 7a is my modified, faster and considerably smaller, variant. Notice how I'm too lazy to write a complete assembler routine: I let C++ handle the function stack frame and allocation of variables on the stack, and then I take over the critical bits with assembler. The code would of course be a little faster if I did it all in assembler, but I would not find writing it as enjoyable! A good policy is to get it working, and then get it working fast: I try to write a straightforward routine and then find out where it can be speeded up with most gain for least effort, and concentrate on those areas - profilers can help show particular hot spots, but for a small application like this one, commensense will suffice. There is a trade-off between development time and efficiency, and I prefer to expend effort on the areas that matter most: this means speed up the tight loops and worry about the rest later (or never, if I'm lucky). Of course, this particular brand of coding trickery is very compiler, and even compiler option, sensitive. For example, if I changed memory model, this code would almost certainly break. I used a Borland compiler, and I'm sure similar techniques would work with a Microsoft, or any other 16-bit, compiler, and the syntax for embedded assembler should be similar.

Figure 7a: Drawing function, slow assembler variant

void Image::plotSlowAsm( Coord x, Coord y ) const
{
  assert( x + _width <= Screen_width );
  assert( y + _height <= Screen_height );

  Pixel* s = Screen_base + y * Screen_width + x;
  const Pixel* p = _pixels;

  _asm {
    push  ds;                    // Preserve!

    les   bx,this;               // Use ES:BX as "this" pointer
    mov   ax,es:[bx]._width;     // AX contains "width"
    mov   dx,es:[bx]._height;    // DX contains "height"
    mov   bx,Screen_width;
    sub   bx,ax;                 // BX contains "offset"

    les   di,s;                  // ES:DI is "s"
    lds   si,p;                  // DS:SI id "p"

    cld;
  }
rowloop:       // Have to come out to C(++) to define label
  _asm mov   cx,ax;                // Get the row width
movint:
  _asm {
       rep   movsb;                // Copy a single row
       jcxz  short movdone;        // Done, or interrupted?
       loop  movint;
       }
movdone:
  _asm {
    add   di,bx;                   // Move to the next row
    dec   dx;
    jnz   rowloop;                 // Finished all the rows?
    pop   ds;
  }
}

Look carefully at that inner loop - it's not just coded as:

            mov cx,limit
            rem movsb

Instead, I went to the effort of writing it as:

            mov  cx,limit
   restart: rep  movsb
            jcxz short done
            loop restart
   done:

Why? It happens that when programming anything lower than a 386 (ie., from an 8086 to an 80286), you have to be aware of an interesting feature of the processor: if a repeated string instruction is interrupted, it may not complete properly! REP is not an instruction in its own right - it's a prefix byte for string operations. On the 80386 and above, if a string operation is interrupted before the count hits zero, the address pushed on to the stack when the interrupt is processed is that of the complete instruction including prefixes: however, with the older processors, prefix bytes were not taken into account. The result of that on the older processors is that, upon return from interrupt, the REP is ignored, the processor executes a single string operation and then continues. In the longer form, the CX register is checked at the end of the REP loop, and if it's not zero, the loop must have been interrupted, so re-enter it. If you know you're only coding for 80386 and above, you can dispense with that check and use the simpler form.

Can we do any more? Yes, but not much. The code above use the MOVSB instruction to copy a byte at a time. All of the 80x86 family members have the MOVSW instruction, to move two bytes at a time. Figure 7b makes use of that and note the check to see if we have an odd number of bytes to move (For brevity, I've omitted the REP loop restart mentioned above). If you're writing for 80386 and above only, you have access to MOVSD, to move 4 bytes at a time - I'll leave that bit as an exercise for you! I think the only way to speed up this particular bit of code would be to replace the C++ entry and exit code with more assembler and juggle with instruction ordering to assist the processor pipelines in the higher end processors, but that may be more effort than it's worth.

Figure 7b: Drawing function, fast assembler variant

void Image::plotFastAsm( Coord x, Coord y ) const
{
  Pixel* s = Screen_base + y * Screen_width + x;
  const Pixel* p = _pixels;
  _asm {
    push  ds;                    // Preserve!

    les   bx,this;               // Use ES:BX as "this" pointer
    mov   ax,es:[bx]._width;     // AX contains "width"
    mov   dx,es:[bx]._height;    // DX contains "height"
    mov   bx,Screen_width;
    sub   bx,ax;                 // BX contains "offset"

    les   di,s;                  // ES:DI is "s"
    lds   si,p;                  // DS:SI id "p"

    cld;
    shr   ax,1;                  // CX is width / 2
    jc    oddrowloop;            // Was width odd?
  }
rowloop:          // Have to come out to C(++) to define label
  _asm {
    mov   cx,ax;                 // Get the row width
    rep   movsw;                 // Copy a single row
    add   di,bx;                 // Move to the next row
    dec   dx;
    jnz   rowloop;               // Finished all the rows?
    jmp   endroutine;
  }
oddrowloop:       // Have to come out to C(++) to define label
  _asm {
    mov   cx,ax;                 // Get the row width
    rep   movsw;                 // Copy a single row
    movsb;                       // ... and the last pixel
    add   di,bx;                 // Move to the next row
    dec   dx;
    jnz   oddrowloop;            // Finished all the rows?
  }
endroutine:
  _asm pop   ds;
}

Pulling it together

Figure 8 is a short program to simply blast lots of Images on to the screen in random positions and time the operations, using the definition of the sprite above. Drawing a single sprite is much too quick to measure accurately, so I draw a large number, numIterations to be exact, giving a much more accurate result. As will be seen below, the assembly code versions are much faster than the C ones, so I add a further multiplication factor to the number of those I draw. The PC clock operates at about 18.2 ticks per second, so the timings will not be particularly precise, and they also include the overhead of the loops and random number generation, but the results will give a good idea of the relative performance of the different implementations of the drawing routine.

Figure 10: Benchmarking application

const int vidInt    = 0x10;  // Interrupt num for video fns
const int setMode   = 0x00;  // Function code to set video mode
const int getMode   = 0x0F;  // Function to read video mode

int vidInterrupt( int fn, int al = 0 )
{
  union REGS r;
  r.h.ah = fn;
  r.h.al = al;
  int86( vidInt, &r, &r );
  return r.h.al;
}

const long numIterations = 10000;
const long asmScale      = 10;

int main()
{
  int oldMode = vidInterrupt( getMode );

  vidInterrupt( setMode, 0x13 );

  long i;

  clock_t start = clock();
  for( i = 0; i < numIterations; i++ )
    alien.plotSlowC( random( 200 ), random( 150 ) );

  clock_t first = clock();
  for( i = 0; i < numIterations; i++ )
    alien.plotFastC( random( 200 ), random( 150 ) );

  clock_t second = clock();
  for( i = 0; i < numIterations * asmScale; i++ )
    alien.plotSlowAsm( random( 200 ), random( 150 ) );

  clock_t third = clock();
  for( i = 0; i < numIterations * asmScale; i++ )
    alien.plotFastAsm( random( 200 ), random( 150 ) );

  clock_t last = clock();

  vidInterrupt( setMode, oldMode );

  const long mult = CLOCKS_PER_SEC * numIterations;
  cout << "Slow C: " << mult / ( first - start ) << " per second\n"
          "Fast C: " << mult / ( second - first ) << " per second\n"
          "Slow assembler: " << mult * asmScale / ( third - second ) << " per second\n"
          "Fast assembler: " << mult * asmScale / ( last - third ) << " per second\n"
  return 0;
}

The vidInterrupt() routine is a short, and not particularly efficient, way to change video modes, and perform other screen based operations via the BIOS. It's only used three times in the program (to get the current mode, to set mode 13h and finally, to restore the original), so I am not too concerned about its speed.

Timings

I ran the code on a few machines, and the results (to two significant figures) are given in Figure 9. These numbers are affected by things other than the processor-the memory speed and video card making most difference, so the inter-machine comparisons may not actually be very valid.

Figure 9: Timing results in sprites per second for a number of conditions
Slow CFast CSlow AssemblerFast Assembler
25MHz 386 SX, compiler optimisation enabled120180720890
33MHz 486 DX, no compiler optimisation380110044005200
33MHz 486 DX, compiler optimisation enabled600120044005200
90MHz Pentium, compiler optimisation enabled2000340062008700

The compiler optimisations do not make anywhere near as much a difference as using a good algorithm in the first place (and the largest difference is where the starting point is a very sloppy implementation). Having a faster processor helps, but once again, the change is swamped by the effect of selecting a better algorithm - for example, the tuned version on the slow 386 outperforms the sloppy version on the higher speed 486 and the tuned 486 code outperforms even the best C on the Pentium! I feel that these days programmers are becoming too lazy and just assume that throwing more money (in other words, more memory or a faster processor) at a problem solves it, whereas the application of common sense and programming skills is much more beneficial.

Does the code work?

Before we get too excited, let's consider this fairly fundamental question. The answer is yes and no since it only does partly what we want. I'll explain next month.
Sprite optimisation article start page | Gavin's home page | BeesKnees home page

Last modified on 4th January 2001