Optimisation: a look at sprite drawing

How fast can you draw a sprite on the screen?

Part 2

Gavin Smyth

Published in EXE, August 1996


Last month I presented a number of implementations of basic sprite drawing routines, ranging from initially obvious, but very inefficient, C++ code to very efficient assembler. However, the code did not really do the job required of a sprite drawing routine... This month, I'll first address the flaw in the algorithm, and then present a pot pourri of ideas for speeding up sprite drawing still further. The final section will be a brief discussion of an excellent DOS port of the GNU C/C++ compiler.

Last month's routines certainly draw the sprite on the screen, but if you ran any of the code so far, you'll have noticed that the background of the sprite is drawn as black, chipping holes out of ones already drawn. This is not very useful in video games, so we need to have some way of specifying that only some of the pixels in the sprite are drawn. There are two obvious ways to do this:

Figure 1: Masking sprite pixels
(a) mask array(b) transparency value

for( Coord col = _width; col > 0; col-- )
{
  if( *mask++ )
    *s = *p;
  s++;
  p++;
}

for( Coord col = _width; col > 0; col-- )
{
  if( *p )
    *s = *p;
  s++;
  p++;
}

I'm going to use the latter because it does not require a second data vector, and because we can easily spare one colour out of the 256 available to us - in fact, it's not even as bad as that, because this is only ruling out the use of this particular value in sprites, and it can still be used for a background colour. An efficient assembler version of plot() is shown in Figure 2.

Figure 2: Masked drawing routine

void Image::plot( Coord x, Coord y ) const
{
  Pixel* s = Screen_base + y * Screen_width + x;
  const Pixel* p = _pixels;

  _asm {
      push  ds;                  // Preserve!

      les   bx,this;             // Use ES:BX as "this" pointer
      mov   ax,es:[bx]._width;   // AX is "width" (temporarily)
      mov   dx,es:[bx]._height;  // DX contains "height"
      mov   bx,Screen_width;
      sub   bx,ax;               // BX contains "offset"

      les   di,s;                // ES:DI is "s"
      lds   si,p;                // DS:SI id "p"

      push  bp;                  // Preserve!
      mov   bp,ax;               // BP contains "width"
      cld;
  }
rowloop:       // Have to come out to C(++) to define label
  _asm mov  cx,bp;               // Get the row width
colloop:
  _asm {
      lodsb;                     // Get the pixel
      and   al,al;               // Set Z flag
      jz    short skipit;
      mov   es:[di],al;          // Wasn't blank, so copy it
      }
skipit:
  _asm {
      inc   di;
      loop  colloop;
      add   di,bx;               // Move to the next row
      dec   dx;
      jnz   short rowloop;       // Finished all the rows?
      pop   bp;
      pop   ds;
  }
}

Unfortunately, because we have to examine each pixel individually, we can't use double and quad move instructions. Incidentally, I'm using the BP register to store the width for each row - I used AX in last month's code, but obviously can't here because LODS uses AL. This code runs at about half the speed of the slow assembler non-masked copy. How could we speed it up? Here are outlines for a couple of mechanisms which you might want to consider:

Run length encoding

Instead of storing the transparent pixels individually, we could just store the number of them in a run, so there would be less time spent scanning past the transparent bytes. Instead of a rectangular array of pixel values, for each line we'd only store the pixels between the first and last non-zero: a first attempt at this might be the data structure in Figure 3a.

Figure 3a: Run length encoded data structure
number of lines

column of the first non-zero pixel in the line, column of the last non-zero pixel

...pixel values (including embedded zeros)...

ditto for other lines...

If we assume that a sprite will have transparent regions no longer than 255 pixels, the data can all be stored in a byte string. To show what this data structure looks like, consider the very small, single coloured, sprite in Figure 3b along with its new style data definition.

Figure 3b: Example sprite data
SpriteStandard data structureRun length encoded structure
RLE sprite example

const int spriteWidth  = 5;
const int spriteHeight = 4;

static const Pixel spriteData[] = {
  {  0, 17, 17, 17,  0 },
  { 17, 17,  0, 17, 17 },
  {  0,  0, 17,  0,  0 },
  {  0,  0, 17,  0,  0 } };

static const unsigned char sprite[] = {
  4,              // number of lines
  1, 3,           // first and last non-zero pixels in row 0
  17, 17, 17,     // row 0 data
  0, 4,           // first and last non-zero pixels in row 1
  17, 17, 0, 17, 17, // row 1 data
  2, 2,           // first and last non-zero pixels in row 2
  17,             // row 2 data
  2, 2,           // first and last non-zero pixels in row 3
  17 };           // row 3 data

Because the lines may vary in length, we can't define a C++ structure, but instead just have to parse the bytes as they come. The code to plot the sprites could be as illustrated in Figure 3c (rather verbose version to show how I pick up the row and column offsets), with sprite pointing to the start of the data vector defined in Figure 3b. (Of course, we should use the more optimal way of referencing screen memory explained last time, but I'm not interested in that at the moment.) It could be made marginally faster by storing a count of the length of a row instead of the column number of the last pixel. If the sprites also have long runs of the same pixel value, we could run length encode the visible pixels too: however, many sprites are multi-coloured and too small to take much advantage of this.

Figure 3c: Drawing function using the encoded data

register unsigned char* p = sprite;
unsigned char numRows = *p++;
for( unsigned char row = 0; row < numRows; row++ )
{
  unsigned char first = *p++;
  unsigned char last = *p++;
  for( unsigned char col = first; col <= last; col++ )
  {
    if( *p )
      Screen_base[ ( y + row ) * Screen_width + x + col ] = *p;
    p++;
  }
}

Another mechanism would be to create specific assembler for each sprite: the program fragments so far have all been based on one piece of code working on separate data structures for each sprite. Consider the case where we don't make this distinction - if this seems odd, be assured that I've found some experienced software engineers have difficulties with this concept! This is something that Lisp and Prolog programmers have been doing for years-just treating program as data and vice versa. When you look at what's in the computer's memory, they're both identical looking bit patterns, just interpreted differently. This would obviously give the fastest sprite drawing, since there are no loops, tests, or even variables apart from a video memory pointer: just an instruction for each individual non-zero pixel. Figure 4a shows an example in pseudo code.

Figure 4a: Pseudo-code to draw a sprite directly

s  screen position
*s++ =  red
*s++ =  red
*s++ =  blue
s += skip to next non-transparent pixel
*s++ = blue
etc.

The routine could be optimised further by splitting the sprite into groups of the same coloured pixels and using register to memory instead of immediate operands, as shown in Figure 4b, or by using clever indexed addressing modes. The costs are in code size and effort in creating the code in the first place - I wouldn't suggest doing it by hand (unless you really enjoy mindlessly tedious tasks), but it would be fairly straightforward to create a little utility to scan sprite data and produce some assembler code for a function to draw the sprite.

Figure 4b: Optimised direct drawing routine

s = screen position
a = red (where a is a register)
*s++ = a
*s++ = a
etc. for the rest of the red pixels
s = screen position + offset to first blue pixel
a = blue
*s++ = a
s += skip to next non-transparent pixel
*s++ = a
etc.

Instead of modifying the algorithm, what else can be examined to speed up sprite drawing: well, just hardware or compiler.

Changing the hardware

The first hardware option is simply to buy a faster processor, but as I showed last month, that does not make as much of a difference as magic numbers like MIPS or iCOMP (Intel's measure of processor power) ratings might suggest.

What about the graphics accelerator: some graphics adapters include masked sprite drawing in hardware - such as the Cirrus 5428 device which I just happen to have in my PC (Some information on this and similar devices may be found at the Cirrus site.). Even in the event that hardware drawing proves to be no faster than high speed assembler, it takes the load off the processor. However, the advanced capabilities of video cards and their usage are far from standardised, and it is very difficult to support a number of cards within one program. Since you'd have to implement sprite drawing in software for some boards, you might as well stick with that - only high powered machines are likely to have clever video hardware, and those machines can certainly cope with running the sprite drawing software.

Nevertheless, as I have a device which can perform hardware sprite drawing, I should at least try to characterise its performance. First of all, with my hardware, it is much easier to perform video memory to video memory copies than system (non-video) memory to video memory ones, so I cheated in the next bit of code and drew a sprite on the screen, and then used the hardware accelerator to copy it elsewhere on the screen. Second, the hardware block transfer mechanism does not work very well in mode 13h - this is because, although mode 13h video memory looks very logically laid out, internally it is somewhat of a mess. For mainly historic reasons, each pixel addressed by the processor it actually stored in parts of four separate bytes of video memory. The hardware copy mechanism works best in linear memory modes and so copes very poorly with mode 13h. The recent VESA standard (recent compared with VGA, that is) defines a number of linear graphics modes, but my particular BIOS does not support a linear 320 × mode so I changed to 800 × 600 for this particular experiment (which was much easier to do than try to create a non-standard 320 × 200 linear memory mode - the Scitech Display Doctor could have cured that!). The three extra pieces of code I used appear in Figure 5.

Figure 5: Using Cirrus hardware sprite drawing
(a) Setting a VESA screen mode

union REGS r;
r.x.ax = 0x4F02;               // VESA set mode command
r.x.bx = 0x103;                // 800 x 600 x 256 mode
int86( vidInt, &r, &r );
(b) Initialising the block copy hardware

void Image::initialiseBlt() const
{
  // Make source image
  plotSlowC( 0, 0 );

  // Source and destination pitch - screen width
  outport( 0x3CE, ( Screen_width << 8     ) | 0x24 );
  outport( 0x3CE, ( Screen_width & 0xFF00 ) | 0x25 );
  outport( 0x3CE, ( Screen_width << 8     ) | 0x26 );
  outport( 0x3CE, ( Screen_width & 0xFF00 ) | 0x27 );

  // Write mode - copy source
  outport( 0x3CE, 0x0D32 );

  // Copy mode - transparency compare
  outport( 0x3CE, 0x0830 );

  // Set transparency colour = 0
  outport( 0x3CE, 0x0034 );
  outport( 0x3CE, 0x0035 );

  // Set transparency mask = 0
  outport( 0x3CE, 0x0038 );
  outport( 0x3CE, 0x0039 );
}
(c) Triggering the hardware copy

void Image::plotHw( Coord x, Coord y ) const
{
  register const unsigned short port = 0x3CE;

  // Wait until blitter is free
  outportb( port, 0x31 );
  while( inportb( 0x3CF ) & 1 );

  // Set up width of blt
  outport( port, ( ( _width - 1 ) << 8 ) | 0x20 );
  outport( port, 0x0021 );

  // Height of blt
  outport( port, ( ( _height - 1 ) << 8 ) | 0x22 );
  outport( port, 0x0023 );

  // Set up destination address
  register const unsigned long dest =
                 (unsigned long)y * Screen_width + x;
  outport( port, (unsigned short)( dest << 8                ) | 0x28 );
  outport( port, (unsigned short)( ( dest & 0xFF00 )        ) | 0x29 );
  outport( port, (unsigned short)( ( dest & 0xF0000L ) << 8 ) | 0x2A );

  // Set up source address - ( 0, 0 )
  outport( port, 0x002C );
  outport( port, 0x002D );
  outport( port, 0x002E );

  // Finally, start the copy
  outport( port, 0x0231 );
}

Figure 5a changes to the 800 × 600 video mode, similar to the vidInterrupt() mode change routine given last time, but for VESA modes. (This is the old style VESA mode set operation-the current standard, 2.0, removes the emphasis on mode numbers and defines modes by characteristics such as pixel width, height and colour depth: see http://www.vesa.org for more information.)

Figure 5b initialises the hardware engine - this engine is controlled by a large number of registers, appearing at I/O port 3CEh and 3CFh: any access uses a write to the first to specify the register offset and the second to read or write the value. Figure 6 contains a list of the relevant registers-things like the source and destination addresses, pitch (number of bytes between the starts of successive lines, in this case, the screen width), number of rows and columns to copy, and the mode (transparent background copy here). As can be seen from the table, many of these registers are multi-byte, and accessing them requires several accesses. Some of the more sophisticated members of the Cirrus VGA controllers have the ability to map the registers into normal memory space which permits direct, say, long word access, as well as being faster then I/O operations - alas, mine does not. Because I/O operations are fairly expensive, I limit them somewhat by setting the constants once at the start, and then updating the working registers for each sprite draw. The initialisation routine in Figure 5b also paints the sprite on the screen, using one of the routines presented last month.

Finally, the copy itself is performed by the code in Figure 5c, which waits for the hardware copy engine to be free, and then loads it with the source, destination and size values before starting the copy. Remember, this will only work on Cirrus chips (and even then, probably not across the whole family).

Figure 6: Cirrus block copy registers
AddressUsageNumber of bitsConstant?
20hCopy region width11no
22hCopy region height10yes
24hDestination pitch12yes
26hSource pitch12yes
28hDestination start address21no
2ChSource start address21no
30hCopy mode8yes
31hStatus and control4no
32hRaster operation8yes
34hTransparency colour16yes
38hTransparency mask16yes

Rather disappointingly, this routine runs not much faster than the software masked copy given earlier (running in the same video mode of course, and copying from video memory to video memory - timings for different modes are not directly comparable since the larger the visible area, the more time the video circuitry needs to read it to stream data out to the screen, and therefore the less time the processor can access it), and the time taken scales almost linearly with the number of pixels to transfer (in other words, the software overhead of all those I/O operations is negligible). A detailed examination of code timing showed that we are wasting about 90% of the time at the wait loop at the start of plotHw(): this time could be used in a real application to perform non-video related processing in parallel with the screen copy.

GNU compiler

On the software side, recently I have come across a truly superb compiler. D.J. Delorie and some colleagues have done an excellent job in porting a lot of the GNU code development toolset to DOS, including of course the C compiler - which also compiles assembler, C++ and Objective C, and it's even possible to get a DOS port of Ada based on the same system. This is a protected mode 32-bit compiler for the 80386 and above which as a plus supports a form of virtual memory. You can find the package, known as DJGPP, in the SimtelNet collection - a good location for those of us in the U.K. is in the directory ftp://sunsite.doc.ic.ac.uk/packages/simtelnet/gnu/djgpp/. The compiler also has a dedicated news group.

I recoded last month's programs for this compiler - the changes were in the way that video memory was accessed and the use of embedded assembly code. The former changes were necessary because of the nature of protected mode: without going into the gory details, the 80386 in protected mode keeps code and data segments well apart from each other, and we have to select a special segment to be able to access any of what is conventionally referred to as DOS memory, which includes the video RAM. As is explained in the DJGPP FAQ, probably the best way to do this for our pixel by pixel manipulation is to select the DOS segment once using _farsetsel() and then referencing DOS memory via _farnspokeb(). Now, these functions are short inlined assembler functions which compile to one instruction apiece so they do not cost very much. Last month's slow C code was modified to that in Figure 7a.

Figure 7a: Drawing routine, simple C implementation in 32-bit mode

const Coord Screen_width = 320;
const Coord Screen_height = 200;
const unsigned long Screen_base = 0xA0000;

void Image::plotSlowC( Coord x, Coord y ) const
{
  _farsetsel( _dos_ds );

  for( Coord row = 0; row < _height; row++ )
    for( Coord col = 0; col < _width; col++ )
        _farnspokeb( Screen_base +
                     ( y + row ) * Screen_width + x + col,
                       _pixels[ row * _width + col ] );
}

As I showed last time, you can get impressive gains in speed by recoding the core code in assembler. However, embedding assembler within GNU C/C++ code is rather tricky, for three main reasons:

Figure 7b contains a port of last month's slower assembly routine, with the numbers referring to the notes below. (The faster variants' code can be found in the source archive - it didn't seem worth repeating it here.)

Figure 7b: Drawing function, assembler implementation in 32-bits

   void Image::plotAsm( Coord x, Coord y ) const
   {
1    register unsigned long s asm( "%edi" ) =
                     Screen_base + y * Screen_width + x;
2    register /*const*/ Coord offset asm( "%ebx" ) =
                     Screen_width - _width;

3    asm volatile( " cld                \n"
4                  " push %%es          \n"
5                  " movw %w0,%%es      \n"
                   "rowloop:            \n"
                   " movl %%eax,%%ecx   \n"
                   " rep; movsb         \n"
                   " addl %%ebx,%%edi   \n"
                   " decl %%edx         \n"
                   " jnz  rowloop       \n"
                   " pop  %%es"
                   : /* No outputs */
6                  : "rm" (_dos_ds), "a" (_width), "d" (_height),
                     "b" (offset), "S" (_pixels), "D" (s)
7                  : "%eax", "%ebx", "%ecx", "%edx", "%esi",
                     "%edi", /*"%es",*/ "cc" );
   }
  1. When we define (register) variables, we can add an attribute which specifies which register to use for those variables-much easier than trying to determine which register the compiler has chosen for the variable and using that in embedded assembler. (I think the compiler could probably have worked it out in this case, given the assembler insert below, but I did not take the chance!) Register names begin with percent symbols but otherwise follow the normal 80386 conventions.
  2. I had wanted to label this variable as constant as well as assigning it to a register for later use, but that causes a compiler error, so I just commented it out.
  3. An assembler insert has five parts: the asm directive; optional attributes, in this case volatile which prevents the compiler from reordering instructions; the assembly code itself, with statements separated by line breaks or semi-colons; a definition of outputs of the code, in this case none; a definition of the inputs (see note 7); and a definition of the registers touched by the code (see note 7).
  4. Within the assembler insert, registers are referenced with two percent signs. This is something to do with the compiler swallowing up one percent during its processing leaving just one for the assembler to process: I'm not really too worried, and just have to remember to type two.
  5. As mentioned earlier, operands are in reverse order to normal 80x86 assembler. The %w0 refers to the first argument specified in the input list.
  6. In a similar way to the variable definition attributes mentioned above, you can specify arguments to pass into the assembler insert. Here, these most usefully take the form of specifying the contents of a number of registers: for example, "a" (_width) tells the compiler to load this->_width into the EAX register - I don't have to write code to do that myself. These definitions are very flexible and I'll direct you towards the GNU compiler documentation for the plethora of options instead of trying to explain them in the space available here.
  7. Finally, there is a list of the registers touched by the assembly insert, so that the compiler can determine which ones should be preserved. In this case, I expect that few will be preserved since the insert ends the routine anyway. If the compiler were to add a few unnecessary register saves, I hope that the assembler would almost certainly optimise them away. I had wanted to specify ES as a register in this list: however, the compiler objected to that, so I removed it from the list and explicitly saved it on the stack within the assembly code.
Initially, this embedded assembler looks very daunting. Most of that is due to unfamiliarity, and you do get used to its idiosyncratic nature: the argument specification is very useful in preloading registers.

So how does DJGPP compare with the 16-bit compiler (Borland 4.53 in my case)? Figure 8 contains some speed comparisons (to two significant figures) of optimised code on a 486DX33 (there are a lot of optimisation options on the GNU compiler - I haven't played with them all, but just specified -O2 in the code used for the table; and I have not enabled Borland's fastthis optimisation).

Figure 8: Timing comparison (sprites per second)
Slow CFast CSlow AssemblerFast AssemblerMasked assembler
16-bit compiler6001200440052002000
DJGPP compiler18003700620080002300

There's a very dramatic speed-up for C, and a smaller but still measurable increase for assembler - I suspect that the masked assembler differences are due to the treatment of the C parts of the code as the assembler portions are almost identical. There is however a size penalty: the DJGPP executable is about three times as big as the Borland one, or twice as big when the symbolic debug information is removed. It must be said though that I do not have any experience about how these figures relate to much larger applications.

So, how fast can you draw sprites?

In last month's article and this, I have described a number of ways to speed up sprite drawing. The single most significant way to speed up this, or in fact any, operation is to apply skill as a programmer, which includes both optimising the algorithm (selecting data structures and operations on them) and optimising the implementation (possibly involving a rewrite in a more appropriate language, such as assembler).

Another notable improvement can be achieved by switching to a better compiler: in this case, from 16-bit to 32-bit. In fact, a programmer's life is generally simpler in the flat 32-bit world than the segmented 16-bit one.

Finally, by allowing the processor to perform other functions while drawing takes place, hardware drawing has the potential for being the fastest mechanism, particularly with the newer accelerator boards. Unfortunately, standards are somewhat lacking in this area, and it will be difficult to support all possibilities.


Sprite optimisation article start page | Gavin's home page | BeesKnees home page

Last modified on 28th December 1998