Part 2
Published in EXE, August 1996
Last month I presented a number of implementations of basic sprite drawing routines, ranging from initially obvious, but very inefficient, C++ code to very efficient assembler. However, the code did not really do the job required of a sprite drawing routine... This month, I'll first address the flaw in the algorithm, and then present a pot pourri of ideas for speeding up sprite drawing still further. The final section will be a brief discussion of an excellent DOS port of the GNU C/C++ compiler.
Last month's routines certainly draw the sprite on the screen, but if you ran any of the code so far, you'll have noticed that the background of the sprite is drawn as black, chipping holes out of ones already drawn. This is not very useful in video games, so we need to have some way of specifying that only some of the pixels in the sprite are drawn. There are two obvious ways to do this:
plot() looks like Figure 1a in rather inefficient C++ (remember, get it working, then optimise
or rewrite in assembler); or| (a) mask array | (b) transparency value |
|---|---|
for( Coord col = _width; col > 0; col-- )
{
if( *mask++ )
*s = *p;
s++;
p++;
}
|
for( Coord col = _width; col > 0; col-- )
{
if( *p )
*s = *p;
s++;
p++;
}
|
I'm going to use the latter because it does not require a second data vector, and because
we can easily spare one colour out of the 256 available to us - in fact, it's not even as bad as
that, because this is only ruling out the use of this particular value in sprites, and it can still
be used for a background colour. An efficient assembler version of plot() is shown in
Figure 2.
void Image::plot( Coord x, Coord y ) const
{
Pixel* s = Screen_base + y * Screen_width + x;
const Pixel* p = _pixels;
_asm {
push ds; // Preserve!
les bx,this; // Use ES:BX as "this" pointer
mov ax,es:[bx]._width; // AX is "width" (temporarily)
mov dx,es:[bx]._height; // DX contains "height"
mov bx,Screen_width;
sub bx,ax; // BX contains "offset"
les di,s; // ES:DI is "s"
lds si,p; // DS:SI id "p"
push bp; // Preserve!
mov bp,ax; // BP contains "width"
cld;
}
rowloop: // Have to come out to C(++) to define label
_asm mov cx,bp; // Get the row width
colloop:
_asm {
lodsb; // Get the pixel
and al,al; // Set Z flag
jz short skipit;
mov es:[di],al; // Wasn't blank, so copy it
}
skipit:
_asm {
inc di;
loop colloop;
add di,bx; // Move to the next row
dec dx;
jnz short rowloop; // Finished all the rows?
pop bp;
pop ds;
}
}
|
Unfortunately, because we have to examine each pixel individually, we can't use double and quad move instructions. Incidentally, I'm using the BP register to store the width for each row - I used AX in last month's code, but obviously can't here because LODS uses AL. This code runs at about half the speed of the slow assembler non-masked copy. How could we speed it up? Here are outlines for a couple of mechanisms which you might want to consider:
|
number of lines column of the first non-zero pixel in the line, column of the last non-zero pixel ...pixel values (including embedded zeros)... ditto for other lines... |
If we assume that a sprite will have transparent regions no longer than 255 pixels, the data can all be stored in a byte string. To show what this data structure looks like, consider the very small, single coloured, sprite in Figure 3b along with its new style data definition.
| Sprite | Standard data structure | Run length encoded structure |
|---|---|---|
![]() |
const int spriteWidth = 5;
const int spriteHeight = 4;
static const Pixel spriteData[] = {
{ 0, 17, 17, 17, 0 },
{ 17, 17, 0, 17, 17 },
{ 0, 0, 17, 0, 0 },
{ 0, 0, 17, 0, 0 } }; |
static const unsigned char sprite[] = {
4, // number of lines
1, 3, // first and last non-zero pixels in row 0
17, 17, 17, // row 0 data
0, 4, // first and last non-zero pixels in row 1
17, 17, 0, 17, 17, // row 1 data
2, 2, // first and last non-zero pixels in row 2
17, // row 2 data
2, 2, // first and last non-zero pixels in row 3
17 }; // row 3 data |
Because the lines may vary in length, we can't define a C++ structure, but instead just have to parse the bytes as they come. The code to plot the sprites could be as illustrated in Figure 3c (rather verbose version to show how I pick up the row and column offsets), with sprite pointing to the start of the data vector defined in Figure 3b. (Of course, we should use the more optimal way of referencing screen memory explained last time, but I'm not interested in that at the moment.) It could be made marginally faster by storing a count of the length of a row instead of the column number of the last pixel. If the sprites also have long runs of the same pixel value, we could run length encode the visible pixels too: however, many sprites are multi-coloured and too small to take much advantage of this.
register unsigned char* p = sprite;
unsigned char numRows = *p++;
for( unsigned char row = 0; row < numRows; row++ )
{
unsigned char first = *p++;
unsigned char last = *p++;
for( unsigned char col = first; col <= last; col++ )
{
if( *p )
Screen_base[ ( y + row ) * Screen_width + x + col ] = *p;
p++;
}
} |
Another mechanism would be to create specific assembler for each sprite: the program fragments so far have all been based on one piece of code working on separate data structures for each sprite. Consider the case where we don't make this distinction - if this seems odd, be assured that I've found some experienced software engineers have difficulties with this concept! This is something that Lisp and Prolog programmers have been doing for years-just treating program as data and vice versa. When you look at what's in the computer's memory, they're both identical looking bit patterns, just interpreted differently. This would obviously give the fastest sprite drawing, since there are no loops, tests, or even variables apart from a video memory pointer: just an instruction for each individual non-zero pixel. Figure 4a shows an example in pseudo code.
s screen position *s++ = red *s++ = red *s++ = blue s += skip to next non-transparent pixel *s++ = blue etc. |
The routine could be optimised further by splitting the sprite into groups of the same coloured pixels and using register to memory instead of immediate operands, as shown in Figure 4b, or by using clever indexed addressing modes. The costs are in code size and effort in creating the code in the first place - I wouldn't suggest doing it by hand (unless you really enjoy mindlessly tedious tasks), but it would be fairly straightforward to create a little utility to scan sprite data and produce some assembler code for a function to draw the sprite.
s = screen position a = red (where a is a register) *s++ = a *s++ = a etc. for the rest of the red pixels s = screen position + offset to first blue pixel a = blue *s++ = a s += skip to next non-transparent pixel *s++ = a etc. |
Instead of modifying the algorithm, what else can be examined to speed up sprite drawing: well, just hardware or compiler.
What about the graphics accelerator: some graphics adapters include masked sprite drawing in hardware - such as the Cirrus 5428 device which I just happen to have in my PC (Some information on this and similar devices may be found at the Cirrus site.). Even in the event that hardware drawing proves to be no faster than high speed assembler, it takes the load off the processor. However, the advanced capabilities of video cards and their usage are far from standardised, and it is very difficult to support a number of cards within one program. Since you'd have to implement sprite drawing in software for some boards, you might as well stick with that - only high powered machines are likely to have clever video hardware, and those machines can certainly cope with running the sprite drawing software.
Nevertheless, as I have a device which can perform hardware sprite drawing, I should at least try to characterise its performance. First of all, with my hardware, it is much easier to perform video memory to video memory copies than system (non-video) memory to video memory ones, so I cheated in the next bit of code and drew a sprite on the screen, and then used the hardware accelerator to copy it elsewhere on the screen. Second, the hardware block transfer mechanism does not work very well in mode 13h - this is because, although mode 13h video memory looks very logically laid out, internally it is somewhat of a mess. For mainly historic reasons, each pixel addressed by the processor it actually stored in parts of four separate bytes of video memory. The hardware copy mechanism works best in linear memory modes and so copes very poorly with mode 13h. The recent VESA standard (recent compared with VGA, that is) defines a number of linear graphics modes, but my particular BIOS does not support a linear 320 × mode so I changed to 800 × 600 for this particular experiment (which was much easier to do than try to create a non-standard 320 × 200 linear memory mode - the Scitech Display Doctor could have cured that!). The three extra pieces of code I used appear in Figure 5.
| (a) Setting a VESA screen mode |
|---|
union REGS r; r.x.ax = 0x4F02; // VESA set mode command r.x.bx = 0x103; // 800 x 600 x 256 mode int86( vidInt, &r, &r ); |
| (b) Initialising the block copy hardware |
void Image::initialiseBlt() const
{
// Make source image
plotSlowC( 0, 0 );
// Source and destination pitch - screen width
outport( 0x3CE, ( Screen_width << 8 ) | 0x24 );
outport( 0x3CE, ( Screen_width & 0xFF00 ) | 0x25 );
outport( 0x3CE, ( Screen_width << 8 ) | 0x26 );
outport( 0x3CE, ( Screen_width & 0xFF00 ) | 0x27 );
// Write mode - copy source
outport( 0x3CE, 0x0D32 );
// Copy mode - transparency compare
outport( 0x3CE, 0x0830 );
// Set transparency colour = 0
outport( 0x3CE, 0x0034 );
outport( 0x3CE, 0x0035 );
// Set transparency mask = 0
outport( 0x3CE, 0x0038 );
outport( 0x3CE, 0x0039 );
} |
| (c) Triggering the hardware copy |
void Image::plotHw( Coord x, Coord y ) const
{
register const unsigned short port = 0x3CE;
// Wait until blitter is free
outportb( port, 0x31 );
while( inportb( 0x3CF ) & 1 );
// Set up width of blt
outport( port, ( ( _width - 1 ) << 8 ) | 0x20 );
outport( port, 0x0021 );
// Height of blt
outport( port, ( ( _height - 1 ) << 8 ) | 0x22 );
outport( port, 0x0023 );
// Set up destination address
register const unsigned long dest =
(unsigned long)y * Screen_width + x;
outport( port, (unsigned short)( dest << 8 ) | 0x28 );
outport( port, (unsigned short)( ( dest & 0xFF00 ) ) | 0x29 );
outport( port, (unsigned short)( ( dest & 0xF0000L ) << 8 ) | 0x2A );
// Set up source address - ( 0, 0 )
outport( port, 0x002C );
outport( port, 0x002D );
outport( port, 0x002E );
// Finally, start the copy
outport( port, 0x0231 );
} |
Figure 5a changes to the 800 × 600 video mode, similar to the vidInterrupt() mode
change routine given last time, but for VESA modes.
(This is the old style VESA mode set
operation-the current standard, 2.0, removes the emphasis on mode numbers and defines
modes by characteristics such as pixel width, height and colour depth: see
http://www.vesa.org for more information.)
Figure 5b initialises the hardware engine - this engine is controlled by a large number of registers, appearing at I/O port 3CEh and 3CFh: any access uses a write to the first to specify the register offset and the second to read or write the value. Figure 6 contains a list of the relevant registers-things like the source and destination addresses, pitch (number of bytes between the starts of successive lines, in this case, the screen width), number of rows and columns to copy, and the mode (transparent background copy here). As can be seen from the table, many of these registers are multi-byte, and accessing them requires several accesses. Some of the more sophisticated members of the Cirrus VGA controllers have the ability to map the registers into normal memory space which permits direct, say, long word access, as well as being faster then I/O operations - alas, mine does not. Because I/O operations are fairly expensive, I limit them somewhat by setting the constants once at the start, and then updating the working registers for each sprite draw. The initialisation routine in Figure 5b also paints the sprite on the screen, using one of the routines presented last month.
Finally, the copy itself is performed by the code in Figure 5c, which waits for the hardware copy engine to be free, and then loads it with the source, destination and size values before starting the copy. Remember, this will only work on Cirrus chips (and even then, probably not across the whole family).
| Address | Usage | Number of bits | Constant? |
|---|---|---|---|
| 20h | Copy region width | 11 | no |
| 22h | Copy region height | 10 | yes |
| 24h | Destination pitch | 12 | yes |
| 26h | Source pitch | 12 | yes |
| 28h | Destination start address | 21 | no |
| 2Ch | Source start address | 21 | no |
| 30h | Copy mode | 8 | yes |
| 31h | Status and control | 4 | no |
| 32h | Raster operation | 8 | yes |
| 34h | Transparency colour | 16 | yes |
| 38h | Transparency mask | 16 | yes |
Rather disappointingly, this routine runs not much faster than the software masked copy
given earlier (running in the same video mode of course, and copying from video memory to
video memory - timings for different modes are not directly comparable since the larger the
visible area, the more time the video circuitry needs to read it to stream data out to the
screen, and therefore the less time the processor can access it), and the time taken scales
almost linearly with the number of pixels to transfer (in other words, the software overhead of
all those I/O operations is negligible). A detailed examination of code timing showed that we
are wasting about 90% of the time at the wait loop at the start of plotHw(): this time could be
used in a real application to perform non-video related processing in parallel with the screen
copy.
I recoded last month's programs for this compiler - the changes were in the way that
video memory was accessed and the use of embedded assembly code. The former changes
were necessary because of the nature of protected mode: without going into the gory details,
the 80386 in protected mode keeps code and data segments well apart from each other, and
we have to select a special segment to be able to access any of what is conventionally
referred to as DOS memory, which includes the video RAM. As is explained in the
DJGPP FAQ,
probably the best way to do this for our pixel by pixel manipulation is to select the
DOS segment once using _farsetsel() and then referencing DOS memory via
_farnspokeb(). Now, these functions are short inlined assembler functions which compile to
one instruction apiece so they do not cost very much.
Last month's slow C code was
modified to that in Figure 7a.
const Coord Screen_width = 320;
const Coord Screen_height = 200;
const unsigned long Screen_base = 0xA0000;
void Image::plotSlowC( Coord x, Coord y ) const
{
_farsetsel( _dos_ds );
for( Coord row = 0; row < _height; row++ )
for( Coord col = 0; col < _width; col++ )
_farnspokeb( Screen_base +
( y + row ) * Screen_width + x + col,
_pixels[ row * _width + col ] );
} |
As I showed last time, you can get impressive gains in speed by recoding the core code in assembler. However, embedding assembler within GNU C/C++ code is rather tricky, for three main reasons:
void Image::plotAsm( Coord x, Coord y ) const
{
1 register unsigned long s asm( "%edi" ) =
Screen_base + y * Screen_width + x;
2 register /*const*/ Coord offset asm( "%ebx" ) =
Screen_width - _width;
3 asm volatile( " cld \n"
4 " push %%es \n"
5 " movw %w0,%%es \n"
"rowloop: \n"
" movl %%eax,%%ecx \n"
" rep; movsb \n"
" addl %%ebx,%%edi \n"
" decl %%edx \n"
" jnz rowloop \n"
" pop %%es"
: /* No outputs */
6 : "rm" (_dos_ds), "a" (_width), "d" (_height),
"b" (offset), "S" (_pixels), "D" (s)
7 : "%eax", "%ebx", "%ecx", "%edx", "%esi",
"%edi", /*"%es",*/ "cc" );
} |
asm directive; optional attributes, in this case
volatile which prevents the compiler from reordering instructions; the assembly code itself,
with statements separated by line breaks or semi-colons; a definition of outputs of the code,
in this case none; a definition of the inputs (see note 7); and a definition of the registers
touched by the code (see note 7).%w0 refers to the first argument specified in the input list."a" (_width) tells the
compiler to load this->_width into the EAX register - I don't have to write code to do that
myself. These definitions are very flexible and I'll direct you towards the GNU compiler
documentation for the plethora of options instead of trying to explain them in the space
available here.
So how does DJGPP compare with the 16-bit compiler (Borland 4.53 in my case)?
Figure 8 contains some speed comparisons (to two significant figures) of optimised code on a
486DX33 (there are a lot of optimisation options on the GNU compiler - I haven't played with
them all, but just specified -O2 in the code used for the table; and I have not enabled
Borland's fastthis optimisation).
| Slow C | Fast C | Slow Assembler | Fast Assembler | Masked assembler | |
|---|---|---|---|---|---|
| 16-bit compiler | 600 | 1200 | 4400 | 5200 | 2000 |
| DJGPP compiler | 1800 | 3700 | 6200 | 8000 | 2300 |
There's a very dramatic speed-up for C, and a smaller but still measurable increase for assembler - I suspect that the masked assembler differences are due to the treatment of the C parts of the code as the assembler portions are almost identical. There is however a size penalty: the DJGPP executable is about three times as big as the Borland one, or twice as big when the symbolic debug information is removed. It must be said though that I do not have any experience about how these figures relate to much larger applications.
Another notable improvement can be achieved by switching to a better compiler: in this case, from 16-bit to 32-bit. In fact, a programmer's life is generally simpler in the flat 32-bit world than the segmented 16-bit one.
Finally, by allowing the processor to perform other functions while drawing takes place, hardware drawing has the potential for being the fastest mechanism, particularly with the newer accelerator boards. Unfortunately, standards are somewhat lacking in this area, and it will be difficult to support all possibilities.
Last modified on 28th December 1998