Part 1
Published in EXE, July 1996
The activity of drawing sprites on the screen is fundamental to just about all computer games, and forms a major part of an awful lot of them. The mechanism for performing this activity therefore has to be flexible and as fast as possible. This article and the one next month discuss implementation optimisation for this particular process as well as dependencies on compiler and hardware capabilities.
What's involved in painting a sprite on to the screen? Obviously we need the sprite data - an array of pixel values with associated width and height will suffice. I'll encode the image as a C++ class with a method to draw the sprite on the screen. And next, we need somewhere to put it - with a PC VGA display adapter operating in mode 13h (320 × 200 pixels, with one byte per pixel), it's the screen memory block starting at location A0000h. The reason that I chose this resolution is that mode 13h is one of the easiest to work with on the PC: the pixel at (0, 0) - top left - is the byte at address A0000h, pixel (0, 1) is at A0001h, (1, 0) is at A0140h, ie., A0000h + 320, etc. (Strictly speaking, I should also include colour palette information with the sprite definition, but I'll ignore that topic for this article and use the default palette for mode 13h.)
typedef unsigned short Coord | Position on screen or screen dimension. |
typedef unsigned int Count | Any counter (as long as it never goes negative). |
typedef unsigned char Pixel | Value put into the screen memory (strictly, the index into the colour palette). |
Figure 1 contains a few type definitions used for brevity and sprites are managed as
objects of class Image, declared in Figure 2. Note the double const in the _pixels
declaration, stating that both the pointer and the data to which it points are to be treated as
constant. As you can see, there are only two functions in this class: the constructor and the
drawing function. As a matter of course, class definitions should really include a copy
constructor, assignment operator and destructor: however, the defaults supplied by the
language are sufficient in this case because the class is so simple.
class Image
{
public:
Image( const Pixel* data,
Coord width, Coord height );
void plot( Coord x, Coord y );
private:
const Pixel* const _pixels;
const Coord _width, _height;
};
|
Image::Image( const Pixel* data, Coord w, Coord h ) :
_pixels( data ),
_width( w ),
_height( h )
{}
|
The constructor (Figure 3a) simply initialises the internal variables with the values passed
in. The member variables are initialised rather than assigned: it is this mechanism that allows
me to have them marked as constant within the class declaration for added safety, as noted
earlier. If instead I had coded the constructor as shown in Figure 3b, I would have had to omit
the const keyword in the member declarations - this would mean that any subsequent
accidental assignments to these variables would not be spotted as errors by the compiler.
The presence of const does not guarantee correctness, but certainly helps trap a lot of
careless mistakes. In general, the const keyword should be used wherever possible to
reduce the likelihood of accidentally altering something that ought to
be invariant.
Image::Image( const Pixel* data, Coord w, Coord h )
{
_pixels = data;
_width = w;
_height = h;
}
|
The pixel data vector passed into the constructor is not copied - instead, the Image class
simply references it. I did it this way because I intend the data to originate from a C++ vector
defined within the program, so there is no need to copy it. Alternatives include reading it from
a file, in which case a valid constructor might be one that takes the name of a file, or a
handle to a file, and reads the data into a piece of storage allocated within the class - for
example, something like Figure 3c. This example assumes the file stores the height and
width as single bytes, and then the sprite data as a stream of bytes: not a particularly clever
format, but it shows the basic technique for reading the data from a file. In this case too, the
member variables cannot be declared constant since they are assigned at run time. For the
current implementation, nothing is gained by permitting the sprite data to be read from a file:
in fact, it's marginally less efficient - extra time has to be taken to read in the data, and extra
space used to store the dynamic memory header block.
Image::Image( const istream& is )
{
unsigned char temp;
is.get( temp ); _width = temp;
is.get( temp ); _height = temp;
_pixels = new Pixel[ _width * _height ];
for( Count i = 0; i < _width * _height; i++ )
is.get( _pixels[ i ] );
}
|
Back to the real topic of this discussion: the vector defines the pixel values of the sprite,
starting at top left, moving along and down the sprite. I could not define the data as a two
dimensional array in C ++ (or C for that matter) because different sprites will have different
sizes, and C++ does not store information about dimension sizes with the array. I could have
defined an array type for the largest sprite I want to use, but that would be wasteful of storage
for smaller ones; or I could have used a clever C++ matrix class, but that would add
unnecessary inefficiency to what has to be a very quick operation, copying the sprite to the
screen. Instead, the sprite pixels are arranged as a simple one dimensional array, with the
pixel for row x (counting from 0 at the top) column y (0 at the left) being the value at
_pixels[ x × _width + y ] - although there is a multiplication in that, it doesn't really matter
because we can very easily optimise that away, as will be shown later.
The sprite I'm going to use for the later speed comparisons is one of the little alien things shown in Figure 4a (OK, so I'm no artist), along with part of its data definition. Thus, constructing an alien sprite is as simple as the single line in Figure 4b.
![]() |
const int alienWidth = 27;
const int alienHeight = 31;
static const Pixel alienData[] = {
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ...
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, ...
0x00, 0x00, 0x00, 0x00, 0x20, 0x00, ...
...
0x00, 0x28, 0x28, 0x28, 0x28, 0x28, ...
0x28, 0x28, 0x28, 0x28, 0x28, 0x28, ...
0x28, 0x28, 0x28, 0x28, 0x28, 0x0F, ...
0x28, 0x28, 0x28, 0x28, 0x28, 0x0F, ...
...
}; |
const Image alien( alienData, alienWidth, alienHeight ); |
plot() function has the job of putting the sprite on the screen, with its top left corner
at (x, y) on the screen. I'm going to take this opportunity to show how optimising the code
helps - the function definitions that follow have long names to indicate what algorithm is
being used, but they are really all just variants of the plot() member function declared in
Figure 2. My first attempt at putting the sprite on the screen looked something like Figure 5a,
where the Screen_XXX values are defined as:
const Coord Screen_width = 320;
const Coord Screen_height = 200;
Pixel* const Screen_base = (Pixel*)MK_FP( 0xA000, 0 );
The first two are the width and height of
the mode 13h VGA display screen, and the final one defines a pointer: because of the
standard PC 16-bit operation, location A0000h cannot be referenced directly as such and,
instead, it has to be accessed via segment and offset - fortunately, the whole of the mode
13h screen memory fits within one segment so once the segment base is set, it need not be
changed. The macro MK_FP( 0xA000, 0 ) creates a far pointer from the specified segment
and offset. Incidentally, I built the code with the large memory model, which means all my
pointers are far pointers - the code could be made more efficient by using near pointers
instead, but the optimisations below remove most of the effects of this inefficiency anyway.
void Image::plotSlowC( Coord x, Coord y )
{
for( Coord row = 0; row < _height; row++ )
for( Coord col = 0; col < _width; col++ )
Screen_base[ ( y + row ) * Screen_width + x + col ] =
_pixels[ row * _width + col ];
}
|
Now this first effort is hideously slow, and I'm almost ashamed that I made it public.
Nevertheless, it does work, and does show the logic of the operation. How can we make it
faster? First, notice that it always steps through the sprite data in sequence-when it has
completed one row, it immediately starts on the next: for example, the last pixel on the third
line is _pixels[ 2 × _width + (_width - 1) ] and the first
on the fourth line is
_pixels[ 3 × _width + 0 ] - the very next location in the vector. This means that the multiplication and
addition on the right side of the assignment can be eliminated: we create a pointer to the start
of the sprite data, and just step the pointer along for each pixel, regardless of whether it's in
the middle of a line or at the end.
Can we do the same for the left? Almost: we can increment a pointer in exactly the same
way as we scan along a line, but at the end of each row, we have to skip a lot of pixels to get
to the equivalent position on the screen, Screen_width - _width to be precise. Putting these
two together gives the function in Figure 5b, without any multiplications in the loop. This is
much faster, even though it introduces a bundle of local variables - this is the usual trade-off
of time vs space.
void Image::plotFasterC( Coord x, Coord y )
{
const Pixel* p = _pixels;
const Pixel* s = Screen_base + y * Screen_width + x;
const Coord offset = Screen_width - _width;
for( Coord row = 0; row < _height; row++ )
{
for( Coord col = 0; col < _width; col++ )
*s++ = *p++;
s += offset; // Back to the right place for the next row
}
}
|
There are still a few tweaks to make the C a little faster. Counting down, and comparing
with zero is marginally faster than counting up (at least, if the compiler is vaguely intelligent);
and placing heavily used variables in registers can speed things up-but remember that the
processor has only a limited number of registers, so the register keyword is really only a
hint to the compiler that the associated variable is heavily used: some compilers ignore the
keyword, particularly when compiler optimisation is disabled. Another minor optimisation has
to do with the use of for loops: these check the test condition before processing the loop
body (so that, say, for( i = 0; i < 0; i++ ) correctly executes zero times) but we already
know that our sprites have non-zero height and width, so we can move the test to the end of
the loop with a do...while structure, and perform the loop termination test once less. The best
general sprite copying C++ routine I could come up with is given in Figure 5c. As I'll show
towards the end of this article, this is roughly twice as fast as the first example. The only
thing that can be done to improve this in C++ is to turn on all compiler optimisations and
fiddle with near pointers! However, if we delve into assembler, we can make it a lot faster
again (by another factor of almost four).
void Image::plotFastC( Coord x, Coord y )
{
const register Pixel* p = _pixels;
register Pixel* s = Screen_base + y * Screen_width + x;
const register Coord offset = Screen_width - _width;
register Coord row = _height;
do
{
register Coord col = _width;
do
*s++ = *p++;
while( --col > 0 );
s += offset; // Back to the right place for the next row
} while( --row > 0 );
}
|
-S option on
many compilers) and examine the file to get the general structure of the code. If nothing else,
this shows where variables are on the stack. Most of my assembly writing is in translating the
compiler's rather dumb register and variable assignments into much more efficient ones. In
fact, even if you don't intend to write any assembly code, it is still worthwhile to peruse the
compiler generated assembler occasionally as it can suggest improvements to the C code,
mainly to conpensate for the compiler's lack of intelligence. Figure 6 shows what the
Borland 4.53 C++
compiler did with the loops of the last C++ function above with all speed optimisations
enabled.
1 ; const register Pixel* p = _pixels;
les bx,dword ptr [bp+6]
mov dx,word ptr es:[bx+2]
mov ax,word ptr es:[bx]
mov word ptr [bp-2],dx
mov word ptr [bp-4],ax
2 ; register Pixel* s = Screen_base + y * Screen_width + x;
mov ax,word ptr [bp+12]
imul ax,ax,320
mov dx,word ptr DGROUP:Screen_base+2
mov bx,word ptr DGROUP:Screen_base
add bx,ax
add bx,word ptr [bp+10]
mov word ptr [bp-6],dx
mov word ptr [bp-8],bx
3 ; const register Coord offset = Screen_width - _width;
mov bx,word ptr [bp+6]
mov ax,320
sub ax,word ptr es:[bx+4]
mov word ptr [bp-10],ax
4 ; register Coord row = _height;
mov cx,word ptr es:[bx+6]
@4@226:
; do
; {
5 ; register Coord col = _width;
les bx,dword ptr [bp+6]
mov dx,word ptr es:[bx+4]
@4@254:
; do
6 ; *s++ = *p++;
les bx,dword ptr [bp-4]
mov al,byte ptr es:[bx]
les bx,dword ptr [bp-8]
mov byte ptr es:[bx],al
inc word ptr [bp-4]
inc word ptr [bp-8]
7a ; while( --col > 0 );
dec dx
mov ax,dx
or ax,ax
ja short @4@254
; s += offset;
mov ax,word ptr [bp-10]
add word ptr [bp-8],ax
7b ; } while( --row > 0 );
dec cx
mov ax,cx
or ax,ax
ja short @4@226
; }
|
The source code appears as comments in the assembler, and the odd looking label names are the ones automatically generated by the compiler. The numbers at the left of the listing refer to notes below:
this pointer: if you look at the
assembler produced for a number of C++ member functions, you'll see that they all use BX
for this purpose (unless you use the fastthis optimisation available in this
compiler), but remember that this is implementation dependent and other compilers may use
different register conventions. According to the declaration of the Image class (Figure 2), the
first member variable is the _pixels pointer, as is obvious here in the assignment from the
long word at address ES:[BX]. Next, you can see that the compiler has ignored my register
hint since it immediately copies the pointer into memory on the stack - the long word at
address BP-4 is pointer p. register has been ignored, and BP-8 is used as the pointer s.offset is stored in BP-10._height field in the Image object, and so we can see that the
CX register is used as the counter for the outer loop.
void Image::plotSlowAsm( Coord x, Coord y ) const
{
assert( x + _width <= Screen_width );
assert( y + _height <= Screen_height );
Pixel* s = Screen_base + y * Screen_width + x;
const Pixel* p = _pixels;
_asm {
push ds; // Preserve!
les bx,this; // Use ES:BX as "this" pointer
mov ax,es:[bx]._width; // AX contains "width"
mov dx,es:[bx]._height; // DX contains "height"
mov bx,Screen_width;
sub bx,ax; // BX contains "offset"
les di,s; // ES:DI is "s"
lds si,p; // DS:SI id "p"
cld;
}
rowloop: // Have to come out to C(++) to define label
_asm mov cx,ax; // Get the row width
movint:
_asm {
rep movsb; // Copy a single row
jcxz short movdone; // Done, or interrupted?
loop movint;
}
movdone:
_asm {
add di,bx; // Move to the next row
dec dx;
jnz rowloop; // Finished all the rows?
pop ds;
}
}
|
Look carefully at that inner loop - it's not just coded as:
mov cx,limit
rem movsb
Instead, I went to the effort of writing it as:
mov cx,limit
restart: rep movsb
jcxz short done
loop restart
done:
Why? It happens that when programming anything lower than a 386 (ie., from an 8086 to an 80286), you have to be aware of an interesting feature of the processor: if a repeated string instruction is interrupted, it may not complete properly! REP is not an instruction in its own right - it's a prefix byte for string operations. On the 80386 and above, if a string operation is interrupted before the count hits zero, the address pushed on to the stack when the interrupt is processed is that of the complete instruction including prefixes: however, with the older processors, prefix bytes were not taken into account. The result of that on the older processors is that, upon return from interrupt, the REP is ignored, the processor executes a single string operation and then continues. In the longer form, the CX register is checked at the end of the REP loop, and if it's not zero, the loop must have been interrupted, so re-enter it. If you know you're only coding for 80386 and above, you can dispense with that check and use the simpler form.
Can we do any more? Yes, but not much. The code above use the MOVSB instruction to copy a byte at a time. All of the 80x86 family members have the MOVSW instruction, to move two bytes at a time. Figure 7b makes use of that and note the check to see if we have an odd number of bytes to move (For brevity, I've omitted the REP loop restart mentioned above). If you're writing for 80386 and above only, you have access to MOVSD, to move 4 bytes at a time - I'll leave that bit as an exercise for you! I think the only way to speed up this particular bit of code would be to replace the C++ entry and exit code with more assembler and juggle with instruction ordering to assist the processor pipelines in the higher end processors, but that may be more effort than it's worth.
void Image::plotFastAsm( Coord x, Coord y ) const
{
Pixel* s = Screen_base + y * Screen_width + x;
const Pixel* p = _pixels;
_asm {
push ds; // Preserve!
les bx,this; // Use ES:BX as "this" pointer
mov ax,es:[bx]._width; // AX contains "width"
mov dx,es:[bx]._height; // DX contains "height"
mov bx,Screen_width;
sub bx,ax; // BX contains "offset"
les di,s; // ES:DI is "s"
lds si,p; // DS:SI id "p"
cld;
shr ax,1; // CX is width / 2
jc oddrowloop; // Was width odd?
}
rowloop: // Have to come out to C(++) to define label
_asm {
mov cx,ax; // Get the row width
rep movsw; // Copy a single row
add di,bx; // Move to the next row
dec dx;
jnz rowloop; // Finished all the rows?
jmp endroutine;
}
oddrowloop: // Have to come out to C(++) to define label
_asm {
mov cx,ax; // Get the row width
rep movsw; // Copy a single row
movsb; // ... and the last pixel
add di,bx; // Move to the next row
dec dx;
jnz oddrowloop; // Finished all the rows?
}
endroutine:
_asm pop ds;
}
|
Images on to the screen in random
positions and time the operations, using the definition of the sprite above. Drawing a single
sprite is much too quick to measure accurately, so I draw a large number, numIterations to
be exact, giving a much more accurate result. As will be seen below, the assembly code
versions are much faster than the C ones, so I add a further multiplication factor to the
number of those I draw. The PC clock operates at about 18.2 ticks per second, so the timings
will not be particularly precise, and they also include the overhead of the loops and random
number generation, but the results will give a good idea of the relative performance of the
different implementations of the drawing routine.
const int vidInt = 0x10; // Interrupt num for video fns const int setMode = 0x00; // Function code to set video mode const int getMode = 0x0F; // Function to read video mode int vidInterrupt( int fn, int al = 0 ) { union REGS r; r.h.ah = fn; r.h.al = al; int86( vidInt, &r, &r ); return r.h.al; } const long numIterations = 10000; const long asmScale = 10; int main() { int oldMode = vidInterrupt( getMode ); vidInterrupt( setMode, 0x13 ); long i; clock_t start = clock(); for( i = 0; i < numIterations; i++ ) alien.plotSlowC( random( 200 ), random( 150 ) ); clock_t first = clock(); for( i = 0; i < numIterations; i++ ) alien.plotFastC( random( 200 ), random( 150 ) ); clock_t second = clock(); for( i = 0; i < numIterations * asmScale; i++ ) alien.plotSlowAsm( random( 200 ), random( 150 ) ); clock_t third = clock(); for( i = 0; i < numIterations * asmScale; i++ ) alien.plotFastAsm( random( 200 ), random( 150 ) ); clock_t last = clock(); vidInterrupt( setMode, oldMode ); const long mult = CLOCKS_PER_SEC * numIterations; cout << "Slow C: " << mult / ( first - start ) << " per second\n" "Fast C: " << mult / ( second - first ) << " per second\n" "Slow assembler: " << mult * asmScale / ( third - second ) << " per second\n" "Fast assembler: " << mult * asmScale / ( last - third ) << " per second\n" return 0; } |
The vidInterrupt() routine is a short, and not particularly efficient, way to change video
modes, and perform other screen based operations via the BIOS. It's only used three times
in the program (to get the current mode, to set mode 13h and finally, to restore the original),
so I am not too concerned about its speed.
| Slow C | Fast C | Slow Assembler | Fast Assembler | |
|---|---|---|---|---|
| 25MHz 386 SX, compiler optimisation enabled | 120 | 180 | 720 | 890 |
| 33MHz 486 DX, no compiler optimisation | 380 | 1100 | 4400 | 5200 |
| 33MHz 486 DX, compiler optimisation enabled | 600 | 1200 | 4400 | 5200 |
| 90MHz Pentium, compiler optimisation enabled | 2000 | 3400 | 6200 | 8700 |
The compiler optimisations do not make anywhere near as much a difference as using a good algorithm in the first place (and the largest difference is where the starting point is a very sloppy implementation). Having a faster processor helps, but once again, the change is swamped by the effect of selecting a better algorithm - for example, the tuned version on the slow 386 outperforms the sloppy version on the higher speed 486 and the tuned 486 code outperforms even the best C on the Pentium! I feel that these days programmers are becoming too lazy and just assume that throwing more money (in other words, more memory or a faster processor) at a problem solves it, whereas the application of common sense and programming skills is much more beneficial.
Last modified on 4th January 2001