Friday, May 23, 2014

String Theory

While I do have an interest in theoretical physics, this column is not about that. When it comes to programming, strings are taken for granted. Pretty much every language comes with a comprehensive string library. In modern-day programming, it really is not necessary to write a string library. After all, most string libraries are really well written so there is nothing to be gained from writing your own.

When it comes to early game consoles, however, writing your own library may make sense. Early consoles don't have that much memory so you only want functionality that is actually required. It is certainly possible to find general purpose 6502 string libraries, but basic strings are simple enough that it is fairly easy to write a few routines for dealing with them.

Ultimately strings are simply arrays of characters. The type of characters is based on the platform. For the NES, we use bytes that indicate which pattern table character to display. This does not have to correspond to ASCII, but my pattern tables tend to at least partially incorporate the ASCII character set. For modern string libraries unicode may be used,quite possibly using UTF-8 formatting. There are numerous ways of implementing strings. The three most common methods are fixed length, Pascal-style and C-style strings.

Fixed length strings have a set length with any unused characters being spaces or null characters. This technique is often used with databases to help ensure that every row of the database table is the same length as that makes manipulating the database much more efficient. As the length of the string is known, no additional information about the string is needed, though if you are going to be frequently comparing strings you may want to keep a hash-value. The downside to this method of representing strings is that every string must be at least the length of the longest allowed string so this method is not memory efficient.

Many dialects of the Pascal language represent strings by having the first byte be the length of the string followed by the array of characters. This length variable didn't need to be a byte so later versions of this method would use 16 or 32-bit integers for the length. Variable-length strings were more efficient, and dealing with the length is simple enough but as nice as Pascal was as a programming language, the C language and it's numerous variants would dominate programming and C had a different method of handling strings.

C-style strings, also known as null-terminating strings, simply continue until a 0 is reached. The lack of an explicit length has resulted in buffer-overflow bugs that still haunt us today. The idea here is that if a string is larger than the size allocated for the string, copying the string will result in data beyond the allocated size of the string being overwritten. This can result in data corruption or worse. If you are writing in C, strncpy should always be used in place of strcpy.

As memory is at a premium, the choice of string format for my home-brew projects really comes down to a choice between C strings and Pascal strings. As C strings are more common, that is the format that will be used. String copy, compare, concatenation and some printing functions should be sufficient for our needs.

No comments: