C For Smarties: Are Pointers Numbers?

Pointers: Are They Just Numbers?

Pointers Versus Numbers

There is a discussion of numbers and representations, both integer and floating-point, in this page. If you have not read it, you should at least skim it now, with particular attention to the multiple possible integer representations and the IEEE floating point representation.

All C pointers always include a type, as described here. At a sufficiently low machine level, most systems have just a few underlying hardware pointer types: sometimes as few as two (one for data and one for code), or even just one. In C, however, every pointer value is at least associated with the type of the object(s) to which it can point. A machine is allowed (but of course not required) to make every different pointer type use a different underlying representation and/or number of bytes in memory. This complicates the issue, at least conceptually; but we can still ask the question once, for each different pointer type: is a value now stored in this object, of type ‘pointer to T’ for some type T, just a number, or is it something more complicated?

Fundamentally, in C, a valid pointer gives you two separate pieces of information: the type of the object to which it points, and the location of that object. In that sense, the type alone makes it ‘something more complicated’. But let us pretend that this is not an issue. As mentioned here, any object can be decomposed into bytes. We can use this trick to store a pointer value in a variable (i.e., an object) with the correct type, then decompose that object into bytes. This allows us to inspect the actual representation of any given pointer value.

In this sense, then, pointers are numbers: a pointer value, once stored in memory, has some underlying bit pattern. This bit pattern can be interpreted as some sort of integer, producing some sort of number. But this is where things get tricky: that is merely an interpretation, not necessarily ‘the’ interpretation to use.

Suppose, for instance, that pointers happen to be exactly 32 bits long on a given machine, and further, that the machine has 8-bit bytes and is little-endian. We can stuff a pointer value into memory, then extract the four bytes and compose a 32-bit integer:

int *ip = malloc(20 * sizeof *ip);
unsigned char *ucp;
unsigned long l;

if (ip == NULL)
    panic("out of memory");
ucp = (unsigned char *)&ip;
l = ucp[0];
l |= (unsigned long)ucp[1] << 8;
l |= (unsigned long)ucp[2] << 16;
l |= (unsigned long)ucp[3] << 24;
printf("ip = %p; as an integer, %lu\n", (void *)ip, l);

This code will show you the ‘integer value’ of the 32-bit pointer stored in ip. But is this the actual value of ip, or just some interpretation? What if the bits making up the pointer are actually a 32-bit IEEE single-precision floating-point number? In this case, to find the ‘true’ value, we have to look at the sign, mantissa, and exponent bits. A value that we printed out as (say) 1081081856 might be more correctly printed as 3.75.

Of course, on this particular machine, it probably really is 1081081856—or 0x40700000. But what if pointer values are in fact structured, similar to (but not the same way as) floating-point values?

In particular, suppose that this machine has ‘segments’ that can be mapped in and out, and pointers consist of a 12 bit segment number plus 20 bits of offset-within-segment. In this case, on this machine, the segment number is 0x407, and the offset is 0. Putting these two together gives us the 32-bit number 0x40700000. Here is where things get particularly sneaky.

On this machine, each segment can be marked invalid. If we call free(ip), the underlying system will mark segment 0x407 invalid. An attempt to use the pointer value will then trap (because this machine, unlike most, is actually designed to catch errors, instead of producing the wrong answer as fast as possible). But now that segment 0x407 is invalid, what is the value of this pointer? If we ask the C compiler to print it directly, it may load it into a pointer register on the machine, and this may look up the segment number, see that it is invalid, and trap at runtime:

free(ip);
printf("ip = %p\n", (void *)ip);

The output never occurs, because the attempt to load the value to send to printf() traps. A later call to malloc() may make segment 0x407 valid again, but map it to a different part of RAM, so that the same pointer (0x40700000) now points to different memory.

On this machine, in other words, the representation stored in the pointer—the bit pattern in memory—never changes. What changes is the value thus represented. The value is constructed, at least in part, by looking up (part of) the representation in a separate table. Whenever the table changes, so does the value stored in the pointer.

Most machines today do not do this sort of thing^[1]. It was more common on older machines. C does, however, allow it; so if you want to write strictly portable C code—guaranteeing that your code will work on a future machine—you should avoid inspecting the values of invalid pointers. They may not mean what you expect, and in some cases, the value itself may not even exist.

back

[1] Well, not visibly anyway: paging techniques do all of this, but keep it hidden from ordinary programmers.