Programming tips

Students in the junior- and senior-level CS classes are expected to know some background information about C, Unix, and software-development tools. This page details some of this background and suggests some exercises.

C

global variables, strings, buffers, dynamic allocation, integers, layout of structs, pointers, output, command-line parameters, language features, multiple source files, linking multiple object files, debugging

C allows you to declare variables outside of any procedure. These variables are called global variables.
1. A global variable is allocated once when the program starts and remains in memory until the program terminates.
2. A global variable is visible to all procedures in the same file.
3. You can make a global variable declared in file A.c visible to all procedures in some other files B.c, C.c, D.c, ... by declaring it with the modifier extern in files B.c, C.c, D.c, ... as in this example:
```
extern int theVariable
```
  If you have many files sharing the variable, you should declare it extern in a header file foo.h (described below) and use #include foo.h in files B.c, C.c, D.c, .... You must declare the variable in exactly one file without the extern modifier or it never gets allocated at all.
4. It is not a good idea to use too many global variables, because you can't localize the places in which they are accessed. But there are situations where global variables allow you to avoid passing lots of parameters to functions.
Strings are pointers to null-terminated character arrays.
1. You declare a string as char *variableName.
2. If your string is constant, you can just assign a string literal to it:
```
char *myString = "This is a sample string"; 
```
3. An empty string has just the null terminator:
```
myString = ""; // empty string, length 0, containing null
```
4. A null pointer is not a valid string value:
```
myString = NULL; // invalid string
```
  You might use such a value to indicate the end of an array of strings:
```
argv[0] = "progName";
argv[1] = "firstParam";
argv[2] = "secondParam";
argv[3] = NULL; // terminator
```
5. If your string is computed at runtime, you need to reserve enough space to hold it. The room must be enough to hold the null at the end:
```
char *myString;
myString = (char *) malloc(strlen(someString) + 1); // allocate space
strcpy(myString, someString); // copy someString to myString 
```
6. To avoid memory leaks, you should eventually return space that you allocate with malloc by using free. Its parameter must be the beginning of the space returned by malloc:
```
free((void *) myString);
```
  To keep your code clean and readable, you should call free() in the same procedure in which you call malloc(); you can call other procedures between those two points to manipulate your string.
7. If you are copying strings, you should be very careful never to copy more bytes than the destination data structure can hold. Buffer overflows are the most common cause of security flaws in programs. In particular, consider using strncpy() and strncat() instead of strcpy() and strcat().
8. If you are using C++, you need to convert your string objects into C-style strings before passing them in a system call.
```
string myString // This declaration only works in C++
...
someCall(myString.c_str())
```
  Unfortunately, c_str() returns an immutable string. If you need a mutable string, you can either copy the data using strcpy() (as above) or you can cast the type:
```
someCall(const_cast<char *>(myString.c_str()))
```
  Casting is not as safe as copying, because someCall() might actually modify the string, which would confuse any part of the program that assumes that myString is constant, which is the usual behavior of C++ strings.
A buffer is a region of memory acting as a container for data. Even though the data might have an interpretation (such as an array of structs with many fields), programs that read and write buffers often treat them as arrays of bytes. An array of bytes is not the same as a string, even though they are both declared char * or char [].
1. They might not contain ASCII characters and they may not be null-terminated.
2. You cannot use strlen() to find the length of data in a buffer (because the buffer may contain null bytes). Instead, you need to figure out the length of data by the return value from the system call (typically read) that generated the data.
3. You cannot use strcpy(), strcat(), or related routines on byte buffers; instead, you need to use memcpy() or bcopy().
4. You write a buffer of 123 bytes to a file using code like this:
```
char *fileName = "/tmp/foo"
#define BUFSIZE 4096
char buf[BUFSIZE]; // buffer containing at most BUFSIZE bytes
...
int outFile; // file descriptor, a small integer
int bytesToWrite; // number of bytes still to be written
char *outPtr = buf;
...
if ((outFile = creat(fileName, 0660)) < 0) { // failure
	// see file permissions to understand 0660
	perror(fileName); // print cause
	exit(1); // and exit
}
bytesToWrite = 123; // initialization; 123 is just an example
while ((bytesWritten = write(outFile, outPtr, bytesToWrite)) < bytesToWrite) {
	// not all bytes have been written yet
	if (bytesWritten < 0) { // failure
		perror("write");
		exit(1);
	}
	outPtr += bytesWritten;
	bytesToWrite -= bytesWritten;
}
```
5. To get the compiler to allocate space for buffers, you must declare the buffer with a size that the compiler can compute, as in
```
#define BUFSIZE 1024
char buf[BUFSIZE];
```
  If you just declare the buffer with no size:
```
char buf[];
```
  then it has unknown size and C does not allocate any space. That's acceptable if buf is a formal parameter (that is, it appears in a procedure header); the actual parameter (provided by the caller) has a size. But it is not acceptable if buf is a variable. If you don't know the size of the buffer at compile time, you should use code like this:
```
char *buf = (char *) malloc(bufferSize);
```
  where bufferSize is the runtime result of some computation.

You can dynamically allocate and deallocate memory.

Individual instances of any type:

typedef ... myType;
myType *myVariable = (myType *) malloc(sizeof(myType));
// you can now access *myVariable.
...
free((void *) myVariable);

Again, it is good programming practice to invoke free() in the same routine in which you call malloc().

One-dimensional arrays of any type:

myType *myArray = (myType *) malloc(arrayLength * sizeof(myType));
// myArray[0] .. myArray[arrayLength - 1] are now allocated.
...
free((void *) myArray);

Two-dimensional arrays are represented by an array of pointers, each pointing to an array:

myType **myArray = (myType **) malloc(numRows * sizeof(myType *));
int rowIndex;
for (rowIndex = 0; rowIndex < numRows; rowIndex += 1) {
	myArray[rowIndex] = (myType *) malloc(numColumns * sizeof(myType));
}
// myArray[0][0] .. myArray[0][numColumns-1] .. myArray[numRows-1][numColumns-1]
// are now allocated.  You might want to initialize them.
...
for (rowIndex = 0; rowIndex < numRows; rowIndex += 1) {
	free((void *) myArray[rowIndex]);
}
free((void *) myArray);

If you are using C++, don't mix new/delete with malloc/free for the same data structure. The advantage of new/delete for class instances is that they automatically call constructors, which might initialize data, and destructors, which can finalize data. When you use malloc/free, you must explicitly initialize and finalize.

Integers
1. C usually represents integers in 4 bytes. For example, the number 254235 is represented as the binary number 00000000,00000011,11100001,00011011.
2. On the other hand, ASCII text represents numbers like any other character, with one byte per digit using a standard encoding. In ASCII, the number 254235 is represented as 00110010, 00110101, 00110110, 00110010, 00110011, 00110101.
3. If you need to write a file of integers, it is generally more efficient in both space and time to write the 4-byte versions than to convert them to ASCII strings and write those. Here is how to write a single integer to an open file:
```
write(outFile, &myInteger, sizeof(myInteger))
```
4. You can look at the individual bytes of an integer by casting it as a structure of four bytes:
```
int IPAddress; // stored as an integer, understood as 4 bytes
typedef struct {
	char byte1, byte2, byte3, byte4;
} IPDetails_t;
IPDetails_t *details = (IPDetails_t *) (&IPAddress);
printf("byte 1 is %o, byte 2 is %o, byte 3 is %o, byte 4 is %o\n",
	details->byte1, details->byte2, details->byte3, details->byte4);
```
5. Multi-byte integers may be represented differently on different machines. Some (like the Sun SparcStation) put the most significant byte first; others (like the Intel i80x86 and its descendents) put the least significant byte first. If you are writing integer data that might be read on other machines, convert the data to "network" byte order by htons() or htonl(). If you are reading integer data that might have been written on other machines, convert the data from "network" order to your local byte order by ntohs() or ntohl().
You can predict the memory layout of structs and the value that sizeof() will return. For instance,
```
struct foo {
	char a; // uses 1 byte
		// C inserts a 3-byte pad here so b can start on a 4-byte boundary
	int b; // uses 4 bytes
	unsigned short c; // uses 2 bytes
	unsigned char d[2]; // uses 2 bytes
};
```
Therefore, sizeof(struct foo) returns 12. This predictability (for a given architecture) is why some call C a "portable assembler language". You need to predict struct layout when generating data that must follow a specific format, such as a header on a network packet.

You can declare pointers in C to any type and assign them values that point to objects of that type.

In particular, C allows you to build pointers to integers:

int someInteger;
int *intPtr = &someInteger; // declares a pointer-valued variable and assigns an appropriate pointer value
someCall(intPtr); // passes a pointer as an actual parameter
someCall(&someInteger); // has the same effect as above

A C library procedure that takes a pointer to a value most likely modifies that value (it becomes an "out" or an "in out" parameter). In the example above, it is very likely that someCall modifies the value of the integer someInteger.

You can build a pointer to an array of integers and use it to step through that array.

#define ARRAY_LENGTH 100
int intArray[ARRAY_LENGTH];
int *intArrayPtr;
...
int sum = 0;
for (intArrayPtr = intArray; intArrayPtr < intArray+ARRAY_LENGTH; intArrayPtr += 1) {
	sum += *intArrayPtr;
}

You can build a pointer to an array of structs and use it to step through that array.

#define ARRAY_LENGTH 100
typedef struct {int foo, bar;} pair_t; // pair_t is a new type
pair_t structArray[ARRAY_LENGTH]; // structArray is an array of ARRAY_LENGTH pair_t elements
pair_t *structArrayPtr; // structArrayPtr points to a pair_t element
...
int sum = 0;
for (structArrayPtr = structArray; structArrayPtr < structArray+ARRAY_LENGTH; structArrayPtr += 1) {
	sum += structArrayPtr->foo + structArrayPtr->bar;
}

When you add an integer to a pointer, the pointer is advanced by that many elements, no matter how big the elements are. The compiler knows the size and does the right thing.

Output
1. You format output with printf or its variant, fprintf.
2. The format string uses %d, %s, %f to indicate that an integer, string, or real is to be placed in the output.
3. The format string uses \t and \n to indicate tab and newline.
4. Example:
```
printf("I think that the number %d is %s\n", 13, "lucky");
```
5. Mixing printf(), fprintf(), and cout may not print elements in the order you expect. They use independent staging areas ("buffers") that they print when they are full.
The main() routine takes function parameters that represent command-line parameters.
1. One common way to write the main routine is this:
```
int main(int argc; char *argv[]);
```
  Here, argc is the number of parameters, and argv is an array of strings, that is, an array of pointers to null-terminated character arrays.
2. By convention, the first element of argv is the name of the program itself.
```
int main(int argc; char *argv[]);
printf("I have %d parameters; my name is %s, and my first parameter is %s\n", 
	argc, argv[0], argv[1]); 
```
Handy language features
1. You can increment an integer or have a pointer point to the next object by using the ++ operator. It is usually best to place this operator after the variable: myInt++. If you put the ++ before the variable, then the variable is incremented before it is evaluated, which is seldom what you want.
2. You can build an assignment where the left-hand side variable participates as the first part of the expression on the right-hand side:
```
myInt -= 3; // equivalent to myInt = myInt - 3
myInt *= 42; // equivalent to myInt = myInt * 42
myInt += 1;  // equivalent to and maybe preferable to myInt++
```
3. You can express numbers in decimal, octal (by prefixing with the digit 0, as in 0453), or hex (by prefixing with 0x, as in 0xffaa).
4. You can treat an integer as a set of bits and perform bitwise operations:
```
myInt = myInt | 0444; // bitwise OR; 0444 is in octal
myInt &= 0444; // bitwise AND with an assignment shorthand
myInt = something ^ whatever; // bitwise XOR
```
5. C and C++ have conditional expressions. Instead of writing
```
if (a < 7)
	a = someValue
else
	a = someOtherValue;
```
  you can write
```
a = a < 7 ? someValue : someOtherValue;
```
6. Assignments return the value of the left-hand side, so you can include an assignment in larger expressions such as conditionals. But you should follow the convention that such assignments are always surrounded by parentheses to indicate both to someone reading your code and to the compiler that you really mean an assignment, not an equality test. For instance, write
```
if ((s = socket(...)) == -1)
```
  not
```
if (s = socket(...) == -1)
```
  The second version is both harder to read and, in this case, incorrect, because the equality operator == has higher precedence than the assignment operator =.
Programs that are not trivially short should usually be decomposed into multiple source files, each with a name ending in .c (for C programs) or .cpp (for C++ programs).
1. Try to group functions that manipulate the same data structures or have related purposes into the same file.
2. All types, functions, global variables, and manifest constants that are needed by more than one source file should also be declared in a header file, with a name ending in .h.
3. Except for inline functions, don't declare function bodies (or anything that causes the compiler to generate code or allocate space) in the header file.
4. Each source file should refer to those header files it needs with an #include line.
5. Never #include a .c file.
When you have multiple source files, you need to link together all the compiled object files along with any libraries that your program needs.
1. The easiest method is to use the C compiler, which knows about the C libraries:
```
gcc *.o -o myProgram
```
  This command asks the compiler to link all the object files with the C library (which is implicitly included) and place the result in file myProgram, which becomes executable.
2. If your program needs other libraries, you should specify them after your object files, because the linker only collects routines from libraries that it already knows it requires, and it links files in the order you specify. So if you need a library such as libxml2, your linking command should be something like this:
```
gcc *.o -lxml2 -o myProgram
```
  The compiler knows how to search various standard directories for the current version of libxml2.
Debugging C programs
1. If you get a segmentation fault, you most likely have an index out of range, an uninitialized pointer, or a null pointer.
2. You can put print statements in your program to help you localize an error.
3. Debugging is likely to be most successful if you use gdb (described below) to figure out where your error is.
4. Programs that run for a long time must be careful to free all memory they allocate, or eventually they run out of memory. To debug memory leaks you might consider these articles on debugging C memory leaks and C++ memory leaks.

Unix

standard files, commands, system calls, file permissions

By convention, every process starts with three standard files open: standard input, standard output, and standard error, associated with file descriptors 0, 1, and 2.
1. Standard input is usually connected to your keyboard. Whatever you type goes to the program.
2. Standard output is usually connected to your screen. Whatever the program outputs becomes visible.
3. Standard error is also usually connected to your screen.
4. You can use the shell to invoke programs so that the standard output of one program is directly linked ("piped") to the standard input of another program:
```
ls | wc
```
5. You can use the shell to invoke programs so the standard input and/or output is linked to a file:
```
ls > lsOutFile
wc < lsOutFile
sort -u < largeFile > sortedFile 
```
6. In general, programs do not know or care if the shell has rearranged the meaning of their standard files.
Unix commands
1. Commands are just the names of executable files. The PATH environment variable tells the shell where to look for them. Typically, this variable has a value like /bin:/usr/bin:/usr/local/bin:..
2. To see where the shell finds a particular program, for instance, vim, say where vim.
System calls and library calls follow some important conventions.
1. The return value of the call usually indicates whether the call succeeded (typically the value is 0 or positive) or failed (typically the value is -1).
2. Always check the return value of library calls. When a system call fails, the perror() function can print what the error was (to standard error):
```
int fd;
char *filename = "myfile";
if ((fd = open(filename, O_RDONLY)) < 0) {
	perror(filename); // might print "myfile: No such file or directory"
}
```
3. A manual page for a system call or library routine might list a data type that it doesn't define, such as size_t or time_t or O_RDONLY. These types are typically defined in header files mentioned in the manual page; you need to include all those header files in your C program.
File permissions in Unix are usually expressed with octal numbers.
1. In the example of creat() above, 0660 is an octal number (that's what the leading 0 means), representing binary 110,110,000. This octal number grants read and write permissions, but not execute permissions, to the file's owner and the file's group, but no permissions to other users.
2. You set permissions when you create a file by the parameter to the creat() call.
3. The command ls -l shows you permissions of files.
4. You can change permissions of a file you own by using the chmod program.
5. All your processes have a characteristic called umask, usually represented as an octal number. When a process creates a file, the bits in the umask are removed from the permissions specified in the creat() call. So if your umask is 066, then others cannot read or write files you create, because 066 represents read and write permissions for your group and for other people. You can inspect and modify your umask by using the umask program, which you typically invoke in your shell startup script (depending on your shell, ~/.login or ~/.profile).

Software-development tools

text editor, debugger, compiler, manual pages, make, search,

Use a text editor to create, modify, and inspect your program. There are several reasonable text editors available.
1. The vim editor and its graphical interface, gvim, take some effort to learn, but they provide a very high quality set of tools for editing program files, including syntax highlighting, parenthesis matching, word completion, automatic indentation, searching by tag (which moves you quickly from a place where the program calls a function to the place where the function is defined), and integrated manual-page search. Vim is designed for keyboard use; you don't ever need to use the mouse if you don't want to. It is freely available for Unix, Win32, and Microsoft operating systems. It is the most highly developed version of the editor series that includes ed, ex, vi, and elvis. You can read online documentation for vim and get immediate assistance through vim's :help command.
2. The emacs editor is, if anything, more feature-laden than vim. It also takes significant effort to learn. It is also freely available for both Unix and Microsoft operating systems. You can find documentation here.
3. There are many other text editors available, but generally they do not give you the two most useful features you need for creating programs: automatic indentation and syntax highlighting. However, these text editors often have the advantage of being easier to learn, in keeping with their limited abilities. Among these lower-quality text editors are (for Unix) pico, gedit, and joe and (for Microsoft) notepad and word.
4. You might be familiar with an integrated development environment (IDE) such as Eclipse, Code Warrior, or .NET. These environments generally have text editors that are integrated with debuggers and compilers. If you are using such an IDE, it makes sense to use the associated text editors.
gdb is a debugger that understands your variables and program structure.
1. You can find documentation here.
2. To use gdb effectively, you need to pass the -g flag to the C or C++ compiler.
3. If your program myProgram has failed leaving a file called core, then try gdb myProgram core.
4. You can also run your program from the beginning under the control of gdb: gdb myProgram.
5. All commands to gdb may be abbreviated to a unique prefix.
6. The help command is very useful.
7. The where command shows the call stack, including line numbers showing where each routine is. This is the first command you should try when you debug a core file.
8. To print the value of some expression (you may include your variables and the usual C operators), type print expression, as in
```
print (myInt + 59) & 0444;
```
9. To see your program, try list myFunction or list myFile.c:38.
10. To set a different activation record as current, use the up (for more recent) or down (for less recent) command.
11. You can set a breakpoint at any line of any file. For example, you can say break foo.p:38 to set a breakpoint at line 38 in file foo.p. Every time your program hits that line while it executes, it will stop and gdb will prompt you for commands. You can look at variables, for example, or step forward through the program.
12. The next command steps forward one statement (calling and returning from any procedures if necessary).
13. The step command steps forward one statement, but if the statement involves a procedure call, it steps into the procedure and stops at the first statement there.
14. If you enter the command set follow-fork-mode child, then when your program executes the fork() call, gdb will continue to debug the child, not the parent.
15. Leave gdb by entering the quit command.
16. You might prefer to use the ddd graphical front end to gdb.
Always give compiler programs gcc or g++ the -Wall flag to turn on a high level of warnings. Similarly give javac the -Xlint:all flag. Don't turn in a program that generates any compile-time warnings.
You can read the manual to get details about programs, C library routines, and Unix system calls by using the man program, as in man printf or man gcc.
1. Sometimes the function you want is located in a specific section of the Unix manual and you must explicitly request it: man 2 open or man 3 printf. Section 1 covers programs, section 2 covers system calls, and section 3 covers the C library, and section 8 covers system administration. You most likely don't need the other sections.
2. You can find if any program, C library routine, or Unix system call is relevant to some subject by using the -k flag, as in man -k print.
Use the make program to organize recipes for recompiling and relinking your program when you change a source file.
1. See this tutorial or this manual for details.
2. If your program is composed of several files, you can compile them separately and then link them together. You compile with the flag -c, and use the -o flag to indicate the output file. A reasonable makefile might look like this:
```
SOURCES = driver.c input.c output.c
OBJECTS = driver.o input.o output.o
HEADERS = common.h
CFLAGS = -g -Wall

program: $(OBJECTS)
	$(CC) $(CFLAGS) $(OBJECTS) -o program

$(OBJECTS): $(HEADERS)
	
testRun: program
	program < testData
```
  This makefile uses a built-in definition of CC and a built-in rule to convert C source files like driver.c into their object file. If you modify just input.c, then make testRun will cause the compiler to rebuild input.o, then cause the compiler to relink the objects, creating program, and then run program with standard input redirected from the file testData.
3. If you have many source files and many header files, you might want to use the makedepend program to automatically build the Makefile rules that specify how source files depend on header files. The example above assumes that all source files depend on all header files, which is often not the case.
The grep program can quickly search for a definition or variable, particularly in include files:
```
grep "struct timeval {" /usr/include/*/*.h
```

Exercises

Do these exercises in C.

Write a program called atoi that opens a data file named on the command line and reads from it a single input line, which should contain an integer represented in ASCII characters. The program converts that string to an integer, multiplies the integer by 3, and prints the result to standard out. The program must not use the atoi() function. You should use the make program. Your Makefile should have three rules: atoi, run (which runs your program on your standard test data and redirects the output to a new file), and clean (which removes temporary files). Make sure your program runs correctly on bad data and exits with a helpful message if the data file is missing or unreadable. Step through your program by starting it with gdb, placing a breakpoint on main(), and using the step command repeatedly.
Look up the manual page for the cat program. Code your own version of cat. Your version must accept multiple (or no) file-name parameters. It need not accept any option parameters.
Write a program removeSuffix that takes a single parameter: the name of a suffix file. The suffix file has one line per entry. An entry is a non-empty string, which we call a suffix, followed by the > sign, followed by another (possibly empty) string, which we call a replacement. Your program should store all the suffixes and their replacements in a hash table. Use external chaining. Your program should then read standard input. For every space-delimited word w in the input, find the longest suffix s that appears in w, and modify w by stripping off s and inserting the s's replacement, creating w'. Output one line per modified word, in the form w>w'. Do not output any word that is not modified.

You can also see translations of this page. I do not vouch for the accuracy of the translations.