C Programming Language Tips

General Documentation C Documentation and tutorials C++ Documentation and tutorials VMR: My Vector/Matrix/Readfile Routines Debugging Memory Problems Scientific & Math libraries (IMSL, LAPACK, GSL) Using Fortran IMSL functions in C Special Issues in Statistical Computing in C Using C mode in Emacs

General Documentation

Various tutorials including creating C libraries, makefiles, working with files and directories in Unix, Unix signal programming, graphics programming, parallel programming and multi-processing
Writing larger programs (headers, scope, multiple source files, makefiles)
Intro to makefiles: BU Rutgers man page
A self-documented Makefile that works for most programs
Operator Precedence
Data Structures and Algorithms
Recommended C Style and Coding Standards
The C Book
Common System Functions
Compiling within Emacs to get error finding Emacs c-mode commands Customizing C indentation
C mode compiling from within Emacs
cout.cpp Demo of redirection of cout to a file

C++ Documentation and tutorials

My Vector/Matrix/Readfile Routines

Downloads: vmr.h libvmr.a
(New 11/24/03: readfileRN() reads a data file with row names.)
(New 1/27/06: matrowsubset() matIrowsubset() makes a matrix subset based on a list of rows.)

The vmr C/C++ functions (or library for Linux/Intel systems) are a free, easy to use convenience for writing Statistical programs in C or C++. They hide all of the details of reading numeric data files and creating and using vectors and matrices. Note that the built in C matrices (e.g. double mymat[10][20]) are incompatible with all standard numeric and statistical libraries, and that pre-set vector and matrix sizing is both dangerous and inelegant. The vmr library provides
1. easy dynamic allocation of integer, double, arbitrary structure, and string vectors and matrices with graceful stopping when memory is exhausted (choice of error return value or error message)
2. dual use matrix format, so matrices can be passed to standard C or Fortran libraries as *mymat or mymat[0], but can also be accessed in your program as mymat[row][col] (or mymat[col][row] when using Fortran's column-major convention)
3. easy freeing of space allocated to vectors or matrices
4. utility functions for zeroing, printing, and writing to a file
5. a function to re-dimension matrices
6. functions for reading data in a loose rectangular format
7. a single include file (vmr.h) with all prototypes and constants (usable in C or C++)
8. ease of use: just add "-Lxx -lvmr" to your gcc link command (where xx is the location of libvmr.a, e.g., "." for in the current directory) (This is for Linux/Intel only unless you create your own library; on other systems, add "vmr.c" to the compile command.)
The functions are:
1. Allocation of double/int/string vectors: mallocvec(), mallocIvec(), mallocSvec()
2. Allocation of double/int/string matrices: mallocmat(), mallocImat(), mallocSmat()
3. Safer functions for beginners: mallocvec1(), mallocIvec1(), mallocmat1(), mallocImat1()
4. Allocation of vectors or matrices of arbitrary structures: mallocASvec(), mallocASmat()
5. Zeroing: zerovec(), zeroIvec(), zeromat(), zeroImat(), zeroSvec(), zeroSmat()
6. Copying: copyvec(), copyIvec(), copymat(), copyImat(), matrowsubset(), matIrowsubset()
7. Printing: printvec(), printIvec(), printmat(), printimat()
8. Writing to a file: fprintvec(), fprintIvec(), fprintmat(), fprintImat()
9. Re-dimensioning: redimmat(), redimImat()
10. Data file reading: readfile(), readIfile(), readSfile()
11. Data file reading with column names: readfileCN()
12. Data file reading in column-major format: readfileCM()
13. Data file reading in freeform input into a vector: scanfile()
14. Data file reading when the number of rows and columns is known in advance: readfileFixed()
Here are the source files (if you don't want to use the library version, or if you want to check details) and a test/demonstration program:
vmr.c vmrtest.c (vmrtest.dat)
Notes:
1. Normally a vector or matrix is defined using something like these: "int *myvec", "int *mymat", "double *myvec", or "double *mymat". This allocates just enough space for a single pointer, which points to nothing. When you want to allocate the space, e.g., after figuring out its needed dimension(s), then use one of the vmr allocation functions, which will allocate the space and assure that the variable defined above points to the vector or matrix.
2. The first argument of the many of the allocation functions is VMERR_QUIT for "quit on insufficient memory error" or VMERR_CONTINUE for "continue with "success vs. failure return value".
3. The xxx1() versions of the functions are safer for beginners because the allow more complete argument checking by the compiler. These functions allocate a single vector or matrix. The VMERR argument is not used. For vector allocation, the first and second arguments are "&myvec, mylength". For matrix allocation, the first through third arguments are a "&mymat, myrows, mycols" triplet.
4. For the non xxx1() versions of the functions, the second argument of the allocation routines is the number of vectors or matrices to allocate. This is followed by "&myvec, mylength" pairs, or by "&mymat, myrows, mycols" triples. Due to the nature of variable argument lists, these pairs or triples are not checked by the compiler for appropriate argument type. For mallocASmat() you need "&mymat, myrows, mycols, structsize" quadruples, where structsize is usually of the form sizeof(struct mystruct). For mallocSmat() you need "&mymat, myrows, mycols, maxchars" quadruples, where maxchars is the maximum length for any string, including the terminating zero. Forgetting the ampersand or otherwise getting the wrong arguments will cause major errors.
5. As an argument to a C function, most libraries want you to use the (default) row-major order for matrices, and to enter the argument as mymat[0] or as *mymat. (Don't add the [0] or * for vectors; they should just be passed as, e.g., myvec.) The matrices can also be accessed as "mymat[row][col]" in your code for easy readability.
6. To use column-major order of matrices (for Fortran libraries), read your data with readfileCM() and allocate other matrices using, e.g., "mallocmat1(VM_ERRQUIT,&mymat,mycols,myrows)" where cols and rows are entered in reverse order, then write all of your code in reverse order, e.g., "mymat[col][row]". As usual, pass the matrix to the library as mymat[0] or *mymat.
7. As an error detection service, the readfile() functions will attempt to tell you which line of your data file does not have the same number of fields as the first line. The readSfile() functions has an extra argument with the maximum string length. It will give an error message if any datum is too wide.
8. You may re-use already allocated matrices with the aid of the redimmat() functions. DANGER: It is up to you to assure that, when re-dimensioning from r1 x c1 to r2 x c2, the following are true: r2*c2<=r1*c1 and r2<=r1.
9. Remember to have vmr.h in your current directory or include path.
10. The library libvmr.a available here is for a Linux/Intel system. For other systems, compile vmr.c according to the directions in its comments, then execute the Linux commands ar rc libvmr.a vmr.o and ranlib libvmr.a to create your own libvmr.a library.
11. To read a rectangular array with different data types in different columns, try reading it as strings using readSfile(). Then use for loops with atoi() or atof() to pull out integers or double respectively from columns of the string array and put them into separate vectors created with mallocIvec() or mallocvec(). Use strcpy() to copy strings into their own vectors.
12. All of the readfile routines have a hardcoded maximum line length of 1024. If your input file has more characters per line, increase MAX_INPUT in vmr.c and follow the three line directions at the top of the file (Compile and Library) to create your own custom version of the library.

Debugging and Profiling

Debugging in Unix (ddd) Quick start Memory Layout Article
Electric Fence from Bruce Perens, when used in conjunction with ddd is an excellent tool to find array over-runs, a major cause of errors in statistical uses of c and c++. To find an array over-run, link the efence library ( PC/Linux version of libefence ) into your program. Then run the program under ddd. Now the debugger is (usually) able to pinpoint the line causing the "Segmentation Fault". If you are using my vmr library, you should use the evmr version instead. For Linux/PC, just download libevmr.a ; otherwise create your own evmr library compiling vmr.c with the "-g" compile option. A sample c file, efencetest.c , demonstrates use of Electric Fence. [Note: the current (2.0.5) version of efence does not "make" correctly on our system. If you are not running PC/Linux and need to "make" the library from Perens' files, you will need to manually make a few changes to use syserror() instead of the deprecated alternative and to add an "#include" of "efence.h" to the eftest.c file.]
Other Public domain memory debugging tools
Numeric problems: Two numeric problems that you should know about are the limited size of integers and the limited mantissa of doubles (or, worse, floats). Compile and run numeric.c to see the problems.
For a 32 bit signed integer, the maximum number it can hold is 2,147,483,647 (2^31-1). If you use "unsigned int", the maximum is 4,294,967,295 (2^32-1). These numbers are accessable in a C program as INT_MAX and UINT_MAX if you include "limits.h". By the way, the "-1" is because zero is one of the 2^32 different numbers that can be represented in 32 bits. As you can see by running the program, if you add one to the maximum integer, you get a large negative number (-2^31) which gets larger as you keep adding one. The reason for this is the "twos-complement" number representation method used by most computers. For more information, see Wikipedia or chapter 1 of James E. Gentle's "Numeric Linear Algebra for Applications in Statistics" (Springer, 1998).
The second problem is the limitation in the size of the mantissa for a (double) floating point number. First you must realize that floating point numbers are represented as a mantissa and an exponent. Typically there is 1 sign bit, 52 mantissa bits, and 11 exponent bits. To add two numbers the computer first re-expresses the smaller number so that it has the same exponent as the larger number. In the example, the number 1.0e0 is added to the number 1.0e-16, which gets re-expressed as 0.0000000000000001e0. But because the mantissa of a double does not quite have enough bits to represent that number, the low order bits are dropped and the mantissa becomes zero. The smallest number that can be added to 1.0 without resulting in 1.0 is called epsilon, and for a "double" in C on most machines epsilon approximately equals 2.220446e-16. In the above example two values of 1e-16 can be added together easily and accurately because both have 1.000000000000000 as their mantissas (and -16 as their exponents). So the moral of the story is that, if you need to add a long list of numbers of very different sizes, you may get big rounding errors if you end up adding small numbers to large numbers. The Gentle book is again a good reference.
Profiling lets you check which parts of your program are slowest; then you can direct your speedup recoding effort most profitably to those functions.
With the gcc compiler use the -pg compilation option, then run your program, e.g. MyProg, as usual. This will cause a file called gmon.out to be created. Run gprof MyProg > profile.txt to analyze gmon.out and send the results to profile.txt. Examine profile.txt to see which functions are eating up most of your run time.
TenDRA public domain C/C++ checker

Memory Problems

Here is a good memory tutorial.
The key memory problems are:

Using an uninitialized pointer. This results in "Segmentation fault", "Bus error", or crazy values when you printf() from the vector or matrix, and "Segmentation fault" or "Bus error" or memory trashing when you assign something to the vector. It is also true that most cases of "Segmentation fault" are due to uninitialized pointers. Remember that an uninitialized "static" pointer is initialized to zero by the compiler, guaranteeing a Seg Fault when used, while an uninitialized "automatic" pointer has random data in it. If that random data happens to be incorrectly aligned for the data type a "bus error" will occur. If the random data is the address of any valid, correctly aligned memory location belonging to your program, it will read incorrect data and write trash on some other variable(s). If the random data is correctly aligned but outside of your program's valid memory locations, it will cause a Seg Fault.
Accessing beyond the end of a vector or matrix. This often results in overwriting other data. If a pointer is overwritten, a "Segmentation fault" may result when that pointer is used later in the program; otherwise other variables are inadvertently changed. These can be hard to detect.
Out of memory. This can be a tricky problem! Most modern compilers take all available memory (other than the program commands themselves and the global static memory) and grow the "stack" (which holds all "automatic" variables declared in C/C++) from one end of the memory and the "heap" (which holds all dynamically allocated memory) from the other end. (If you don't know what automatic means, you should definitely read about it in Kernighan and Ritchie's "The C Programming Language".) If you allocate a lot of large vectors or matrices dynamically or as automatic variables you may get a "Segmentation fault" with very little indication of where the problem lies due to running out of memory (some systems give a "stack overflow" error if the problem is on the stack). To avoid the problem with dynamic memory, always check the malloc() return value to assure that the memory was successfully allocated; or use the vmr routines which do that for you.
Fragmented memory. If you use mallocvec(), mallocmat() and friends, fragmentation of memory (fully analogous to a framented hard drive) will eventually be detected as a "Can't allocate xxx bytes" message. If you use malloc()'s instead, you must check the return value every time if you want to be able to catch the problem. Otherwise you will have strange results, perhaps a "Segmentation fault". For statistical programs, you usually can and should reuse vectors and matrices rather than freeing and reallocating them. I say this because most statistical programs use the same sized vectors and matrices without danger of two functions attempting to use the same vector at the same time. In addition to causing fragmentation, repeated allocating and freeing of memory is very slow, so with a little thought you can often avoid thousands of allocate/free cycles by re-using data structures. In some cases in makes sense to allocate the biggest needed vector for some purpose and also use it for cases that only need a portion of it. For the corresponding case for matrices, you can do the same trick, but if you access the memory in mat[row][col] format, you will need to use redimmat(). You can also solve the fragmentation problem if you use the gc garbage collector with or without the garbage collection version of the vmr library, called gcvmr. This automatically does defragmentation as needed. If you #include "gc.h" in your program, then you can also call gc_gcollect() to explicitly defragment memory at any time.
Memory leaks. Memory leaks are due to forgetting to free memory that you have allocated. Not freeing memory needed till the end of the program before return()ing from main() is benign. At the other extreme, not freeing memory allocated at each iteration of a loop is generally disastrous. One common form is forgetting to free the row pointers for a (manually created) matrix. The vmr and gcvmr libraries make freeing memory particularly easy: for vectors or matrices, just free the first vector or matrix listed in any mallocvec() or mallocmat(), and all memory associated with all vectors or matrices allocated in the same call are correctly and totally freed.
The simplest leak detection scheme, suitable for when you are using the vmr library or your own malloc()'s, but not for gcvmr, is to put the macro
```
  #define MEMCHK(x) {printf("===%s: ",x); \
    myinfo=mallinfo(); printf("%12d %12d\n",myinfo.hblkhd,myinfo.uordblks);}
```
or, for C++,
```
  #define MEMCHK(x) {cerr << "==="<<x<<  ": "; \
    struct mallinfo myinfo=mallinfo(); cout << myinfo.hblkhd << " " << myinfo.uordblks << endl;}
```
at the top of your source file after all of the #include's but before main(). This requires that you have "#include <malloc.h>" and "#include <stdio.h>" among your includes. Also, for C, put "struct mallinfo myinfo;" somewhere in scope, e.g. above main. Then, you can sprinkle MEMCHK("My message") throughout your program to get a report of used memory in two regions (basically large and small allocations). When placed before and after any function call, a difference of greater than 4 for the second number and 0 for the first indicates that that function has a memory leak.
When using the gcvmr library, or gc directly, you can perform leak detection as described in the README below.

gcvmr: libgcvmr libgc.a gc.h leak_detector.h README

Mathematical, Statistical and Linear Algebra Libraries: IMSL, LAPACK and GSL

Warning: IMSL requires an expensive per-site license. If you plan to share your program consider GSL (or Numeric Recipies in C/C++ ) as an alternative.
Using C-IMSL on CMU computers running Linux with C or C++ (Note: you must use $LINK_CNL_STATIC instead of $LINK_CNL as stated in Sin Tips link.)
Calling IMSL Fortran libraries from C (if CNL is unavailable)
IMSL documentation: Function Catalog C math functions (pdf) C stat functions (pdf)
Tip: At CMU Stats, add the following to your .cshrc file, then use imslmath& or imslstat& to quickly see the documentation.
```
    alias imslmath acroread /usr/statlocal/vni/CTT4.0/help/cmath.pdf
    alias imslstat acroread /usr/statlocal/vni/CTT4.0/help/cstat.pdf
    
```
A C program to demonstrate random number generation and simulation with C-IMSL: LRSim.c
A C program to demonstrate linear algebra with C-IMSL: LR.c (Test data: LRtest.dat )
A C program to demonstrate reading a file using the vmr library and performing matrix operations using LAPACK: matmult.c with test data file X.dat
A C++ program to demonstrate linear algebra with LAPACK: lrlapack.cc
A C program to demonstrate statistical function calling with C-IMSL: LRnative.c (LRtest.dat)
A C program to demonstrate Gibbs sampleing with C-IMSL: Gibbs.c Gibbs.h (Test data: table91.txt )
LAPACK documentation: LAPACK User's Guide List of LAPACK routines List of Double-Precision routines Quick-Reference for BLAS routines
A great newer choice for statistical and scientific functions is The Gnu Scientific Library .

Using Fortran IMSL functions in C

The Fortran IMSL library is a collection of mathematical and statistical functions written in Fortran and pre-compiled. They are licensed and available to the CMU community and available on Andrew and throughout the Statistics department on HP and SUN computers. By using these routines, you will get well-implemented versions of functions which can perform a wide range of tasks, from generating random numbers to performing regression, but without any coding on your part! The functions are described in books labelled IMSL Stat Library and IMSL Math Library and via the links listed below. They are currently unavailable at CMU for Windows. For Linux, IMSL C libraries are available, but unfortunately these do not closely parallel the Fortran libraries, so constructing cross-platform programs is difficult. An alternative matrix library is MTL.

The home page for IMSL is http://www.vni.com/products/imsl/ . A pdf document from Visual Numerics, the distributor of IMSL, on calling IMSL from C is given http://www.vni.com/tech/imsl/8709.pdf. Pdf math and stat guides are http://www.vni.com/books/docs/. The main points of using these in C are to include the correct libraries when linking, to declare the functions you will use, and to pass only pointers to the routines as arguments. (I use #define's to add the & pointer operator for me.)

Complex data is stored in a double-length vector, e.g. float cmplx[6] can hold 3 complex numbers stored as real/imaginary pairs. Double complex arguments are used for the d versions of functions.

In general, you need to use some judgment to guess the argument types. If the argument must always refer to a whole number, it is probably "int". If the function has two versions, plain and d, e.g. rnnor and nrnnor, then non-integer values are either "float" or "double" depending on the version. If there is only one version, non-integers are float. Most functions return void, but some, like chidf/dchidf return float/double.

If you need to pass a matrix to a function, it must be declared as a single float or double memory block with the first "col" elements corresponding to the first row. A trick for allocating a matrix that can be accessed as mat[r][c] in your program but passed to imsl as imslmat is to use something like this C++ generalized inverse example:

int rows=3, cols=3;
double *imslmat = new double[rows*cols];
double **mat = new double*[rows];
for (int i=0; i<rows; i++) mat[i]=imslmat+i*cols;

for (int row=0; row<rows; row++)
  for (int col=0; col<cols; col++)
    mat[row][col]=row+col+row*col+1.0;

double *rslt = new double[rows*cols];
int rank;
double tol=2e-16;
dlsgrr(&rows, &cols, imslmat, &cols, &tol, &rank, rslt, &cols);
for (int r=0; r<rows; r++) {
  for (int c=0; c<cols; c++)
    cout << rslt[r*cols+c] << "  ";
  cout << endl;
}

Place these three files in a directory, type "make", and then "Demo 15" to see an example sufficient to get you started.
Demo.c IMSLrand.h Unix makefile Linux makefile

Fortran-IMSL Function Summaries Fortran to C IMSL function Rosetta stone (xls format)

A C program to demonstrate reading files, allocating matrices, and using Fortran-IMSL matrix routines: stat.c

Special Issues in Statistical Computing in C

Likelihoods out of range
Especially in MCMC, you may get errors or get stuck in a bad region of the parameter space if your likelihood involves exponentiating a very large or small number. Here is a good solution:

#include<stdlib.h>
#include<stdio.h>
#include<values.h>        /* has needed constants */
double minexpable, maxexpable;  /* largest and smallest numbers to which exp() can be applied */
void doMCMC(double *data, double *params);

/* first argument to program is number of simulations */
int main(int argc, char **argv) {
  int i, nsim;
  ...
  if (argc<2) {
    fprintf(stderr, "Program needs at least one argument, which is the simulation count.\n");
    exit(EXIT_FAILURE);
  }
  nsim=atoi(argv[1]);
  ...
  /* Calculate largest and smallest numbers to which exp() can be applied (one time) */
  minexpable=log(MINDOUBLE);  /* defined in values.h */
  maxexpable=log(MAXDOUBLE);  /* defined in values.h */
  ...
 
  for (i=0; i<nsim; i++)
    doMCMC(data, params);

  ...

  return(EXIT_SUCCESS);
}


void doMCMC(double *data, double *params) {
  ...
  /* avoid taking exp() on too large or too small a number */
  if (loglik>=minexpable && loglik<=maxexpable) {
    like=exp(loglik);
  } else if (log<minexpable) {
    like=MINDOUBLE;
  } else {
    like=MAXDOUBLE;
  }
  ...
  return;
}

Archiving old versions of your programs (version control)
The best way to write most statistical programs is to build them up in stages, adding more features only after the first ones work. For example, in an MCMC, the stages might be reading in the data, generating starting values, and then in the main MCMC loop you could consider updating each parameter as a separate stage.
The problem that often arises is that you "break" your program when trying to go to the next stage. A related problem is that you remove a feature or function, only later to discover that your really need it again.
The solution here is to run a simple-to-use script file that keeps various versions of your program (and all related files) intact, that you can refer to later if needed. The script creates a directory with all of the files of the working program. Because this is so easy to do, it is worth doing for every project. (If you want a more sophisticated solution, look at the svn program.)
The one-time setup is to put archive in your /bin directory. (Ask for help if you don't have a /bin directory and corresponding PATH entry.) Check that the file is executable or just run the Linux command chmod u+x archive to set it as executable. You will need to run rehash right after you install it.
Now, each time you start a new project (which must be in different directory from every other project), you setup for archiving with these two simple steps:
1. Run the Linux command archive new MyProjectName.
2. Run the Linux command archive add myFile1 [myFile2 [...]] one or more times to add the names of all pertinent files to the list of files to be archived. Pertinent files include all .c, .h, and .cpp files, your makefile, your documentation file(s), and any data files that are fairly small (e.g., less than 100KB).
Now each time you come to a point in your programming experience where either your program appears to be working or you are about to make major changes that you might want to undo, run the Linux command archive save "My comments". The comments are optional, but a brief statement of the state of your program is useful. Be sure to use quotes around the comments.
Other features of the archive script are use of archive alone to show the names of all of the files on the archive list, and archive remove myfile to stop archiving some file (remove it from the list). Also you can add files at any time. Use archive help to show all options.
To see a summary of all archives run archive show. The directory names encode the date and time of the archives in an obvious way. Then you can list the files in any archive, and also view or copy any file(s) you need.
Note: I found that it was better for my working style to have any re-running of archive save with the same hour to overwrite the old archive rather than creating a new one. This is because if I am saving shortly after a previous save, its probably because I thought of one little thing and this is not really a new "version". In this case, the script warns and prompts before overwriting. If you don't like this, change the script!

Using C mode in Emacs

When you edit a file named *.c in Emacs, you are automatically in c-mode as indicated by (C) on the status line. Emacs will automatically do certain formatting and add some new capabilities. The two most important are:

Emacs finds the lines that have the errors: To compile your program you must first have file called makefile in your current directory that contains instructions for compiling your program (and linking if you like). For the usual simple case, just modify this simple make file makefile.
Each time you want to compile, type Alt-x compile and press Enter twice. You will now see a second window with the compiler results. If you click with the middle mouse button on an error message, the cursor will jump to the error in your source. You can then use ^x` (where the character after the control x is the one under the tilde on HP keyboard) to jump to subsequent errors.
Source for example Output for example
You may find it useful to know that ^x1 returns you to a single window, and ^x^ (where the second ^ really means caret, not control) increases the size of a window at the expense of the other window.
Check preprocessing: highlight some code, then use Alt-x c-mac followed by Tab and Enter to pop up a new emacs window where preprocessor directives such as #if and #define are expanded. (Unfortunately this feature is not smart enough to expand macros and defines that are found inside include files.)

All links active 4/22/2005. Report broken links to

Back to my Home Page