C Programming Language Tips

General Documentation     C Documentation and tutorials     C++ Documentation and tutorials     VMR: My Vector/Matrix/Readfile Routines     Debugging     Memory Problems     Scientific & Math libraries (IMSL, LAPACK, GSL)     Using Fortran IMSL functions in C     Special Issues in Statistical Computing in C     Using C mode in Emacs


General Documentation


C Documentation and tutorials


C++ Documentation and tutorials


My Vector/Matrix/Readfile Routines

Downloads: vmr.h     libvmr.a
(New 11/24/03: readfileRN() reads a data file with row names.)
(New 1/27/06: matrowsubset() matIrowsubset() makes a matrix subset based on a list of rows.)


Debugging and Profiling


Memory Problems

Here is a good memory tutorial.
The key memory problems are:
  1. Using an uninitialized pointer. This results in "Segmentation fault", "Bus error", or crazy values when you printf() from the vector or matrix, and "Segmentation fault" or "Bus error" or memory trashing when you assign something to the vector. It is also true that most cases of "Segmentation fault" are due to uninitialized pointers. Remember that an uninitialized "static" pointer is initialized to zero by the compiler, guaranteeing a Seg Fault when used, while an uninitialized "automatic" pointer has random data in it. If that random data happens to be incorrectly aligned for the data type a "bus error" will occur. If the random data is the address of any valid, correctly aligned memory location belonging to your program, it will read incorrect data and write trash on some other variable(s). If the random data is correctly aligned but outside of your program's valid memory locations, it will cause a Seg Fault.
  2. Accessing beyond the end of a vector or matrix. This often results in overwriting other data. If a pointer is overwritten, a "Segmentation fault" may result when that pointer is used later in the program; otherwise other variables are inadvertently changed. These can be hard to detect.
  3. Out of memory. This can be a tricky problem! Most modern compilers take all available memory (other than the program commands themselves and the global static memory) and grow the "stack" (which holds all "automatic" variables declared in C/C++) from one end of the memory and the "heap" (which holds all dynamically allocated memory) from the other end. (If you don't know what automatic means, you should definitely read about it in Kernighan and Ritchie's "The C Programming Language".) If you allocate a lot of large vectors or matrices dynamically or as automatic variables you may get a "Segmentation fault" with very little indication of where the problem lies due to running out of memory (some systems give a "stack overflow" error if the problem is on the stack). To avoid the problem with dynamic memory, always check the malloc() return value to assure that the memory was successfully allocated; or use the vmr routines which do that for you.
  4. Fragmented memory. If you use mallocvec(), mallocmat() and friends, fragmentation of memory (fully analogous to a framented hard drive) will eventually be detected as a "Can't allocate xxx bytes" message. If you use malloc()'s instead, you must check the return value every time if you want to be able to catch the problem. Otherwise you will have strange results, perhaps a "Segmentation fault". For statistical programs, you usually can and should reuse vectors and matrices rather than freeing and reallocating them. I say this because most statistical programs use the same sized vectors and matrices without danger of two functions attempting to use the same vector at the same time. In addition to causing fragmentation, repeated allocating and freeing of memory is very slow, so with a little thought you can often avoid thousands of allocate/free cycles by re-using data structures. In some cases in makes sense to allocate the biggest needed vector for some purpose and also use it for cases that only need a portion of it. For the corresponding case for matrices, you can do the same trick, but if you access the memory in mat[row][col] format, you will need to use redimmat(). You can also solve the fragmentation problem if you use the gc garbage collector with or without the garbage collection version of the vmr library, called gcvmr. This automatically does defragmentation as needed. If you #include "gc.h" in your program, then you can also call gc_gcollect() to explicitly defragment memory at any time.
  5. Memory leaks. Memory leaks are due to forgetting to free memory that you have allocated. Not freeing memory needed till the end of the program before return()ing from main() is benign. At the other extreme, not freeing memory allocated at each iteration of a loop is generally disastrous. One common form is forgetting to free the row pointers for a (manually created) matrix. The vmr and gcvmr libraries make freeing memory particularly easy: for vectors or matrices, just free the first vector or matrix listed in any mallocvec() or mallocmat(), and all memory associated with all vectors or matrices allocated in the same call are correctly and totally freed.

    The simplest leak detection scheme, suitable for when you are using the vmr library or your own malloc()'s, but not for gcvmr, is to put the macro

      #define MEMCHK(x) {printf("===%s: ",x); \
        myinfo=mallinfo(); printf("%12d %12d\n",myinfo.hblkhd,myinfo.uordblks);}
    
    or, for C++,
      #define MEMCHK(x) {cerr << "==="<<x<<  ": "; \
        struct mallinfo myinfo=mallinfo(); cout << myinfo.hblkhd << " " << myinfo.uordblks << endl;}
    
    at the top of your source file after all of the #include's but before main(). This requires that you have "#include <malloc.h>" and "#include <stdio.h>" among your includes. Also, for C, put "struct mallinfo myinfo;" somewhere in scope, e.g. above main. Then, you can sprinkle MEMCHK("My message") throughout your program to get a report of used memory in two regions (basically large and small allocations). When placed before and after any function call, a difference of greater than 4 for the second number and 0 for the first indicates that that function has a memory leak.

    When using the gcvmr library, or gc directly, you can perform leak detection as described in the README below.

gcvmr: libgcvmr     libgc.a     gc.h     leak_detector.h     README


Mathematical, Statistical and Linear Algebra Libraries: IMSL, LAPACK and GSL


Using Fortran IMSL functions in C

The Fortran IMSL library is a collection of mathematical and statistical functions written in Fortran and pre-compiled. They are licensed and available to the CMU community and available on Andrew and throughout the Statistics department on HP and SUN computers. By using these routines, you will get well-implemented versions of functions which can perform a wide range of tasks, from generating random numbers to performing regression, but without any coding on your part! The functions are described in books labelled IMSL Stat Library and IMSL Math Library and via the links listed below. They are currently unavailable at CMU for Windows. For Linux, IMSL C libraries are available, but unfortunately these do not closely parallel the Fortran libraries, so constructing cross-platform programs is difficult. An alternative matrix library is MTL.

The home page for IMSL is http://www.vni.com/products/imsl/ . A pdf document from Visual Numerics, the distributor of IMSL, on calling IMSL from C is given http://www.vni.com/tech/imsl/8709.pdf. Pdf math and stat guides are http://www.vni.com/books/docs/. The main points of using these in C are to include the correct libraries when linking, to declare the functions you will use, and to pass only pointers to the routines as arguments. (I use #define's to add the & pointer operator for me.)

Complex data is stored in a double-length vector, e.g. float cmplx[6] can hold 3 complex numbers stored as real/imaginary pairs. Double complex arguments are used for the d versions of functions.

In general, you need to use some judgment to guess the argument types. If the argument must always refer to a whole number, it is probably "int". If the function has two versions, plain and d, e.g. rnnor and nrnnor, then non-integer values are either "float" or "double" depending on the version. If there is only one version, non-integers are float. Most functions return void, but some, like chidf/dchidf return float/double.

If you need to pass a matrix to a function, it must be declared as a single float or double memory block with the first "col" elements corresponding to the first row. A trick for allocating a matrix that can be accessed as mat[r][c] in your program but passed to imsl as imslmat is to use something like this C++ generalized inverse example:

int rows=3, cols=3;
double *imslmat = new double[rows*cols];
double **mat = new double*[rows];
for (int i=0; i<rows; i++) mat[i]=imslmat+i*cols;

for (int row=0; row<rows; row++)
  for (int col=0; col<cols; col++)
    mat[row][col]=row+col+row*col+1.0;

double *rslt = new double[rows*cols];
int rank;
double tol=2e-16;
dlsgrr(&rows, &cols, imslmat, &cols, &tol, &rank, rslt, &cols);
for (int r=0; r<rows; r++) {
  for (int c=0; c<cols; c++)
    cout << rslt[r*cols+c] << "  ";
  cout << endl;
}
Place these three files in a directory, type "make", and then "Demo 15" to see an example sufficient to get you started.
Demo.c   IMSLrand.h   Unix makefile   Linux makefile

Fortran-IMSL Function Summaries     Fortran to C IMSL function Rosetta stone (xls format)

A C program to demonstrate reading files, allocating matrices, and using Fortran-IMSL matrix routines: stat.c


Special Issues in Statistical Computing in C

  1. Likelihoods out of range
    Especially in MCMC, you may get errors or get stuck in a bad region of the parameter space if your likelihood involves exponentiating a very large or small number. Here is a good solution:
    #include<stdlib.h>
    #include<stdio.h>
    #include<values.h>        /* has needed constants */
    double minexpable, maxexpable;  /* largest and smallest numbers to which exp() can be applied */
    void doMCMC(double *data, double *params);
    
    /* first argument to program is number of simulations */
    int main(int argc, char **argv) {
      int i, nsim;
      ...
      if (argc<2) {
        fprintf(stderr, "Program needs at least one argument, which is the simulation count.\n");
        exit(EXIT_FAILURE);
      }
      nsim=atoi(argv[1]);
      ...
      /* Calculate largest and smallest numbers to which exp() can be applied (one time) */
      minexpable=log(MINDOUBLE);  /* defined in values.h */
      maxexpable=log(MAXDOUBLE);  /* defined in values.h */
      ...
     
      for (i=0; i<nsim; i++)
        doMCMC(data, params);
    
      ...
    
      return(EXIT_SUCCESS);
    }
    
    
    void doMCMC(double *data, double *params) {
      ...
      /* avoid taking exp() on too large or too small a number */
      if (loglik>=minexpable && loglik<=maxexpable) {
        like=exp(loglik);
      } else if (log<minexpable) {
        like=MINDOUBLE;
      } else {
        like=MAXDOUBLE;
      }
      ...
      return;
    }
    
  2. Archiving old versions of your programs (version control)
    The best way to write most statistical programs is to build them up in stages, adding more features only after the first ones work. For example, in an MCMC, the stages might be reading in the data, generating starting values, and then in the main MCMC loop you could consider updating each parameter as a separate stage.

    The problem that often arises is that you "break" your program when trying to go to the next stage. A related problem is that you remove a feature or function, only later to discover that your really need it again.

    The solution here is to run a simple-to-use script file that keeps various versions of your program (and all related files) intact, that you can refer to later if needed. The script creates a directory with all of the files of the working program. Because this is so easy to do, it is worth doing for every project. (If you want a more sophisticated solution, look at the svn program.)

    The one-time setup is to put archive in your /bin directory. (Ask for help if you don't have a /bin directory and corresponding PATH entry.) Check that the file is executable or just run the Linux command chmod u+x archive to set it as executable. You will need to run rehash right after you install it.

    Now, each time you start a new project (which must be in different directory from every other project), you setup for archiving with these two simple steps:

    1. Run the Linux command archive new MyProjectName.
    2. Run the Linux command archive add myFile1 [myFile2 [...]] one or more times to add the names of all pertinent files to the list of files to be archived. Pertinent files include all .c, .h, and .cpp files, your makefile, your documentation file(s), and any data files that are fairly small (e.g., less than 100KB).

    Now each time you come to a point in your programming experience where either your program appears to be working or you are about to make major changes that you might want to undo, run the Linux command archive save "My comments". The comments are optional, but a brief statement of the state of your program is useful. Be sure to use quotes around the comments.

    Other features of the archive script are use of archive alone to show the names of all of the files on the archive list, and archive remove myfile to stop archiving some file (remove it from the list). Also you can add files at any time. Use archive help to show all options.

    To see a summary of all archives run archive show. The directory names encode the date and time of the archives in an obvious way. Then you can list the files in any archive, and also view or copy any file(s) you need.

    Note: I found that it was better for my working style to have any re-running of archive save with the same hour to overwrite the old archive rather than creating a new one. This is because if I am saving shortly after a previous save, its probably because I thought of one little thing and this is not really a new "version". In this case, the script warns and prompts before overwriting. If you don't like this, change the script!

Using C mode in Emacs

When you edit a file named *.c in Emacs, you are automatically in c-mode as indicated by (C) on the status line. Emacs will automatically do certain formatting and add some new capabilities. The two most important are:
  1. Emacs finds the lines that have the errors: To compile your program you must first have file called makefile in your current directory that contains instructions for compiling your program (and linking if you like). For the usual simple case, just modify this simple make file makefile.
    Each time you want to compile, type Alt-x compile and press Enter twice. You will now see a second window with the compiler results. If you click with the middle mouse button on an error message, the cursor will jump to the error in your source. You can then use ^x` (where the character after the control x is the one under the tilde on HP keyboard) to jump to subsequent errors.
    Source for example   Output for example

    You may find it useful to know that ^x1 returns you to a single window, and ^x^ (where the second ^ really means caret, not control) increases the size of a window at the expense of the other window.

  2. Check preprocessing: highlight some code, then use Alt-x c-mac followed by Tab and Enter to pop up a new emacs window where preprocessor directives such as #if and #define are expanded. (Unfortunately this feature is not smart enough to expand macros and defines that are found inside include files.)


All links active 4/22/2005. Report broken links to


up Back to my Home Page