C Programming Language Tips
General Documentation
C Documentation and tutorials
C++ Documentation and tutorials
VMR: My Vector/Matrix/Readfile Routines
Debugging
Memory Problems
Scientific & Math libraries (IMSL, LAPACK, GSL)
Using Fortran IMSL functions in C
Special Issues in Statistical Computing in C
Using C mode in Emacs
Downloads: vmr.h
libvmr.a
(New 11/24/03: readfileRN() reads a data file with row names.)
(New 1/27/06: matrowsubset() matIrowsubset() makes a matrix subset based on a list of rows.)
- Debugging in Unix (ddd)
Quick start
Memory Layout Article
- Electric Fence from Bruce Perens, when used in conjunction
with ddd is an excellent tool to find array over-runs, a major cause of errors in statistical uses of c
and c++. To find an array over-run, link the efence library ( PC/Linux version of libefence ) into
your program.
Then run the program under ddd. Now the debugger is (usually) able to pinpoint the line causing the "Segmentation Fault".
If you are using my vmr library, you should use the evmr version instead. For Linux/PC, just download
libevmr.a ; otherwise create your own evmr library compiling vmr.c with the "-g" compile option. A sample
c file, efencetest.c , demonstrates use of Electric Fence. [Note: the current (2.0.5) version of efence
does not "make" correctly on our system. If you are not running PC/Linux and need to "make" the library from Perens' files, you
will need to manually make a few changes to use syserror() instead of the deprecated alternative and to add an "#include" of "efence.h" to the
eftest.c file.]
Other Public domain memory debugging tools
- Numeric problems: Two numeric problems that you should know about are the limited size of integers and the limited mantissa
of doubles (or, worse, floats). Compile and run numeric.c to see the problems.
For a 32 bit signed integer, the maximum number it can hold is 2,147,483,647 (2^31-1). If you use "unsigned int", the maximum
is 4,294,967,295 (2^32-1). These numbers are accessable in a C program as INT_MAX and UINT_MAX if you include "limits.h".
By the way, the "-1" is because zero is one of the 2^32 different numbers that can be represented in 32 bits.
As you can see by running the program, if you add one to the maximum integer, you get a large negative number (-2^31) which gets larger
as you keep adding one. The reason for this is the "twos-complement" number representation method used by most computers. For
more information, see Wikipedia
or chapter 1 of James E. Gentle's "Numeric Linear Algebra for Applications in Statistics" (Springer, 1998).
The second problem is the limitation in the size of the mantissa for a (double) floating point number. First you must realize
that floating point numbers are represented as a mantissa and an exponent. Typically there is 1 sign bit, 52 mantissa bits,
and 11 exponent bits. To add two numbers the computer first re-expresses the smaller number so that it has the same exponent
as the larger number. In the example, the number 1.0e0 is added to the number 1.0e-16, which gets re-expressed
as 0.0000000000000001e0. But because the mantissa of a double does not quite have enough bits to represent that number,
the low order bits are dropped and the mantissa becomes zero. The smallest number that can be added to 1.0 without resulting
in 1.0 is called epsilon, and for a "double" in C on most machines epsilon approximately equals 2.220446e-16. In the above example
two values of 1e-16 can be added together easily and accurately because both have 1.000000000000000 as their mantissas (and -16
as their exponents). So the moral of the story is that, if you need to add a long list of numbers of very different sizes,
you may get big rounding errors if you end up adding small numbers to large numbers. The Gentle book is again a good reference.
- Profiling lets you check which parts of your program are slowest; then you can direct your speedup recoding effort most
profitably to those functions.
With the gcc compiler use the -pg compilation option, then run your program, e.g. MyProg, as usual.
This will cause a file called gmon.out to be
created. Run gprof MyProg > profile.txt to analyze gmon.out and send the results to profile.txt.
Examine profile.txt to see which functions are eating up most of your run time.
- TenDRA public domain C/C++ checker
Here is a good memory tutorial.
The key memory problems are:
- Using an uninitialized pointer. This results in "Segmentation fault", "Bus error", or
crazy values when you
printf() from the vector or matrix, and "Segmentation fault" or "Bus error" or
memory trashing when you
assign something to the vector. It is also true that most cases of
"Segmentation fault" are due to uninitialized pointers. Remember that an uninitialized
"static" pointer is initialized to zero by the compiler, guaranteeing a Seg Fault when used, while
an uninitialized "automatic" pointer has random data in it. If that random data happens to be
incorrectly aligned for the data type a "bus error" will occur. If the random data is the
address of any valid, correctly aligned memory location belonging to your program, it will
read incorrect data and write trash on some other variable(s). If the random data is
correctly aligned but outside of your program's valid memory locations, it will cause a Seg Fault.
- Accessing beyond the end of a vector or matrix. This often results
in overwriting other data. If a pointer is overwritten, a "Segmentation fault"
may result when that pointer is used later in the program; otherwise other variables are inadvertently changed. These
can be hard to detect.
- Out of memory. This can be a tricky problem! Most modern compilers take
all available memory (other than the program commands themselves and the global static memory)
and grow the "stack" (which holds all "automatic" variables declared in C/C++) from one end of
the memory and the "heap" (which holds all dynamically allocated memory) from the other end.
(If you don't know what automatic means, you should definitely
read about it in Kernighan and Ritchie's "The C Programming Language".)
If you allocate a lot of large vectors or matrices dynamically or as automatic variables you may get
a "Segmentation fault" with very little indication of where the problem lies due to running out of memory
(some systems give a "stack overflow" error if the problem is on the stack).
To avoid the problem with dynamic memory, always check the malloc() return value to assure
that the memory was successfully allocated; or use the vmr routines which do that for you.
- Fragmented memory. If you use mallocvec(), mallocmat() and friends,
fragmentation of memory (fully analogous to a framented hard drive)
will eventually be detected as a "Can't allocate xxx bytes" message. If you use
malloc()'s instead, you must check the return value every time if you want to be
able to catch the problem. Otherwise
you will have strange results, perhaps a "Segmentation fault".
For statistical programs, you usually can and should reuse vectors and matrices
rather than freeing and reallocating them. I say this because most statistical
programs use the same sized vectors and matrices without danger of two functions
attempting to use the same vector at the same time. In addition to causing
fragmentation, repeated allocating and freeing of
memory is very slow, so with a little thought you can often avoid thousands of
allocate/free cycles by re-using data structures. In some cases in makes sense
to allocate the biggest needed vector for some purpose and also use it for cases
that only need a portion of it. For the corresponding case for matrices, you
can do the same trick, but if you access the memory in mat[row][col] format,
you will need to use redimmat().
You can also solve the fragmentation problem if you use
the gc garbage
collector with or without the garbage collection version
of the vmr library, called gcvmr. This automatically does defragmentation
as needed. If you #include "gc.h" in your program, then you can also
call gc_gcollect() to explicitly defragment memory at any time.
- Memory leaks. Memory leaks are due to forgetting to free memory that
you have allocated. Not freeing memory needed till the end of the program
before return()ing from main() is benign. At the other extreme, not
freeing memory allocated at each iteration of a loop is generally disastrous.
One common form is forgetting to free the row pointers for a (manually created) matrix. The
vmr and gcvmr libraries make freeing memory particularly easy: for vectors
or matrices, just free the first vector or matrix listed in any mallocvec()
or mallocmat(), and all memory associated with all vectors or matrices
allocated in the same call are correctly and totally freed.
The simplest leak detection scheme, suitable for when you are using the vmr
library or your own malloc()'s, but not for gcvmr, is to put the macro
#define MEMCHK(x) {printf("===%s: ",x); \
myinfo=mallinfo(); printf("%12d %12d\n",myinfo.hblkhd,myinfo.uordblks);}
or, for C++,
#define MEMCHK(x) {cerr << "==="<<x<< ": "; \
struct mallinfo myinfo=mallinfo(); cout << myinfo.hblkhd << " " << myinfo.uordblks << endl;}
at the top of your source file after all of the #include's but before
main(). This requires that you have "#include <malloc.h>" and
"#include <stdio.h>" among your includes.
Also, for C, put "struct mallinfo myinfo;" somewhere in scope, e.g. above main.
Then, you can sprinkle MEMCHK("My message")
throughout your program to get a report of used memory in two regions
(basically large and small allocations). When placed before and after
any function call, a difference of greater than 4 for the second number
and 0 for the first indicates that that function has a memory leak.
When using the gcvmr library, or gc directly, you can perform leak
detection as described in the README below.
gcvmr: libgcvmr
libgc.a
gc.h
leak_detector.h
README
The Fortran IMSL library is a collection of mathematical and statistical functions written
in Fortran and pre-compiled. They are licensed and available to the CMU
community and available on Andrew and throughout the Statistics department on HP and
SUN computers.
By using these routines, you will get well-implemented versions of functions
which can perform a wide range of tasks, from generating random numbers to
performing regression, but without any coding on your part! The functions are
described in books labelled IMSL Stat Library and IMSL Math Library and
via the links listed below. They are currently unavailable at CMU
for Windows. For Linux, IMSL C libraries are available, but unfortunately
these do not closely parallel the Fortran libraries, so constructing
cross-platform programs is difficult.
An alternative matrix library is
MTL.
The home page for IMSL is http://www.vni.com/products/imsl/ .
A pdf document from Visual Numerics, the distributor of IMSL, on calling
IMSL from C is given http://www.vni.com/tech/imsl/8709.pdf.
Pdf math and stat guides are
http://www.vni.com/books/docs/.
The main points of using these in C are to include the correct libraries
when linking, to declare the functions you will use, and to pass only
pointers to the routines as arguments. (I use #define's to add the &
pointer operator for me.)
Complex data is stored in a double-length vector, e.g. float cmplx[6]
can hold 3 complex numbers stored as real/imaginary pairs. Double complex
arguments are used for the d versions of functions.
In general, you need to use some judgment to guess the argument types.
If the argument must always refer to a whole number, it is probably "int".
If the function has two versions, plain and d, e.g. rnnor and nrnnor,
then non-integer values are either "float" or "double" depending on the version.
If there is only one version, non-integers are float. Most functions
return void, but some, like chidf/dchidf return float/double.
If you need to pass a matrix to a function, it must be declared as
a single float or double memory block with the first "col" elements corresponding
to the first row. A trick for allocating a matrix that can be accessed
as mat[r][c] in your program but passed to imsl as imslmat is
to use something like this C++ generalized inverse example:
int rows=3, cols=3;
double *imslmat = new double[rows*cols];
double **mat = new double*[rows];
for (int i=0; i<rows; i++) mat[i]=imslmat+i*cols;
for (int row=0; row<rows; row++)
for (int col=0; col<cols; col++)
mat[row][col]=row+col+row*col+1.0;
double *rslt = new double[rows*cols];
int rank;
double tol=2e-16;
dlsgrr(&rows, &cols, imslmat, &cols, &tol, &rank, rslt, &cols);
for (int r=0; r<rows; r++) {
for (int c=0; c<cols; c++)
cout << rslt[r*cols+c] << " ";
cout << endl;
}
Place these three files in a directory, type "make", and then "Demo 15" to see
an example sufficient to get you started.
Demo.c
IMSLrand.h
Unix makefile
Linux makefile
Fortran-IMSL Function Summaries
Fortran to C IMSL function Rosetta stone
(xls format)
A C program to demonstrate reading files, allocating matrices, and
using Fortran-IMSL matrix routines: stat.c
- Likelihoods out of range
Especially in MCMC, you may get errors or get stuck
in a bad region of the parameter space if your likelihood involves exponentiating
a very large or small number. Here is a good solution:
#include<stdlib.h>
#include<stdio.h>
#include<values.h> /* has needed constants */
double minexpable, maxexpable; /* largest and smallest numbers to which exp() can be applied */
void doMCMC(double *data, double *params);
/* first argument to program is number of simulations */
int main(int argc, char **argv) {
int i, nsim;
...
if (argc<2) {
fprintf(stderr, "Program needs at least one argument, which is the simulation count.\n");
exit(EXIT_FAILURE);
}
nsim=atoi(argv[1]);
...
/* Calculate largest and smallest numbers to which exp() can be applied (one time) */
minexpable=log(MINDOUBLE); /* defined in values.h */
maxexpable=log(MAXDOUBLE); /* defined in values.h */
...
for (i=0; i<nsim; i++)
doMCMC(data, params);
...
return(EXIT_SUCCESS);
}
void doMCMC(double *data, double *params) {
...
/* avoid taking exp() on too large or too small a number */
if (loglik>=minexpable && loglik<=maxexpable) {
like=exp(loglik);
} else if (log<minexpable) {
like=MINDOUBLE;
} else {
like=MAXDOUBLE;
}
...
return;
}
- Archiving old versions of your programs (version control)
The best way to write most statistical programs is to build them up in stages,
adding more features only after the first ones work.
For example, in an MCMC, the stages might be reading in the data,
generating starting values, and then in the main MCMC loop you could
consider updating each parameter as a separate stage.
The problem that often arises is that you "break" your program when
trying to go to the next stage. A related problem is that you remove
a feature or function, only later to discover that your really need it
again.
The solution here is to run a simple-to-use script file that keeps various
versions of your program (and all related files) intact, that you can refer
to later if needed. The script creates a directory with all of the files
of the working program. Because this is so easy to do, it is worth doing for
every project. (If you want a more sophisticated solution, look at the
svn program.)
The one-time setup is to put archive
in your /bin directory. (Ask for help if you don't have a /bin directory
and corresponding PATH entry.) Check that the file is executable or just run the Linux command
chmod u+x archive to set it as executable. You will need to run rehash
right after you install it.
Now, each time you start a new project (which must be in different directory from every other
project), you setup for archiving with these two simple steps:
- Run the Linux command archive new MyProjectName.
- Run the Linux command archive add myFile1 [myFile2 [...]] one or more times to
add the names of all pertinent files to the list of files to be archived. Pertinent
files include all .c, .h, and .cpp files, your makefile, your documentation file(s),
and any data files that are fairly small (e.g., less than 100KB).
Now each time you come to a point in your programming experience where either your
program appears to be working or you are about to make major changes that you might
want to undo, run the Linux command archive save "My comments". The comments
are optional, but a brief statement of the state of your program is useful. Be sure
to use quotes around the comments.
Other features of the archive script are use of archive alone to show the names
of all of the files on the archive list, and archive remove myfile to stop
archiving some file (remove it from the list). Also you can add files at any time.
Use archive help to show all options.
To see a summary of all archives run archive show. The directory names
encode the date and time of the archives in an obvious way. Then you can list the files
in any archive, and also view or copy any file(s) you need.
Note: I found that it was better for my working style to have any re-running
of archive save with the same hour to overwrite the old archive
rather than creating a new one. This is because if I am saving shortly after
a previous save, its probably because I thought of one little thing and this
is not really a new "version". In this case, the script warns and prompts
before overwriting. If you don't like this, change the script!
When you edit a file named *.c in Emacs, you are automatically in
c-mode as indicated by (C) on the status line. Emacs will automatically
do certain formatting and add some new capabilities. The two most
important are:
- Emacs finds the lines that have the errors: To compile your program
you must first have file called makefile in your
current directory that contains instructions for compiling your
program (and linking if you like). For the usual simple case,
just modify this simple make file makefile.
Each time you want to compile, type Alt-x compile and press
Enter twice. You will now see a second window with the
compiler results. If you click with the middle mouse button on
an error message, the cursor will jump to the error in your source.
You can then use ^x` (where the character after the control
x is the one under the tilde on HP keyboard) to jump to subsequent errors.
Source for example
Output for example
You may find it useful to know that ^x1 returns you to
a single window, and ^x^ (where the second ^ really means
caret, not control) increases the size of a window
at the expense of the other window.
- Check preprocessing: highlight some code, then use Alt-x c-mac
followed by Tab and Enter to pop up a new emacs window
where preprocessor directives such as #if and #define are expanded.
(Unfortunately this feature is not smart enough to expand macros and
defines that are found inside include files.)
All links active 4/22/2005. Report broken links to
Back to my Home Page