Here is the text of a book that I began writing many years ago. I never found the time to finish it. However, you may find some useful tips.

Faster C and C++ Code

Copyright © 2000 by Rex A. Barzee. All rights reserved.

Contents

Introduction

This book documents programming techniques and tips that cause the compiled version of a C or C++ function to contain fewer assembly instructions or execute fewer assembly instructions or use faster assembly instructions, causing the function to run faster. In a few cases, these techniques may cause the C code to be less intuitive (as if any code is) or less readable. However, in many cases these techniques will make code more concise and more readable; to paraphrase, “brevity is the essence of [programming].”

The techniques discussed in this book apply to both C and C++ because C++ is a superset of C and despite C++ classes, methods, overloaded operators, virtual functions, templates, exceptions, runtime type checking, etc., the C++ programmer must still write the body of functions using loops, arrays, pointers, primitive types, and other C expressions. I assume the reader of this book is a C and/or C++ programmer, and I make no attempt to explain C or C++ language constructs or concepts. There are already many excellent books that do this.

The two most important points about writing faster code are:

  1. Write correct code before writing fast code.
  2. Use a profiler or other technique to choose where to write fast code.

I cannot emphasize these two points enough. What good is fast code if it gives incorrect results? Also, it does no good to optimize the code for a function that executes for a small amount of the overall time that your program executes. If you work hard and rewrite a function making it run twice as fast as it was previously, but it was taking only 1% of the total execution time, you will have sped up the entire program by only 0.5% (one half of one percent).

Coding Techniques

1. Move statements outside of loops

The simplest and best way to make any code run faster is to reduce the amount of code or reduce the number of times the computer must execute that code. I often see code that has statements that don’t need to be repeated placed inside a loop where they are repeated. Consider this simple function to compute the average of an array of numbers.

Example 1.1

/** Computes and returns the average of the values in a. */
double average(const double *a, int n) {
    double sum = 0;
    double avg = 0;
    for (int i = 0;  i < n;  ++i) {
        sum += a[i];
        avg = sum / n;
    }
    return avg;
}

Of course, the average doesn’t need to be and shouldn’t be calculated each time through the loop. Instead, it should be calculated only once after the loop is finished.

Example 1.2

/** Computes and returns the average of the values in a. */
double average(const double *a, int n) {
    double sum = 0;
    for (int i = 0;  i < n;  ++i) {
        sum += a[i];
    }
    double avg = sum / n;
    return avg;
}

Consider this C++ example that counts the number of uppercase letters in a string. It contains a more subtle example of code that is repeated that doesn’t need to be.

Example 1.3

/** Counts and returns the number uppercase letters in a string. */
int countUppers(const string &s) {
    int count = 0;
    for (int i = 0;  i < s.length();  ++i) {
        if (isupper(s[i])) {
            count++;
        }
    }
    return count;
}

In example 1.3, the comparison i < s.length() causes the computer to make an extra function call each time through the loop. Also, for a C++ string, the subscript operator ([ ]) is a function call. Example 1.3 should be rewritten to move the call to the length function outside the loop and to eliminate the subscript operator function call. These two optimizations are shown in example 1.4.

Example 1.4

/** Counts and returns the number uppercase letters in a string. */
int countUppers(const string &s) {
    int count = 0;
    int length = s.length();
    const char *data = s.data();
    for (int i = 0;  i < length;  ++i) {
        if (isupper(data[i])) {
            count++;
        }
    }
    return count;
}

2. Write loops to count backwards (decrement) to zero

When writing a loop, instead of writing it to count from 0 to n,

Example 2.1

    for (int i = 0;  i != n;  ++i) {
    }

write the loop to count backwards from n to zero.

Example 2.2

    for (int i = n;  i != 0;  --i) {
    }

Many CPUs have a single instruction for decrement and compare to zero, for example, Hewlett-Packard Company’s PA-RISC instruction addib,znv -1,r,r, but have no such single instruction for increment and compare. So after being compiled, the end of loop 2.1 will consist of two or three assembly instructions:

or

The end of loop 2.2 will likely consist of a single assembly instruction because the optimizer will combine i != 0 and --i into a single instruction.

3. Use pointer dereference instead of array subscript

On many computer architectures, the array subscript operator requires two instructions:

  1. adding the index value (i in example 3.1) to the base of the array (samples in example 3.1) giving a pointer into the array
  2. dereferencing the resultant pointer

Example 3.1

    double sum = 0;
    for (int i = 0;  i < length;  ++i) {
        sum += samples[i];
    }

However, many CPUs have a single dereference and increment instruction, so *p++ in example 3.2 can be done in a single instruction. Also, because example 3.2 uses a pointer dereference instead of an array subscript, we can rewrite the counting loop to count down to 0 as explained in tip 2.

Example 3.2

    double sum = 0;
    double *p = samples;
    for (int i = n;  i != 0;  --i) {
        sum += *p++;
    }

4. Use as few variables as possible in a function

Space for local variables in a function is allocated from the CPU’s registers or on the call stack or both. Since a variable stored on the stack must first be loaded into a register before its value can be manipulated, we want to write code that maintains as many variables as possible in the CPU registers. In modern computer architectures, some CPU registers are designated as parameter passing registers, some as return value registers, and some as scratch or temporary registers. These are all available to be used as local variables. The compiler and optimizer decide which registers are allocated to which variables. Of course, the compiler tries to allocate registers in a way to make the function run as fast as possible, but it doesn’t always find the optimal mapping of variables to registers. This is one of the reasons for the existence of the C keyword register. Using register a programmer can hint to the compiler which variables should be allocated in registers.

If a function uses more variables than are available for local registers, the compiler must output code to spill the contents of the registers specified as non-local to the stack and restore those contents from the stack. Also, the compiler must output code to swap the contents of registers specified as local to and from the stack.

Often, programmers write code that uses more variables than necessary. In other words, some variables are synonyms for others as in example 4.1. Also, many variables are used during only a small part of a function. A good optimizer combines synonyms and allocates a register for a variable only while that variable is in use. Also, many modern CPUs perform register renaming which helps use registers more efficiently. However, not all optimizers are good ones, and even good ones don’t find the optimal variable to register mapping every time. Therefore, you should help the compiler by eliminating synonyms and other unnecessary variables. Also, don’t initialize local variables long before they are actually needed as ____ is in example 4.1. Instead, initialize them immediately before they are needed. C++ programmers and C programmers using C99 can go a step further and not declare variables until they are needed as ____ is in example 4.2.

I’m not suggesting that you name all your variables tmp1, tmp2, etc. and use the variables for anything that is needed. Give variables meaningful names and use as many as necessary but no more.

5. Use the formal parameters of a function like local variables

Many programmers forget that the formal parameters of a C function are passed by value from the calling function to the called function, so they are copies of the values used in the calling function. In other words, they are local variables and can be modified like local variables. The functions in both examples 5.1 and 5.2 compute and return the sum of the numbers in an array. Notice that example 5.1 has two more variables (i and p) than example 5.2 has. Example 5.2 uses both of its parameters as local variables and modifies them within the while loop. For all the reasons given in tip 4, it is a good idea to use fewer variables in a function, and using the formal parameters as local variables often reduces the number of variables that a function needs.

Example 5.1

/** Calculates and returns the sum of the numbers in array. */
double sum(const double *array, int n) {
    const double *p = array;
    double s = 0;
    for (int i = n;  i != 0;  --i) {
        s += *p++;
    }
    return s;
}

Example 5.2

/** Calculates and returns the sum of the numbers in array. */
double sum(const double *array, int n) {
    double s = 0;
    while (n != 0) {
        s += *array++;
        --n;
    }
    return s;
}

6. Use memset when setting more than N bytes to the same value.

When setting more than small amounts of memory to the same value, don’t write a loop to do it as shown in example 6.1. Instead, use the memset function as shown in example 6.2. memset is often written in assembly code and has been optimized and tested by the developers of the standard C library. Calling a function requires some time or overhead, so calling memset to set small amounts of memory is slower than setting the memory inside a loop. You can measure on your architecture at what number of bytes calling memset becomes faster than setting the bytes in a loop.

Example 6.1

    int n = 1024;
    uint8_t array[n];
    for (int i = 0;  i < n;  ++i) {
        array[i] = 0;
    }

Example 6.2

    int n = 1024;
    uint8_t array[n];
    memset(array, 0, n);

Unfortunately, the memset function works for bytes only. In other words, you can’t use memset to fill an array of integers with a number unless all the bytes of the number are the same. Example 6.3 shows code that attempts to fill an array of integers with the number 26, but it doesn’t work. Instead, of printing the number 26, the printf function prints the number 437918234. It prints 437918234 because the memset function filled the array with the byte 26 which is hexadecimal 1a and binary 0001 1010. When using the gcc compiler, an int occupies four bytes, so the printf function combines four bytes that were set to 0x1a1a1a1a by the memset function and prints the corresponding decimal number which is 437918234.

Example 6.3

    int n = 1024;
    int array[n];
    memset(array, 26, sizeof(array));
    printf("%d\n", array[0]);

The memset function can be used to set values in arrays of short, int, long, and even float and double if all the bytes of the desired number are the same. This works especially well for zero because when the number zero is stored as a short, int, long, float, or double, the number zero is simply a group of bytes all with the value 0. You can see that this is true by compiling and running the code in example 6.4. When you run the code you will see the output shown below example 6.4.

Example 6.4

short s = 0;
    int i = 0;
    long n = 0;
    float f = 0;
    double d = 0;

    uint8_t *p = (void *)(&s);
    printf("short: %d %d\n", p[0], p[1]);

    p = (void *)(&i);
    printf("int: %d %d %d %d\n", p[0], p[1], p[2], p[3]);

    p = (void *)(&n);
    printf("long: %d %d %d %d %d %d %d %d\n",
            p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7]);

    p = (void *)(&f);
    printf("float: %d %d %d %d\n", p[0], p[1], p[2], p[3]);

    p = (void *)(&d);
    printf("double: %d %d %d %d %d %d %d %d\n",
            p[0], p[1], p[2], p[3], p[4], p[5], p[6], p[7]);

Output

    short: 0 0
    int: 0 0 0 0
    long: 0 0 0 0 0 0 0 0
    float: 0 0 0 0
    double: 0 0 0 0 0 0 0 0

In summary, memset works great for setting all the bytes in an array to a specified byte value. This means memset will work great for setting all the numbers in an array of integers to zero as shown in example 6.5.

Example 6.5

    int n = 1024;
    int array[n];
    memset(array, 0, sizeof(array));

7. When copying more than N bytes use memcpy or memmove

When copying more than a small amount of memory, instead of writing a loop to copy the data, call memcpy or memmove. Both memcpy and memmove are often written in assembly code and use special instructions such as cache prefetching or double word reads and writes to speed up memory copying. Calling a function requires some time, so calling memcpy or memmove to copy small amounts of memory is slow. You can measure on your architecture at what number of bytes calling memcpy or memmove becomes faster than copying the memory in a loop.

Example 7.1 contains code to copy data from a source array to a destination array by using a loop. Example 7.2 shows code to copy the same data by using memcpy instead of a loop. Example 7.2 is almost certainly faster than example 7.1.

Example 7.1

    int n = 1024;
    int source[n];
    int destination[n];
    for (int i = 0;  i < n;  ++i) {
        destination[i] = source[i];
    }

Example 7.2

    int n = 1024;
    int source[n];
    int destination[n];
    memcpy(destination, source, sizeof(destination));

The difference between memcpy and memmove is that memcpy cannot be used to copy data from a source block of memory to a destination block of memory if the two blocks overlap. However, memmove can be used to copy overlapping blocks. For example, if you need code to insert text into a large string, you could use memmove to move the existing text and make a spot for the new text to be inserted as demonstrated in example 7.3.

Example 7.3

    char longText[1024] = "For unto us a is born, unto us a son is"
            " given, and the government shall be upon his shoulder.";
    char *missingText = "child ";

    int longLength = strlen(longText);
    int missingLength = strlen(missingText);
    char *insert = strstr(longText, "is");
    char *move = insert + missingLength;
    int beforeLength = insert - longText;
    int afterLength = longLength - beforeLength + 1;  // + 1 to copy '\0'

    /* Move "is born, unto us a..." to make space to insert "child " */
    memmove(move, insert, afterLength);

    /* Insert "child " into the longText. */
    memcpy(insert, missingText, missingLength);

    /* Print longText to see that "child " was correctly inserted. */
    printf("%s\n", longText);

8. Use integer arithmetic instead of floating point arithmetic

9. Use look up tables

10. Declare functions and global variables as static

11. Declare functions as inline

12. Use combinatorial algorithms

Example 12.1

/** Modifies an image by masking each pixel. */
void bitAnd(uint8_t *img, uint32_t len, uint8_t mask) {
    while (len != 0) {
        *img++ &= mask;
        len--;
    }
}

Example 12.2

/** Modifies an image by masking each pixel. */
void bitAnd(uint8_t *img, uint32_t len, uint8_t mask) {
    uint32_t *combo;

    /* Process any bytes that are not aligned on a
     * 32-bit boundary. This will be at most 3 bytes.*/
    uintptr_t align = ((uintptr_t)img) & (sizeof(*combo) - 1);
    while (align && len) {
        *img++ &= mask;
        align = ((uintptr_t)img) & (sizeof(*combo) - 1);
        len--;
    }

    /* Process most of the image 4 bytes at a time. */
    uint32_t bigMask = mask;
    bigMask = (bigMask << 8) | bigMask;
    bigMask = (bigMask << 16) | bigMask;
    combo = (uint32_t *)img;
    for (int i = len / sizeof(*combo);  i != 0;  --i) {
        *combo++ &= bigMask;
    }

    /* Process any remaining bytes that are not
     * aligned. This will be at most 3 bytes. */
    len &= (sizeof(*combo) - 1);
    img = (uint8_t *)combo;
    while (len) {
        *img++ &= mask;
        --len;
    }
}

13. Understand and use the sizeof operator

Using the sizeof operator does not really make C code faster but instead more portable and maintainable. Not using the sizeof operator is a mistake I see so often that I had to include this item here. Recall that the sizeof operator returns the number of bytes that a data type or variable requires in memory. The sizeof operator can be applied to classes, structures, arrays, types, pointers, and probably to other things I’m not aware of. Example 13.1 shows the sizeof operator applied to different variable types.

Example 13.1

    uint8_t u8;
    uint16_t u16;
    uint32_t u32;
    uint64_t u64;

    char c;
    short s;
    int i;
    long n;

    float f;
    double d;

    double *ptr;
    double array[20];

    printf(" uint8_t %u   u8 %u\n", sizeof(uint8_t), sizeof(u8));
    printf("uint16_t %u  u16 %u\n", sizeof(uint16_t), sizeof(u16));
    printf("uint32_t %u  u32 %u\n", sizeof(uint32_t), sizeof(u32));
    printf("uint64_t %u  u64 %u\n", sizeof(uint64_t), sizeof(u64));

    printf("  char %u  c %u\n", sizeof(char), sizeof(c));
    printf(" short %u  s %u\n", sizeof(short), sizeof(s));
    printf("   int %u  i %u\n", sizeof(int), sizeof(i));
    printf("  long %u  n %u\n", sizeof(long), sizeof(n));
    printf(" float %u  f %u\n", sizeof(float), sizeof(f));
    printf("double %u  d %u\n", sizeof(double), sizeof(d));

    printf("ptr %u  *ptr %u\n", sizeof(ptr), sizeof(*ptr));
    printf("array %u  array[0] %u  *array %u\n",
            sizeof(array), sizeof(array[0]), sizeof(*array));
    printf("number of elements in array: %u\n",
            sizeof(array) / sizeof(array[0]));

Output

     uint8_t 1   u8 1
    uint16_t 2  u16 2
    uint32_t 4  u32 4
    uint64_t 8  u64 8
      char 1  c 1
     short 2  s 2
       int 4  i 4
      long 4  n 4
     float 4  f 4
    double 8  d 8
    ptr 4  *ptr 8
    array 160  array[0] 8  *array 8
    number of elements in array: 20

Most of the output from example 13.1 is not surprising. For example, the size of the data type uint8_t is 1 byte, and the size of the variable u8 is also 1 byte. From this, we learn that we can apply the sizeof operator to a data type or a variable. Notice that the size of the float data type and the variable f is 4 bytes, and the size of the double data type and the variable d is 8 bytes. This was expected because float in C is an IEEE 754 single-precision number, and double is an IEEE 754 double-precision number.

The last three lines of output may be surprising to you. Notice that the size of a pointer on my computer is 4 bytes. However, the pointer points to a double so the size of the dereferenced pointer (*ptr) is the size of the data type that it points to which is 8 bytes because ptr points to a double. Notice that the size of array is 160 bytes and not 4 bytes as a pointer is. This is because the length of the array (20) is declared in the same scope where I used the sizeof operator, and 8 * 20 is 160.

Image Processing Example: A Mode Filter

static uint32_t Mode(const uint32_t *hist, uint32_t n) {
    const uint32_t *t = hist;
    uint32_t i, first, last, num, max = *t++;

    first = last = num = 0;
    for (i = n - 1;  i;  --i, ++t)
        if (*t > max) {
            max = *t;
            first = n - i;
            num = 0;
        }
        else if (*t == max) {
            last = n - i;
            ++num;
        }

    /* If several entries in the histogram tie for
     * the mode, then choose the middle one. */
    if (num > 1) {
        num >>= 1;
        for (t = &hist[first + 1];  1;  ++t)
            if (*t == max)
                if (--num == 0) {
                    first = t - hist;
                    break;
                }
    }

    return first;
}


// real    0m7.89s - pgmoil C code
// Time: 10.250628 secs
// Time:  9.714068 secs
const ImgU8 ImgU8::OrigFilt(int kerW, int kerH) const {
    ImgU8 dst(cols, rows, true);
    uint8_t *d = dst.rc->data;
    const uint8_t *s = Data();
    const int halfKerW = ((kerW + 1) >> 1) - 1,
    halfKerH = ((kerH + 1) >> 1) - 1;
    int i, j, x, y;
    uint32_t hist[256], mode;

    for (y = 0;  y < rows;  ++y) {
        for (x = 0;  x < cols;  ++x) {
            for (i = 0;  i <= NUMBER(hist);  ++i)
                hist[i] = 0;

            for (j = y - halfKerH;  j <= y + halfKerH;  ++j) {
                if (j >= 0 && j < rows)
                    for (i = x - halfKerW;  i <= x + halfKerW;  ++i)
                        if (i >= 0 && i < cols)
                            ++hist[(int)s[j * cols + i]];
            }

            for (j = i = 0;  i < NUMBER(hist);  ++i)
                if (hist[i] > j) {
                    j = hist[i];
                    mode = i;
                }

            *d++ = mode;
        }
    }
    return dst;
}


// Time: 8.371114 secs
// Time: 8.330760 secs
const ImgU8 ImgU8::OutFilt(int kerW, int kerH) const {
    ImgU8 dst(cols, rows, true);
    uint8_t *d = dst.rc->data;
    const uint8_t *s = Data();
    const int halfKerW = ((kerW + 1) >> 1) - 1,
    halfKerH = ((kerH + 1) >> 1) - 1;
    int i, i1, i2, j, j1, j2, x, y;
    uint32_t hist[256], mode;

    for (y = 0;  y < rows;  ++y) {
        j1 = y - halfKerH;
        if (j1 < 0) {
            j1 = 0;
        }
        j2 = y + halfKerH + 1;
        if (j2 > rows) {
            j2 = rows;
        }

        for (x = 0;  x < cols;  ++x) {
            for (i = 0;  i <= NUMBER(hist);  ++i)
                hist[i] = 0;

            i1 = x - halfKerW;
            if (i1 < 0) {
                i1 = 0;
            }
            i2 = x + halfKerW + 1;
            if (i2 > cols) {
                i2 = cols;
            }
            for (j = j1;  j < j2;  ++j)
                for (i = i1;  i < i2;  ++i)
                    ++hist[(int)s[j * cols + i]];

            for (j = i = 0;  i < NUMBER(hist);  ++i)
                if (hist[i] > j) {
                    j = hist[i];
                    mode = i;
                }

            *d++ = mode;
        }
    }
    return dst;
}


// Time: 8.341486 secs
// Time: 8.735447 secs
const ImgU8 ImgU8::MemsetFilt(int kerW, int kerH) const {
    ImgU8 dst(cols, rows, true);
    uint8_t *d = dst.rc->data;
    const uint8_t *s = Data();
    const int halfKerW = ((kerW + 1) >> 1) - 1,
    halfKerH = ((kerH + 1) >> 1) - 1;
    int i, i1, i2, j, j1, j2, x, y;
    uint32_t hist[256], mode;

    for (y = 0;  y < rows;  ++y) {
        j1 = y - halfKerH;
        if (j1 < 0) {
            j1 = 0;
        }
        j2 = y + halfKerH + 1;
        if (j2 > rows) {
            j2 = rows;
        }
        for (x = 0;  x < cols;  ++x) {
            memset(hist, 0, sizeof(*hist) * (NUMBER(hist) + 1));

            i1 = x - halfKerW;
            if (i1 < 0) {
                i1 = 0;
            }
            i2 = x + halfKerW + 1;
            if (i2 > cols) {
                i2 = cols;
            }
            for (j = j1;  j < j2;  ++j) {
                for (i = i1;  i < i2;  ++i)
                    ++hist[(int)s[j * cols + i]];
            }

            for (j = i = 0;  i < NUMBER(hist);  ++i)
                if (hist[i] > j) {
                    j = hist[i];
                    mode = i;
                }

            *d++ = mode;
        }
    }
    return dst;
}


// Time: 7.331546 secs
// Time: 6.884807 secs
const ImgU8 ImgU8::PointFilt(int kerW, int kerH) const {
    ImgU8 dst(cols, rows, true);
    uint8_t *d = dst.rc->data;
    const uint8_t *pix, *row, *src = Data();
    const int halfKerW = ((kerW + 1) >> 1) - 1,
    halfKerH = ((kerH + 1) >> 1) - 1;
    int i, i1, i2, j, j1, j2, x, y;
    uint32_t hist[256], mode;

    for (y = 0;  y < rows;  ++y) {
        j1 = y - halfKerH;      if (j1 < 0)    j1 = 0;
        j2 = y + halfKerH + 1;  if (j2 > rows) j2 = rows;
        j2 -= j1;
        row = &src[j1 * cols];
        for (x = 0;  x < cols;  ++x) {
            memset(hist, 0, sizeof(*hist) * (NUMBER(hist) + 1));

            i1 = x - halfKerW;
            if (i1 < 0) {
                i1 = 0;
            }
            i2 = x + halfKerW + 1;
            if (i2 > cols) {
                i2 = cols;
            }
            i2 -= i1;
            pix = &row[i1];
            for (j = j2;  j;  --j, pix += cols - (i2 - i1)) {
                for (i = i2;  i;  --i)
                    ++hist[(int)*pix++];
            }

            for (j = i = 0;  i < NUMBER(hist);  ++i) {
                if (hist[i] > j) {
                    j = hist[i];
                    mode = i;
                }
            }

            *d++ = mode;
        }
    }
    return dst;
}


// Time: 5.411607 secs
// Time: 5.481288 secs
const ImgU8 ImgU8::ModeFilt(int kerW, int kerH) const {
    ImgU8 dst(cols, rows, true);
    uint8_t *d = dst.rc->data;
    const uint8_t *t, *s = Data();
    const int halfKerW = ((kerW + 1) >> 1) - 1,
    halfKerH = ((kerH + 1) >> 1) - 1;
    int i, i1, i2, j, j2, x, y;
    uint32_t hist[256];

    /* Process first halfKerH rows */
    for (y = 0;  y < halfKerH;  ++y) {
        j2 = y + halfKerH + 1;
        for (x = 0;  x < cols;  ++x) {
            memset(hist, 0, sizeof(hist));
            i1 = x - halfKerW;
            if (i1 < 0) {
                i1 = 0;
            }
            i2 = x + halfKerW + 1;
            if (i2 > cols) {
                i2 = cols;
            }

            for (j = 0;  j < j2;  ++j) {
                t = &s[j * cols + i1];
                for (i = i1;  i < i2;  ++i, ++t)
                    ++hist[*t];
            }

            *d++ = Mode(hist, NUMBER(hist));
        }
    }

    /* Process middle rows */
    const int kwm1 = kerW - 1;  /* kernel width minus 1 */
    for (y = rows - kerH + 1;  y;  --y) {
        memset(hist, 0, sizeof(hist));

        /* Compute first halfKerW pixels in row */
        for (j = 0;  j < kerH;  ++j) {
            t = &s[j * cols];
            for (i = halfKerW;  i;  --i, ++t)
                ++hist[*t];
        }

        for (x = 0;  x < halfKerW;  ++x) {
            t = &s[halfKerW + x];
            for (j = kerH;  j;  --j, t += cols)
                ++hist[*t];
            *d++ = Mode(hist, NUMBER(hist));
        }

        /* Compute fully filtered pixels in row */
        for (x = cols - kwm1;  x;  --x) {
            /* Add to hist pixels in next right source column */
            t = &s[kwm1];
            for (j = kerH;  j;  --j, t += cols)
                ++hist[*t];
            *d++ = Mode(hist, NUMBER(hist));

            /* Remove from hist pixels in left source column */
            t = s++;
            for (j = kerH;  j;  --j, t += cols)
                --hist[*t];
        }

        /* Compute last halfKerW pixels in row */
        *d++ = Mode(hist, NUMBER(hist));
        for (x = halfKerW - 1;  x;  --x) {
            t = s++;
            for (j = kerH;  j;  --j, t += cols)
                --hist[*t];
            *d++ = Mode(hist, NUMBER(hist));
        }

        s += halfKerW + 1;
    }

    /* Process last halfKerH rows */
    s = Data();
    for (y = rows - halfKerH;  y < rows;  ++y) {
        j2 = y - halfKerH;
        for (x = 0;  x < cols;  ++x) {
            memset(hist, 0, sizeof(hist));
            i1 = x - halfKerW;
            if (i1 < 0) {
                i1 = 0;
            }
            i2 = x + halfKerW + 1;
            if (i2 > cols) {
                i2 = cols;
            }
            for (j = j2;  j < rows;  ++j) {
                t = &s[j * cols + i1];
                for (i = i1;  i < i2;  ++i, ++t)
                    ++hist[*t];
            }
            *d++ = Mode(hist, NUMBER(hist));
        }
    }

    return dst;
}