Counting to a Billion

[This blog post is reconstructed from the Internet Archive's copy of my original article posted on April 7th, 2007. The original post was lost.]

Let’s count to a billion.

#include <iostream>

void Count(const unsigned long limit) {
  DWORD ticStart = ::GetTickCount();
  unsigned long counter = 0;
  for (unsigned long i = 0; i < limit; ++i) {
    ++counter;
  }
  DWORD ticStop = ::GetTickCount();

  std::cout << "Counting took "
            << ticStop − ticStart
            << "milliseconds."
            << std::endl;
}

int main() {
  Count(1000 * 1000 * 1000);
  return 0;
}

Computers are pretty fast. With optimization turned off, my computer reports that it took a mere 2547 milliseconds (or about 2.5 seconds) to count to a billion. With optimization turned on, the compiler realizes there are no side effects in the loop, so it replaces the loop with an assignment: counter = limit;, taking virtually no time.

In the ray tracer, I want to keep statistics. How many intersection tests were performed? How many rays per seconds were traced? How many of those were shadow rays? Essentially, this boils down to counting how many times certain operations are performed. From this experiment, I can see that incrementing a few counters won’t be a significant performance hit.

But the ray tracer is multithreaded. To keep n cores cranking, I’ve got n threads rendering different parts of the picture. So if I want an accurate count of how many times something happens, I’m going to have to put some sort of synchronization on the accesses to the counter.

void CountWithCriticalSection(const unsigned long limit) {
  CRITICAL_SECTION cs;
  ::InitializeCriticalSection(&cs);

  DWORD ticStart = ::GetTickCount();
  unsigned long counter = 0;
  for (unsigned long i = 0; i < limit; ++i) {
    ::EnterCriticalSection(&cs);
    ++counter;
    ::LeaveCriticalSection(&cs);
  }
  DWORD ticStop = ::GetTickCount();
  ::DeleteCriticalSection(&cs);

  std::cout << "Critical section counting took "
            << ticStop − ticStart
            << " milliseconds."
            << std::endl;
}

The example is still single threaded, but the code is now thread safe. The critical section prevents two threads from corrupting the counter by trying to modify it at the same time.

The bad news is that this example, optimized or not, took 73,719 milliseconds. Ouch. Thread safety has a price. And this is without any contention for the critical section. It would be worse with two threads incrementing the counter. Imagine how bad it would have been if we had used a mutex and had to make a billion round trips to kernel mode!

Fortunately, Windows provides a set of interlocked functions which can help us do simple, thread-safe operations like counting without the overhead of a critical section. One more version:

void CountInterlocked(const unsigned long limit) {
  DWORD ticStart = ::GetTickCount();
  volatile long counter = 0;
  for (unsigned long i = 0; i < limit; ++i) {
    ::InterlockedIncrement(&counter);
  }
  DWORD ticStop = ::GetTickCount();

  std::cout << "Interlocked counting took "
            << ticStop − ticStart
            << " milliseconds."
            << endl;
}

This version scores significantly better at 37,390 milliseconds, just over half the time of the uncontested critical section. But that’s still a lot of overhead to take on while rendering.

And a little back-of-the-envelope calculation delivers even more bad news. Suppose one of the counters is supposed to be the number of intersection tests performed in the rendering job. If were computing images at hi-def resolution (1920 by 1080), that’s about 2 million pixels. If, for anti-aliasing, we oversample with say 16 rays per pixel, we’re up to about 32 million primary rays. Each primary might spawn a secondary reflection or refraction ray. Each of those 64 million rays might require, say, 10 shadow rays for shading. With a good acceleration structure, we might get away with as few as 4 intersections per ray. Next thing you know, we’ve exceeded the capacity of a 32-bit unsigned integer. If not for a single frame, then certainly once we start animating at 24 or 30 frames per second. So it looks like we’ll need 64 bits for at least some of our counters.

But, unless you’re on a 64-bit architecture, there aren’t 64-bit versions of the interlocked APIs. That means we’re back to critical sections for making the counters thread-safe. Obviously, we need to rethink this.

The key to good performance with multiple threads is to eliminate shared data as much as possible. Suppose instead of one counter, each thread had its own. Each thread would be able to increment its own counter without any synchronization concerns. Then, when the render is done, the independent counters could be summed to determine the total count. This adds a small amount of complexity and coordination, but not much, and the benefit should be worth it.

And this brings me back to my singleton post. I’m now paying the price for not taking my own advice. The ray tracer’s statistics are based on a set of counters kept in a singleton, so now I have to do some extra refactoring to distribute the counters among thread-specific objects.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>