SOPA and PIPA are dangerous

The SOPA and PIPA bills are dangerous.

They’re not really about piracy, either. Piracy is just the excuse. These bills, like most copyright legislation in the past two decades, are about the entertainment industry holding back innovations that might disrupt their business models. It’s yet another power grab.

Worse, these bills could actually make the net less secure, while providing an infrastructure for censorship that will inevitably be abused.

And by blocking innovation, these bill will actually prevent new jobs and slow our economic recovery.

Call or write your legislators. Let them know that handing our freedom and security over to the content industry will cost them votes.

Something Changed!

Yes, the blog has been neglected for a quite a while as I was focused on my novel. Now that the manuscript is being reviewed by my beta readers, I’m starting to spiff up the domain, so stay tuned for evem more changes.

If you don’t already have an ereader, I recommend you put one on your wish list for the holidays.

Dear Tech Support

Dear CBS/TVCity Entertainment Panel/Sony/Nielsen/ReelResearch/ReelSurvey tech support:

I am unable to get the promotional videos that are part of the latest entertainment survey to play.

I ran the survey using Internet Explorer, despite the fact that it is a far less secure browser. I’m running Internet Explorer 8.0.6001.18702 on Windows XP SP4. I have Windows Media Player 11.0.5721.5280 because I got tricked into upgrading to the crappy, locked-down version a couple years ago. I even happen to have Flash installed this week. (I often uninstall Flash because Adobe is so bad at security, Flash-based ads are abusive of my bandwidth, and Flash-based web sites may as well be content-free.)

When I clicked the link to play the video, I got the following message:


You do not have the rights to play this file. Go to the content provider[']s Web
site to find out how to obtain the necessary play rights.

Address:

http://drm1.reelsurvey.com/RR/Video/V7LicPage.asp

Web pages can contain elements that could be harmful to your computer. It is
important to be certain that the content is from a trustworthy source before
continuing.

So far, I’ve been called a cheater and told to be very suspicious of you.

Just who are you anyway? The survey invitations are sent to an email I gave only to CBS, the emails claim to be from TVCity Entertainment Panel, which appears to be affiliated with Sony—the criminals who got away with compromising the security of hundreds of thousands of PCs by distributing a rootkit on audio CDs. The page titles say Nielsen, and the survey URLs vary between reelresearch.com and reelsurvey.com. You could at least use https and give me a certificate to check out.

Against my better judgment, I clicked through. A poorly-drawn, non-resizeable window popped up, and an error popped up on top of that. The error dialog said:


An error has occurred in the script on this page.

Line: 8
Char: 2
Error: 'netobj' is undefined
Code: 0
URL: http://drm1.reelsurvey.com/RR/Video/V7LicPage.asp

Do you want to continue running scripts on this page?

Curious as to just how much farther this train wreck could go I clicked ‘Yes’. The error dialog vanished and returned me to the non-resizeable window which was mostly black, with the text: “A license for the media file has been downloaded to your system. Please click play.” Unfortunately, the Play button was disabled.

At the top of the window was an IE gold message bar. The message bar said: “This web site wants to run the following add-on: ‘DRM ActiveX Network Object’ from ‘Microsoft Corporation’. If you trust the web site and the add-on and want to allow it to run, …” [The ellipsis is the original text, not an edit on my part.]

Well, I was pretty sure I didn’t trust the web site, and even if I was morbidly curious about what kinds of atrocities would be waged against my computer, I couldn’t continue because the instruction was cut off because somebody thought it would be a good idea to keep the user from resizing the window.

Why—oh why!—do you put DRM on a frakkin’ promotional video? Don’t you want promotional videos to be seen?

When will the entertainment industry realize that Digital Restrictions Management makes their products less valuable by hindering only legitimate consumers? If you want to give your clients some valuable feedback, tell them to wise up about treating their customers like criminals.

P.S. Submitting this message through the web form resulted in a server error.

Reconcile This

One of the reasons I’ve used Quicken so consistently to track our finances is that it so often saves me money. In a typical year, I used to catch $75 to $150 worth of bank errors, usually with credit card charges. These erroneous transactions are things that I would have missed with a visual inspection of the monthly statement, but are easy to nail when reconciling my records against the bank’s.

Common errors I catch:

  • Transposed digits. That $34 dollar meal is mistakenly entered as $43 by the hurried waiter. Small enough that I’d never notice, but easy to catch in Quicken. I got pretty good at contesting these.
  • Double charges. When you charge a meal at a restaurant, the waiter enters the pre-tip total and brings you the receipt. On the receipt, you enter the tip amount and sign. The waiter then enters a second charge for the total with the tip, and reverses the original charge. But sometimes, the reversal doesn’t work, and you’ll end up paying for your meal twice.
  • Charges at the statement cutoff. Twice I’ve had a charge show up as the last transaction of my monthly statement, and again as the first charge of the next monthly statement.

Mistakes like these are one of the main reasons I’ve stuck with an older version of Quicken that emphasizes reconciling over downloading.

Over the past few years, I’ve noticed a change in the types of errors I catch. There are fewer and fewer of those common charge errors in the vendor’s favor and more and more in my favor. In fact, the most common type of error I see now is the transaction that never posts to the account.

For example, I have a charge slip (actual ink on a physical piece of paper) showing that I paid our veterinarian $77.64 with our Visa card last December, but that charge has never appeared on our credit card statement. And it’s not just small vendors like the local vet clinic. TJ Maxx, Crate & Barrel, H&M, Target and other big name retailers sometimes fail to get a charge through.

I wonder who is losing out. Are these vendors not getting paid or is the bank failing to bill me? The first few times this happened, I called the bank. I gave specific information about the time, place, and amount of receipts that I had that didn’t post to my account. The rep said they didn’t have a record of any of those charges, and suggested that I called the vendors—if I cared. I called a couple of smaller ones, like the local vet clinic. They were uncertain how to verify payment for a specific credit card charge. In general, they seemed unconcerned—they didn’t think the credit card companies were underpaying them.

Ignorance is bliss, I guess.

In the past, the paper trail was king. A printed receipt was the final arbiter of the transaction. If the charge on the statement didn’t match the receipt, you sent a copy of the receipt to the bank, and the bank fixed their records. But today, it seems, the bank’s database is the final authority—reality be damned.

If the bank is happy and the vendors are satisfied, I guess I’m the only one with a problem. Sure, I’m ahead a couple hundred dollars, but my statements don’t reconcile. To a hardcore Quicken user, that’s torture.

Counting to a Billion

[This blog post is reconstructed from the Internet Archive's copy of my original article posted on April 7th, 2007. The original post was lost.]

Let’s count to a billion.

#include <iostream>

void Count(const unsigned long limit) {
  DWORD ticStart = ::GetTickCount();
  unsigned long counter = 0;
  for (unsigned long i = 0; i < limit; ++i) {
    ++counter;
  }
  DWORD ticStop = ::GetTickCount();

  std::cout << "Counting took "
            << ticStop − ticStart
            << "milliseconds."
            << std::endl;
}

int main() {
  Count(1000 * 1000 * 1000);
  return 0;
}

Computers are pretty fast. With optimization turned off, my computer reports that it took a mere 2547 milliseconds (or about 2.5 seconds) to count to a billion. With optimization turned on, the compiler realizes there are no side effects in the loop, so it replaces the loop with an assignment: counter = limit;, taking virtually no time.

In the ray tracer, I want to keep statistics. How many intersection tests were performed? How many rays per seconds were traced? How many of those were shadow rays? Essentially, this boils down to counting how many times certain operations are performed. From this experiment, I can see that incrementing a few counters won’t be a significant performance hit.

But the ray tracer is multithreaded. To keep n cores cranking, I’ve got n threads rendering different parts of the picture. So if I want an accurate count of how many times something happens, I’m going to have to put some sort of synchronization on the accesses to the counter.

void CountWithCriticalSection(const unsigned long limit) {
  CRITICAL_SECTION cs;
  ::InitializeCriticalSection(&cs);

  DWORD ticStart = ::GetTickCount();
  unsigned long counter = 0;
  for (unsigned long i = 0; i < limit; ++i) {
    ::EnterCriticalSection(&cs);
    ++counter;
    ::LeaveCriticalSection(&cs);
  }
  DWORD ticStop = ::GetTickCount();
  ::DeleteCriticalSection(&cs);

  std::cout << "Critical section counting took "
            << ticStop − ticStart
            << " milliseconds."
            << std::endl;
}

The example is still single threaded, but the code is now thread safe. The critical section prevents two threads from corrupting the counter by trying to modify it at the same time.

The bad news is that this example, optimized or not, took 73,719 milliseconds. Ouch. Thread safety has a price. And this is without any contention for the critical section. It would be worse with two threads incrementing the counter. Imagine how bad it would have been if we had used a mutex and had to make a billion round trips to kernel mode!

Fortunately, Windows provides a set of interlocked functions which can help us do simple, thread-safe operations like counting without the overhead of a critical section. One more version:

void CountInterlocked(const unsigned long limit) {
  DWORD ticStart = ::GetTickCount();
  volatile long counter = 0;
  for (unsigned long i = 0; i < limit; ++i) {
    ::InterlockedIncrement(&counter);
  }
  DWORD ticStop = ::GetTickCount();

  std::cout << "Interlocked counting took "
            << ticStop − ticStart
            << " milliseconds."
            << endl;
}

This version scores significantly better at 37,390 milliseconds, just over half the time of the uncontested critical section. But that’s still a lot of overhead to take on while rendering.

And a little back-of-the-envelope calculation delivers even more bad news. Suppose one of the counters is supposed to be the number of intersection tests performed in the rendering job. If were computing images at hi-def resolution (1920 by 1080), that’s about 2 million pixels. If, for anti-aliasing, we oversample with say 16 rays per pixel, we’re up to about 32 million primary rays. Each primary might spawn a secondary reflection or refraction ray. Each of those 64 million rays might require, say, 10 shadow rays for shading. With a good acceleration structure, we might get away with as few as 4 intersections per ray. Next thing you know, we’ve exceeded the capacity of a 32-bit unsigned integer. If not for a single frame, then certainly once we start animating at 24 or 30 frames per second. So it looks like we’ll need 64 bits for at least some of our counters.

But, unless you’re on a 64-bit architecture, there aren’t 64-bit versions of the interlocked APIs. That means we’re back to critical sections for making the counters thread-safe. Obviously, we need to rethink this.

The key to good performance with multiple threads is to eliminate shared data as much as possible. Suppose instead of one counter, each thread had its own. Each thread would be able to increment its own counter without any synchronization concerns. Then, when the render is done, the independent counters could be summed to determine the total count. This adds a small amount of complexity and coordination, but not much, and the benefit should be worth it.

And this brings me back to my singleton post. I’m now paying the price for not taking my own advice. The ray tracer’s statistics are based on a set of counters kept in a singleton, so now I have to do some extra refactoring to distribute the counters among thread-specific objects.

105% Complete

One of my UI pet peeves is misleading progress indicators.

This morning, I’m running a disk drive diagnostic program. The first test claimed it would take 200 seconds. After only 10 seconds, the display said “10% done”. Huh? Last time I checked, 10 ÷ 200 was 0.05, or 5%. Near the end of the test, the display said “100% done” about 10 seconds before it actually completed.

To a user, incorrect progress bars are annoying and useless. I don’t know about you, but when something is “100% done”, I think of it as, well, done. Not close to done. Those eternal seconds between “100% done” and actually done are immensely frustrating. People have even done research papers on designing progress bars to improve the user’s perception of a program’s progress. But this kind of research is light years ahead of real life, where we can’t even get a decent linear progress bar to work.

From my programmer’s perspective, misleading progress indicators are especially perplexing. You have to write extra code to get it wrong. Since programmers are lazy, nobody should be writing extra code, and all of our progress bars should just work.

Consider the disk drive diagnostic. You’d expect a plain jane linear progress indicator to be computed something like this:

indicator_value = work_done / total_work

Pretty simple, right? Ten seconds into a 200 second test should give us 0.05.

Now it’s likely that the calculation is done with integer arithmetic, and the final result will be presented as a percentage. So we have to scale up the numbers like so:

indicator_value = 100 * work_done / total_work

Now, 100 * 10 seconds ÷ 200 seconds gives us 5. So why in the world would the disk diagnostic claim 10%? Because somebody decided to do more work than was necessary.

Somebody apparently decided to have the indicator display progress in 10 percentage point increments (10%, 20%, 30%, etc.). Fine, you have to draw the line somewhere. So we revise our code to something like this:

indicator_value = 100 * work_done / total_work / 10 * 10

Mathematicians unfamiliar with integer arithmetic in most programming languages are now scratching their heads. The salient detail is that there’s an implicit floor function on the result of each division operation when working with integers. Conceptually, the above is equivalent to:

indicator_value = floor(floor(100 * work_done / total_work) / 10) * 10

Any programmer who find him- or herself at this point should stop, take a breath, and check off the progress indicator feature as complete.

But many programmers don’t stop there. Something bothers them. They instinctively worry about the “rounding down” that the integer division does. They want to fix it. They want to round up. Next thing you know, they’ve written an expression like this:

indicator_value = (100 * work_done / total_work + 5) / 10 * 10

Adding 5 (which is half of the 10 percentage point interval) will bias the number up. In many cases, this is the right thing to do. If you’re writing code to credit frequent flier miles to my account, I want you to round up. But we’re talking about code for a progress indicator. Don’t over-promise. Manage the user’s expectations.

It’s true that, 10 seconds into a 200 second operation, we’re closer to 10% than we are to 0%. But we’re not 10% done. We’re not even close. And, at the other end, you’ll show 100% done 10 seconds before we’re actually done. That’s an outright lie. User interfaces shouldn’t lie to users.

Don’t round up on progress indicators. Ever!

“But now users will panic,” some of you are sure to complain, “because it’ll seem like nothing is happening for 20 whole seconds!”

Perhaps so. Maybe it’s time to revisit the decision to show progress in 10 percentage point increments. You could have done even less work.

[Happy Pi Day!]

Review: Writing and Selling Your Mystery Novel

I’m at a stage where the best thing I could do to improve my writing would simply be to practice more. But every now and then, when I’m looking for some motivation to get back to my work in progress, I pick up one of those “how-to-write” books. There are some excellent ones (like Self-Editing for Fiction Writers and Writing the Breakout Novel) and some not-so-excellent ones. There are also some that are probably pretty good but whose advice is too intangible for me to incorporate into my skillset (e.g., Creating Character Emotions). Nowadays, I don’t have high expectations for these books. I just use them as a way to get back into the writing mood.

Cover of Writing and Selling Your Mystery Novel by Hallie Ephron

Writing and Selling Your Mystery Novel by Hallie Ephron

It was with this mindset that I picked up Writing and Selling Your Mystery Novel by Hallie Ephron, and boy was I pleasantly surprised. It’s easy for these books to overreach by trying to touch on all aspects of becoming a published author. Covering both writing and selling in a single book is ambitious, so my hopes for depth were low. Nevertheless, Ephron manages to pack a lot of great ideas into less than 250 pages. I found the portions specific to mysteries (e.g., tips on plotting, building suspense, and pacing) to be especially helpful.

Many of her specific suggestions I’d not read anywhere else before (and I’ve read a lot of these books). And while she doesn’t plunge into any of the ideas in great depth, in many cases depth wasn’t necessary. Many of her tools will go into my toolbox. Other tidbits immediately sparked several ideas that I am already applying to my work in progress. I look forward to revisiting her chapters on revision—if I ever get this first draft done.

The book is well organized. You can read your way through it pretty quickly (even if you do the exercises along the away). You can also use it as a reference by returning to individual chapters for a refresher as you move through the stages of writing your own story. Most of the chapters are short and cover very specific topics with thought-provoking discussion, concrete examples, worksheets, and checklists.

Ephron also interviewed several well-known mystery writers and let them contribute some of their own tips. For example, Lee Child offers tricks on how to avoid getting bogged down in description while writing what should be a fast-paced action sequence.

If you’re working on a mystery, or even a thriller, I’d easily recommend Writing and Selling your Mystery Novel. It won’t replace Self-Editing or Breakout Novel—you still need those (and fantastic blogs like Edittorrent). But you will get tangible, actionable ideas from Ephron’s work that you won’t find in the more general writing advisers.

Claiming My Chase Account

I had a long-term CD account at Washington Mutual, which was bought by Chase during the econopocalypse. In my latest quarterly statement (which Chase has managed to bloat into two pages), there was a note that I had to “claim” my account or they would be forced to turn it over to the FDIC. I called the number and here is roughly what transpired.

Recording: Thanks for calling Chase to claim your account. You can claim your account by executing a transaction on it.

Me: No, I can’t, not without incurring an interest penalty. It’s a long term CD that doesn’t mature until 2010.

Recording: Or by visiting a branch.

Me: I haven’t been inside a bank branch in years.

Recording: Or by logging into your Washington Mutual online banking account.

Me: I don’t have one of those.

Recording: Or by remaining on the line to talk to a banker.

Me: Now we’re getting somewhere.

Recording: If you’d like to participate in a customer satisfaction survey, please stay on the line after the banker disconnects, and you’ll be automatically connected to the survey.

Banker: Thanks for calling Chase. How can I help you?

Me: I got a note in my statement that I have to claim the CD account I had with Washington Mutual.

Banker: No problem. What’s the account number?

Me: [reads number from the statement]

Banker: Hmm. I don’t see an account with that number.

Me: It’s the number right from the statement.

Banker: Did you open this account in California?

Me: Yes. I opened it with Great Western Savings in California. After the savings and loan melt-down, Great Western Savings became Great Western Bank, which was bought by Washington Mutual, which closed my branch without notice, and now they’ve been bought by you.

Banker: Wow, you’ve had that CD for quite a while.

Me: Yes, and this is the most work I’ve ever had to put into maintaining it.

Banker: Maybe I can look it up by your name.

Me: [gives name, address, date-of-birth, and full Social Security number, and hopes that this isn't some scam]

Banker: Oh, OK, here it is. I see it was opened in 2002.

Me: Quicken says I’ve had it since at least 1992. I think that was after Great Western switch from an S&L to a bank.

Banker: Let me confirm the balance with you then. [tells me the balance]

Me: Yeah, that’s right.

Banker: OK, you’ve now claimed your account.

Me: Gee, that’s a relief, because it seems like you didn’t have it there for a minute. And I don’t really understand why I had to claim it. Every bank account I’ve ever had has changed hands at least once before, and I’ve never had to claim any of those.

Banker: It’s just a regulation, sir.

Me: Then it’s a regulation no other bank follows.

Banker: Is there anything else I can do for you?

Me: No, just keep my money safe, thanks.

Banker: Thanks for calling Chase. Goodbye.

Me: [waits on the line for the promised customer satisfaction survey]

Me: [… and waits]

Me: [… and waits]

Click. Dial tone.

Me: Note to self, move the CD when it matures next year.

Test-Driven Development

On a C++ programming forum, somebody asked a relatively simple question: How do you write a function that counts the number of lines in a text file? Most of the answers were friendly and helpful, but flawed.

As programmers, we often think of a problem as so trivial that we don’t take the time to think through the answers or to appreciate the subtlety of corner cases. In one of my favorite books, Programming Pearls, Jon Bentley talks about how hard it is for programmers to get fairly simple and routine problems correct on the first try. He writes:

I’ve assigned this [binary search] problem in courses at Bell Labs and IBM. Professional programmers had a couple hours to [implement a binary search] in the language of their choice; a high-level pseudocode was fine. At the end of the specified time, almost all the programmers reported that they had the correct code for the task. We would then take thirty minutes to examine their code, which the programmers did with test cases. In several classes and with over a hundred programmers, the results were varied little; ninety percent of the programmers found bugs in their programs (and I wasn’t always convinced of the correctness of the code in which no bugs were found). [p. 36]

Remember, these are professional programmers, in the language of their choice, with hours of time, messing up a binary search.

Notice the oh-so-common pattern here:

  1. Write code.
  2. Declare that it works.
  3. Test it and discover it’s broken.

It’s enough to make you wonder if there’s a better way. Many people in the extreme programming and agile crowds would say that we’re doing it all out of order. Not only should “declare it works” come after testing, but probably so should writing code. So let’s take this seemingly trivial line-counting problem and apply a little test-driven development in C++.

First, let’s agree on an interface. I propose:

template <typename InputIt>
size_t CountLines(InputIt begin, InputIt end);

That is, we’ll give the function a pair of iterators pointing to the beginning and end of the text we want examined, and the function will return the number of lines. (If you want to count the lines in a file, read the file into a std::string and call the function.) I’ve used size_t for the return type, as that’s typically unsigned (after all, the count will be 0 or more—it can’t be negative), and it’s generally as large an integer type as your platform will comfortably handle, so we shouldn’t have to worry about text consisting of too many lines. If you’re not a fan of the template parameters for the iterator types, feel free to re-write it with const char pointers.

Now that we know the interface, we will write the test function—even before we discuss how we’ll count lines. This function will iterate through a set of test cases, feed the input to CountLines, and check the return.

bool Test_CountLines() {
  struct TestCase {
    std::string input;
    size_t output;
  };

  const TestCase tests[] = {
    {"Hello\nWorld\n", 2}
  };

  const size_t cTests = sizeof(tests) / sizeof(tests[0]);

  size_t cPassed = 0;
  size_t cFailed = 0;
  for (size_t i = 0; i < cTests; ++i) {
    const TestCase &test = tests[i];
    size_t cLines = CountLines(test.input.begin(), test.input.end());
    if (cLines == test.output) {
      ++cPassed;
    } else {
      std::cerr << "FAILURE" << std::endl
                << "Input \"" << test.input << '"' << std::endl
                << "Output " << cLines << " instead of " << test.output << std::endl;
      ++cFailed;
    }
  }

  return cPassed == cTests && cFailed == 0;
}

I’ve started us with one test case, and we can add more as we develop the function. Just to make sure our testing framework is solid, let’s stub out CountLines and a little main function to make sure it all compiles and runs.

template <typename InputIt>
size_t CountLines(InputIt /* begin */, InputIt /* end */) {
  size_t cLines = 0;
  return cLines;
}

I commented out the parameter names to avoid the “unused parameter” warnings. These will come into play shortly.

int main() {
  assert(Test_CountLines());
  return 0;
}

If you put all this together and run it, you’ll see that the test fails as expected. This is good, as we want to make sure we exercise the failure case of our test function to know that it works.

Some of you are probably saying that you could have already written a correct CountLines function by now. Maybe. Remember all those “helpful” answers on the programming forum? Remember all those professional developers attempting a binary search? Besides, it really didn’t take long to build this test infrastructure. If a bug ever does crop up in the function, we can easily debug it. When we fix a bug, we can immediately verify that we didn’t create a regression by re-checking every test case we’ve ever come up with. And you can add that new test case so that it never breaks again. If you change something fundamental, like the compiler or OS you’re using one, you can know with certainty whether or not it affects the behavior of your function.

Next we need test cases. Off the top of my head, here are some interesting cases we’ll have to design for:

  • An empty string should have zero lines.
  • The count of a string with blank lines or empty lines (e.g., “Hello\n\nWorld!\n”) should include those blank lines.

Doesn’t seem too hard yet, so let’s add those test cases.

const TestCase tests[] = {
  {"", 0},
  {"Hello\nWorld\n", 2},
  {"Hello\n\nWorld\n", 3}
};

It seems pretty “obvious” that the way to count the lines is to count the number of newline (‘\n’) characters in the string:

template <typename InputIt>
size_t CountLines(InputIt begin, InputIt end) {
  return std::count(begin, end, '\n');
}

Ta da! Test-driven development has led us directly to one of the common—but incorrect—answers given on the programming forum.

Consider the string “Hello”. That has zero newline characters, but most people would consider it to be one line long. Too bad we didn’t think of that while coming up with our test cases. Now that we have thought of it, the first thing to do is to add it to our test code.

const TestCase tests[] = {
  …
  {"Hello", 1},
  …
};

We might be tempted to treat this new string as a special case.

// Ugly -- don't do this!
template <typename InputIt>
size_t CountLines(InputIt begin, InputIt end) {
  size_t cLines = std::count(begin, end, '\n');
  if (cLines == 0 && begin != end) {
    cLines = 1;
  }
  return cLines;
}

While this hack will pass our tests, it’s ugly and still wrong. We’ve failed to realize that the “Hello” string is just a special case of not having a newline character on the last line in the string. (This is an extremely common bug in the wild. Many parsers choke if the last line of the input doesn’t end with a newline.) So we add another test case:

const TestCase tests[] = {
  …
  {"Hello\nWorld", 2},
  …
};

When adding a test case, it’s good to think of counter-examples. For instance, if the last line of the string does end with a newline, that shouldn’t trigger an extra line. Fortunately, we’ve already got a test case for that.

Now our original approach is starting to look bad. Simply counting the newline characters, even with some hacky fix-ups for “corner cases”, isn’t going to work.

All is not lost. Yes, we have to completely re-think how we’re going to implement CountLines, but none of that invalidates the test framework we’ve built. When we come up with our new solution, we can immediately check to see if it solves the new test cases as well as the others we’ve already made. Suddenly it looks like the time we’ve put into the test framework is going to pay off.

There are a few ways to solve the line counting problem, but they all boil down to a state machine. Here’s my solution:

template <typename InputIt>
size_t CountLines(InputIt begin, InputIt end) {
  size_t cLines = 0;
  bool bStartOfLine = true;
  for ( ; begin != end; ++begin) {
    if (bStartOfLine) {
      ++cLines;
    }
    bStartOfLine = *begin == '\n';
  }
  return cLines;
}

In this case, there’s not much state—just a boolean. If we’re at the beginning of a line, and there’s anything on it (even a ‘\n’), then we need to count the line. After that, we simply need to figure out if the next character will be at the start of a line or not. This solution passes all five of our test cases.

Some say that you shouldn’t “waste time” on the corner cases, since, well, they’re corner cases. In my experience, though, if you get the corner cases right, then the mainstream cases generally just work. After all, the corner cases should just be degenerate versions of the mainstream case—or vice versa.

Uh oh, another bug. We’ve been considering ‘\n’ to be the newline character. Text files on Multics-derived operating systems use a single ASCII line feed character (LF) to indicate a line break, but that’s far from universal. CP/M, DOS, Windows and a lot of Internet protocols require a carriage return (CR) followed by LF to mark a line break. C and C++ file i/o routines will translate whichever convention your platform uses to and from ‘\n’. Most compilers give ‘\n’ the value 0x0A, which happens to be the same as an ASCII LF. On Unix systems, the compiler’s i/o routines have to do no translation. On others, when you write ‘\n’, the output routine actually writes CR-LF, and when the input routine reads CR-LF, it returns ‘\n’. (Read all about newlines on Wikipedia.)

But if we were to read in the data in binary mode (which could be faster), then we’ll see the actual bytes used in the file. So we have to add test cases. First, we’ll add the CR+LF cases. Then we’ll conditionally add ones with explicit LFs in case our compiler doesn’t use LF for ‘\n’.

const TestCase tests[] = {
  …
  {"Hello\x0D\x0AWorld", 2},
  {"Hello\x0D\x0AWorld\x0D\x0A", 2},
  {"Hello\x0D\x0A\x0D\x0AWorld\x0D\x0A", 3},
#if '\n' != '\x0A'
  {"Hello\x0AWorld", 2},
  {"Hello\x0AWorld\x0A", 2},
  {"Hello\x0A\x0AWorld\x0A", 3}
#endif
};

Happily, our latest version of CountLines works with these new examples. Essentially, the CR characters aren’t important to the algorithm, and our compiler does assign LF for ‘\n’.

If you did read the Wikipedia article, you know that there are several more newline conventions that we might want to support. Old Mac and Amiga files use just a CR, old IBM mainframes used NEL from the EBCDIC character set, Unicode introduced LINE SEPARATOR and PARAGRAPH SEPARATOR code points, etc. I’m sure you could imagine refining CountLines to handle these additional cases. Essentially, the state machine grows in complexity.

Conclusion

OK, this has been a long blog post, mostly because I’ve shown the iterative process necessary to solve the problem. What we’ve seen is that even the simplest problems might be more complex than we anticipate. The more thought we put into coming up with test cases, the more robust an implementation we were able to write. And when we had to start the implementation over from scratch, having a unit test framework helped us quickly ensure that the new implementation was at least as correct as the previous one.