Monday, May 16, 2016

Topcoder problems - Tokenizing

One of the most common tasks in topcoder problems is tokenizing a string into individual elements. Almost always the entries are separated by spaces.

The naive and obvious version

Seen surprisingly often in various avatars, it is a C mode of thinking goes something like this:

Very verbose and tedious!
Of course on topcoder they usually write tersely equivalent code on a single line #defined as a macro or an inline function.

Can we do better?

Here is a more C++ like version

Still smells like C

How about

Think in functional terms, get the loop mentality out of the way - The STL is your friend :

Nice and neat!

Can this be a one-liner?

Let's try putting everything together...

Unfortunately this is not correct. The temporary istringstream dies after the istream iterator is constructed, so bad things will happen. The constructor of the istream_iterator takes a istream& rather than a const istream& otherwise it would be workable - See this
Actually should have been obvious...
You cannot read stuff out of a const reference to a stream since it changes state when you do.

What about other delimiters?

C++ streams are tied to locales, so they read stuff based on the definition of isspace() for the current locale. There seems to be no way to manufacture our own locales for splitting strings on things other than whitespace, so we need a different approach:

Neat huh?
If that looks like Greek and Latin to you, don't worry, it looks that way to me too, three days after writing it.

Let's analyze what the heck is going on here...

The intent is as follows - we want to maintain two pointers/iterators - one points to the start of a word, and the other points to a separator (or end of string)

The for loop starts by initializing ps(pointer to start), pe(pointer to end-of-string).
Since we use auto we have to assign something to pd(pointer to delimiter), so might as well assign ps to it. We are going to loop until ps reaches pe.

Next we place the next available token into the result vector - emplace_back() is a function that can push an element onto a vector without copying the value passed into it - it uses move semantics to achieve this. Whenever you push a temporary object onto a vector, use emplace_back()
Remember never to reuse the emplaced object again, or else...

OK, but what the hell is "the next available token" mentioned above?
It is a string consisting of an iterator range - the start of the range is obviously ps. The end of the range is the next delimiter in the string, or the end of the string if no such delimiter exists.
Well that's exactly what find_first_of() does - returns an iterator in a range that has any one of the values you give it (here the characters in the string seps). If not found, it returns the end of the range, which is what we want.

So all we need to do is set ps to the point after the delimiter, so we can grab the next token - this is done in the increment part of the for loop. If it goes past the end, were done. Assigning pd on the fly inside the expression is a neat way to avoid one extra line of code. Remember that everything in C/C++ except control statements return a value (In fact and BTW, in GNU C, there is an extension that lets every statement be a value).

Not too bad at all! I can't think of a more neat version than this. I have some vague idea about using adjacent_find(), any_of() and so on, but I suspect it won't be as neat as this.

One thing you might notice is that I use pointers instead of iterators. The point is that pointers are the best kind of iterator, and if you know that the underlying object is contiguous, one might as well use a pointer than an iterator (the compiler will actually generate almost the same code for an iterator in this case). There is nothing evil about pointers, as long as you use them sparingly in a type safe way.

Here's how it looks in the brief form using the macros from my topcoder template:

Try writing a similar split in your favorite language for kicks!

No comments:

Post a Comment