Tom's Programming Blog: February 2012

So I recently wrote a post about good old reliable Apache Commons StringUtils, which provoked a couple of comments, one of which was that Google Guava provides better mechanisms for joining and splitting Strings. I have to admit, this is a corner of Guava I've yet to explore. So thought I ought to take a closer look, and compare with StringUtils, and I have to admit I was surprised at what I found.

Splitting strings eh? There can't be many different ways of doing this surely?

Well Guava and StringUtils do take a sylisticly different approach. Lets start with the basic usage.

// Apache StringUtils...
String[] tokens1 = StringUtils.split("one,two,three",',');

// Guava splitter...
Iterable<String> tokens2 = Splitter.on(',').split("one,two,three");

So, my first observation is that Splitter is more object orientated. You have to create a splitter object, which you then use to do the splitting. Whereas the StringUtils splitter methods uses a more functional style, with static methods.

Here I much prefer Splitter. Need a reusable splitter that splits comma separated lists? A splitter that also trims leading and trailing white space, and ignores empty elements? Not a problem:

Splitter niceCommaSplitter = Splitter.on(',')
                              .omitEmptyString()
                              .trimResults();

niceCommaSplitter.split("one,, two,  three"); //"one","two","three"
niceCommaSplitter.split("  four  ,  five  "); //"four","five"

That looks really useful, any other differences?

The other thing to notice is that Splitter returns an Iterable<String>, whereas StringUtils.split returns a String array.

Don't really see that making much of a difference, most of the time I just want to loop through the tokens in order anyway!

I also didn't think it was a big deal, until I examined the performance of the two approaches. To do this I tried running the following code:

final String numberList = "One,Two,Three,Four,Five,Six,Seven,Eight,Nine,Ten";

long start = System.currentTimeMillis();  
for(int i=0; i<1000000; i++) {
    StringUtils.split(numberList , ',');   
}
System.out.println(System.currentTimeMillis() - start);
  
start = System.currentTimeMillis();
for(int i=0; i<1000000; i++) {
    Splitter.on(',').split(numberList );
}
System.out.println(System.currentTimeMillis() - start);

On my machine this output the following times:

594
31

Guava's Splitter is almost 10 times faster!

Now this is a much bigger difference than I was expecting, Splitter is over 10 times faster than StringUtils. How can this be? Well, I suspect it's something to do with the return type. Splitter returns an Iterable<String>, whereas StringUtils.split gives you an array of Strings! So Splitter doesn't actually need to create new String objects.

It's also worth noting you can cache your Splitter object, which results in an even faster runtime.

Blimey, end of argument? Guava's Splitter wins every time?

Hold on a second. This isn't quite the full story. Notice we're not actually doing anything with the result of the Strings? Like I mentioned, it looks like the Splitter isn't actually creating any new Strings. I suspect it's actually deferring this to the Iterator object it returns.

So can we test this?

Sure thing. Here's some code to repeatedly check the lengths of the generated substrings:

final String numberList = "One,Two,Three,Four,Five,Six,Seven,Eight,Nine,Ten";
long start = System.currentTimeMillis();  
for(int i=0; i<1000000; i++) {
  final String[] numbers = StringUtils.split(numberList, ',');
    for(String number : numbers) {
      number.length();
    }
  }
System.out.println(System.currentTimeMillis() - start);
  
Splitter splitter = Splitter.on(',');
start = System.currentTimeMillis();
for(int i=0; i<1000000; i++) {
  Iterable<String> numbers = splitter.split(numberList);
    for(String number : numbers) {
      number.length();
    }
  }
System.out.println(System.currentTimeMillis() - start);

On my machine this outputs:


609

2048

Guava's Splitter is almost 4 times slower!

Indeed, I was expecting them to be about the same, or maybe Guava slightly faster, so this is another surprising result. Looks like by returning an Iterable, Splitter is trading immediate gains, for longer term pain. There's also a moral here about making sure performance tests are actually testing something useful.

In conclusion I think I'll still use Splitter most of the time. On small lists the difference in performance is going to be negligible, and Splitter just feels much nicer to use. Still I was surprised by the result, and if you're splitting lots of Strings and performance is an issue, it might be worth considering switching back to Commons StringUtils.

Usefull SVN commands

Have been increasingly using command line SVN these days. Find it's just a bit quicker and more reliable than the GUI clients I had been using. I've mainly written this for myself, as a quick reference for the commands I use most frequently

svn help COMMAND

Displays help for a particular svn command.
Lists all available commands if none is specified

svn co URL[@REV]

Check out: Creates a local working copy of the repository found at the URL

svn log

Provides a log of commit messages
Use with -l 10 to limit to 10 most recent messages
Use -v for verbose output (lists changed files)
Use -r 12345 to get info on particular revision

svn info

Provides useful information about the current working copy, such as repository URL, and current revision

svn status

Provides a list of diffences between the working copy, and the repostory. Take a look at the help (svn help status to find out what the different column values mean)

svn diff

Displays local modifications
Use -r N:M to display differences between two revisions
Use -c to see the changes for a particular revision

svn revert PATH

Reverts the specified path to the contents of the repository. You will lost any local changes!
Not recursive by default, use -R to make recursive

svn add PATH

Adds the specified path the version control.
Note, this doesn't add the file to the repository yet, that doesn't happen till you commit.

svn copy SRC[@REV] DEST

Copies something from SRC to DEST.
SRC and DEST can both be either working copies paths, or URLS.
Usual usage would be WC -> WC, or URL -> URL.
URL -> URL is used for branching and tagging

svn delete TARGET

Deletes a file.
If TARGET is a working copy path, the file is scheduled for deletion on the next commit
If TARGET is a repository URL, it immediately deletes the file from the repository.

svn move SRC DST

Moves a the specified target.
Equivalent of a copy then a delete
Maintains history on the moved file.

svn commit -m MESSAGE

Commits the working copy changes to the repository

svn list TARGET

Provied a directory listing for the specified folder in the repository

svn mkdir TARGET

Creates a directory
TARGET can be working copy path or repository URL

svn merge -r N:M SOURCE@REV

Merges the range of revisions starting at N and ending at M from SOURCE into the current working copy
If N > M then it is a reverse merge, and can be used to undo the differences between N and M
Typically used to catch up a feature branch
Can use -c option to pick a single revision
Can supply multiple -c and -r options to cherry pick revisions

svn merge --reintegrate SOURCE@REV

Used to reintegrate a branch into it's parent branch
Working copy should be the parent branch (often trunk)
A branch cannot be reintegrated twice, so good practice to delete the branch afterwads

svn blame TARGET@REV

Outputs target, with author names, and revision numbers attached to changes
Use to see who broke what.

Tom's Programming Blog

Monday, 27 February 2012

Guava Splitter vs StringUtils

Monday, 6 February 2012

Useful SVN commands

Usefull SVN commands

About Me