Converting Text into a Sorted List

Ever had a list that's sorta broken up into different lines but is also mostly just uses spaces to delimit items? Ever wanted each item in that list on its own line? Ever want that one-line-per-item list sorted? It's shocking how often I need to do this. Actually it's probably more unfortunate than shocking.

If you find yourself needing to do all that too then you're in luck! It turns out there's plenty of easy ways to turn a bunch of words into a sorted list!

First, a few notes...

Since I'm on OSX I'm accessing my clipboard with pbcopy and pbpaste. If you're on Linux with X11 you can use xclip or xsel instead. Obviously you can replace the clipboard paste with some other output command and you can omit the clipboard copy or replace it with some other command.

All of these examples use grep and sort. grep is used here to remove blank lines, I'll just leave it at that since a book could be (and has been apparently) written about grep. sort does exactly what you'd expect, it sorts a file or input. The -f option makes the sorting case insensitive. If you do really want words that start with capital letters to go first then omit that option.

sed

% pbpaste | sed 's/[[:space:]]/\'$'\n'/g | grep -v '^[[:space:]]*$' | sort -f | pbcopy

sed is ancient. Despite its age it remains incredibly powerful and versatile. If you're on OSX or some other BSD variant then your sed will function somewhat differently from GNU sed. I won't waste a bunch of space explaining the details here, but this Unix & Linux Stack Exchange question explains it nicely. Basically BSD sed doesn't do escape sequences in output. The best solution I've seen to the problem is in this Stack Overflow comment. If you're on Linux and using GNU sed then this is what you'd do:

% xclip -o -selection clipboard | sed 's/[[:space:]]/\n/g' | grep -v '^[[:space:]]*$' | sort -f | xclip -i -selection clipboard

The s command takes a regular expression, a replacement string, and optionally one or more flags in the form "s/regular expression/replacement string/flags". The g flag, like it does most places, makes the substitution global.

tr

% pbpaste | tr -s '[:space:]' '\n' | grep -v '^[[:space:]]*$' | sort -f | pbcopy

tr is similar to sed but much simpler. So simple there isn't much to say. The first argument is a set of characters to replace and the second argument is a corresponding set of characters to replace the first with one-to-one. The -s option squeezes consecutive occurrences of the replacement characters to into a single character.

awk

% pbpaste | awk '{gsub(/[[:space:]]+/, "\n"); print}' | grep -v '^[[:space:]]*$' | sort -f | pbcopy

awk reads each line and executes the action inside the curly braces for each line. In our case we're using gsub to do a global substitution and then unconditionally printing the line. awk does far more than simple substitution and printing so there's probably a million different ways to accomplish this task. I've met several people who swear by awk, and I can understand why. Personally, I find it to be too awkward (pun sorta intended) for serious use given that alternatives with far fewer rough edges and more extensibility exist.

Ruby

% pbpaste | ruby -ne 'puts $_.gsub(/[[:space:]]+/, "\n")' | grep -v '^[[:space:]]*$' | sort -f | pbcopy

This right here is actually the biggest reason why I'm writing this post. Whenever I'm faced with a task involving transforming text my natural inclination is to write a small throwaway script in Ruby to get the job done. Usually those scripts end up being fairly elaborate and proper, in the sense that they could easily be part of an actual program. I like to make it a habit to not write overly terse code. Even when I know I'm going to throw it all away I like my code to be readable with nice descriptive variable names and no magical short cuts. That being said, this article inspired me to venture forth and try my hand at something arcane and nigh unreadable. I try and avoid writing Ruby that looks like 1990's Perl, but the -n option coupled with -e is just too cool to ignore. I will, however, choose to ignore that the Ruby example looks almost exactly like the awk example. Personally I don't think that's a very flattering comparison.

If all of this seems familiar, it's probably because you've seen Avdi Grimm's excellent post on solving almost the same problem in several different languages.