String manipulation in bash

I spend a lot of my time working with, moving, modifying and renaming large amounts of files. Managing this much data is a pain if you do it manually, so I always end up automating as much of it as possible. Although this can be fairly easy if you write a simple script, e.g. in Python, it can be a bit over-the-top to write a script in a general-purpose language. Most, if not all, tasks that can be handled manually in a shell can be automated. Here I’ll show some tricks for automating string manipulation tasks in my favorite shell, bash.

Why bash?

Simply put, it’s a matter of taste. If you like another shell more, stick with it. There are a few reasons I use bash:

    • It’s always available. Bash is the default shell on most UNIX-like systems. I have rarely experienced it not being the default shell on a given machine, but even when it isn’t, it’s always available.
    • It’s POSIX conform. At least for the most part. That means that if you write a script for Bash, it will likely be able to run independently of the executing shell. That is, unless you do esoteric, bash-specific things (some of which I will show in this post).
    • It’s got a huge community. Everybody’s drinking the Kool-Aid! That means, if you have a question, somebody else has already probably asked it, and also found the answer, and it’s probably posted somewhere where you can find it. Also, the shell continues to be developed – as of this writing, the latest release was only 6 months old.
    • It’s got tons of usability features. There are tons of keyboard shortcuts that will make your life easier – when working interactively, there’s absolutely no reason why your hands should have to leave the home keys. The collection of productivity shortcuts is so large that I still continue to learn new ones, even though I’ve been working with bash on a daily basis for several years.

In short, bash is omnipresent and highly compatible with other shells. You know it’ll be there and you know that if it executes your script, other shells probably will too. I would not recommend scripting esoteric, bash-specific stuff, and of course the keyboard shortcuts don’t help you in a script, but in interactive sessions they’re a great help.

Show me the money!

In bash, everything is a string – variables, variable names, etc. When you enter a command on the command line, it is expanded before the shell interprets it. That means, if you use a ‘*’, for example, it will be expanded and substituted with the values that fit the ‘*’ expression. Keep that in mind when working with the shell. Any time you manipulate something in bash, you’re actually manipulating a string.

Let’s see some cool tricks. Keep in mind that these are now bash-specific and are meant to be executed interactively, not in a script – if you do so in a script you won’t have the guarantee that other shells will give you the same results – even if this often is the case.

Brace expansion

# Form: $prefix{comma,separated,list}$suffix
# $prefix and $suffix can be used together or alone.

You can use brace expansion to permute through string combinations, like this:

echo {Daniel,Bruce}" Lee"
# Daniel Lee Bruce Lee

This permutes through each comma-separated string inside the curly braces with the adjacent string(s). In this case, I permute through all combinations of “Daniel” and “Bruce” followed by ” Lee”. You can, of course, also permute through combinations with a prefixed string, or a combination of prefixes and suffixes:

echo "Daniel the "{Great,Terrible}
# Daniel the Great Daniel the Terrible
echo {"Daniel ","Danny "}{Lee,Glover}
# Daniel Lee Daniel Glover Danny Lee Danny Glover
echo "Something "{good,bad}" is gonna happen."
# Something good is gonna happen. Something bad is gonna happen.

Note that the quotes “protect” the spaces in the strings so that the shell doesn’t interpret them to be meaningful.

Reporting a string’s length

# Form: ${#string_to_find_length_of}

In order to do this, you need to save the string in a variable – passing a string straight into the expression will result in an error.

myString="This is a string."
echo ${#myString}
# 17

Simple arithmetic

# Form: $(($variable $operator $variable ...))

Simple arithmetic can be done like this:

echo $((1+2))
# 3
echo $((10-5))
# 5
echo $((2*3*4))
# 24

Since in bash first all expressions are expanded – from the inside to the outside – you can do arithmetic with an expression that produces a number. Say you can to find the index position of the third from last character in a string:

echo $((${#myString}-3))
# 14

String slicing

# Form: ${$variable:$start_position:$number_of_characters}
# $number_of_characters is optional

You can slice a string from a given position. If you specify the number of characters to slice, you’ll get that many characters, otherwise you’ll get the slice to the end. Bash works with a 0-based index.

numbers=0123456789
echo ${numbers:5}
# 56789
echo ${numbers:2:3}
# 234

If you’re interested in slicing from the end of the string, you can do something like this (although it’s not very readable, so never write this into a script:

echo ${numbers:$((${#numbers}-6))}
456789

That reports the characters starting at the 6th to the last position in the variable $numbers.

Expression-based substring deletion

# Form: ${variable#delete_shortest_from_front}
# Form: ${variable##delete_longest_from_front}
# Form: ${variable%delete_shortest_from_back}
# Form: ${variable%%delete_longest_from_back}

These expressions delete the shortest/longest matching substring from the front/back of a string:

myString=asdfASDFasdf
echo ${myString#*s}
# dfASDFasdf
echo ${myString##*s}
# df
echo ${myString%s*}
# asdfASDFa
echo ${myString%%s*}
# a

Substring replacement

# Form: ${variable/search_term/replace_first}
# Form: ${variable//search_term/replace_all}
# Form: ${variable/#prefix_match_prefix/replace}
# Form: ${variable/%suffix_match_prefix/replace}

This syntax is familiar from other tools.

myString=asdfgASDFGasdfg
echo ${myString/sdf/123}
# a123gASDFGasdfg
echo ${myString//sdf/123}
# a123gASDFGa123g
echo ${myString/#as/beginning_matches}
# beginning_matchesdfgASDFGasdfg
echo ${myString/#doesnt_match/beginning_matches}
# asdfgASDFGasdfg
echo ${myString/%fg/suffix_matches}
# asdfgASDFGasdsuffix_matches
echo ${myString/%doesnt_match/suffix_matches}
# asdfgASDFGasdfg

Tying it together: A concrete example

So how does this help you in real life? Here’s a concrete example of something I had to do the other day. I had a number of files in a large directory. Here is a subset of the data:

me@localhost:~/original> ls FINO1_Windrichtung_*
FINO1_Windrichtung_33m_20130802_20140630.dat  FINO1_Windrichtung_70m_20130802_20140630.dat
FINO1_Windrichtung_40m_20130802_20140630.dat  FINO1_Windrichtung_80m_20130802_20140630.dat
FINO1_Windrichtung_50m_20130802_20140630.dat  FINO1_Windrichtung_90m_20130802_20140630.dat
FINO1_Windrichtung_60m_20130802_20140630.dat

The prefix “FINO1_Windrichtung_” stood for the observations recorded at the FINO station for wind direction. I had a number of such observations. Each file was composed of a header that was 6 lines long, followed by data. The file name referred to the station where the observations were recorded, followed by the variable measured, in what height, and for which time space.

I’d already downloaded data for these sensors at that station previously, and it was formated the same. My goal was to merge them. Because the data in the header was no longer relevant, all I was interested in doing was taking the observations recorded for each sensor at each height and extending the original file with the new observations. So I started building the command from the inside out.

Since the header in the new file didn’t matter, I printed the rest of the file’s contents to stdout, using head:

head -n -6 $my_file

That worked, so I knew that I could redirect the stream from stdout to the file of my choice. Now the trick was to find the file containing the older data. This would, of course, be no problem to do manually, but I had several hundred files so I automated it. The files were in another folder, “combined”, and looked the same, except that the dates were different. So the first thing I did was strip off the dates:

my_file=FINO1_Windrichtung_50m_20130802_20140630.dat
shortened=${my_file%_*}
shortened=${shortened%_*}
echo $shortened
# FINO1_Windrichtung_50m

With those two building blocks, I could loop over all the files in my folder, removing the header and appending the rest of the contents onto the appropriate folder in the combined folder:

for f in *
do
  shortened=${f%_*}
  root_name=${shortened%_*}
  head -n -6 $f >> combined/$root_name*
done

That would have been fine, but I would have had a slight problem in the combined folder: The dates in the filename would have been incorrect. Of course, storing that type of information only in the filename is a horrible thing to do, and I don’t do that, but it does make it easier to find stuff it it’s named the way you’re expecting it to be. So I modified the loop to become this:

for f in *
do
  shortened=${f%_*}
  root_name=${shortened%_*}
  new_file=combined/$root_name*
  head -n -6 $f >> $new_file
  name_first_half=${new_file%_*}
  name_second_half=${f##*_}
  new_name=$name_first_half"_"$name_second_half
  mv $new_file $new_name
done

That kept the first half of the filename, but changed the second half to match its new end date.

Of course, if you’re just hacking this on the command line, you might do it all in one line like this – you’re trying to just get things done once, not make reusable code:

for f in *; do shortened=${f%_*}; root_name=${shortened%_*}; new_file=combined/$root_name*; head -n -6 $f >> $new_file; name_first_half=${new_file%_*}; name_second_half=${f##*_}; new_name=$name_first_half"_"$name_second_half; mv $new_file $new_name; done

That’s no way to store code for later, but for hacking on the command line it’s a fast way of getting things done. Have fun with these string tricks – I find them indispensible and don’t understand how I used to get things done without them! Warning: If you start using them, you might feel the same way.

About

My name’s Daniel Lee. I’m an enthusiast for open source and sharing. I grew up in the United States and did my doctorate in Germany. I've founded a company for planning solar power. I've worked on analog space suit interfaces, drones and a bunch of other things in my free time. I'm also involved in standards work for meteorological data. Now I work for the German Weather Service on improving forecasts for weather and renewable power production.

Tagged with: , ,
Posted in Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

From the archive