Bash scripting is a way to automate tasks in the Unix/Linux command line environment. Bash is a command-line shell and scripting language that allows users to interact with and manipulate the Unix/Linux operating system. Bash scripts are sets of commands and statements that are executed in sequence, allowing users to automate repetitive tasks and perform complex operations.
Bash scripts can be used for a wide range of tasks, including data processing and analysis, system administration, and software development. They are particularly useful in bioinformatics, where large amounts of data need to be processed and analyzed quickly and efficiently.
Some of the key features of Bash scripting include:
Variables: Bash scripts use variables to store and manipulate data. Variables can be used to store values, such as file paths or input parameters, and can be manipulated using arithmetic and string operations.
Conditional statements: Bash scripts use conditional statements to control the flow of execution. Conditional statements allow users to test for specific conditions and execute different commands based on the results.
Loops: Bash scripts use loops to repeat commands and statements. Loops can be used to iterate over lists of files, perform repetitive tasks, and process large amounts of data.
Functions: Bash scripts use functions to organize and modularize code. Functions allow users to reuse code and make their scripts more efficient and easier to maintain.
In this markdown we will work our way up from zero to learn the basics of scripting. With these core skills, you will be able to approach many bioinformatics tasks with confidence.
With the advent of high-throughput technologies such as next-generation sequencing, there is a vast amount of biological data that needs to be processed and analyzed. Bash scripting is a powerful tool that can be used to automate many of the tasks involved in bioinformatics analysis.
Some of the bioinformatics tasks that can be automated using Bash scripts include:
Data preprocessing: Before analyzing biological data, it often needs to be preprocessed to remove noise, filter out low-quality reads, and perform quality control checks. Bash scripts can be used to automate these preprocessing steps, such as trimming reads, removing adapter sequences, and filtering out low-quality reads.
Sequence alignment: Sequence alignment is the process of aligning two or more DNA or protein sequences to identify similarities and differences. Bash scripts can be used to automate sequence alignment tasks, such as aligning reads to a reference genome, or aligning protein sequences to a database of known proteins.
Variant calling: Variant calling is the process of identifying genetic variations, such as single nucleotide polymorphisms (SNPs) or insertions/deletions (indels), in DNA sequences. Bash scripts can be used to automate variant calling tasks, such as calling SNPs and indels from aligned reads, and filtering and annotating variants.
Gene expression analysis: Gene expression analysis is the process of quantifying the expression levels of genes in a sample. Bash scripts can be used to automate gene expression analysis tasks, such as mapping RNA-seq reads to a reference genome, quantifying gene expression levels, and identifying differentially expressed genes.
In our lab, we routinely include bash scripts to automate tasks such as these. UCE data is not unique or special and the same core scripting approaches we will cover here will also translate to other bioinformatic chores, making it easier for you to analyze your data!
Note that their are many shells available that vary by system you are using.
Here’s an overview of some of the most common shells and the systems on which they are commonly found:
Bash (Bourne-Again SHell): This is the default shell on most Linux distributions and macOS.
Zsh (Z Shell): This is an alternative to Bash that is also available on most Unix-like systems. It has some additional features and improvements over Bash, such as more advanced tab completion and spelling correction. You will see mac users often use this.
Ksh (Korn SHell): This shell is also available on most Unix-like systems and is slightly older than bash. The number of users is dwindling, but you will still see this used. A major difference is that Korn uses print instead of echo to print messages in the terminal (we will see this in the next tab.)
PowerShell: This is the default shell on Windows systems. It’s also available on Linux and macOS. It treats everything as objects. I have not seen this in widespread use relative to Bash or Zsh, but that may be due to most students running linux in our department.
It’s worth noting that there are many other shells available, and the choice of which shell to use often comes down to personal preference and the your specific needs. Additionally, many shells are highly customizable, allowing users to modify the shell’s behavior and appearance to suit their needs. FOR UCE assembly, Zsh and bash are fine.
If you are ever unsure of what version of what you are using simply type this command into your terminal
echo $SHELL
I use mac or linux and have little experience with windows. Windows users can use the windows subsystem for linux to run bash scripts or other command line tools or set up a dual boot computer to have both linux and windows on the same machine.
There are also several online bash shells for testing code. Repl.it supports multiple environments and code sharing. There are other ones I have not tried such as JSLinux and Shellbox you can use while you configure your computer.
‘ls’ stands for “list”. It is used to list the files and directories in the current directory.
ls
#Lists your files (long format, gives file sizes, permissions, owner, last modification)
ls -l
#Lists all files, including hidden files
ls -a
#List all files and folders (adds \ to separate folders)
ls -F
#Recursively list Sub-Directories
ls -R
#Best ls command
ls -thor
#-t sort by modification time, newest first
#-h --human-readable file size [mb, etc]
#-o like -l, but do not list group information
#-r --reverse, reverse order while sorting, with oldest files listed first
cd stands for “change directory”. It is used to change the current directory.
# To change directory
cd directory_name
# Change to the previous directory (note this is wherever you were last)
cd -
# Change to the home directory
cd ~
#Back a directory in your current path
cd ..
There are two types of paths in Bash: absolute and relative. An absolute path is the complete path from the root directory to a file or directory. A relative path is the path from the current directory to a file or directory.
# Absolute path example
cd /home/user/Desktop
# Relative path example
cd Documents
mkdir stands for “make directory”. It is used to create a new directory.
mkdir directory_name
mv stands for “move”. It is used to move files and directories.
# Move a file
mv file_name directory_name/new_file_name
# Move a directory
mv <source_directory_path> <destination_directory_path>
#For example
mv ~/my_dir ~/Documents/
#Rename file
mv <filename> <newfilename>
#Rename directory
mv old_dir new_dir
Note that it is a good practice to put “” around paths. If the path has a space or special characters the command may not be properly interpreted. FOr example
cd my folder
#versus
cd "my folder"
In the above the second option will allow you to change directory to a folder called my folder, while thefirst will tell you the string is not in the path (for those who are more advance, yes you can get around some of this, but we will keep it simple here)
touch is one way we can create a new file. We will explore multiple ways to do this and discuss the pros and cons on the “cat” tab
touch new_file_name
You can also use touch to update the time stamps of existing files
touch -t 202303102300 file_name
Will update the time stamp to March 7th, 11 pm 2023
rm stands for “remove”. It is used to remove files and directories.
# Remove a file
rm file_name
# Remove a directory
rm -r directory_name
To run a command in Bash, simply type the name of the command followed by any necessary arguments.
command_name argument1 argument2
Commands in Bash can output text to the screen. This output can be redirected to a file using the > operator.
# Redirect output to a file
command_name > output_file
Commands in Bash can also take input from a file using the < operator.
# Redirect input from a file
command_name < input_file
cp stands for copy and is used for copying files and directories within the same system.
#Copies a file, here file 1 gets copied into file2
cp <filename1> <filename2>
#Copying directories recursively
cp -R /directory/ /directory1/
To understand recursion, consider this example:
#you have a directory that looks like this
mydir/
|-- file1.txt
|-- subdir1/
| |-- file2.txt
| |-- file3.txt
|-- subdir2/
|-- file4.txt
|-- file5.txt
#you run this command
cp -R mydir/ mydir_copy/
#creating this
mydir_copy/
|-- file1.txt
|-- subdir1/
| |-- file2.txt
| |-- file3.txt
|-- subdir2/
|-- file4.txt
|-- file5.txt
This command creates a copy of “mydir” directory and all its contents in a new directory named “mydir_copy” that will be identical to “mydir”
We can also cp in different ways:
#Copy multiple files into another directory
cp file* /directory/subdirectory
#Copy all files in new directory unless they already exist
cp -u *.fasta newdir/
The cp -u command copies files from one location to another, but only if the source file is newer than the destination file or if the destination file does not exist. The -u option stands for “update”. In this case we are copying all fasta files unless they already exist
This is handy if you want to play it safe while copying:
#Backs up files
cp --backup <origfile> <newfile>
If a file with the same name already exists in the destination directory, cp will make a backup copy of that file before overwriting it with the contents of
This can be useful if you want to make sure you have a backup of any files that might be overwritten during the copy process.
If you need multiple versions you can:
#Backs up files with numbering
cp --backup=numbered <origfile> <newfile>
scp [secure copy] is used to copy files securely between remote systems. It encrypts the data during transfer and requires authentication with username and password or SSH key.
This is a common command for getting data to or from a cluster
#general form
scp <source_file_path> <destination_file_path>
#example
scp mydata.txt user@physalia.edu:/remote/directory/
rsync is used for synchronizing files between local and remote systems or between two remote systems with advanced options like compression, bandwidth throttling, and only copying changes.
#general form
rsync options sourcedestination
#example
rsync -avzP /path/to/files/ user@remotehost:/path/to/destination/
If you are like most programmers, you probably don’t want to type the same options over and over. You can use the alias command to shorten commands
#general form
alias command='command option'
#examples
alias ls="ls -l"
alias cp="cp -i"
#this second example makes copying interactive, asking you if you truly want to copy something. This can be safer since you can check before accidentally overwriting files. You confirm by typing 'y' and hitting enter
If you want to remove your alias, simply
unalias ls
unalias cp
Note that these are stored in memory. Once you exit the shell, any aliases that you have defined will be lost.
One last note on command history. Your terminal will remember the history of your commands which is useful. You can view it with the history command However, if you mess up and forgot to close a quote or something else that is causing commands to fail, you can always clear your history. This is rare, but I wanted to include it here since it is different between mac and linux.
#view your history
history
#on linux you -c to clear
history -c
#on mac you -p to purge
history -p
Sequence files are essentially just large text files and Bash happens to be a powerful tool for working with these file types. Bash has variety of built-in commands and utilities that allow you to easily manipulate, search, and analyze text data. This allows you to use simple commands to perform complex operations quickly and efficiently, without the need for specialized software or programming knowledge.
In this section we will begin looking at core text manipulation operations.
Before we dive into text files, I just want to point out the way that bash prints output using the command echo. Echo is used to display a string of text. You will see this command again later.
# Display a string of text
echo "Scripting is easy!"
The cat command is a very useful utility that can be used to concatenate and display the contents of one or more files.
The basic syntax for using cat is as follows:
cat [OPTIONS] FILE...
where FILE is the name of one or more files that you want to concatenate and display.
For example, to display the contents of a file called example.txt, you can use the following command
cat sharkTracker.txt
This will display the contents of the file sharkTracker.txt file on your terminal.
-n or –number
This option adds line numbers to the output. For example, if you want to display the contents of example.txt with line numbers, you can use the following command:
cat -n sharkTracker.txt
If for some reason you only need non-blank lines (this comes up more often than you would expect as you get handed datasets), you can use: -b or –number-nonblank
cat -b sharkTracker.txt
If you need to see line ends (these will be marked by a $), use -E or –show-ends
cat -E sharkTracker.txt
Likewise, -T or –show-tabs will show tabs. This can come in handy
–show-ends
cat -T sharkTracker.txt
We are going to start putting pieces together here to see how we can use cat to accomplish core tasks. To begin we will make some files using cat.
You can use the cat command with the output redirection operator > to create a new file and write text to it. For example, to create a new file called sharkTracker2.txt and write some lines of text to it with this following command:
cat > sharkTracker2.txt
#first enter the above command, then paste this in
Shark Species,Weight (lbs),Length (ft),Coastal Town,Date
Sandbar Shark,120,6.3,Wilmington,2022-06-01
Tiger Shark,430,10.2,Morehead City,2022-05-15
Dusky Shark,250,8.5,Nags Head,2022-07-23
Blacktip Shark,80,5.4,Hatteras,2022-04-12
Great White Shark,1600,18.2,Atlantic Beach,2022-08-05
Bull Shark,350,9.1,Beaufort,2022-06-18
As you can see, you can keep entering text as you hit enter. To save this output you need to hold control and press D
You can also use cat to append text to an existing file. To do this, you can use the >> operator. For example, to append the text “Hammerhead Shark,600,12.3,Swansboro,2022-05-01” to the sharkTracker2.txt file created in the previous example, you can use the following command:
cat >> sharkTracker2.txt
Hammerhead Shark,600,12.3,Swansboro,2022-05-01
After you run this command, you will again see a blank line waiting for you to input text. Paste in your text (without quotes) and press Enter. Then press Ctrl-D to append the text to the file.
You can use cat to create a new file from the contents of an existing file. We will zoom in on this in a second, but just realize that by simply using the > operator with cat, you can accomplish the same thing as cp. For example, to create a new file called sharkcopy.txt with the contents of sharkTracker2.txt, you simply:
cat sharkTracker2.txt > sharkcopy.txt
This seems silly to do, but in the next section we will begin building on this concept to manipulate files. From there things will just get more and more powerful and you will rapidly feel like an informatics wizard!Here we will continue to use cat to explore ways we can subsample files, introducing a few more core functions along the way.
This is a common task. You have a big meta data file, but you just need one or a few columns of it. Here is how to use cat with the cut command and the pipe operator | to isolate parts of a file you need and write those into a new file
cut -d',' -f 1,4 sharkTracker2.txt | cat > sharkMap.txt
There is a lot going on here, let’s break this down.
To view this file you can use cat
#view the file
cat sharkMap.txt
If this file was thousands or millions of rows, you probably wouldn’t want to display the output. You can use the head command to isolate the number of rows of your choosing and display those. For example, if we want to spot check the first 3 rows we can
head -n 3 sharkMap.txt
We can also use the tail command to do the same for the bottom x rows
tail -n 3 sharkMap.txt
Its a silly example here, but quite handy for checking log files for on runs on clusters.
If you are feeling creative at the moment, you may have thought of something. Can we combine cat, head, tail, and the pipe operator to isolate rows of text? The answer is yes! Lets see this in action.
#cut out rows 2-5
cat sharkTracker2.txt | head -n 5 | tail -n +2 > rows2Through5.txt
Let’s break down what this command does:
#View the file
cat rows2Through5.txt
In the next section we will build on this to look for specific patterns to isolate data. This is where things start to get powerful.
Grep is a powerful command-line tool used for searching text files for specific patterns. It allows users to search for regular expressions or strings in one or more files at once. With a variety of options and commands, grep is a versatile tool that can be used for a wide range of text search tasks.
In bioinformatics, grep is often used to search for specific DNA or protein sequences in large text files such as genome sequence files or sequence alignment results. It can also be used to search for specific patterns or motifs within sequences, or to filter out specific sequences based on certain criteria. Additionally, grep is often used in conjunction with other command-line tools to perform more complex analyses and tasks in bioinformatics.
The basic syntax for using grep is as follows:
#View the file
grep [options] pattern [file(s)]
Let’s try an example of a simple search using the sharkTracker2.txt file we made
#The general pattern
grep "search string" filename
#with our shark file
grep "Hammerhead" sharkTracker2.txt
Running this second line retrieves the line that contains the word “Hammerhead”
You can also search for a pattern in multiple files by seperating them with a space
#The general pattern
grep pattern file1.txt file2.txt
#example with sharks
grep "Hammerhead" sharkTracker2.txt sharkMap.txt
There are numerous options that grep can use. Here are some of the most common:
Ignoring case is handy, especially if you or your collaborators are prone to bumping the caps lock key
grep -i HAmmerHEAD sharkTracker2.txt
You can also retrieve everything but the pattern you are looking for. For example if we wanted all sharks that are not hammerheads
grep -v Hammerhead sharkTracker2.txt
You can get line numbers with the search
grep -n Hammerhead sharkTracker2.txt
This is handy, print only the file(s) that has(ve) a specific string
grep -l Hammerhead *
You can also count matches (this is useful if you need a quick count of reads in a fasta, more on that tomorrow)
grep -c Hammerhead sharkTracker2.txt
A couple more handy things
#To search for multiple patterns, list them separated by a pipe character (|)
grep 'Hammerhead\|Great' sharkTracker2.txt
#use regular expressions, this will return any shark at 12 lbs and any single decimal after
grep 12.'[0-9]' sharkTracker2.txt
#search for whole words
grep -w 'Great White Shark' sharkTracker2.txt
#search in all files except
grep Great * --exclude=sharkTracker2.txt
As you can see, the list of things you can do with grep can go on and on. Rather than provide an exhaustive list of arguments, I’m going to focus the next section on common tasks you can use grep for that can come in pretty handy.
#To count instances of a word, note that -o extracts only the search pattern, not the entire line
grep -o 'word' input.txt | wc -l
#To count instances of a word separated by boundaries (example red verses hired)
grep -o '\bword\b' input.txt | wc -l
tr is a command in Bash that translates or deletes characters. It reads standard input and performs a set of translations based on the command line arguments, and then outputs the results to standard output.
The syntax for tr is as follows:
tr [OPTION]... SET1 [SET2]
SET1 specifies the set of characters to be translated, and SET2 specifies the replacement characters. If SET2 is not specified, tr deletes the characters in SET1.
Here’s an example of how to use tr to replace all occurrences of the character a with the character b:
echo "banana" | tr 'a' 'b'
This will output bbnbnb, which is the original string with all occurrences of the character a replaced with the character b.
Sed (stream editor) is a powerful command-line tool for performing text processing tasks on large datasets. Sed operates by reading in a stream of text data, applying a series of text manipulation commands, and then outputting the modified data. It is particularly useful for batch processing of text files, such as those commonly encountered in bioinformatics.
In bioinformatics, sed is often used to manipulate large text files containing genomic or proteomic data. For example, sed can be used to extract specific fields from a tab-delimited file, to remove or replace certain characters from a text file, or to convert between different file formats. Sed can also be combined with other command-line tools such as awk and grep to perform more complex text processing tasks. Due to its speed and versatility, sed is an essential tool in the bioinformatics toolkit for working with large datasets.#Basic usage
sed [options] 'command' filename
This is a little different from what we have seen so far, we now have options mixed with commands that will do something to a file. Let’s zoom into some core functions starting with s (substitute)
#Basic usage
sed s/pattern/replacement/flags filename
In this line we invoke sed, and ask it to find a pattern and replace it in a file. Their are optional flags we can include to modify the behavior further. Lets look at this.
#change shark to dolphin
sed 's/Shark/dolphin/' sharkTracker2.txt
#use regular expressions
sed 's/2022.*5/No data/' sharkTracker2.txt
A quick note on substituting versus deleting when ignoring case
#change shark to dolphin regardless of case using the i flag
sed 's/shark/dolphin/i' sharkTracker2.txt
#delete lines containing hammerheads, insensitive to case
sed '/hammer/Id' sharkTracker2.txt
Notice the case change above? If you use the i flag without any text to insert, sed doesn’t know what text to insert and therefore throws an error.
If we want to delete lines containing the word “hammer”, we need to use the I flag (uppercase), which tells sed to ignore case when searching for the pattern, rather than the i flag (lowercase) which is used to insert text.
We can also use substite to remove leading whitespaces
#replace leading white space
sed 's/^[ \t]*//' filename.txt
Let’s unpack this
* s/: indicates a substitution command (i.e., replace one string with another) * the ^ symbol matches the start of a line (this is very useful for working with fasta files) * the [ * matches any number of spaces or tabs at the beginning of the line * the // indicates the replacement string is empty (i.e., delete the matched string, or replace with nothing depending on how you look at it) * and the final ’ ends the substitution command
cool! Let’s do one last task
#replace shark species with just species in just first line by specifying the line number in front of the s
sed '1s/Shark Species/Species/g' sharkTracker2.txt
in the above example, I added the g flag. The g (global) flag is used to replace all occurrences of a pattern within a line, rather than just the first occurrence.
By default, sed replaces only the first occurrence of a pattern in each line it processes. The g flag is useful when you want to replace all occurrences of a pattern within a line. We don’t have this case here, but you see this often and it is useful to point out. I also want to issue a warning here:
if you use the g flag without specifying a pattern, sed will replace all occurrences of the empty string with the replacement text. This can cause unexpected changes to your text.
Sed’s ability to search for patterns and perform modifications on targeted lines of text makes it an efficient tool for bioinformaticians looking to extract or modify specific fields within a file. This can be particularly useful for tasks such as parsing data from large sequencing files or formatting data for downstream analyses.
Let’s look at this with some simple examples
#print the third line
sed -n '3p' sharkTracker2.txt
This command prints the third line of the sharkTracker2.txt file. The -n option tells sed to not print anything, and the p command prints the specific line. The logic here is that you are limiting output to just the line designated
#delete the third line
sed -n '3d' sharkTracker2.txt
This command deletes the third line of the sharkTracker2.txt file. The d command tells sed to delete the specific line.
We can also replace individual lines. Note that linux and mac do this slightly differently
#linux version
sed -i '2s/.*/This is the new second line/' sharkTracker2.txt
#mac version
sed -i '' '2s/.*/This is the new second line/' sharkTracker2.txt
The i command is for in place editing and requires an argument on a Mac, even if it’s just an empty string.
This command is handy and we can use it to insert lines above or below targets.
#linux for above
sed -i '3i\Spiny Dogfish,5,3,Avon,2022-07-14\' sharkTracker2.txt
#mac for above
sed -i '' '3i\
Basking Shark,40,2000,Topsail,2022-07-14\
' sharkTracker2.txt
#linux for below
sed -i '3a\Basking Shark,40,2000,Topsail,2022-07-1' sharkTracker2.txt
#mac for below
sed -i '' '3a\
Basking Shark,40,2000,Topsail,2022-07-14\
' sharkTracker2.txt
Note that on a mac you need the backslash followed by an enter. This is because the macOS version of sed is based on BSD (Berkeley Software Distribution) sed, which requires the newline character to be escaped with a backslash in order to continue the command on the next line.
In contrast, the GNU version of sed (found on most Linux systems) allows you to continue a command onto the next line by simply placing a backslash at the end of the line, without requiring a newline character.
We can also find lines similar to grep
#find the dogfish, regardless of case with the I flag
sed -n '/doGfIsh/I =' sharkTracker2.txt
How many lines are in this file anyway?
lines=$(sed -n '$=' sharkTracker2.txt)
echo lines
I warned you echo would start to come back. This is a soft introduction to content we will see later. When you enclose a command within $() in bash, it runs that command and captures its output. In this case, the command being run is sed -n ‘$=’ sharkTracker2.txt, which outputs the number of lines in the file sharkTracker2.txt.
One last note, we can use the -e (expression) to chain commands
#on linux
sed -e '1i My Shark Data' -e 's/$/,/' sharkTracker2.txt
#on a mac
sed -i '' -e '1i\
My Shark Data' -e 's/$/,/' sharkTracker2.txt
This allows you to specify multiple sed commands to be executed on the same input file, with each command separated by the -e option. This option is useful when you want to execute multiple sed commands on the same input file without having to create multiple temporary files. In this example we added a terrible header and also added a comma to the end of each line
To remove annoying things like commas at the end of lines, simply
#on linux
sed -i 's/,$//' sharkTracker2.txt
#on mac
sed -i '' 's/,$//' sharkTracker2.txt
Variables are an essential part of command-line computing in bioinformatics. They allow users to store and manipulate data efficiently, automate repetitive tasks, and avoid the need to retype lengthy commands repeatedly. Variables are often used to store file paths, program parameters, or other data that need to be passed to a program or script.
Using variables in the command line requires defining the variable and assigning it a value. We will work with variables here to keep building our foundation as proper use of variables can streamline workflows and make data analysis more efficient.
In Bash, variables are defined using the syntax “variable_name=value”, where variable_name is the name of the variable, and value is the value to be assigned to the variable. This should familiar if you code in other languages and is the equivalent of “variable_name<-value” in R (in case that helps).
Once a variable is defined, it can be used in subsequent commands by enclosing the variable name in “$” and using it as an argument or parameter in the command. Let’s explore this now
Variables are used to store values that can be used later on. These values can be used in commands, scripts, or even in other variables. Here’s an example of a simple variable in the command line
#make a variable and print the output
MY_VARIABLE="Watch out for titan triggerfish\!"
echo $MY_VARIABLE
In this example, we create a variable called MY_VARIABLE and assign it the value “Watch out for titan triggerfish!”. We then use the echo command to display the value of the variable. The $ symbol is used to reference the value of the variable.
Note that in the above, we need to escape the ! by adding a backslash. Remember this if you have variables with special characters and suddenly you are prompted by dquote after trying to define them.
Here are a few more examples of how variables can be used in the command line
# Create a variable with a number
MY_NUMBER=10
# Use the variable in a command
echo "The number is $MY_NUMBER"
# Create a variable with a filename
MY_FILE="emptyfile.txt"
# Use the variable to create a new file
touch $MY_FILE
# Create a variable with a directory path
MY_DIR="/Users/alexdornburg/Documents/UCE_Workshop/Day1/Roadwork"
# Use the variable to navigate to the directory
cd $MY_DIR
Search for a pattern using grep and a variable:
SEARCH_PATTERN="Shark"
grep $SEARCH_PATTERN sharkTracker2.txt
#Search for a pattern using a variable and ignore case
SEARCH_PATTERN="sHark"
grep -i $SEARCH_PATTERN sharkTracker2.txt
Replace a pattern using sed and a variable:
OLD_PATTERN="Spiny"
NEW_PATTERN="Smooth"
sed "s/$OLD_PATTERN/$NEW_PATTERN/g" sharkTracker2.txt
#Replace a pattern using a variable and only change the first occurrence
OLD_PATTERN="Shark"
NEW_PATTERN="Dolphin"
#on linux
sed "0,/$OLD_PATTERN/s//$NEW_PATTERN/" sharkTracker2.txt
#on mac
sed -i '' "1s/$OLD_PATTERN/$NEW_PATTERN/" sharkTracker2.txt
In this example, the sed command replaces only the first occurrence of the value of the OLD_PATTERN variable with the value of the NEW_PATTERN variable in the file.txt file. The 0,/old_pattern/ range specifies the first occurrence, and the s// syntax is shorthand for s/old_pattern/NEW_PATTERN/.
Note that the 0,/pattern/ address range is not supported by all versions of sed. On a mac, you can use the second command. In this command, we are specifying only the first instance of the word at the beginning of the sed pattern. In general, if you are on a mac and things aren’t working with sed, be prepared to look on stack overflow for the equivalent command on osx. As a note of caution, chatgpt has trouble with sed at the time of this writing, and will confidently return commands that do not work.
You can use variables in commands by enclosing them in curly braces and prefixing them with a dollar sign
#Using variables in commands
file_name="emptyfile.txt"
touch ${file_name}
This will create a new file called example.txt using the touch command. you can ls to see it in your directory
You can use the output of a command as the value of a variable by enclosing the command in $()
#Using variables to capture command output
date_today=$(date +"%Y-%m-%d")
echo "Today's date is ${date_today}"
This will output today’s date in the format YYYY-MM-DD. To break this down: * The date command prints the current date and time in the default format: Wed Mar 1 15:20:26 EST 2023
* The + option allows you to specify a custom output format for the date command. In this case, we’re using the format string %Y-%m-%d, which represents the year, month, and day in the format YYYY-MM-DD.
* The $(…) syntax is command substitution. This means that the output of the date command is captured and used as the value of the date_today variable.
So, when you run the command date_today=$(date +“%Y-%m-%d”), the current date is captured and stored as a string in the date_today variable, which we then display with echo. Neat!
Command line arguments are a way to pass inputs to a command or script when it is executed. When you run a command or script with arguments, they are automatically assigned to variables called positional parameters. The first argument is assigned to $1, the second to $2, and so on.
We haven’t made official scripts yet (we will shortly), but lets make one using cat
cat > hello.sh
#!/bin/bash
echo "Hello, $1! How are you today?"
The #!/bin/bash is called a “shebang” or “hashbang,” and it’s the first line in many Bash scripts. It’s a directive to the shell to use the Bash interpreter to execute the script. The #! characters tell the system that what follows is the interpreter to use, and /bin/bash specifies the path to the Bash executable. Essentially, it ensures that the script is interpreted correctly by the right shell and that it’s executed as a Bash script. Without the shebang, the script might not run, or it might run with the wrong interpreter. The shebang line is an essential component of most Bash scripts and is often used to make sure that the script runs consistently across different systems.
The rest of the script looks familiar except the $1. Lets see what happens when we execute it.
#we first set the permission for the script to execute (more on this later)
chmod +x hello.sh
#then run it
./hello.sh Bob
In this example, the $1 variable is used to access the first command line argument, which is the name of the person being greeted. If you wanted to include more arguments, you would use $2,$3, and so on to access them.
Let’s do one more example to combine some concepts
cat > lameCalculator.sh
#!/bin/bash
sum=$(expr $1 + $2)
echo "The sum of $1 and $2 is $sum"
now set permission and run
#we first set the permission for the script to execute (more on this later)
chmod +x lameCalculator.sh
#then run it
./lameCalculator.sh 5633426784 47246738246
Pretty easy, however, a few things to remember about reserved numbers or numbers 10 and up
echo $0 # prints the script name
echo $1 # prints the first argument
echo $2 # prints the second argument
echo $9 # prints the ninth argument
echo $10 # prints the first argument, followed by 0
echo ${10} # prints the tenth argument
echo $# # prints the number of arguments
We will level up when we play with sequence read data and use these concepts to accomplish more complex tasks. You will hopefully notice that many of these tasks collapse to simple examples like this with a few more steps and commands layered on top.
an array is a variable that can hold multiple values. Each value in the array is assigned a unique index, starting from 0.
Arrays in Bash can be of two types: indexed arrays and associative arrays.
Indexed array is a simple list of values, where each value is associated with an index number starting from 0. To declare an indexed array in Bash, you can use the following syntax:
my_array=(value1 value2 value3 ...)
so an array of sharks may look like
sharks=('Bull Shark' 'Tiger Shark' 'Blue Shark')
Note the quotes here since these names have spaces in them. Another important note for everything I am about to cover:
bash array indexing starts at 0 (always)
zsh array indexing starts at 1
I am writing this while using a zsh. Therefore my examples may not match your shell. For zsh my first element is at position one. If you are on bash, you will access the same element at position 0. That means, if these examples don’t match, simply subtract 1.
Working with index positions
#reference an index
echo "${sharks[1]}"
#count elements in an array
echo ${#sharks[@]}
# the [@] part tells Bash to treat the array as a whole, rather than a single variable, and the # operator returns the length of the array. In zsh you can omit this
#copy a subset of an array in bash (note this counts from zero)
miniShark=("${sharks[@]:1:2}")
echo "${miniShark[@]}"
In this example, we’re using the “${array_name[@]:offset:length}” syntax to create a slice of the sharks array.
Note that array slicing in Bash uses a different syntax than ranges. In Bash, you can use the ${array_name[@]:offset:length} syntax to create a slice of an array, where offset is the starting index and length is the number of elements to include.
#copy a subset of an array in zsh
miniShark=(${sharks[from,to]})
# example
miniShark=(${sharks[2,3]})
echo "${miniShark[@]}"
#add to an array
miniShark+=('Hammerhead Shark' 'Goblin Shark')
echo "${miniShark[@]}"
#note above in zsh you can just echo $miniShark to see everything
#slice after an index
miniMiniShark=${miniShark:2}
echo $miniMiniShark
We can also remove an element from an array (in zsh)
sharks=('Bull Shark' 'Tiger Shark' 'Blue Shark')
sharks[1]=()
echo $sharks
An associative array is a type of array where the index is not limited to numbers, but can be any string. To declare an associative array in Bash, you can use the following syntax:
declare -A my_array
my_array[key1]=value1
my_array[key2]=value2
my_array[key3]=value3
For example, to declare an associative array that maps pokemon names to their main colors, you can use the following code:
declare -A pokemon_colors=(
["pikachu"]="#F7D02C"
["snorlax"]="#8BBE8A"
["charmander"]="#F7786B"
)
This creates an array where the keys are the names of the pokemon and the values are their colors. (if you use python this will be familiar)
To access a value of an associative array you can use this
#basic syntax
${array_name[key]}
#example
echo ${pokemon_colors[pikachu]}
# Add a new element to the array
pokemon_colors["Squirtle"]="#7FC8D1"
Conditional statements are a fundamental concept in programming and are used to create branching logic in code. They allow a program to make decisions based on certain conditions and execute different code paths accordingly.
In simple terms, a conditional statement evaluates a Boolean expression, which is a statement that is either true or false. If the Boolean expression is true, the program executes a specific block of code, and if it is false, the program may execute a different block of code, or continue with the rest of the program.
Conditional statements are essential for creating dynamic programs that can respond to user input or changing conditions in the program environment. They enable programmers to create complex decision-making structures and automate repetitive tasks.
The most basic conditional statement in Bash is the if statement, which allows you to execute a block of code if a condition is true. The general form works like this
if [ condition ]
then
# code to run if condition is true
fi
The if statement starts with the if keyword, followed by a opening square bracket ([), which is a shorthand for the test command. The test command evaluates the condition enclosed in quotes and returns either a true or false value. ]) followed by the then keyword. If the condition is true, the code block following then will be executed.
The code block to execute if the condition is true is enclosed between then and fi. The fi keyword marks the end of the if statement and closes the code block.
It’s important to note that the condition inside the square brackets can be a comparison between variables, an expression, or the result of a command. Bash supports a wide range of operators to build the condition, such as == for string comparison, -eq for integer comparison, and -f to check if a file exists. We will meet some of these in a moment, first an example with our budding scripting skills
cat > ifStatement.sh
#!/bin/bash
sum=$(expr $1 + $2)
if [ $sum -gt 10 ]
then
echo "$sum is greater than 10."
fi
Let’s run this, then talk about it
#we first set the permission for the script to execute
chmod +x ifStatement.sh
#then run it
./ifStatement.sh 2 1
#nothing happened, how about now
./ifStatement.sh 9 7
This rehashes our lameCalculator from the previous page and calculates the sum of the two arguments passed to the script and stores the result in a variable called sum. The expr command is used to evaluate the arithmetic expression, and $1 and $2 are the first and second positional parameters (i.e., the two arguments passed to the script).
The if statement checks if the value of sum is greater than 10. If the condition is true, the script prints a message to the console using the echo command. If the condition is false, nothing happens.
Not returning anything the first time is not desirable. Luckily we can use an else command that allows us to do something if the condition being tested is not true. Let’s see this in action.
#can you figure out what this does based on the previous page?
sed -i '' '/echo/a\
else\
echo "The $sum is less than or equal to 10."\
' ifStatement.sh
The above is just modifying our ifStatement.sh script to get a new line. If you don’t understand this, go back to review the sed section. Note this command for is for a mac, you may need to modify slightly if you are on a linux machine. This is more for review, so use the line below rather than troubleshoot for your specific sed. What we are making is a file that looks like this:
#!/bin/bash
sum=$(expr $1 + $2)
if [ $sum -gt 10 ]
then
echo "$sum is greater than 10."
else
echo "The argument is less than or equal to 10."
fi
The above statement builds on our previous example. Here, the condition is the addition that is performed. If the sum is greater than 10, it returns true, and the first command is executed. Otherwise, the second command is executed. Let’s see it in action
./ifStatement.sh 2 1
Neat, now we can make a basic comparison!Often we need to make decisions based on a value. Here are arithmetic comparisons you will commonly see
#equal to
if [ 10 -eq 10 ]
then
echo "10 is equal to 10"
fi
#not equal to
if [ 20 -ne 10 ]
then
echo "20 is not equal to 10"
fi
#less than
if [ 5 -lt 10 ]
then
echo "5 is less than 10"
fi
#less than or equal to
if [ 10 -le 10 ]
then
echo "10 is less than or equal to 10"
fi
#greater than
if [ 20 -gt 10 ]
then
echo "20 is greater than 10"
fi
#greater than or equal to
if [ 10 -ge 10 ]
then
echo "10 is greater than or equal to 10"
fi
We can also use a lot of other comparisons
You can use regular expressions in an if statement to perform more complex string tests.
statement="hello world"
if [[ $statement =~ ^hello ]]
then
echo "String starts with hello"
fi
You can use command substitution in an if statement to execute a command and use its output in the test.
if [ $(whoami) = "root" ]
then
echo "You are logged in as root"
else
echo "You are not logged in as root"
fi
In this example, the script uses the $(whoami) command to get the current user’s username, and checks if it’s equal to “root”. If it is, it prints “You are logged in as root”. Otherwise, it prints “You are not logged in as root”.
We can also use an If statement with exit status
file=sharkTracker2.txt
string="hello"
if grep -q $string $file
then
echo "$string is in $file"
else
echo "$string is not in $file"
fi
In this example, the script uses the grep command to search for a string in a given file. -q suppresses the output and only returns the exit status. The exit status of 0 means that grep found the pattern. If the exit status is 1, it means that grep did not find the pattern.
We use the exit status of grep in shell scripts to make decisions based on whether a pattern was found or not, in our case to check if a file contains a certain string (you may also sometimes see non-zero exit status as an error, now you know what that means!)
It’s important to note that while if statements are a common way to perform conditional checks in Bash, they are not the only way, and sometimes other constructs may be more appropriate or more readable for a particular situation.
The test command (also known as [) can be used to perform simple conditional checks. Here’s an example that checks if a file named sharkTracker2.txt exists
test -e sharkTracker2.txt && echo "File exists"
The test command can be combined with logical operators to perform more complex conditional checks. Here’s an example that checks if a file named file.txt exists and is readable (-r):
test -r sharkTracker2.txt && echo "File exists and is readable"
We can also test if a file is executable
test -x lameCalculator.sh && echo "File exists and is executable"
The case statement can be used to perform multiple conditional checks in a more structured way than using multiple if statements. Here’s an example that checks if a variable named color is set to “red”, “green”, or “blue”:
color="blue"
case $color in
red)
echo "The color is red"
;;
green)
echo "The color is green"
;;
blue)
echo "The color is blue"
;;
*)
echo "The color is not red, green, or blue"
;;
esac
Here’s how the syntax works:
case variable in: This line starts the case block and specifies the variable to match against patterns.
red): This line specifies the first pattern to match against. If the value of the variable matches red, the code block below it will be executed.
code to execute if variable is red: echo “The color is red”.
;;: This double semicolon tells Bash to exit the case block and continue with the next line of code in the script.
the pattern repeats for the others
*): This line specifies a default case to handle if the value of the variable doesn’t match any of the patterns above (captures any other pattern), and returns the next statement
esac: This line ends the case block. It took me forever to notice this, but esac is just case spelled backwards. If you forget to close the case block you will get an error.
Just like with other things we can also use command substitution. Here’s an example that checks if a file named emptyfile.txt exists and has a size greater than zero:
test $(wc -c < emptyfile.txt) -gt 0 && echo "File exists and is not empty" || echo "This file either does not exist or is empty"
|| is a logical operator that represents the “OR” operation. It’s used to execute a command or a block of commands only if the previous command failed (i.e., returned a non-zero exit status). Pretty cool if you think about this, let’s unpack this for a second
$(wc -c < emptyfile.txt) calculates the size of the “emptyfile.txt” in bytes. The wc command with the -c option counts the number of bytes in the input, and < emptyfile.txt redirects the input to come from the file “emptyfile.txt”.
test $(wc -c < emptyfile.txt) -gt 0 tests whether the output of the previous command (the size of “emptyfile.txt”) is greater than zero. The test command with the -gt (greater than) operator returns a true (0) exit status if the left-hand side is greater than the right-hand side.
If the previous command returns a true exit status (i.e., the file exists and is not empty), && echo “File exists and is not empty” executes the echo command, which prints the message “File exists and is not empty” to the console.
If the previous command returns a false exit status (i.e., the file does not exist or is empty), || echo “This file either does not exist or is empty” executes the echo command, which prints the message “This file either does not exist or is empty” to the console.
We can also use regular expressions with conditionals like this:
string="I saw a shark swimming in the ocean."
if [[ $string =~ ^"shark" ]]; then
echo "Match found"
else
echo "No match found"
fi
In this script, we’re using the [[ ]] operator to test whether the variable $string contains the substring “shark” at the beginning. If it does, the script prints “Match found”. Otherwise, it prints “No match found”.
the [[ … ]] operator is used for conditional expressions, which are used to test whether a particular condition is true or false.
The =~ operator is a regular expression matching operator that allows us to test whether a string matches a given regular expression.
One last bit, a few conditionals you can use with arrays
Check if an array is empty
if [[ -z $VARNAME ]] then
echo "Empty"
else
echo "Not Empty"
fi
Check if a value is contained in an array
if (( $VARNAME[(Ie)value] )) then
echo "Value Present"
else
echo "Value Not Present"
fi
Check if an array is not empty
if [[ ! -z $VARNAME ]] then
echo "Not Empty"
else
echo "Empty"
fi
We’re getting close to putting it all together. Let’s look at loops nextIn this section, we’ll be covering the basics of loops in Bash, including for, while, and until loops.
Loops are an essential part of any programming language, and Bash is no exception. Loops allow us to execute a block of code repeatedly, based on a particular condition.
A for loop is used to execute a block of code for a fixed number of times, or for each item in a list. The basic syntax of a for loop is as follows:
#basic syntax
for item in list; do
# code to be executed
done
In this syntax, item is a variable that takes on each value in list, and the code inside the loop is executed for each value.
Here’s an example of a for loop that prints the numbers 1 to 5
for i in {1..5}; do
echo $i
done
Let’s break down this code:
When you run this code, it will output the numbers 1 to 5, each on a separate line.
We can also use a for loop to iterate over an array a variable that can hold multiple values in bash like this:
shark=("tiger" "bull" "sandbar" "blue")
for shark in "${shark[@]}"; do
echo $shark
done
In this example, shark is an array that contains four sharks. The for loop iterates over each fruit in the array, and the code inside the loop (echo $shark) is executed for each shark. When you run this code, it will output each shark on a separate line.
My favorite use of for loops is iterating over all the files in a directory. This is a very common task when dealing with sequence files. He is a basic example
#This returns the full path
for file in /Users/alexdornburg/Documents/UCE_Workshop/Day1/*; do
echo $file
done
#You can do this for your current directory
for file in *; do
echo $file
done
This seems basic, but if you start thinking about all the previous examples this is like getting the keys to the kingdom. You now have the core foundation to begin manipulating thousands of files with just a few lines of code!
You can also use a for loop to iterate over a range of numbers. Here’s an example:
for (( i=1; i<=5; i++ )); do
echo $i
done
In this example, the for loop uses a C-style loop to iterate over the numbers 1 to 5. The ((…)) syntax is used to evaluate arithmetic expressions. The i variable is initialized to 1, and the loop continues as long as i is less than or equal to 5. The i++ statement increments i by 1 at the end of each iteration. The code inside the loop (echo $i) is executed for each value of i. When you run this code, it will output the numbers 1 to 5, each on a separate line.
We can also loop over key value pairs like this
#bring back our pokemon example
declare -A pokemon_colors=(
["Pikachu"]="#F7D02C"
["Snorlax"]="#8BBE8A"
["Charmander"]="#F7786B"
)
#loop over it
for key val in "${(@kv)pokemon_colors}"; do
echo "$key : $val"
done
Here’s what each part of the loop does:
So the entire loop will iterate over the pokemon_colors associative array and print out each key-value pair in the format key : value.
A while loop is a control flow statement in Bash that allows you to execute a block of code repeatedly as long as a certain condition is true. It is useful when you need to repeat a task multiple times until a specific condition is met. The while loop checks the condition at the beginning of each iteration and will continue to loop as long as the condition remains true.
The basic syntax of a while loop is:
while [ condition ]
do
# code to be executed
done
Let’s start with a simple example that uses a while loop to count from 1 to 5 and print each number on a new line.
counter=1
while [ $counter -le 5 ]
do
echo $counter
((counter++))
done
In this example, we set the initial value of the counter variable to 1. Then we use a while loop to check if the value of counter is less than or equal to 5. If it is, we print the value of counter and increment it by 1. The loop will repeat until counter is no longer less than or equal to 5.
IMPORTANT: If you do not increment the loop will be infinite
It is not hard to accidentally generate an infinite loop while working on informatic tasks. Be careful or else you can fill your hard drive with erroneous multi-TB files!
In this example, we’ll use a while loop to read a file line by line and print each line.
Here is an example of how to read a file line by line
file="sharkTracker2.txt"
while read line
do
echo "$line"
done < $file
In this example, we use the read command inside a while loop to read a file line by line. We store each line in the line variable and print it out using the echo command. The < $file syntax redirects the input of the loop to come from the example.txt file.
This is unusual compared to other languages like R, but the done keyword is used to indicate the end of the loop body. The < character is used for input redirection, which means that it takes input from a file instead of from standard input.
So, when we use the syntax done < $file, it tells Bash to take the input for the loop from the file specified by $file instead of from standard input.
We can also evaluate based on multiple conditions. For example:
counter=1
sum=0
while [ $counter -le 10 ] && [ $sum -lt 50 ]
do
echo "Adding $counter to the sum"
((sum += counter))
((counter++))
done
echo "The final sum is: $sum"
In this example, we have two conditions for the while loop. The first condition checks if counter is less than or equal to 10, and the second condition checks if sum is less than 50. If both conditions are true, we print a message indicating that we’re adding the current value of counter to sum, and then we increment both counter and sum. If either condition is false, the loop exits. Once the loop is finished, we print the final value of sum.
The continue keyword to skip conditions is also handy:
counter=1
while [ $counter -le 10 ]
do
if [ $((counter % 2)) -eq 0 ]
then
((counter++))
continue
fi
echo $counter
((counter++))
done
An until loop is similar to a while loop in Bash, except that it continues looping until a specified condition is true, rather than looping while a condition is true. In other words, the loop body will continue to execute as long as the condition is false. Once the condition becomes true, the loop will terminate.
The syntax of an until loop is as follows:
until [ condition ]
do
# statements
done
Here, condition is the condition that is checked at the beginning of each iteration of the loop. If condition is false, then the loop body is executed. Once the loop body has executed, condition is checked again, and the loop continues until condition becomes true.
Now let’s move on to the examples:
# a basic example
count=0
until [ $count -eq 5 ]
do
echo "Count is $count"
((count++))
done
In this example, the loop continues to execute until the $count variable is equal to 5. Inside the loop body, we print out the current value of $count and then increment it by 1 using the ((count++)) syntax.
Here’s a variation of our even odd example backwards
count=0
until [ $count -eq 10 ]
do
if [ $((count % 2)) -eq 0 ]
then
echo "$count is even"
else
echo "$count is odd"
fi
((count++))
done
In this example, we print out whether each number from 0 to 9 is even or odd. We use an if statement inside the loop body to check if the current value of $count is even or odd, and print out the appropriate message.
Here is a variation of the example that involves solving an expression
count=0
sum=0
until [ $count -eq 10 ]
do
((sum+=count))
((count++))
done
echo "The sum of the first 10 numbers is: $sum"
In this example, we use an until loop to calculate the sum of the first 10 numbers. Inside the loop body, we add the current value of $count to the running total in $sum, and then increment $count by 1. Once the loop is finished, we print out the final value of $sum.
We can even count up and down using until
count=1
until [ $count -eq 11 ]
do
if [ $count -lt 6 ]
then
echo "Count is $count"
else
echo "Count is $(expr 11 - $count)"
fi
((count++))
done
In this example, we print out a sequence of numbers that goes from 1 to 5 and then back down to 1. We use an if statement inside the loop body to print out the current value of $count if it is less than 6, and the value of 11 - $count if it is greater than or equal to 6. This creates a sequence that starts at 1, goes up to 5, and then counts back down to 1.
As with while loops, remember that a loop can continue to execute until the specified condition becomes true, so be sure to use an appropriate condition to prevent the loop from becoming infinite!
That is it for loops, let’s move on to functions!
In this section, we’ll be looking at how to create functions. A function is a set of instructions that perform a specific task, and you can define and use functions just like you would in any other programming language. Functions can be particularly useful when you need to perform a task repeatedly or as part of a larger script. We will start with basic functions and then use this as a chance to build functions that do the core commands we have learned so far as a chance to review.
The syntax for defining a function in Bash is as follows:
function_name () {
# statements
}
Here, function_name is the name of the function, and statements are the instructions that make up the function body.
To call a function, you simply use its name followed by any arguments you want to pass to the function. For example:
function_name arg1 arg2
Let’s create simple function that prints a message and run it
# Define a function that prints a message
cheer () {
echo "You are doing great\!"
}
# Call the function
cheer
This function is simple and takes no arguments, if we type cheer into the command line we can get a message.
We can also make another function that prints 5 random numbers
randomNumbers () {
for i in {1..5}
do
echo $RANDOM
done
}
randomNumbers
We haven’t seen this yet, but $RANDOM calls a random number from range between 0 and 32767 (which feels oddly specific). RANDOM is actually quite useful for setting a seed for randomly subsampling reads and other tasks.
The funniest use of RANDOM I have seen was in a function for an infinite loop that prints random numbers and text snippets on your terminal to look like a 90s movie hacker’s screen.Here is a function with a conditional to evaluate if something is even or odd
Define a function that checks if a number is even
is_even () {
if [ $(($1 % 2)) -eq 0 ]
then
echo "$1 is even"
else
echo "$1 is odd"
fi
}
# Call the function with an even number
is_even 4
# Call the function with an odd number
is_even 7
Here we define a function called is_even that checks if a given number is even or odd using an if statement. We then call the function with an even number (4) and again with an odd number (7).
Let’s make a function with three arguments that uses sed to replace words in a file
# Define a function (for osx) that replaces one word with another in a file using sed. On linux you need to modify this
replacer () {
echo "Replacing $2 with $3 in the file $1"
sed -i '' "s/$2/$3/g" "$1"
}
# Call the function with a string containing "world"
replacer sharkTracker2.txt Shark Dolphin
#view the file
cat sharkTracker2.txt
How cool is that!
We can also use grep inside of functions. Let’s look for a pattern inside a text file
# Define a function that searches for a pattern in a file and returns the line number and line while also ignoring case
searchFile () {
grep -ni "$1" "$2"
}
# Call the function to search for the word "Great White" in a file called "example.txt"
searchFile "great White" "sharkTracker2.txt"
We can also place functions in functions, just like with any other coding language. We will also throw in a command substitution.
# Define a function that searches for a pattern in a file and returns the line number and line while also ignoring case
searchFile () {
grep -ni "$1" "$2"
}
# create a function that uses the function
seeker() {
output=$(searchFile "$1" "$2")
if [[ ! -z "$output" ]]; then
echo "The match is $output"
else
echo "There are no matches"
fi
}
# Call the function to search for the word "Great White" in our sharkTracker2.txt file
seeker "great White" "sharkTracker2.txt"```
This is pretty neat, we now have feedback when we get no matches, and return output if we do have matches. If you start to think about what these last two examples are doing, you might be starting to connect the dots to how to expand these concepts to parsing informatics files and assembling outputs…
We can also work with arrays
Define a function that creates an array of numbers
# and prints the sum of the array
myArray=(1 2 4 5 3 7)
printSum () {
numbers=("$@")
sum=0
for i in "${numbers[@]}"
do
sum=$((sum + i))
done
echo "The sum is $sum"
}
printSum "${myArray[@]}"
The parentheses in numbers=(“$@”) are used to assign the positional parameters to an array named numbers.
By enclosing $@ in parentheses, we ensure that each positional parameter is treated as a separate element in the array. This is important because the elements of the array may contain spaces or other special characters that need to be preserved.
Without the parentheses, the positional parameters would be treated as a single string and any special characters within them would be interpreted by the shell.
In this case, we want to ensure that each element of the array is processed separately by the for loop that follows, so we use parentheses to create an array containing the positional parameters. Note than on osx you need to use “${myArray[@]}” instead of $myArray, to pass the array elements as separate arguments to the function rather than as a single string.
Awk was developed at Bell Labs in the 1970s and is available on most Unix-like systems. The name awk comes from the initials of its developers – Alfred Aho, Peter Weinberger, and Brian Kernighan.
At its core, awk operates on records and fields within a text file. A record is typically a line of text, and fields are the individual units within a record that are separated by a delimiter (e.g., a comma, space, or tab). awk allows you to manipulate and extract data from text files by specifying patterns and actions that should be performed on records and fields that match those patterns.
Awk reads text files line by line and splits each line into fields. By default, fields are separated by whitespace characters (spaces, tabs, or newlines), but you can specify a different delimiter using the -F option. Awk then processes each line based on the rules you provide, which can include patterns and actions. In many cases this allows you to accomplish otherwise extremely cumbersome tasks in a single line of code!
Awk uses patterns that are used to match specific lines, and actions are used to specify what to do with the matched lines. For example, you might use a pattern to match lines that contain a specific word, and then use an action to print out the line. There is a vast diversity of commands that use awk, we will cover some basic ones as well as a few more advanced ones in this section.
Let’s start by printing the first, second, or last field of each line in our sharkTracker2.txt file
#This prints the first column
awk -F',' '{print $1}' sharkTracker2.txt
#This prints the second column
awk -F',' '{print $2}' sharkTracker2.txt
#This prints the last column
awk -F',' '{print $NF}' sharkTracker2.txt
$NF is a special variable that represents the number of fields on the current line, so $NF is always the last field. Note that since our file is a csv file, we need to specify the delimiter as a comma using the -F flag.
Note that in awk, the block of commands enclosed in the {} specifies what to do when the condition is true.
We can also use awk for pattern matching.
#general form
awk '/pattern/ {print}' file.txt
#example from our file
awk '/Great White/ {print}' sharkTracker2.txt
#print all lines that DON'T contain a pattern
awk '!/Great White/ {print}' sharkTracker2.txt
#print all lines where the fourth field contains a pattern
awk -F',' '$4 ~ /Nags/ {print}' sharkTracker2.txt
On this last one, the ~ operator in the command is used to perform a regular expression match on the fourth field of the input file sharkTracker2.txt.
Specifically, the pattern /Nags/ between the forward slashes is a regular expression that is being matched against the value of the fourth field. This pattern matches any string that contains the substring “Nags”. The ~ operator applies the regular expression match to the value of the fourth field, and returns true if it matches the pattern, and false otherwise.
So, this command will print all lines in the sharkTracker2.txt file where the fourth field contains the substring “Nags”.
Just to remind you, the -F’,’ option sets the field separator to a comma, since the fields in the input file are comma-separated.
We can also use awk for finding fields based on conditions. Print lines where a numeric field is greater than a specific value:
#general form
awk '$3 > 10 {print}' file.txt
#example for sharks over 10 feet
awk -F',' '$3 > 10 {print}' sharkTracker2.txt
We can also print lines where a numeric field is between two specific values
#general form
awk '$3 > 10 {print}' file.txt
#example for sharks over 10 feet
awk -F',' '$3 >= 7 && $3 <= 15{print}' sharkTracker2.txt
We can also search for field values, either numeric or string
#simple example
awk -F',' '$1 == "Tiger Shark" {print}' sharkTracker2.txt
#A more complicated example using a variable
shark="Tiger Shark"
awk -F',' -v pattern="$shark" '$1 == pattern {print}' sharkTracker2.txt
Let’s break this down
* shark=“Tiger Shark” sets a shell variable named shark to the value “Tiger Shark”.
-F’,’ specifies the field separator as a comma.
-v pattern=“$shark” defines a variable named pattern in the awk program, which is set to the value of the shell variable shark. The -v option is used to pass a variable to the awk program from the shell.
$1 == pattern checks whether the value of the first field is equal to the value of the pattern variable.
{print} specifies the command to print the entire line if the condition $1 == pattern is true.
sharkTracker2.txt is the input file that awk operates on.
To summarize, the command sets the shell variable shark to “Tiger Shark”, then runs awk with the specified options and program on the sharkTracker2.txt file. The program compares the value of the first field of each row in the input file to the value of the pattern variable (i.e., “Tiger Shark”), and if they are equal, the entire row is printed to the standard output.
Note that the use of the -v option to pass shell variables to awk is a useful technique for creating flexible awk programs that can be easily customized based on the input and the specific task at hand.
We can also do this with a number instead of a string
#example with a number
length=9.1
awk -F',' -v pattern="$length" '$3 == pattern {print}' sharkTracker2.txt
Let’s look at a few more commands. As you will see in this section, awk is vast….
FNR (file number record) is a built-in variable in awk that holds the number of the current record in the current file being processed. In other words, it keeps track of the line number within the current file.
This variable is useful when you want to perform different actions on different lines or records of the file. For example, you may want to print a header line at the beginning of the file or a summary line at the end of the file.
Here is a simple example that demonstrates the use of FNR. Suppose we have a file named data.txt with three lines of text. The following awk command prints each line of the file along with its line number:
awk '{print "Line " FNR ": " $0}' sharkTracker2.txt
print: This is the awk command that prints output.
“Line”: This is a string literal that will be printed before each line of input.
*FNR: This is an awk variable that holds the record number (line number) of the current input file being processed.
“:”: This is a string literal that separates the line number from the actual line of text.
$0: This is an awk variable that holds the entire input line.
When awk starts processing sharkTracker2.txt, it reads the first line of input, assigns it to $0, and assigns the record number (line number) to FNR.
*awk then executes the program ‘{print “Line” FNR “:” $0}’ for the first line of input. This program consists of a single action: to print the text “Line”, followed by the current value of FNR, followed by the text “:”, followed by the entire input line ($0).
*awk then reads the next line of input and repeats steps 5 and 6 for each subsequent line of input.
Finally, awk exits after processing the last line of input.
You may be wondering why we are focusing on this? Well, lets consider what happens if you are working with multiple files simultaneously and need to merge them.
Suppose we have two FASTQ files, sample1_R1.fastq and sample1_R2.fastq, which contain paired-end reads from a sequencing experiment. We want to extract the read sequences from these files and merge them into a single file for downstream analysis.
The reads look like this
@read1_R1
AGCTGATCGATCGTACG
+
IIIIIIIIIIIIIIII
@read2_R1
TACGTACGTACGTACGT
+
IIIIIIIIIIIIIIII
...
@read1_R2
CGTACGATCGATCGTAC
+
IIIIIIIIIIIIIIII
@read2_R2
ACGTACGTACGTACGTA
+
IIIIIIIIIIIIIIII
...
It turns out that we can use awk and FNR to create a merged file of all reads, just the sequences, to map back to a genome!
Check this out:
awk 'FNR%4 == 2 {print $0 > "merged_reads.fastq"}' sample1_R1.fastq sample1_R2.fastq
In a FASTQ file, each sequence record consists of four lines: a header line that starts with the @ symbol, a sequence line, a separator line that starts with the + symbol, and a quality score line. The FNR%4 expression takes advantage of the fact that the line number of each record within a FASTQ file is a multiple of 4. Therefore, the remainder of FNR%4 can be used to identify the type of line being processed:
When FNR%4 is 1, the line being processed is the header line of a new sequence record.
When FNR%4 is 2, the line being processed is the sequence line of a sequence record.
When FNR%4 is 3, the line being processed is the separator line of a sequence record.
When FNR%4 is 0, the line being processed is the quality score line of a sequence record.
So, in the example awk command awk ‘FNR%4==2 {print}’ input.fastq, the condition FNR%4==2 is used to select only the second line of each sequence record, which corresponds to the actual sequence data. The print action then prints out only these lines, effectively extracting the sequence data from the FASTQ file! One line is all it took!!
The NR variable in awk stands for “number of records”, and it keeps track of the total number of input records that have been processed so far, across all input files. This variable can be used in a variety of ways to perform computations or output certain information based on the total number of records processed.
awk -F, '{if(NR%2==0) print $0}' sharkTracker2.txt
This line prints every other line from sharkTracker2.txt by seeing if it is divisible by 2.
Now lets apply this function to a pretend fasta file.
cat > pretend.fasta
>seq1
atatatataatatatatatatata
>seq2
ACTCGATGTATCGCTAGATCTATA
>seq3
TCGCTAGATCTATTGATCGATGCT
#save with control d
#count the number of sequences with NR
awk '{if(NR%2==0) count++} END {print count}' pretend.fasta
Lets count the number of sequences in the fasta file with a regular expression
awk '/^>/ {count++} END {print count}' pretend.fasta
In this command, /^>/ is a regular expression that matches any line starting with the > character, which indicates a header line in a FASTA file. When such a line is encountered, the count variable is incremented. At the end of processing, the END block is executed, which simply prints out the final value of count. Therefore, this command prints out the total number of sequences in the input file.
Here is an example of using this for a gff3 file to get gene coordinates!
cat > pretend.gff3
##gff-version 3
##sequence-region chr1 1 1000000
chr1 . gene 1000 9000 . + . ID=gene00001;Name=Gene 1
chr1 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=MRNA 1;Note=This is mRNA 1
chr1 . exon 1050 1500 . + . ID=exon00001;Parent=mRNA00001;Name=Exon 1
chr1 . exon 3000 3902 . + . ID=exon00002;Parent=mRNA00001;Name=Exon 2
chr1 . exon 5000 5500 . + . ID=exon00003;Parent=mRNA00001;Name=Exon 3
chr1 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001;Name=CDS 1
chr1 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001;Name=CDS 1
chr1 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001;Name=CDS 1
#control d to save
In this example, the first line starts with “##gff-version 3”, which is a comment indicating that this is a GFF3 file and which version of the GFF3 format it follows. The second line starts with “##sequence-region” and provides information about the sequence region covered by this file.
The remaining lines represent features on the sequence, such as genes, exons, and CDSs. Each feature is described using nine columns separated by tabs. The columns represent:
Now check out this command
awk '$3=="gene" {print NR, $9}' pretend.gff3 > index.txt
In this command, the condition $3==“gene” is used to select only lines that correspond to gene features in the GFF3 file. When such a line is encountered, the print action outputs the record number NR, which corresponds to the line number of the gene feature in the input file, as well as the ninth field ($9) of the GFF3 line, which contains the gene name or ID. The output of this command is redirected to a file index.txt, which can be used to quickly look up the position of a gene in the input file based on its name or ID. Easy!
BEGIN: This is a special pattern that is executed before any records are read from the input file(s). It can be used to initialize variables, set options, or perform other setup tasks. Here is an example in Bash:
awk 'BEGIN{FS=",";OFS="\t"} {print $1,$2,$3}' sharkTracker2.txt > sharkTracker2.tsv
If you followed what just happened, we converted a file from csv to tsv there! This command sets the input and output field separators to , and \t (tab), respectively, before printing the first three fields of each line in sharkTracker2.txt.
END: This is another special pattern that is executed after all records have been processed. It can be used to print summary information, perform cleanup tasks, or take other actions based on the input data. Here is an example:
awk -F, '{totalWeight += $2} END {print "Total Weight:",totalWeight}' sharkTracker2.txt
This command calculates the sum of the 2nd field of each line in sharkTracker2.txt, and then prints the total at the end of processing all the records.
length: This command returns the length of a string.
awk -F, '{if(length($2) > 3) print $0}' sharkTracker2.txt
This command prints each line of sharkTracker2.txt where the length of the 2nd column field is greater than 2 (weights above 1000 in our case). This is a handy way to filter data
awk is actually pretty handy for handling strings. For example let’s take a substring of a string using substr. Here is the syntax
#the basic idea
substr(string, start, length)
Here’s a breakdown of each of the arguments:
string: The input string from which you want to extract the substring.
start: The starting position of the substring. This can be either a number indicating the character position (starting from 1) or a variable that contains a number.
length: The length of the substring to extract. This can also be a variable that contains a number.
Here’s an example of how to use substr in an awk script:
awk '{ print substr($0, 5, 10) }' sharkTracker2.txt
In this example, we’re using substr to extract a 10-character substring starting from the 5th character in each line of the input.txt file. The $0 variable represents the entire line, so we’re applying substr to each line in the file. The output will be the extracted substrings.
##looking for a text fragment
awk -F, '{ if (index(substr($4, 1, 10), "ead") > 0) print $0 }' sharkTracker2.txt
In this command, we use substr($4, 1, 10) to extract a 10-character substring of the fourth field, starting at position 1 (i.e. the first character). Then, we use index to check if “ead” is present in this substring. If the index of “ead” is greater than 0, then “ead” is present in the substring, and we print the entire record ($0).
We can also change text in our file using toupper or tolower,
#command
awk '{print tolower($0)}' pretend.fasta > pretend2.fasta
#see the difference
cat pretend.fasta
cat pretend2.fasta
#to modify the existing file
#OSX
awk -i inplace '{print toupper($0)}' pretend2.fasta
In the first command command, we use the tolower function to convert each line of the input file to lowercase. The print statement outputs the converted line, and $0 refers to the entire input line. Finally, the output is redirected to a new file output_file.
In the second command, the -i inplace option tells awk to modify the input file in place. We use toupper to convert each line to lowercase, and the print statement outputs the converted line.
We can also use this in a search
awk -F, '{if(tolower($1) == "great white shark") print $0}' sharkTracker2.txt
This command prints each line of file.csv where the sixth field is the word “great white shark” in any case.
We can also split an array
awk '{split($9, a, ";"); print a[1]}' pretend.gff3
This command splits the ninth field on slashes and prints the first element of the resulting array for each.
We can also look for mismatches in a field of a file using getline. This is handy if you have a huge datafile that has a few cases you may need to remove
#this is just sample code so you have it
awk '{getline nextline; if($1 != nextline) print "Mismatch: "$1" != "nextline}' file.txt
We can also look for mismatches in a field of a file using getline. This is handy if you have a huge datafile that has a few cases you may need to remove
#this is just sample code so you have it
awk '{getline nextline; if($1 != nextline) print "Mismatch: "$1" != "nextline}' file.txt
we can also make associative (key:value) arrays
awk -F ',' '{ array[$1] = $2 } END { for (key in array) print key, array[key] }' sharkTracker2.txt
In this command, we create an associative array array where the keys are the values in the first column ($1) and the values are the values in the second column ($2). The END statement is used to print out the contents of the array after all input has been processed.
The for (key in array) loop iterates over each key in the array array, and key is the variable that contains each key in turn. We then print out each key-value pair using print key, array[key].Remember that in an associative array, each key is associated with a value. When you use array[key], you are accessing the value that is associated with the given key.
Note that if there are duplicate keys in the first column, the last value in the second column will be the one stored in the array.
This just barely scratches the surface of using awk. However, it should be enough to get you started and able to approach bioinformatic tasks.
File compression is common so here is a cheat sheet by Rick White from his introduction to scripting course at UNC Charlotte
#compress a file
gzip <filename>
# results in <filename>.gz
#Uncompresses files compressed by gunzip (.gz)
gunzip <filename>.gz
# results in <filename>
#Compresses files compressed by tar (tar.gz)
tar -cvzf <foldername.tar.gz>
#Arguments are
- -z: gzip compression
- -c: Creates a new .tar archive file.
- -v: Verbosely show the .tar file progress.
- -f: File name type of the archive file.
# List contents of tar.gz
tar -tvf <foldername.tar.gz>
# Prints a zipped file without opening it
gzcat <filename.gz> | more
gzcat <filename.gz> | less
#Uncompresses files compressed by tar (tar.gz)
tar -zxvf <foldername.tar.gz>
#arguments
- -z: many files
- -x: extract gzip
- -v: Verbosely show the .tar file progress.
- -f: File name type of the archive file.
# Compresses files compressed by tar (tar.bz2, more compression)
tar -cvjf <foldername.tar.gz>
#argument explanation
- -j: bz2 compression
- -c: Creates a new .tar archive file.
- -v: Verbosely show the .tar file progress.
- -f: File name type of the archive file.
#Uncompresses files compressed by tar (tar.bz2)
tar -xjvf <foldername.tar.bz2>
#argument explanation
- -j: bz2 file
- -x: extract gzip
- -v: Verbosely show the .tar file progress.
- -f: File name type of the archive file.