Today’s Goals:

  • To understand the basic concepts and commands of Bash scripting
  • To learn how to use Bash to automate bioinformatics tasks
  • To develop skills to write Bash scripts for specific bioinformatics tasks
  • To be able to understand and modify existing Bash scripts for bioinformatics tasks

Why learn bash?

Bash scripting is a way to automate tasks in the Unix/Linux command line environment. Bash is a command-line shell and scripting language that allows users to interact with and manipulate the Unix/Linux operating system. Bash scripts are sets of commands and statements that are executed in sequence, allowing users to automate repetitive tasks and perform complex operations.

Bash scripts can be used for a wide range of tasks, including data processing and analysis, system administration, and software development. They are particularly useful in bioinformatics, where large amounts of data need to be processed and analyzed quickly and efficiently.

Some of the key features of Bash scripting include:

  • Variables: Bash scripts use variables to store and manipulate data. Variables can be used to store values, such as file paths or input parameters, and can be manipulated using arithmetic and string operations.

  • Conditional statements: Bash scripts use conditional statements to control the flow of execution. Conditional statements allow users to test for specific conditions and execute different commands based on the results.

  • Loops: Bash scripts use loops to repeat commands and statements. Loops can be used to iterate over lists of files, perform repetitive tasks, and process large amounts of data.

  • Functions: Bash scripts use functions to organize and modularize code. Functions allow users to reuse code and make their scripts more efficient and easier to maintain.

In this markdown we will work our way up from zero to learn the basics of scripting. With these core skills, you will be able to approach many bioinformatics tasks with confidence.


Who uses bash scripts

With the advent of high-throughput technologies such as next-generation sequencing, there is a vast amount of biological data that needs to be processed and analyzed. Bash scripting is a powerful tool that can be used to automate many of the tasks involved in bioinformatics analysis.

Some of the bioinformatics tasks that can be automated using Bash scripts include:

  • Data preprocessing: Before analyzing biological data, it often needs to be preprocessed to remove noise, filter out low-quality reads, and perform quality control checks. Bash scripts can be used to automate these preprocessing steps, such as trimming reads, removing adapter sequences, and filtering out low-quality reads.

  • Sequence alignment: Sequence alignment is the process of aligning two or more DNA or protein sequences to identify similarities and differences. Bash scripts can be used to automate sequence alignment tasks, such as aligning reads to a reference genome, or aligning protein sequences to a database of known proteins.

  • Variant calling: Variant calling is the process of identifying genetic variations, such as single nucleotide polymorphisms (SNPs) or insertions/deletions (indels), in DNA sequences. Bash scripts can be used to automate variant calling tasks, such as calling SNPs and indels from aligned reads, and filtering and annotating variants.

  • Gene expression analysis: Gene expression analysis is the process of quantifying the expression levels of genes in a sample. Bash scripts can be used to automate gene expression analysis tasks, such as mapping RNA-seq reads to a reference genome, quantifying gene expression levels, and identifying differentially expressed genes.

In our lab, we routinely include bash scripts to automate tasks such as these. UCE data is not unique or special and the same core scripting approaches we will cover here will also translate to other bioinformatic chores, making it easier for you to analyze your data!

The diversity of shells

Note that their are many shells available that vary by system you are using.
Here’s an overview of some of the most common shells and the systems on which they are commonly found:

  • Bash (Bourne-Again SHell): This is the default shell on most Linux distributions and macOS.

  • Zsh (Z Shell): This is an alternative to Bash that is also available on most Unix-like systems. It has some additional features and improvements over Bash, such as more advanced tab completion and spelling correction. You will see mac users often use this.

  • Ksh (Korn SHell): This shell is also available on most Unix-like systems and is slightly older than bash. The number of users is dwindling, but you will still see this used. A major difference is that Korn uses print instead of echo to print messages in the terminal (we will see this in the next tab.)

  • PowerShell: This is the default shell on Windows systems. It’s also available on Linux and macOS. It treats everything as objects. I have not seen this in widespread use relative to Bash or Zsh, but that may be due to most students running linux in our department.

It’s worth noting that there are many other shells available, and the choice of which shell to use often comes down to personal preference and the your specific needs. Additionally, many shells are highly customizable, allowing users to modify the shell’s behavior and appearance to suit their needs. FOR UCE assembly, Zsh and bash are fine.

If you are ever unsure of what version of what you are using simply type this command into your terminal

echo $SHELL

I use mac or linux and have little experience with windows. Windows users can use the windows subsystem for linux to run bash scripts or other command line tools or set up a dual boot computer to have both linux and windows on the same machine.

There are also several online bash shells for testing code. Repl.it supports multiple environments and code sharing. There are other ones I have not tried such as JSLinux and Shellbox you can use while you configure your computer.

Basic Commands

ls

‘ls’ stands for “list”. It is used to list the files and directories in the current directory.

ls

#Lists your files (long format, gives file sizes, permissions, owner, last modification)
ls -l

#Lists all files, including hidden files
ls -a

#List all files and folders (adds \ to separate folders)
ls -F

#Recursively list Sub-Directories
ls -R

#Best ls command 
ls -thor

#-t sort by modification time, newest first
#-h --human-readable file size [mb, etc]
#-o like -l, but do not list group information
#-r --reverse, reverse order while sorting, with oldest files listed first


cd

cd stands for “change directory”. It is used to change the current directory.

# To change directory
cd directory_name

# Change to the previous directory (note this is wherever you were last)
cd -

# Change to the home directory
cd ~

#Back a directory in your current path
cd ..


Absolute and Relative Paths

There are two types of paths in Bash: absolute and relative. An absolute path is the complete path from the root directory to a file or directory. A relative path is the path from the current directory to a file or directory.

# Absolute path example
cd /home/user/Desktop

# Relative path example
cd Documents


Creating Directories

mkdir stands for “make directory”. It is used to create a new directory.

mkdir directory_name


mv

mv stands for “move”. It is used to move files and directories.

# Move a file
mv file_name directory_name/new_file_name

# Move a directory
mv <source_directory_path> <destination_directory_path>
#For example
mv ~/my_dir ~/Documents/

#Rename file
mv <filename> <newfilename>

#Rename directory
mv old_dir new_dir

Note that it is a good practice to put “” around paths. If the path has a space or special characters the command may not be properly interpreted. FOr example

cd my folder
#versus
cd "my folder"

In the above the second option will allow you to change directory to a folder called my folder, while thefirst will tell you the string is not in the path (for those who are more advance, yes you can get around some of this, but we will keep it simple here)

Creating Files

touch is one way we can create a new file. We will explore multiple ways to do this and discuss the pros and cons on the “cat” tab

touch new_file_name

You can also use touch to update the time stamps of existing files

touch -t 202303102300 file_name

Will update the time stamp to March 7th, 11 pm 2023

Removing Files and Directories

rm stands for “remove”. It is used to remove files and directories.

# Remove a file
rm file_name

# Remove a directory
rm -r directory_name


Command Execution

Running Commands

To run a command in Bash, simply type the name of the command followed by any necessary arguments.

command_name argument1 argument2


Command Output

Commands in Bash can output text to the screen. This output can be redirected to a file using the > operator.

# Redirect output to a file
command_name > output_file


Command Input

Commands in Bash can also take input from a file using the < operator.

# Redirect input from a file
command_name < input_file


Copying files and directories

cp

cp stands for copy and is used for copying files and directories within the same system.

#Copies a file, here file 1 gets copied into file2
cp <filename1> <filename2>

#Copying directories recursively
cp -R /directory/ /directory1/

To understand recursion, consider this example:

#you have a directory that looks like this
mydir/
|-- file1.txt
|-- subdir1/
|   |-- file2.txt
|   |-- file3.txt
|-- subdir2/
    |-- file4.txt
    |-- file5.txt

#you run this command
cp -R mydir/ mydir_copy/

#creating this
mydir_copy/
|-- file1.txt
|-- subdir1/
|   |-- file2.txt
|   |-- file3.txt
|-- subdir2/
    |-- file4.txt
    |-- file5.txt

This command creates a copy of “mydir” directory and all its contents in a new directory named “mydir_copy” that will be identical to “mydir”

We can also cp in different ways:

#Copy multiple files into another directory
cp file* /directory/subdirectory

#Copy all files in new directory unless they already exist
cp -u *.fasta newdir/

The cp -u command copies files from one location to another, but only if the source file is newer than the destination file or if the destination file does not exist. The -u option stands for “update”. In this case we are copying all fasta files unless they already exist

This is handy if you want to play it safe while copying:

#Backs up files
cp --backup <origfile> <newfile>

If a file with the same name already exists in the destination directory, cp will make a backup copy of that file before overwriting it with the contents of . The backup copy will have the same name as the original file but with a tilde ~ character added to the end of the filename.

This can be useful if you want to make sure you have a backup of any files that might be overwritten during the copy process.

If you need multiple versions you can:

#Backs up files with numbering
cp --backup=numbered <origfile> <newfile>
scp

scp [secure copy] is used to copy files securely between remote systems. It encrypts the data during transfer and requires authentication with username and password or SSH key.

This is a common command for getting data to or from a cluster

#general form
scp <source_file_path> <destination_file_path>

#example
scp mydata.txt user@physalia.edu:/remote/directory/
rsync

rsync is used for synchronizing files between local and remote systems or between two remote systems with advanced options like compression, bandwidth throttling, and only copying changes.

#general form
rsync options sourcedestination

#example
rsync -avzP /path/to/files/ user@remotehost:/path/to/destination/
  • -v : verbose [print a line for each file that it transfers, including the name of the file, the size, the transfer speed, and other information]
  • -r : copies data recursively (but don’t preserve timestamps and permission while transferring data
  • -a : archive mode, archive mode allows copying files recursively and it also preserves symbolic links, file permissions, user & group ownerships and timestamps. useful for making a complete and exact copy of the source directory, including all its subdirectories and files.
  • -z : compress file data. uses the gzip algorithm for compression by default, but other compression algorithms can be specified using the –compress option.
  • -h : human-readable, output numbers in a human-readable format
  • -P: enhance the progress display and enable the resumption of interrupted transfers.

Alias commands to shorten typing

If you are like most programmers, you probably don’t want to type the same options over and over. You can use the alias command to shorten commands

#general form
alias command='command option'

#examples 
alias ls="ls -l"
alias cp="cp -i"
#this second example makes copying interactive, asking you if you truly want to copy something. This can be safer since you can check before accidentally overwriting files. You confirm by typing 'y' and hitting enter

If you want to remove your alias, simply

unalias ls
unalias cp

Note that these are stored in memory. Once you exit the shell, any aliases that you have defined will be lost.

One last note on command history. Your terminal will remember the history of your commands which is useful. You can view it with the history command However, if you mess up and forgot to close a quote or something else that is causing commands to fail, you can always clear your history. This is rare, but I wanted to include it here since it is different between mac and linux.

#view your history
history

#on linux you -c to clear
history -c

#on mac you -p to purge
history -p

Sequence files are essentially just large text files and Bash happens to be a powerful tool for working with these file types. Bash has variety of built-in commands and utilities that allow you to easily manipulate, search, and analyze text data. This allows you to use simple commands to perform complex operations quickly and efficiently, without the need for specialized software or programming knowledge.

In this section we will begin looking at core text manipulation operations.

Writing output to text files

Before we dive into text files, I just want to point out the way that bash prints output using the command echo. Echo is used to display a string of text. You will see this command again later.

# Display a string of text
echo "Scripting is easy!"

Using cat

The cat command is a very useful utility that can be used to concatenate and display the contents of one or more files.

Basic Usage

The basic syntax for using cat is as follows:

cat [OPTIONS] FILE...

where FILE is the name of one or more files that you want to concatenate and display.

For example, to display the contents of a file called example.txt, you can use the following command

cat sharkTracker.txt

This will display the contents of the file sharkTracker.txt file on your terminal.

Useful Options:

-n or –number
This option adds line numbers to the output. For example, if you want to display the contents of example.txt with line numbers, you can use the following command:

cat -n sharkTracker.txt

If for some reason you only need non-blank lines (this comes up more often than you would expect as you get handed datasets), you can use: -b or –number-nonblank

cat -b sharkTracker.txt

If you need to see line ends (these will be marked by a $), use -E or –show-ends

cat -E sharkTracker.txt

Likewise, -T or –show-tabs will show tabs. This can come in handy
–show-ends

cat -T sharkTracker.txt

Using cat to create files

We are going to start putting pieces together here to see how we can use cat to accomplish core tasks. To begin we will make some files using cat.

You can use the cat command with the output redirection operator > to create a new file and write text to it. For example, to create a new file called sharkTracker2.txt and write some lines of text to it with this following command:

cat > sharkTracker2.txt
#first enter the above command, then paste this in
Shark Species,Weight (lbs),Length (ft),Coastal Town,Date
Sandbar Shark,120,6.3,Wilmington,2022-06-01
Tiger Shark,430,10.2,Morehead City,2022-05-15
Dusky Shark,250,8.5,Nags Head,2022-07-23
Blacktip Shark,80,5.4,Hatteras,2022-04-12
Great White Shark,1600,18.2,Atlantic Beach,2022-08-05
Bull Shark,350,9.1,Beaufort,2022-06-18

As you can see, you can keep entering text as you hit enter. To save this output you need to hold control and press D

Using cat to append files

You can also use cat to append text to an existing file. To do this, you can use the >> operator. For example, to append the text “Hammerhead Shark,600,12.3,Swansboro,2022-05-01” to the sharkTracker2.txt file created in the previous example, you can use the following command:

cat >> sharkTracker2.txt
Hammerhead Shark,600,12.3,Swansboro,2022-05-01

After you run this command, you will again see a blank line waiting for you to input text. Paste in your text (without quotes) and press Enter. Then press Ctrl-D to append the text to the file.

Using cat to copy files

You can use cat to create a new file from the contents of an existing file. We will zoom in on this in a second, but just realize that by simply using the > operator with cat, you can accomplish the same thing as cp. For example, to create a new file called sharkcopy.txt with the contents of sharkTracker2.txt, you simply:

cat sharkTracker2.txt > sharkcopy.txt
This seems silly to do, but in the next section we will begin building on this concept to manipulate files. From there things will just get more and more powerful and you will rapidly feel like an informatics wizard!

Using cat to manipulate files

Here we will continue to use cat to explore ways we can subsample files, introducing a few more core functions along the way.

Isolating columns into new files

This is a common task. You have a big meta data file, but you just need one or a few columns of it. Here is how to use cat with the cut command and the pipe operator | to isolate parts of a file you need and write those into a new file

cut -d',' -f 1,4 sharkTracker2.txt | cat > sharkMap.txt

There is a lot going on here, let’s break this down.

  • cut -d’,’ -f 1,4 data.txt uses the cut command to extract the first and fourth columns of data.txt. The -d option specifies the delimiter (a comma in this case), and the -f option specifies which fields to extract (the first and third fields).
  • is the pipe operator, when you see one of these, say the word “then”. It is taking the first command (cut) THEN using cat to make a new file with the results of that command. If you are familiar with R and tidyverse, this is the same as the %>% operator.
  • cat > new_data.txt uses the output redirection operator > to create a new file called new_data.txt and write the output of the cut command to it.
  • So this code can be read as a sentence as cut out the first and fourth columns delimited by commas,THEN write those columns into a new file called sharkMap.txt

To view this file you can use cat

#view the file
cat sharkMap.txt
Heads or Tails?

If this file was thousands or millions of rows, you probably wouldn’t want to display the output. You can use the head command to isolate the number of rows of your choosing and display those. For example, if we want to spot check the first 3 rows we can

head -n 3 sharkMap.txt

We can also use the tail command to do the same for the bottom x rows

tail -n 3 sharkMap.txt

Its a silly example here, but quite handy for checking log files for on runs on clusters.

Putting it all together

If you are feeling creative at the moment, you may have thought of something. Can we combine cat, head, tail, and the pipe operator to isolate rows of text? The answer is yes! Lets see this in action.

#cut out rows 2-5
cat sharkTracker2.txt | head -n 5 | tail -n +2 > rows2Through5.txt

Let’s break down what this command does:

  • cat sharkTracker2.txt reads the contents of sharkTracker.txt and passes them as input to the next command using the | pipe operator. In human, this means take sharkTracker2.txt, THEN
  • head -n 5 selects the first 5 lines of the input (i.e., lines 1-5), THEN
  • tail -n +2 selects lines starting from line 2 of the input (giving us lines 2-5).
  • “>rows2Through5.txt” redirects the output of the pipeline to a new file called rows2Through5.txt. So when you run this command, it will create a new file called rows2Through5.txt that contains rows 2-5 of sharkTracker2.txt.
#View the file
cat rows2Through5.txt

In the next section we will build on this to look for specific patterns to isolate data. This is where things start to get powerful.

Grep is a powerful command-line tool used for searching text files for specific patterns. It allows users to search for regular expressions or strings in one or more files at once. With a variety of options and commands, grep is a versatile tool that can be used for a wide range of text search tasks.

In bioinformatics, grep is often used to search for specific DNA or protein sequences in large text files such as genome sequence files or sequence alignment results. It can also be used to search for specific patterns or motifs within sequences, or to filter out specific sequences based on certain criteria. Additionally, grep is often used in conjunction with other command-line tools to perform more complex analyses and tasks in bioinformatics.

Basic Usage

The basic syntax for using grep is as follows:

#View the file
grep [options] pattern [file(s)]
  • grep is the command to search for a pattern
  • [options] are optional flags that modify the behavior of grep
  • pattern is the regular expression or string to search for
  • [file(s)] are the file(s) to search in. If no file is specified, grep will read from standard input.

Let’s try an example of a simple search using the sharkTracker2.txt file we made

#The general pattern
grep "search string" filename

#with our shark file
grep "Hammerhead" sharkTracker2.txt

Running this second line retrieves the line that contains the word “Hammerhead”

You can also search for a pattern in multiple files by seperating them with a space

#The general pattern
grep pattern file1.txt file2.txt

#example with sharks
grep "Hammerhead" sharkTracker2.txt sharkMap.txt

Options with grep

There are numerous options that grep can use. Here are some of the most common:

  • -i: Ignore case when matching
  • -v: Invert the match (print lines that don’t match the pattern)
  • -n: Print the line number of the matched line
  • -r: Recursively search subdirectories
  • -l: Print only the names of files that contain the pattern
  • -c: Print the count of matching lines

Ignoring case is handy, especially if you or your collaborators are prone to bumping the caps lock key

grep -i HAmmerHEAD sharkTracker2.txt

You can also retrieve everything but the pattern you are looking for. For example if we wanted all sharks that are not hammerheads

grep -v Hammerhead sharkTracker2.txt

You can get line numbers with the search

grep -n Hammerhead sharkTracker2.txt

This is handy, print only the file(s) that has(ve) a specific string

grep -l Hammerhead *

You can also count matches (this is useful if you need a quick count of reads in a fasta, more on that tomorrow)

grep -c Hammerhead sharkTracker2.txt

A couple more handy things

#To search for multiple patterns, list them separated by a pipe character (|)
grep  'Hammerhead\|Great' sharkTracker2.txt

#use regular expressions, this will return any shark at 12 lbs and any single decimal after
grep  12.'[0-9]' sharkTracker2.txt

#search for whole words
grep  -w 'Great White Shark' sharkTracker2.txt

#search in all files except 
grep  Great * --exclude=sharkTracker2.txt

As you can see, the list of things you can do with grep can go on and on. Rather than provide an exhaustive list of arguments, I’m going to focus the next section on common tasks you can use grep for that can come in pretty handy.

Working with grep

#To count instances of a word, note that -o extracts only the search pattern, not the entire line
grep -o 'word' input.txt | wc -l
#To count instances of a word separated by boundaries (example red verses hired)
grep -o '\bword\b' input.txt | wc -l

A quick note on tr

tr is a command in Bash that translates or deletes characters. It reads standard input and performs a set of translations based on the command line arguments, and then outputs the results to standard output.

The syntax for tr is as follows:

tr [OPTION]... SET1 [SET2]

SET1 specifies the set of characters to be translated, and SET2 specifies the replacement characters. If SET2 is not specified, tr deletes the characters in SET1.

Here’s an example of how to use tr to replace all occurrences of the character a with the character b:

echo "banana" | tr 'a' 'b'

This will output bbnbnb, which is the original string with all occurrences of the character a replaced with the character b.

Sed (stream editor) is a powerful command-line tool for performing text processing tasks on large datasets. Sed operates by reading in a stream of text data, applying a series of text manipulation commands, and then outputting the modified data. It is particularly useful for batch processing of text files, such as those commonly encountered in bioinformatics.

In bioinformatics, sed is often used to manipulate large text files containing genomic or proteomic data. For example, sed can be used to extract specific fields from a tab-delimited file, to remove or replace certain characters from a text file, or to convert between different file formats. Sed can also be combined with other command-line tools such as awk and grep to perform more complex text processing tasks. Due to its speed and versatility, sed is an essential tool in the bioinformatics toolkit for working with large datasets.

Substituting with sed

#Basic usage
sed [options] 'command' filename

This is a little different from what we have seen so far, we now have options mixed with commands that will do something to a file. Let’s zoom into some core functions starting with s (substitute)

#Basic usage
sed s/pattern/replacement/flags filename

In this line we invoke sed, and ask it to find a pattern and replace it in a file. Their are optional flags we can include to modify the behavior further. Lets look at this.

#change shark to dolphin
sed 's/Shark/dolphin/' sharkTracker2.txt

#use regular expressions
sed 's/2022.*5/No data/' sharkTracker2.txt

A quick note on substituting versus deleting when ignoring case

#change shark to dolphin regardless of case using the i flag
sed 's/shark/dolphin/i' sharkTracker2.txt

#delete lines containing hammerheads, insensitive to case
sed '/hammer/Id' sharkTracker2.txt

Notice the case change above? If you use the i flag without any text to insert, sed doesn’t know what text to insert and therefore throws an error.

If we want to delete lines containing the word “hammer”, we need to use the I flag (uppercase), which tells sed to ignore case when searching for the pattern, rather than the i flag (lowercase) which is used to insert text.

We can also use substite to remove leading whitespaces

#replace leading white space
sed 's/^[ \t]*//' filename.txt

Let’s unpack this
* s/: indicates a substitution command (i.e., replace one string with another) * the ^ symbol matches the start of a line (this is very useful for working with fasta files) * the [ * matches any number of spaces or tabs at the beginning of the line * the // indicates the replacement string is empty (i.e., delete the matched string, or replace with nothing depending on how you look at it) * and the final ’ ends the substitution command
cool! Let’s do one last task

#replace shark species with just species in just first line by specifying the line number in front of the s
sed '1s/Shark Species/Species/g' sharkTracker2.txt

in the above example, I added the g flag. The g (global) flag is used to replace all occurrences of a pattern within a line, rather than just the first occurrence.

By default, sed replaces only the first occurrence of a pattern in each line it processes. The g flag is useful when you want to replace all occurrences of a pattern within a line. We don’t have this case here, but you see this often and it is useful to point out. I also want to issue a warning here:
if you use the g flag without specifying a pattern, sed will replace all occurrences of the empty string with the replacement text. This can cause unexpected changes to your text.

Using sed to work with lines

Sed’s ability to search for patterns and perform modifications on targeted lines of text makes it an efficient tool for bioinformaticians looking to extract or modify specific fields within a file. This can be particularly useful for tasks such as parsing data from large sequencing files or formatting data for downstream analyses.

Let’s look at this with some simple examples

#print the third line
sed -n '3p' sharkTracker2.txt

This command prints the third line of the sharkTracker2.txt file. The -n option tells sed to not print anything, and the p command prints the specific line. The logic here is that you are limiting output to just the line designated

#delete the third line
sed -n '3d' sharkTracker2.txt

This command deletes the third line of the sharkTracker2.txt file. The d command tells sed to delete the specific line.

We can also replace individual lines. Note that linux and mac do this slightly differently

#linux version
sed -i '2s/.*/This is the new second line/' sharkTracker2.txt
#mac version
sed -i '' '2s/.*/This is the new second line/' sharkTracker2.txt

The i command is for in place editing and requires an argument on a Mac, even if it’s just an empty string.

This command is handy and we can use it to insert lines above or below targets.

#linux for above
sed -i '3i\Spiny Dogfish,5,3,Avon,2022-07-14\' sharkTracker2.txt

#mac  for above
sed -i '' '3i\
Basking Shark,40,2000,Topsail,2022-07-14\
' sharkTracker2.txt

#linux for below
sed -i '3a\Basking Shark,40,2000,Topsail,2022-07-1' sharkTracker2.txt

#mac  for below
sed -i '' '3a\
Basking Shark,40,2000,Topsail,2022-07-14\
' sharkTracker2.txt

Note that on a mac you need the backslash followed by an enter. This is because the macOS version of sed is based on BSD (Berkeley Software Distribution) sed, which requires the newline character to be escaped with a backslash in order to continue the command on the next line.

In contrast, the GNU version of sed (found on most Linux systems) allows you to continue a command onto the next line by simply placing a backslash at the end of the line, without requiring a newline character.

We can also find lines similar to grep

#find the dogfish, regardless of case with the I flag
sed -n '/doGfIsh/I =' sharkTracker2.txt

How many lines are in this file anyway?

lines=$(sed -n '$=' sharkTracker2.txt)
echo lines

I warned you echo would start to come back. This is a soft introduction to content we will see later. When you enclose a command within $() in bash, it runs that command and captures its output. In this case, the command being run is sed -n ‘$=’ sharkTracker2.txt, which outputs the number of lines in the file sharkTracker2.txt.

One last note, we can use the -e (expression) to chain commands

#on linux
sed -e '1i My Shark Data' -e 's/$/,/' sharkTracker2.txt

#on a mac
sed -i '' -e '1i\
My Shark Data' -e 's/$/,/' sharkTracker2.txt

This allows you to specify multiple sed commands to be executed on the same input file, with each command separated by the -e option. This option is useful when you want to execute multiple sed commands on the same input file without having to create multiple temporary files. In this example we added a terrible header and also added a comma to the end of each line

To remove annoying things like commas at the end of lines, simply

#on linux
sed -i 's/,$//' sharkTracker2.txt

#on mac
sed -i '' 's/,$//' sharkTracker2.txt

Variables are an essential part of command-line computing in bioinformatics. They allow users to store and manipulate data efficiently, automate repetitive tasks, and avoid the need to retype lengthy commands repeatedly. Variables are often used to store file paths, program parameters, or other data that need to be passed to a program or script.

Using variables in the command line requires defining the variable and assigning it a value. We will work with variables here to keep building our foundation as proper use of variables can streamline workflows and make data analysis more efficient.

In Bash, variables are defined using the syntax “variable_name=value”, where variable_name is the name of the variable, and value is the value to be assigned to the variable. This should familiar if you code in other languages and is the equivalent of “variable_name<-value” in R (in case that helps).

Once a variable is defined, it can be used in subsequent commands by enclosing the variable name in “$” and using it as an argument or parameter in the command. Let’s explore this now

Working with variables

Variables are used to store values that can be used later on. These values can be used in commands, scripts, or even in other variables. Here’s an example of a simple variable in the command line

#make a variable and print the output
MY_VARIABLE="Watch out for titan triggerfish\!"
echo $MY_VARIABLE

In this example, we create a variable called MY_VARIABLE and assign it the value “Watch out for titan triggerfish!”. We then use the echo command to display the value of the variable. The $ symbol is used to reference the value of the variable.

Note that in the above, we need to escape the ! by adding a backslash. Remember this if you have variables with special characters and suddenly you are prompted by dquote after trying to define them.

Here are a few more examples of how variables can be used in the command line

# Create a variable with a number
MY_NUMBER=10

# Use the variable in a command
echo "The number is $MY_NUMBER"

# Create a variable with a filename
MY_FILE="emptyfile.txt"

# Use the variable to create a new file
touch $MY_FILE

# Create a variable with a directory path
MY_DIR="/Users/alexdornburg/Documents/UCE_Workshop/Day1/Roadwork"

# Use the variable to navigate to the directory
cd $MY_DIR

Using variables with grep and sed

Search for a pattern using grep and a variable:

SEARCH_PATTERN="Shark"
grep $SEARCH_PATTERN sharkTracker2.txt

#Search for a pattern using a variable and ignore case
SEARCH_PATTERN="sHark"
grep -i $SEARCH_PATTERN sharkTracker2.txt

Replace a pattern using sed and a variable:

OLD_PATTERN="Spiny"
NEW_PATTERN="Smooth"
sed "s/$OLD_PATTERN/$NEW_PATTERN/g" sharkTracker2.txt

#Replace a pattern using a variable and only change the first occurrence
OLD_PATTERN="Shark"
NEW_PATTERN="Dolphin"
#on linux
sed "0,/$OLD_PATTERN/s//$NEW_PATTERN/" sharkTracker2.txt

#on mac
sed -i '' "1s/$OLD_PATTERN/$NEW_PATTERN/" sharkTracker2.txt

In this example, the sed command replaces only the first occurrence of the value of the OLD_PATTERN variable with the value of the NEW_PATTERN variable in the file.txt file. The 0,/old_pattern/ range specifies the first occurrence, and the s// syntax is shorthand for s/old_pattern/NEW_PATTERN/.

Note that the 0,/pattern/ address range is not supported by all versions of sed. On a mac, you can use the second command. In this command, we are specifying only the first instance of the word at the beginning of the sed pattern. In general, if you are on a mac and things aren’t working with sed, be prepared to look on stack overflow for the equivalent command on osx. As a note of caution, chatgpt has trouble with sed at the time of this writing, and will confidently return commands that do not work.

Using variables with commands

You can use variables in commands by enclosing them in curly braces and prefixing them with a dollar sign

#Using variables in commands
file_name="emptyfile.txt"
touch ${file_name}

This will create a new file called example.txt using the touch command. you can ls to see it in your directory

You can use the output of a command as the value of a variable by enclosing the command in $()

#Using variables to capture command output
date_today=$(date +"%Y-%m-%d")
echo "Today's date is ${date_today}"

This will output today’s date in the format YYYY-MM-DD. To break this down: * The date command prints the current date and time in the default format: Wed Mar 1 15:20:26 EST 2023
* The + option allows you to specify a custom output format for the date command. In this case, we’re using the format string %Y-%m-%d, which represents the year, month, and day in the format YYYY-MM-DD.
* The $(…) syntax is command substitution. This means that the output of the date command is captured and used as the value of the date_today variable.
So, when you run the command date_today=$(date +“%Y-%m-%d”), the current date is captured and stored as a string in the date_today variable, which we then display with echo. Neat!

Using variables as command line arguments

Command line arguments are a way to pass inputs to a command or script when it is executed. When you run a command or script with arguments, they are automatically assigned to variables called positional parameters. The first argument is assigned to $1, the second to $2, and so on.

We haven’t made official scripts yet (we will shortly), but lets make one using cat

cat > hello.sh
#!/bin/bash

echo "Hello, $1! How are you today?"

The #!/bin/bash is called a “shebang” or “hashbang,” and it’s the first line in many Bash scripts. It’s a directive to the shell to use the Bash interpreter to execute the script. The #! characters tell the system that what follows is the interpreter to use, and /bin/bash specifies the path to the Bash executable. Essentially, it ensures that the script is interpreted correctly by the right shell and that it’s executed as a Bash script. Without the shebang, the script might not run, or it might run with the wrong interpreter. The shebang line is an essential component of most Bash scripts and is often used to make sure that the script runs consistently across different systems.

The rest of the script looks familiar except the $1. Lets see what happens when we execute it.

#we first set the permission for the script to execute (more on this later)
chmod +x hello.sh

#then run it
./hello.sh Bob

In this example, the $1 variable is used to access the first command line argument, which is the name of the person being greeted. If you wanted to include more arguments, you would use $2,$3, and so on to access them.

Let’s do one more example to combine some concepts

cat > lameCalculator.sh
#!/bin/bash

sum=$(expr $1 + $2)
echo "The sum of $1 and $2 is $sum"

now set permission and run

#we first set the permission for the script to execute (more on this later)
chmod +x lameCalculator.sh

#then run it
./lameCalculator.sh 5633426784  47246738246

Pretty easy, however, a few things to remember about reserved numbers or numbers 10 and up

echo $0      # prints the script name
echo $1      # prints the first argument
echo $2      # prints the second argument
echo $9      # prints the ninth argument
echo $10     # prints the first argument, followed by 0 
echo ${10}   # prints the tenth argument
echo $#      # prints the number of arguments

We will level up when we play with sequence read data and use these concepts to accomplish more complex tasks. You will hopefully notice that many of these tasks collapse to simple examples like this with a few more steps and commands layered on top.

Arrays, a way to store multiple elements

an array is a variable that can hold multiple values. Each value in the array is assigned a unique index, starting from 0.

Arrays in Bash can be of two types: indexed arrays and associative arrays.

Indexed array is a simple list of values, where each value is associated with an index number starting from 0. To declare an indexed array in Bash, you can use the following syntax:

my_array=(value1 value2 value3 ...)

so an array of sharks may look like

sharks=('Bull Shark' 'Tiger Shark' 'Blue Shark')

Note the quotes here since these names have spaces in them. Another important note for everything I am about to cover:

bash array indexing starts at 0 (always)
zsh array indexing starts at 1

I am writing this while using a zsh. Therefore my examples may not match your shell. For zsh my first element is at position one. If you are on bash, you will access the same element at position 0. That means, if these examples don’t match, simply subtract 1.

Working with index positions

#reference  an index
echo "${sharks[1]}"

#count elements in an array
echo ${#sharks[@]} 
# the [@] part tells Bash to treat the array as a whole, rather than a single variable, and the # operator returns the length of the array. In zsh you can omit this

#copy a subset of an array in bash (note this counts from zero)
miniShark=("${sharks[@]:1:2}")
echo "${miniShark[@]}"

In this example, we’re using the “${array_name[@]:offset:length}” syntax to create a slice of the sharks array.

  • offset is the starting index of the slice (in this case, 2)
  • length is the number of elements to include in the slice (in this case, 2)
  • The resulting slice is then used to populate the new miniShark array.

Note that array slicing in Bash uses a different syntax than ranges. In Bash, you can use the ${array_name[@]:offset:length} syntax to create a slice of an array, where offset is the starting index and length is the number of elements to include.

#copy a subset of an array in zsh
miniShark=(${sharks[from,to]})

# example
miniShark=(${sharks[2,3]})
echo "${miniShark[@]}"

#add to an array
miniShark+=('Hammerhead Shark' 'Goblin Shark')
echo "${miniShark[@]}" 
#note above in zsh you can just echo $miniShark to see everything

#slice after an index
miniMiniShark=${miniShark:2}
echo $miniMiniShark

We can also remove an element from an array (in zsh)

sharks=('Bull Shark' 'Tiger Shark' 'Blue Shark')
sharks[1]=()  
echo $sharks

An associative array is a type of array where the index is not limited to numbers, but can be any string. To declare an associative array in Bash, you can use the following syntax:

declare -A my_array
my_array[key1]=value1
my_array[key2]=value2
my_array[key3]=value3

For example, to declare an associative array that maps pokemon names to their main colors, you can use the following code:

declare -A pokemon_colors=(
["pikachu"]="#F7D02C"
["snorlax"]="#8BBE8A"
["charmander"]="#F7786B"
)

This creates an array where the keys are the names of the pokemon and the values are their colors. (if you use python this will be familiar)

To access a value of an associative array you can use this

#basic syntax
${array_name[key]}

#example
echo ${pokemon_colors[pikachu]}

# Add a new element to the array
pokemon_colors["Squirtle"]="#7FC8D1"

Conditional statements are a fundamental concept in programming and are used to create branching logic in code. They allow a program to make decisions based on certain conditions and execute different code paths accordingly.

In simple terms, a conditional statement evaluates a Boolean expression, which is a statement that is either true or false. If the Boolean expression is true, the program executes a specific block of code, and if it is false, the program may execute a different block of code, or continue with the rest of the program.

Conditional statements are essential for creating dynamic programs that can respond to user input or changing conditions in the program environment. They enable programmers to create complex decision-making structures and automate repetitive tasks.

Basics

The most basic conditional statement in Bash is the if statement, which allows you to execute a block of code if a condition is true. The general form works like this

if [ condition ]
then
    # code to run if condition is true
fi

The if statement starts with the if keyword, followed by a opening square bracket ([), which is a shorthand for the test command. The test command evaluates the condition enclosed in quotes and returns either a true or false value. ]) followed by the then keyword. If the condition is true, the code block following then will be executed.

The code block to execute if the condition is true is enclosed between then and fi. The fi keyword marks the end of the if statement and closes the code block.

It’s important to note that the condition inside the square brackets can be a comparison between variables, an expression, or the result of a command. Bash supports a wide range of operators to build the condition, such as == for string comparison, -eq for integer comparison, and -f to check if a file exists. We will meet some of these in a moment, first an example with our budding scripting skills

cat > ifStatement.sh

#!/bin/bash
sum=$(expr $1 + $2)

if [ $sum -gt 10 ]
then
    echo "$sum is greater than 10."
fi

Let’s run this, then talk about it

#we first set the permission for the script to execute
chmod +x ifStatement.sh

#then run it
./ifStatement.sh 2  1

#nothing happened, how about now
./ifStatement.sh 9  7

This rehashes our lameCalculator from the previous page and calculates the sum of the two arguments passed to the script and stores the result in a variable called sum. The expr command is used to evaluate the arithmetic expression, and $1 and $2 are the first and second positional parameters (i.e., the two arguments passed to the script).

The if statement checks if the value of sum is greater than 10. If the condition is true, the script prints a message to the console using the echo command. If the condition is false, nothing happens.

Not returning anything the first time is not desirable. Luckily we can use an else command that allows us to do something if the condition being tested is not true. Let’s see this in action.

#can you figure out what this does based on the previous page?
sed -i '' '/echo/a\
  else\
  echo "The $sum is less than or equal to 10."\
  ' ifStatement.sh

The above is just modifying our ifStatement.sh script to get a new line. If you don’t understand this, go back to review the sed section. Note this command for is for a mac, you may need to modify slightly if you are on a linux machine. This is more for review, so use the line below rather than troubleshoot for your specific sed. What we are making is a file that looks like this:

#!/bin/bash
sum=$(expr $1 + $2)

if [ $sum -gt 10 ]
then
    echo "$sum is greater than 10."
  else
  echo  "The argument is less than or equal to 10."
  fi

The above statement builds on our previous example. Here, the condition is the addition that is performed. If the sum is greater than 10, it returns true, and the first command is executed. Otherwise, the second command is executed. Let’s see it in action

./ifStatement.sh 2  1
Neat, now we can make a basic comparison!

Arithmetic comparisons

Often we need to make decisions based on a value. Here are arithmetic comparisons you will commonly see

#equal to
if [ 10 -eq 10 ]
then
    echo "10 is equal to 10"
fi

#not equal to
if [ 20 -ne 10 ]
then
    echo "20 is not equal to 10"
fi

#less than
if [ 5 -lt 10 ]
then
    echo "5 is less than 10"
fi

#less than or equal to
if [ 10 -le 10 ]
then
    echo "10 is less than or equal to 10"
fi

#greater than 
if [ 20 -gt 10 ]
then
    echo "20 is greater than 10"
fi

#greater than or equal to
if [ 10 -ge 10 ]
then
    echo "10 is greater than or equal to 10"
fi
We can also use a lot of other comparisons

More advanced if statements

You can use regular expressions in an if statement to perform more complex string tests.

statement="hello world"
if [[ $statement =~ ^hello ]]
then
    echo "String starts with hello"
fi

You can use command substitution in an if statement to execute a command and use its output in the test.

if [ $(whoami) = "root" ]
then
    echo "You are logged in as root"
else
    echo "You are not logged in as root"
fi

In this example, the script uses the $(whoami) command to get the current user’s username, and checks if it’s equal to “root”. If it is, it prints “You are logged in as root”. Otherwise, it prints “You are not logged in as root”.

We can also use an If statement with exit status

file=sharkTracker2.txt
string="hello"
if grep -q $string $file
then
    echo "$string is in $file"
else
    echo "$string is not in $file"
fi

In this example, the script uses the grep command to search for a string in a given file. -q suppresses the output and only returns the exit status. The exit status of 0 means that grep found the pattern. If the exit status is 1, it means that grep did not find the pattern.

We use the exit status of grep in shell scripts to make decisions based on whether a pattern was found or not, in our case to check if a file contains a certain string (you may also sometimes see non-zero exit status as an error, now you know what that means!)

Test and &&

It’s important to note that while if statements are a common way to perform conditional checks in Bash, they are not the only way, and sometimes other constructs may be more appropriate or more readable for a particular situation.

The test command (also known as [) can be used to perform simple conditional checks. Here’s an example that checks if a file named sharkTracker2.txt exists

test -e sharkTracker2.txt && echo "File exists"
  • The -e option to the test command checks if the specified file exists. If it does, the test command returns a status code of 0, which means the check succeeded. If the file does not exist, test returns a status code of 1, which means the check failed.
  • When you use the && operator, it checks if the exit status of the previous command was 0. If the exit status was 0 (which means the previous command succeeded), then the next command after && is executed. If the exit status was not 0 (which means the previous command failed), the next command is not executed.
  • you can use the && operator to run a command only if the previous command succeeded. This is useful for creating simple conditional checks in Bash scripts.

The test command can be combined with logical operators to perform more complex conditional checks. Here’s an example that checks if a file named file.txt exists and is readable (-r):

test -r sharkTracker2.txt && echo "File exists and is readable"

We can also test if a file is executable

test -x lameCalculator.sh && echo "File exists and is executable"

Case

The case statement can be used to perform multiple conditional checks in a more structured way than using multiple if statements. Here’s an example that checks if a variable named color is set to “red”, “green”, or “blue”:

color="blue"

case $color in
  red)
    echo "The color is red"
    ;;
  green)
    echo "The color is green"
    ;;
  blue)
    echo "The color is blue"
    ;;
  *)
    echo "The color is not red, green, or blue"
    ;;
esac

Here’s how the syntax works:

  • case variable in: This line starts the case block and specifies the variable to match against patterns.

  • red): This line specifies the first pattern to match against. If the value of the variable matches red, the code block below it will be executed.

  • code to execute if variable is red: echo “The color is red”.

  • ;;: This double semicolon tells Bash to exit the case block and continue with the next line of code in the script.

  • the pattern repeats for the others

  • *): This line specifies a default case to handle if the value of the variable doesn’t match any of the patterns above (captures any other pattern), and returns the next statement

esac: This line ends the case block. It took me forever to notice this, but esac is just case spelled backwards. If you forget to close the case block you will get an error.

Command substitution and regular expressions

Just like with other things we can also use command substitution. Here’s an example that checks if a file named emptyfile.txt exists and has a size greater than zero:

test $(wc -c < emptyfile.txt) -gt 0 && echo "File exists and is not empty" || echo "This file either does not exist or is empty"

|| is a logical operator that represents the “OR” operation. It’s used to execute a command or a block of commands only if the previous command failed (i.e., returned a non-zero exit status). Pretty cool if you think about this, let’s unpack this for a second

  • $(wc -c < emptyfile.txt) calculates the size of the “emptyfile.txt” in bytes. The wc command with the -c option counts the number of bytes in the input, and < emptyfile.txt redirects the input to come from the file “emptyfile.txt”.

  • test $(wc -c < emptyfile.txt) -gt 0 tests whether the output of the previous command (the size of “emptyfile.txt”) is greater than zero. The test command with the -gt (greater than) operator returns a true (0) exit status if the left-hand side is greater than the right-hand side.

  • If the previous command returns a true exit status (i.e., the file exists and is not empty), && echo “File exists and is not empty” executes the echo command, which prints the message “File exists and is not empty” to the console.

  • If the previous command returns a false exit status (i.e., the file does not exist or is empty), || echo “This file either does not exist or is empty” executes the echo command, which prints the message “This file either does not exist or is empty” to the console.

We can also use regular expressions with conditionals like this:

string="I saw a shark swimming in the ocean."

if [[ $string =~ ^"shark" ]]; then
    echo "Match found"
else
    echo "No match found"
fi  

In this script, we’re using the [[ ]] operator to test whether the variable $string contains the substring “shark” at the beginning. If it does, the script prints “Match found”. Otherwise, it prints “No match found”.

  • the [[ … ]] operator is used for conditional expressions, which are used to test whether a particular condition is true or false.

  • The =~ operator is a regular expression matching operator that allows us to test whether a string matches a given regular expression.

One last bit, a few conditionals you can use with arrays

Check if an array is empty

if [[ -z $VARNAME ]] then
    echo "Empty"
else
    echo "Not Empty"
fi  

Check if a value is contained in an array

if (( $VARNAME[(Ie)value] ))  then
    echo "Value Present"
else
    echo "Value Not Present"
fi  

Check if an array is not empty

if [[ ! -z $VARNAME ]] then
    echo "Not Empty"
else
    echo "Empty"
fi  
We’re getting close to putting it all together. Let’s look at loops next

In this section, we’ll be covering the basics of loops in Bash, including for, while, and until loops.

Loops are an essential part of any programming language, and Bash is no exception. Loops allow us to execute a block of code repeatedly, based on a particular condition.

For Loops

A for loop is used to execute a block of code for a fixed number of times, or for each item in a list. The basic syntax of a for loop is as follows:

#basic syntax  
  for item in list; do
    # code to be executed
done

In this syntax, item is a variable that takes on each value in list, and the code inside the loop is executed for each value.

Here’s an example of a for loop that prints the numbers 1 to 5

for i in {1..5}; do
    echo $i
done

Let’s break down this code:

  • The for keyword indicates the start of a for loop.
  • i is the variable that takes on each value in the list.
  • {1..5} is a Bash shorthand for a list of numbers from 1 to 5. Curly braces can make ranges using this format. For example {1..10} is the range from 1 to 10. There are other uses for curly braces too, but we will see one in the next example.
  • do indicates the start of the code block to be executed.
  • echo $i is the code to be executed for each value of i.
  • done indicates the end of the loop.

When you run this code, it will output the numbers 1 to 5, each on a separate line.

We can also use a for loop to iterate over an array a variable that can hold multiple values in bash like this:

shark=("tiger" "bull" "sandbar" "blue")
for shark in "${shark[@]}"; do
    echo $shark
done

In this example, shark is an array that contains four sharks. The for loop iterates over each fruit in the array, and the code inside the loop (echo $shark) is executed for each shark. When you run this code, it will output each shark on a separate line.

My favorite use of for loops is iterating over all the files in a directory. This is a very common task when dealing with sequence files. He is a basic example

#This returns the full path
for file in /Users/alexdornburg/Documents/UCE_Workshop/Day1/*; do
    echo $file
done

#You can do this for your current directory
for file in *; do
    echo $file
done

This seems basic, but if you start thinking about all the previous examples this is like getting the keys to the kingdom. You now have the core foundation to begin manipulating thousands of files with just a few lines of code!

You can also use a for loop to iterate over a range of numbers. Here’s an example:

for (( i=1; i<=5; i++ )); do
    echo $i
done

In this example, the for loop uses a C-style loop to iterate over the numbers 1 to 5. The ((…)) syntax is used to evaluate arithmetic expressions. The i variable is initialized to 1, and the loop continues as long as i is less than or equal to 5. The i++ statement increments i by 1 at the end of each iteration. The code inside the loop (echo $i) is executed for each value of i. When you run this code, it will output the numbers 1 to 5, each on a separate line.

We can also loop over key value pairs like this

#bring back our pokemon example 
declare -A pokemon_colors=(
["Pikachu"]="#F7D02C"
["Snorlax"]="#8BBE8A"
["Charmander"]="#F7786B"
)
#loop over it
for key val in "${(@kv)pokemon_colors}"; do
    echo "$key : $val"
done

Here’s what each part of the loop does:

  • for key val in “${(@kv)pokemon_colors}”; starts the loop and defines two variables, key and val, which will hold the key-value pairs of the pokemon_colors array. The “${(@kv)pokemon_colors}” syntax expands the array as a list of key-value pairs, which is necessary for iterating over an associative array. The @ flag tells Bash to treat the array as an associative array, and the kv flag tells Bash to format the array as key-value pairs.
  • do marks the beginning of the loop body.
  • echo “$key : $val” prints out each key-value pair in the array, separated by a colon and a space.
  • done marks the end of the loop body.

So the entire loop will iterate over the pokemon_colors associative array and print out each key-value pair in the format key : value.

While loops

A while loop is a control flow statement in Bash that allows you to execute a block of code repeatedly as long as a certain condition is true. It is useful when you need to repeat a task multiple times until a specific condition is met. The while loop checks the condition at the beginning of each iteration and will continue to loop as long as the condition remains true.

The basic syntax of a while loop is:

while [ condition ]
do
    # code to be executed
done
  • while is the keyword that starts the loop.
  • [ condition ] is the expression that is evaluated at the beginning of each loop iteration. If the condition is true, the loop body will be executed. If it is false, the loop will exit.
  • do marks the beginning of the loop body.
  • code to be executed is the block of code that will be executed repeatedly as long as the condition is true.
  • done marks the end of the loop.

Let’s start with a simple example that uses a while loop to count from 1 to 5 and print each number on a new line.

counter=1
while [ $counter -le 5 ]
do
    echo $counter
    ((counter++))
done

In this example, we set the initial value of the counter variable to 1. Then we use a while loop to check if the value of counter is less than or equal to 5. If it is, we print the value of counter and increment it by 1. The loop will repeat until counter is no longer less than or equal to 5.

IMPORTANT: If you do not increment the loop will be infinite

It is not hard to accidentally generate an infinite loop while working on informatic tasks. Be careful or else you can fill your hard drive with erroneous multi-TB files!

In this example, we’ll use a while loop to read a file line by line and print each line.

Here is an example of how to read a file line by line

file="sharkTracker2.txt"
while read line
do
    echo "$line"
done < $file

In this example, we use the read command inside a while loop to read a file line by line. We store each line in the line variable and print it out using the echo command. The < $file syntax redirects the input of the loop to come from the example.txt file.
This is unusual compared to other languages like R, but the done keyword is used to indicate the end of the loop body. The < character is used for input redirection, which means that it takes input from a file instead of from standard input.
So, when we use the syntax done < $file, it tells Bash to take the input for the loop from the file specified by $file instead of from standard input.

We can also evaluate based on multiple conditions. For example:

counter=1
sum=0
while [ $counter -le 10 ] && [ $sum -lt 50 ]
do
    echo "Adding $counter to the sum"
    ((sum += counter))
    ((counter++))
done
echo "The final sum is: $sum"

In this example, we have two conditions for the while loop. The first condition checks if counter is less than or equal to 10, and the second condition checks if sum is less than 50. If both conditions are true, we print a message indicating that we’re adding the current value of counter to sum, and then we increment both counter and sum. If either condition is false, the loop exits. Once the loop is finished, we print the final value of sum.

The continue keyword to skip conditions is also handy:

counter=1
while [ $counter -le 10 ]
do
    if [ $((counter % 2)) -eq 0 ]
    then
        ((counter++))
        continue
    fi
    echo $counter
    ((counter++))
done
  • In this example, we use an if statement inside the loop to check if the current value of counter is even by using the modulo operator % to get the remainder when dividing by 2. If the remainder is 0, then the number is even. -eq is the equality comparison operator.
  • If the current value of $counter is even, then we increment the value of $counter by 1 and use the continue keyword to skip the rest of the loop body and start the next iteration of the loop.
  • This is accomplished by marking the end of the if statement with fi [if backwards. If you face palmed here, I don’t blame you]
  • If the value of counter is odd, we print it out using the echo command and update the counter.

Until loops

An until loop is similar to a while loop in Bash, except that it continues looping until a specified condition is true, rather than looping while a condition is true. In other words, the loop body will continue to execute as long as the condition is false. Once the condition becomes true, the loop will terminate.

The syntax of an until loop is as follows:

until [ condition ]
do
    # statements
done

Here, condition is the condition that is checked at the beginning of each iteration of the loop. If condition is false, then the loop body is executed. Once the loop body has executed, condition is checked again, and the loop continues until condition becomes true.

Now let’s move on to the examples:

# a basic example
count=0

until [ $count -eq 5 ]
do
    echo "Count is $count"
    ((count++))
done

In this example, the loop continues to execute until the $count variable is equal to 5. Inside the loop body, we print out the current value of $count and then increment it by 1 using the ((count++)) syntax.

Here’s a variation of our even odd example backwards

count=0

until [ $count -eq 10 ]
do
    if [ $((count % 2)) -eq 0 ]
    then
        echo "$count is even"
    else
        echo "$count is odd"
    fi

    ((count++))
done

In this example, we print out whether each number from 0 to 9 is even or odd. We use an if statement inside the loop body to check if the current value of $count is even or odd, and print out the appropriate message.

Here is a variation of the example that involves solving an expression

count=0
sum=0

until [ $count -eq 10 ]
do
    ((sum+=count))
    ((count++))
done

echo "The sum of the first 10 numbers is: $sum"

In this example, we use an until loop to calculate the sum of the first 10 numbers. Inside the loop body, we add the current value of $count to the running total in $sum, and then increment $count by 1. Once the loop is finished, we print out the final value of $sum.

We can even count up and down using until

count=1

until [ $count -eq 11 ]
do
    if [ $count -lt 6 ]
    then
        echo "Count is $count"
    else
        echo "Count is $(expr 11 - $count)"
    fi

    ((count++))
done

In this example, we print out a sequence of numbers that goes from 1 to 5 and then back down to 1. We use an if statement inside the loop body to print out the current value of $count if it is less than 6, and the value of 11 - $count if it is greater than or equal to 6. This creates a sequence that starts at 1, goes up to 5, and then counts back down to 1.

As with while loops, remember that a loop can continue to execute until the specified condition becomes true, so be sure to use an appropriate condition to prevent the loop from becoming infinite!

That is it for loops, let’s move on to functions!

In this section, we’ll be looking at how to create functions. A function is a set of instructions that perform a specific task, and you can define and use functions just like you would in any other programming language. Functions can be particularly useful when you need to perform a task repeatedly or as part of a larger script. We will start with basic functions and then use this as a chance to build functions that do the core commands we have learned so far as a chance to review.

The syntax for defining a function in Bash is as follows:

function_name () {
    # statements
}

Here, function_name is the name of the function, and statements are the instructions that make up the function body.

To call a function, you simply use its name followed by any arguments you want to pass to the function. For example:

function_name arg1 arg2  

Let’s create simple function that prints a message and run it

# Define a function that prints a message
cheer () {
    echo "You are doing great\!"
}

# Call the function
cheer  

This function is simple and takes no arguments, if we type cheer into the command line we can get a message.

We can also make another function that prints 5 random numbers

randomNumbers () {
    for i in {1..5}
    do
        echo $RANDOM
    done
}

randomNumbers

We haven’t seen this yet, but $RANDOM calls a random number from range between 0 and 32767 (which feels oddly specific). RANDOM is actually quite useful for setting a seed for randomly subsampling reads and other tasks.

The funniest use of RANDOM I have seen was in a function for an infinite loop that prints random numbers and text snippets on your terminal to look like a 90s movie hacker’s screen.

Here is a function with a conditional to evaluate if something is even or odd

Define a function that checks if a number is even

is_even () {
    if [ $(($1 % 2)) -eq 0 ]
    then
        echo "$1 is even"
    else
        echo "$1 is odd"
    fi
}

# Call the function with an even number
is_even 4

# Call the function with an odd number
is_even 7

Here we define a function called is_even that checks if a given number is even or odd using an if statement. We then call the function with an even number (4) and again with an odd number (7).

Let’s make a function with three arguments that uses sed to replace words in a file

# Define a function (for osx) that replaces one word with another in a file using sed. On linux you need to modify this
replacer () {
    echo "Replacing $2 with $3 in the file $1" 
    sed -i '' "s/$2/$3/g" "$1"
}

# Call the function with a string containing "world"
replacer sharkTracker2.txt Shark Dolphin

#view the file
cat sharkTracker2.txt

How cool is that!

We can also use grep inside of functions. Let’s look for a pattern inside a text file

# Define a function that searches for a pattern in a file and returns the line number and line while also ignoring case
searchFile () {
    grep -ni "$1" "$2"
}

# Call the function to search for the word "Great White" in a file called "example.txt"
searchFile "great White" "sharkTracker2.txt"

We can also place functions in functions, just like with any other coding language. We will also throw in a command substitution.

# Define a function that searches for a pattern in a file and returns the line number and line while also ignoring case
searchFile () {
    grep -ni "$1" "$2"
}

# create a function that uses the function 
seeker() {
  output=$(searchFile "$1" "$2")
  if [[ ! -z "$output" ]]; then
    echo "The match is $output"
  else
    echo "There are no matches"
  fi
}

# Call the function to search for the word "Great White" in our sharkTracker2.txt file
seeker "great White" "sharkTracker2.txt"```

This is pretty neat, we now have feedback when we get no matches, and return output if we do have matches. If you start to think about what these last two examples are doing, you might be starting to connect the dots to how to expand these concepts to parsing informatics files and assembling outputs…

We can also work with arrays

Define a function that creates an array of numbers
# and prints the sum of the array
myArray=(1 2 4 5 3 7)

printSum () {
    numbers=("$@")
    sum=0
    for i in "${numbers[@]}"
    do
        sum=$((sum + i))
    done
    echo "The sum is $sum"
}

printSum "${myArray[@]}"

The parentheses in numbers=(“$@”) are used to assign the positional parameters to an array named numbers.

By enclosing $@ in parentheses, we ensure that each positional parameter is treated as a separate element in the array. This is important because the elements of the array may contain spaces or other special characters that need to be preserved.

Without the parentheses, the positional parameters would be treated as a single string and any special characters within them would be interpreted by the shell.

In this case, we want to ensure that each element of the array is processed separately by the for loop that follows, so we use parentheses to create an array containing the positional parameters. Note than on osx you need to use “${myArray[@]}” instead of $myArray, to pass the array elements as separate arguments to the function rather than as a single string.

Introduction to Awk:

Awk was developed at Bell Labs in the 1970s and is available on most Unix-like systems. The name awk comes from the initials of its developers – Alfred Aho, Peter Weinberger, and Brian Kernighan.

At its core, awk operates on records and fields within a text file. A record is typically a line of text, and fields are the individual units within a record that are separated by a delimiter (e.g., a comma, space, or tab). awk allows you to manipulate and extract data from text files by specifying patterns and actions that should be performed on records and fields that match those patterns.

Awk reads text files line by line and splits each line into fields. By default, fields are separated by whitespace characters (spaces, tabs, or newlines), but you can specify a different delimiter using the -F option. Awk then processes each line based on the rules you provide, which can include patterns and actions. In many cases this allows you to accomplish otherwise extremely cumbersome tasks in a single line of code!

Awk uses patterns that are used to match specific lines, and actions are used to specify what to do with the matched lines. For example, you might use a pattern to match lines that contain a specific word, and then use an action to print out the line. There is a vast diversity of commands that use awk, we will cover some basic ones as well as a few more advanced ones in this section.

Getting started with awk

Let’s start by printing the first, second, or last field of each line in our sharkTracker2.txt file

#This prints the first column
awk -F',' '{print $1}' sharkTracker2.txt

#This prints the second column
awk -F',' '{print $2}' sharkTracker2.txt

#This prints the last column
awk -F',' '{print $NF}' sharkTracker2.txt

$NF is a special variable that represents the number of fields on the current line, so $NF is always the last field. Note that since our file is a csv file, we need to specify the delimiter as a comma using the -F flag.

Note that in awk, the block of commands enclosed in the {} specifies what to do when the condition is true.

Pattern matching with awk

We can also use awk for pattern matching.

#general form
awk '/pattern/ {print}' file.txt

#example from our file
awk '/Great White/ {print}' sharkTracker2.txt

#print all lines that DON'T contain a pattern
awk '!/Great White/ {print}' sharkTracker2.txt

#print all lines where the fourth field contains a pattern
awk -F',' '$4 ~ /Nags/ {print}' sharkTracker2.txt

On this last one, the ~ operator in the command is used to perform a regular expression match on the fourth field of the input file sharkTracker2.txt.

Specifically, the pattern /Nags/ between the forward slashes is a regular expression that is being matched against the value of the fourth field. This pattern matches any string that contains the substring “Nags”. The ~ operator applies the regular expression match to the value of the fourth field, and returns true if it matches the pattern, and false otherwise.

So, this command will print all lines in the sharkTracker2.txt file where the fourth field contains the substring “Nags”.

Just to remind you, the -F’,’ option sets the field separator to a comma, since the fields in the input file are comma-separated.

Finding data meeting criteria with awk

We can also use awk for finding fields based on conditions. Print lines where a numeric field is greater than a specific value:

#general form
awk '$3 > 10 {print}' file.txt

#example for sharks over 10 feet
awk -F',' '$3 > 10 {print}' sharkTracker2.txt

We can also print lines where a numeric field is between two specific values

#general form
awk '$3 > 10 {print}' file.txt

#example for sharks over 10 feet
awk -F',' '$3 >= 7 && $3 <= 15{print}'  sharkTracker2.txt

We can also search for field values, either numeric or string

#simple example
awk -F',' '$1 == "Tiger Shark" {print}' sharkTracker2.txt

#A more complicated example using a variable
shark="Tiger Shark"
awk -F',' -v pattern="$shark" '$1 == pattern {print}' sharkTracker2.txt

Let’s break this down
* shark=“Tiger Shark” sets a shell variable named shark to the value “Tiger Shark”.

  • -F’,’ specifies the field separator as a comma.

  • -v pattern=“$shark” defines a variable named pattern in the awk program, which is set to the value of the shell variable shark. The -v option is used to pass a variable to the awk program from the shell.

  • $1 == pattern checks whether the value of the first field is equal to the value of the pattern variable.

  • {print} specifies the command to print the entire line if the condition $1 == pattern is true.

  • sharkTracker2.txt is the input file that awk operates on.

To summarize, the command sets the shell variable shark to “Tiger Shark”, then runs awk with the specified options and program on the sharkTracker2.txt file. The program compares the value of the first field of each row in the input file to the value of the pattern variable (i.e., “Tiger Shark”), and if they are equal, the entire row is printed to the standard output.

Note that the use of the -v option to pass shell variables to awk is a useful technique for creating flexible awk programs that can be easily customized based on the input and the specific task at hand.

We can also do this with a number instead of a string

#example with a number
length=9.1
awk -F',' -v pattern="$length" '$3 == pattern {print}' sharkTracker2.txt

Let’s look at a few more commands. As you will see in this section, awk is vast….

FNR and awk

FNR (file number record) is a built-in variable in awk that holds the number of the current record in the current file being processed. In other words, it keeps track of the line number within the current file.

This variable is useful when you want to perform different actions on different lines or records of the file. For example, you may want to print a header line at the beginning of the file or a summary line at the end of the file.

Here is a simple example that demonstrates the use of FNR. Suppose we have a file named data.txt with three lines of text. The following awk command prints each line of the file along with its line number:

awk '{print "Line " FNR ": " $0}' sharkTracker2.txt
  • print: This is the awk command that prints output.

  • “Line”: This is a string literal that will be printed before each line of input.

*FNR: This is an awk variable that holds the record number (line number) of the current input file being processed.

  • “:”: This is a string literal that separates the line number from the actual line of text.

  • $0: This is an awk variable that holds the entire input line.

  • When awk starts processing sharkTracker2.txt, it reads the first line of input, assigns it to $0, and assigns the record number (line number) to FNR.

*awk then executes the program ‘{print “Line” FNR “:” $0}’ for the first line of input. This program consists of a single action: to print the text “Line”, followed by the current value of FNR, followed by the text “:”, followed by the entire input line ($0).

*awk then reads the next line of input and repeats steps 5 and 6 for each subsequent line of input.

Finally, awk exits after processing the last line of input.

You may be wondering why we are focusing on this? Well, lets consider what happens if you are working with multiple files simultaneously and need to merge them.

Suppose we have two FASTQ files, sample1_R1.fastq and sample1_R2.fastq, which contain paired-end reads from a sequencing experiment. We want to extract the read sequences from these files and merge them into a single file for downstream analysis.

The reads look like this

@read1_R1
AGCTGATCGATCGTACG
+
IIIIIIIIIIIIIIII
@read2_R1
TACGTACGTACGTACGT
+
IIIIIIIIIIIIIIII
...
@read1_R2
CGTACGATCGATCGTAC
+
IIIIIIIIIIIIIIII
@read2_R2
ACGTACGTACGTACGTA
+
IIIIIIIIIIIIIIII
...

It turns out that we can use awk and FNR to create a merged file of all reads, just the sequences, to map back to a genome!

Check this out:

awk 'FNR%4 == 2 {print $0 > "merged_reads.fastq"}' sample1_R1.fastq sample1_R2.fastq

In a FASTQ file, each sequence record consists of four lines: a header line that starts with the @ symbol, a sequence line, a separator line that starts with the + symbol, and a quality score line. The FNR%4 expression takes advantage of the fact that the line number of each record within a FASTQ file is a multiple of 4. Therefore, the remainder of FNR%4 can be used to identify the type of line being processed:

  • When FNR%4 is 1, the line being processed is the header line of a new sequence record.

  • When FNR%4 is 2, the line being processed is the sequence line of a sequence record.

  • When FNR%4 is 3, the line being processed is the separator line of a sequence record.

  • When FNR%4 is 0, the line being processed is the quality score line of a sequence record.

So, in the example awk command awk ‘FNR%4==2 {print}’ input.fastq, the condition FNR%4==2 is used to select only the second line of each sequence record, which corresponds to the actual sequence data. The print action then prints out only these lines, effectively extracting the sequence data from the FASTQ file! One line is all it took!!

NR and awk

The NR variable in awk stands for “number of records”, and it keeps track of the total number of input records that have been processed so far, across all input files. This variable can be used in a variety of ways to perform computations or output certain information based on the total number of records processed.

awk -F, '{if(NR%2==0) print $0}' sharkTracker2.txt 

This line prints every other line from sharkTracker2.txt by seeing if it is divisible by 2.

Now lets apply this function to a pretend fasta file.

cat > pretend.fasta
>seq1
atatatataatatatatatatata
>seq2
ACTCGATGTATCGCTAGATCTATA
>seq3
TCGCTAGATCTATTGATCGATGCT

#save with control d

#count the number of sequences with NR
awk '{if(NR%2==0) count++} END {print count}' pretend.fasta

Lets count the number of sequences in the fasta file with a regular expression

awk '/^>/ {count++} END {print count}' pretend.fasta

In this command, /^>/ is a regular expression that matches any line starting with the > character, which indicates a header line in a FASTA file. When such a line is encountered, the count variable is incremented. At the end of processing, the END block is executed, which simply prints out the final value of count. Therefore, this command prints out the total number of sequences in the input file.

Here is an example of using this for a gff3 file to get gene coordinates!

cat > pretend.gff3
##gff-version 3
##sequence-region chr1 1 1000000
chr1    .   gene    1000    9000    .   +   .   ID=gene00001;Name=Gene 1
chr1    .   mRNA    1050    9000    .   +   .   ID=mRNA00001;Parent=gene00001;Name=MRNA 1;Note=This is mRNA 1
chr1    .   exon    1050    1500    .   +   .   ID=exon00001;Parent=mRNA00001;Name=Exon 1
chr1    .   exon    3000    3902    .   +   .   ID=exon00002;Parent=mRNA00001;Name=Exon 2
chr1    .   exon    5000    5500    .   +   .   ID=exon00003;Parent=mRNA00001;Name=Exon 3
chr1    .   CDS     1201    1500    .   +   0   ID=cds00001;Parent=mRNA00001;Name=CDS 1
chr1    .   CDS     3000    3902    .   +   0   ID=cds00001;Parent=mRNA00001;Name=CDS 1
chr1    .   CDS     5000    5500    .   +   0   ID=cds00001;Parent=mRNA00001;Name=CDS 1

#control d to save

In this example, the first line starts with “##gff-version 3”, which is a comment indicating that this is a GFF3 file and which version of the GFF3 format it follows. The second line starts with “##sequence-region” and provides information about the sequence region covered by this file.

The remaining lines represent features on the sequence, such as genes, exons, and CDSs. Each feature is described using nine columns separated by tabs. The columns represent:

  1. Sequence ID (e.g. “chr1”)
  2. Source of the feature (e.g. “.”, indicating unknown)
  3. Type of feature (e.g. “gene”, “mRNA”, “exon”, “CDS”)
  4. Start position of the feature (inclusive)
  5. End position of the feature (inclusive)
  6. Score (e.g. “.”, indicating unknown)
  7. Strand (e.g. “+”, indicating the feature is on the forward strand)
  8. Phase (e.g. “0”, indicating that the first base of the feature is the first base of a codon)
  9. Attributes (e.g. “ID=gene00001;Name=Gene 1”, providing additional information about the feature)

Now check out this command

awk '$3=="gene" {print NR, $9}' pretend.gff3 > index.txt   

In this command, the condition $3==“gene” is used to select only lines that correspond to gene features in the GFF3 file. When such a line is encountered, the print action outputs the record number NR, which corresponds to the line number of the gene feature in the input file, as well as the ninth field ($9) of the GFF3 line, which contains the gene name or ID. The output of this command is redirected to a file index.txt, which can be used to quickly look up the position of a gene in the input file based on its name or ID. Easy!

Special patterns

BEGIN: This is a special pattern that is executed before any records are read from the input file(s). It can be used to initialize variables, set options, or perform other setup tasks. Here is an example in Bash:

awk 'BEGIN{FS=",";OFS="\t"} {print $1,$2,$3}' sharkTracker2.txt > sharkTracker2.tsv

If you followed what just happened, we converted a file from csv to tsv there! This command sets the input and output field separators to , and \t (tab), respectively, before printing the first three fields of each line in sharkTracker2.txt.

END: This is another special pattern that is executed after all records have been processed. It can be used to print summary information, perform cleanup tasks, or take other actions based on the input data. Here is an example:

awk -F, '{totalWeight += $2} END {print "Total Weight:",totalWeight}' sharkTracker2.txt

This command calculates the sum of the 2nd field of each line in sharkTracker2.txt, and then prints the total at the end of processing all the records.

length: This command returns the length of a string.

awk -F, '{if(length($2) > 3) print $0}' sharkTracker2.txt

This command prints each line of sharkTracker2.txt where the length of the 2nd column field is greater than 2 (weights above 1000 in our case). This is a handy way to filter data

Playing with strings

awk is actually pretty handy for handling strings. For example let’s take a substring of a string using substr. Here is the syntax

#the basic idea
substr(string, start, length)

Here’s a breakdown of each of the arguments:

  • string: The input string from which you want to extract the substring.

  • start: The starting position of the substring. This can be either a number indicating the character position (starting from 1) or a variable that contains a number.

length: The length of the substring to extract. This can also be a variable that contains a number.

Here’s an example of how to use substr in an awk script:

awk '{ print substr($0, 5, 10) }' sharkTracker2.txt

In this example, we’re using substr to extract a 10-character substring starting from the 5th character in each line of the input.txt file. The $0 variable represents the entire line, so we’re applying substr to each line in the file. The output will be the extracted substrings.

##looking for a text fragment
awk -F, '{ if (index(substr($4, 1, 10), "ead") > 0) print $0 }' sharkTracker2.txt

In this command, we use substr($4, 1, 10) to extract a 10-character substring of the fourth field, starting at position 1 (i.e. the first character). Then, we use index to check if “ead” is present in this substring. If the index of “ead” is greater than 0, then “ead” is present in the substring, and we print the entire record ($0).

We can also change text in our file using toupper or tolower,

#command
awk '{print tolower($0)}' pretend.fasta > pretend2.fasta

#see the difference
cat pretend.fasta
cat pretend2.fasta

#to modify the existing file
#OSX
awk -i inplace '{print toupper($0)}' pretend2.fasta

In the first command command, we use the tolower function to convert each line of the input file to lowercase. The print statement outputs the converted line, and $0 refers to the entire input line. Finally, the output is redirected to a new file output_file.

In the second command, the -i inplace option tells awk to modify the input file in place. We use toupper to convert each line to lowercase, and the print statement outputs the converted line.

We can also use this in a search

awk -F, '{if(tolower($1) == "great white shark") print $0}' sharkTracker2.txt

This command prints each line of file.csv where the sixth field is the word “great white shark” in any case.

We can also split an array

awk '{split($9, a, ";"); print a[1]}' pretend.gff3

This command splits the ninth field on slashes and prints the first element of the resulting array for each.

We can also look for mismatches in a field of a file using getline. This is handy if you have a huge datafile that has a few cases you may need to remove

#this is just sample code so you have it
awk '{getline nextline; if($1 != nextline) print "Mismatch: "$1" != "nextline}' file.txt

We can also look for mismatches in a field of a file using getline. This is handy if you have a huge datafile that has a few cases you may need to remove

#this is just sample code so you have it
awk '{getline nextline; if($1 != nextline) print "Mismatch: "$1" != "nextline}' file.txt

we can also make associative (key:value) arrays

awk -F ',' '{ array[$1] = $2 } END { for (key in array) print key, array[key] }' sharkTracker2.txt

In this command, we create an associative array array where the keys are the values in the first column ($1) and the values are the values in the second column ($2). The END statement is used to print out the contents of the array after all input has been processed.

The for (key in array) loop iterates over each key in the array array, and key is the variable that contains each key in turn. We then print out each key-value pair using print key, array[key].Remember that in an associative array, each key is associated with a value. When you use array[key], you are accessing the value that is associated with the given key.

Note that if there are duplicate keys in the first column, the last value in the second column will be the one stored in the array.

This just barely scratches the surface of using awk. However, it should be enough to get you started and able to approach bioinformatic tasks.

File compression

File compression is common so here is a cheat sheet by Rick White from his introduction to scripting course at UNC Charlotte

#compress a file
gzip <filename> 
# results in <filename>.gz
  
#Uncompresses files compressed by gunzip (.gz)
gunzip <filename>.gz
# results in <filename>

#Compresses files compressed by tar (tar.gz)
tar -cvzf <foldername.tar.gz>
#Arguments are
- -z: gzip compression
- -c: Creates a new .tar archive file.
- -v: Verbosely show the .tar file progress.
- -f:  File name type of the archive file.

# List contents of tar.gz 
tar -tvf <foldername.tar.gz>

# Prints a zipped file without opening it
gzcat <filename.gz> | more
gzcat <filename.gz> | less

#Uncompresses files compressed by tar (tar.gz) 
tar -zxvf <foldername.tar.gz>
#arguments
- -z: many files
- -x: extract gzip 
- -v: Verbosely show the .tar file progress.
- -f:  File name type of the archive file.

# Compresses files compressed by tar (tar.bz2, more compression)
tar -cvjf <foldername.tar.gz>
#argument explanation
- -j: bz2 compression
- -c: Creates a new .tar archive file.
- -v: Verbosely show the .tar file progress.
- -f:  File name type of the archive file.

#Uncompresses files compressed by tar (tar.bz2) 
tar -xjvf <foldername.tar.bz2>
#argument explanation
- -j: bz2 file
- -x: extract gzip 
- -v: Verbosely show the .tar file progress.
- -f:  File name type of the archive file.