The purpose of this markdown is to provide problems that challenge you to apply what you have learned from the tutorial sheet.
Each section will have multiple problems and solutions for each. Note that there is often more than one solution to a problem, so if you solve something another way, that is great!
Click each tab to go to exercises corresponding to that section. Note that the exercises increase in difficulty with the tabs, as these build upon each other
Throughout the markdown there are prompts followed by bottons like the one shown below this text. The idea here is that you try to solve the problem first, then look at the solution.
For example, how would you print “Gengar is one of my favorite Pokémon” to the console?
# a code chunk will be revealed
echo "Gengar is one of my favorite Pokémon"
The echo command in a terminal is used to print text to the console. In this case, running echo “Gengar is one of my favorite Pokémon” would print the text “Gengar is one of my favorite Pokémon” to the console.
So this is just an example of a command you could run in a terminal to print some text.As you go through the other sections, do your best to solve the challenges!
Problem 1: Create a new file
pwd
cd /Users/alexdornburg/Documents/UCE_Workshop/
touch sharkObservations.txt
ls
pwd displays the current working directory cd plus a path changes the directory The touch command creates a new file with the specified name. the ls command will display this file along with other files in your directory
Problem 2: Making directories and moving files
mkdir sharkData
mv sharkObservations.txt sharkData/
cd sharkData/
ls
The mkdir command creates a new directory with the specified name. The mv command moves the sharkObservations.txt file into the sharkData directory.
Change your current working directory to sharkData and use the ls command to display this fileProblem 3: Removing and renaming files
#check where you are
pwd
#create the file
touch moreSharkObservations.txt
#remove the old one
rm SharkObservations.txt
#rename the file you made
mv moreSharkObservations.txt SharkObservations_v2.txt
# check it is there
ls
The ls -thor command is a combination of several options of the ls command
Here is what each option does:
-t sorts the files by modification time, with the most recently modified files first.
-h makes the file sizes human-readable, so that they are easier to understand. For example, instead of displaying a file size in bytes, it will be displayed in a more understandable format like “2.5K” for 2.5 kilobytes.
-o lists the files with long format, but omits the group information.
-r lists the files in reverse order.
#change directory
cd /Users/alexdornburg/Documents/Nototheniod_opsin
#use ls -thor
ls -thor
#In my directory the oldest file is from 2020 and is 256B
#go back to the previous directory
cd -
Remember that the command whoami will give you your username. Can you write a line of code to return your user name and your directory in a sentence?
#use echo!
echo "My username is $(whoami) and I am currently in the directory $(pwd)"
#come back to the above if this doesn't make sense yet after going through later sections. The $() is capturing the output of the commands whoami and pwd.
Your workshop folder has a file called elementarySchoolShark.txt. We will use that for this exercise. You should check that you are in the folder that contains it before starting.
move the elementarySchoolShark.txt file from your workshop examples download to your working directory then
Display only the first 5 lines of the elementarySchoolShark.txt file
#your directory paths will be unique but something like
cd /Users/alexdornburg/Downloads/workshopExamples
#move to your desired location
mv elementarySchoolShark.txt your/path/
#change back
cd -
#look at the file
cat elementarySchoolShark.txt | head -n 5
The cat command is used to concatenate and display the contents of a file. Here, it is used to display the entire content of “elementarySchoolShark.txt”.
The vertical bar | is called a pipe and it takes the output of the cat command and “pipes” it as input to the head command.
The head command is used to display the beginning of a file, by default the first 10 lines. The -n option is used to specify the number of lines to display, so in this case -n 5 tells head to only display the first five lines of the input passed to it via the pipe.
get only the names of the sharks from the elementarySchoolShark.txt file write them into another file called sharkNames.txt
Then view the first two lines of the new file
cut -d'-' -f 1 elementarySchoolShark.txt | cat > sharkNames.txt
head -n 2 sharkNames.txt
cut -d’-’ -f 1 elementarySchoolShark.txt uses the cut command to extract the first column of the file. The -d option specifies the delimiter (a hyphen in this case), and the -f option specifies which fields to extract (the first).
The vertical bar | is called a pipe and it takes the output of the cat command and “pipes” it as input to the cat command which is writing the new file.
Use head to view the first two names
Cut out rows 2-4 of the elementarySchoolShark.txt file
save the output as sharkFactSubset.txt
Then view the new file
cat elementarySchoolShark.txt | head -n 4 | tail -n +2 > sharkFactSubset.txt
cat sharkFactSubset.txt
cat elementarySchoolShark.txt reads the contents of elementarySchoolShark.txt and passes them as input to the next command using the | pipe operator. In human, this means take elementarySchoolShark.txt, THEN
head -n 4 selects the first 4 lines of the input (i.e., lines 1-4), THEN
tail -n +2 selects lines starting from line 2 of the input (giving us lines 2-4).
“>sharkFactSubset.txt” redirects the output of the pipeline to a new file called sharkFactSubset.txt. So when you run this command, it will create a new file called sharkFactSubset.txt that contains rows 2-5 of elementarySchoolShark.txt
cat sharkFactSubset.txt views the output
You often need to mix things you just learned with things you know. For this problem, you just learned that wc -l will count the lines of something. How could you use this with cat to count the number of lines in elementarySchoolShark.txt?
After you figure it out, also count the number of lines in sharkFactSubset.txt
#cat THEN count
cat elementarySchoolShark.txt | wc -l
#rinse and repeat
cat sharkFactSubset.txt | wc -l
To give more detail: The w and c options in the wc command are used to specify what type of count you want to perform on the input file.
Here’s what they do:
The w option stands for “word count”. It counts the number of words in the input file. A word is defined as a string of characters that is delimited by whitespace (spaces, tabs, or line breaks).
The c option stands for “character count”. It counts the number of characters in the input file, including whitespace characters.
When you run the wc command with no options specified, it defaults to counting the number of lines, words, and characters in the input file.
For example, the command wc example.txt would output the number of lines, words, and characters in the file “example.txt”.
If you want to count only a specific type of entity, you can specify the appropriate option. For example, the command wc -w example.txt would output only the number of words in the file “example.txt”, and the command wc -c example.txt would output only the number of characters. In our example -l gives the line.
The reason for combining this with cat is that using this on the file directly also returns the filename, which may not be desirable depending on what you are trying to do (for example count the number of sequences, etc)
Append a new line of data to elementarySchoolShark.txt that includes the name Oceanic whitetip shark in the name field and the fact Oceanic whitetip sharks are known for their distinctive long, pointed fins, which can extend up to 1.5 meters (4.9 feet) in length.
Then count the number of lines
also view just the names of the shark species
#you can just use echo for this
echo "Oceanic whitetip shark - Oceanic whitetip sharks are known for their distinctive long, pointed fins, which can extend up to 1.5 meters (4.9 feet) in length." >> elementarySchoolShark.txt
#count lines
cat elementarySchoolShark.txt | wc -l
#view names
cut -d'-' -f 1 elementarySchoolShark.txt
Here are some practice problems for working with grep
#simple grep, ignoring case
grep -i "mako" elementarySchoolShark.txt
Search for lines that start with the letter “t”:
#simple grep, ignoring case, using ^ for start of line
grep -i "^t" elementarySchoolShark.txt
Let’s pretend we are searching for a sequence motif, search for lines ending with “gaagcatt” in the example.fasta file
note that the -B [number] argument allows you to include a line above a match#simple grep, ignoring case, using $ for end of line with -B 1 for one line above included
grep -i -B 1 "gaagcatt$" example.fasta
Similar to problem 3 above we can use -A [number] to include 1 line below. Armed with that knowledge, can you return all the octopus sequences from the example.fasta file?
#simple grep, ignoring case, using \b for boundary with -A 1 for one line below included
grep -i -A 1 "\boctopus" example.fasta
Similar to problem 4 above we can build on a more complex grep. -v allows us to exclude specific matches. Here are all the genera containing the stem teuthis in our fasta: Architeuthis, Bathyteuthis, Brachioteuthis, Chiroteuthis, Histioteuthis, Joubiniteuthis, Mastigoteuthis, Octopoteuthis, Onychoteuthis, Opisthoteuthis, Pholidoteuthis, Pholidoteuthis, Pterygioteuthis, Selenoteuthis, Sepioteuthis, Thysanoteuthis
Can you return all sequences from the example.fasta file except those from Octopoteuthis, Onychoteuthis, and Opisthoteuthis, ?
#simple grep, ignoring case, with -A 1 for one line below included finding teuthis. THEN ignore case, ignore any names starting with o, and be sure to keep the sequence below (if you forget that part, it makes a mess)
grep -i -A 1 "teuthis" example.fasta | grep -i -v "\bo" -A 1
#-c returns! Simply count the > at the start of line
grep -c "^>" example.fasta
Can you find all the lines in a file that contain a word mouth, any two words, and then the word plankton?
To do this you will need to use the -E option with grep to enable extended regular expressions (also known as extended regex or ERE). By default, grep uses basic regular expressions (BRE), which have a more limited set of pattern matching capabilities.With extended regular expressions, you can use additional metacharacters and quantifiers to specify more complex patterns.
In this case you can use .+ to denote a single word.
#-c returns! Simply count the > at the start of line
grep -E '\bmouths\b .+ .+ plankton\b' elementarySchoolShark.txt
we can use sort with grep to return a sorted output.
Can you return a sorted elementaryShark.txt output that includes only hammerhead, tiger, or mako sharks?
#pipe and sort!
grep -i "hammer\|mako\|tiger" elementarySchoolShark.txt| sort
This one is a bit more of a challenge. Building on the above, can you count the number of A, C, T, G, in a fasta file with just one line and a few piped grep or other commands? You will need to combine your experience from the above examples AND
use one more command uniq, that returns unique entries. Note that for uniq to work, entries need to be sorted. You can supply arguments like -c to uniq to count….
#solution
grep -v "^>" example.fasta | grep -o -i "[ACTG]" | sort | uniq -c
Write a sed command that deletes lines that include the word plates from the elementarySchoolShark.txt file: hint: the d argument deletes.
#solution
sed '/plates/d' elementarySchoolShark.txt
Let’s build on the above. /N keeps the line below. How might I delete all the Octopus sequences from example.fasta using sed?
#solution, combine N and d!
sed '/octopus_.*/{N;d;}' example.fasta
I just discovered my Sepia latimanus samples were misidentified and in fact Sepia mestus! How could I rename the sequences in example.fasta using sed?
#solution, pretty easy right?
sed 's/latimanus/mestus/' example.fasta
When you use sed, y/acgtu/ACGTU/ replaces all lower case letters with upper case. Can you write a sed command that makes all the sequences uppercase, but not the sequence headers?
#remember to find the lines that DON'T start with a > by using /^>/!
sed '/^>/!y/acgtu/ACGTU/' example.fasta
Replace all the thymine in example.fasta with uracil.
sed '/^>/!s/t/u/g' example.fasta
Declare a variable called hobby and assign a hobby you like to it. Then, print out a message that says “Hello, I have many hobbies including [your hobby]”.
hobby="reading manga"
echo "Hello, I have many hobbies including $hobby"
Declare a variable called age and assign your age to it. Then, print out a message that says “In 5 years I will be [age+5] years old”.
age=42
echo "In 5 years I will be $(($age+5)) years old"
Declare a variable called PI and assign the value of pi (3.14159) to it. Then, calculate the circumference of a circle with a radius of 5 using the formula C = 2 * PI * r.
PI=3.14159
r=5
C=$((2 * $PI * $r))
echo "The circumference of the circle is $C"
make a variable that points to another folder and then list all the files in that folder, including the file sizes, in order from oldest to newest
directory=/Users/alexdornburg/Documents/Nototheniod_opsin
ls -thor $directory
create a variable called sequence and another variable called fasta. In sequence, assign a sequence “gaag” and in fasta give the name of the example.fasta file. search the fasta using the variables and return the sequences and headers
#assign variables
fasta=example.fasta
sequence="gaag"
#return sequences
grep -i -B 1 $sequence $fasta
Create an array named fishes that contains five different fish Then, print the length of the array.
fishes=("salmon" "tuna" "swordfish" "halibut" "cod")
echo "The length of the fishes array is ${#fishes[@]}"
Create an array named fishes that contains five different fish Then, print the first fish in the array.
#note I am using zsh which counts from 1, you may need to use 0 instead
fishes=("salmon" "tuna" "swordfish" "halibut" "cod")
echo "My favorite fish is ${fishes[1]}"
Create an array named fishes that contains five different fish Then, sort it alphabetically.
fishes=("salmon" "tuna" "swordfish" "halibut" "cod")
sorted_fishes=$(echo "${fishes[@]}" | tr ' ' '\n' | sort -d | tr '\n' ' ')
echo "Sorted array: ${sorted_fishes[@]}"
This is confusing to look at, but pretty easy if you remember tr.
This code first uses tr to replace the space character with a newline character, so that each element of the array is on a separate line. Then it pipes the resulting string to sort -d to sort the lines based on the whole line. Finally, it uses tr again to replace the newline character with a space character, so that the elements of the sorted array are separated by spaces.Write a script that takes an integer input and determines whether it is even or odd.
To take user input, the line is: read -p “Enter an integer:” num
#create
cat > evenOdd.sh
#!/bin/bash
read -p "Enter an integer: " num
if (( num % 2 == 0 )); then
echo "$num is even."
else
echo "$num is odd."
fi
#control d to save
#make executable
chmod +x evenOdd.sh
./evenOdd.sh
7
Write a condition statement that checks whether elementarySchoolShark.txt exists and is readable. If yes, print “The file exists and is readable.” if not print “The file does not exist or is not readable.”
if [[ -r elementarySchoolShark.txt ]]; then
echo "The file exists and is readable."
else
echo "The file does not exist or is not readable."
fi
Write a condition statement that checks whether evenOdd.sh that we created in problem 1 is executable. If yes, print “The file is executable.” if not print “The file is not executable.”
if [[ -x evenOdd.sh ]]; then
echo "The file is executable."
else
echo "The file is not executable."
fi
Write a condition statement that checks whether a fasta has sequences and returns a message that includes the number of sequences if it does, or a message that it is empty if it does not. You can use example.fasta for this
num_seqs=$(grep -c "^>" example.fasta)
if [ $num_seqs -gt 0 ]; then
echo "The file example.fasta contains $num_seqs sequences."
else
echo "The file example.fasta does not contain any sequences."
fi
Create a variable that contains a filename.
Then write a condition statement that checks whether the file has the extension .fasta
If yes, it returns a message that includes the filename indicating it does
If not it again includes a message with the filename that it does not. If not, also fix the extension to be .fasta.
You can create a file for this or use an existing file if you don’t mind the extension being potentially changed.
#create a file for an example
cat > squid
>squid
actgctagctagcta
#control d to save
filename="squid"
if [[ "${filename##*.}" == "fasta" ]]; then
echo "The file $filename has the correct extension."
else
echo "The file $filename does not have the correct extension. Renaming the file..."
extension=".fasta"
mv "$filename" "${filename%.*}$extension"
echo "The file extension has been corrected to have $extension as the extension"
fi
#view it
ls
This one kicks it up a notch create a file called squid.seq.s1.r.fa (that’s a lot of dots!) with some sequence data Then write a condition statement that checks whether the file has the extension .fasta put does this in a way that looks for the last “.”
To do this we will learn two new things “%” and “##”.
The % character signals that we want to remove a suffix pattern from the variable value. It will remove the shortest possible suffix.
In contrast, the ## characters signal that we want to remove the longest matching suffix pattern from the variable value.
Note this only works with shell variables
You normally use syntax like this to use these characters
filename="example.file.txt"
extension="${filename##*.}"
echo "The file extension is $extension"
With that massive hint, see if you can use this to check if a file is a .fasta and rename if necessary and print out a message. This is nice since you don’t need to worry about how many possible “.” there are in a file to do this task.
#create a file for an example
cat > squid.seq.s1.r.fa
>squid
atcgctcgat
#save with cntrl-d
filename="squid.seq.s1.r.fa"
extension="${filename##*.}"
base="${filename%.*}"
new_extension=".fasta"
if [[ "$extension" == "fasta" ]]; then
echo "The file $filename has the correct extension."
else
echo "The file $filename does not have the correct extension. Renaming the file..."
mv "$filename" "$base$new_extension"
echo "The file extension has been corrected to $new_extension"
fi
#view it
ls
filename=“squid.seq.s1.r.fa”: This line sets the value of the filename variable to squid.seq.s1.r.fa, which is the name of the file we want to check and potentially rename.
extension=“${filename##*.}”: This line extracts the file extension from the filename variable using the ${filename##*.} parameter expansion we discussed earlier. The resulting value is assigned to the extension variable.
base=“${filename%.*}”: This line extracts the base name of the file (i.e., the file name without the extension) using the ${filename%.*} parameter expansion. The resulting value is assigned to the base variable.
new_extension=“.fasta”: This line sets the new file extension we want to use if the original extension is not .fasta.
if [[ “$extension” == “fasta” ]]; then: This line starts an if statement to check if the original file extension is .fasta.
echo “The file $filename has the correct extension.”: If the original file extension is .fasta, this line prints a message indicating that the file has the correct extension.
else: If the original file extension is not .fasta, the if statement branches to this line.
echo “The file $filename does not have the correct extension. Renaming the file…”: This line prints a message indicating that the file needs to be renamed.
mv “$filename” “$base$new_extension”: This line renames the file by using the mv command to move the file from its original name to a new name that consists of the base name ($base) and the new extension ($new_extension).
echo “The file extension has been corrected to $new_extension”: This line prints a message indicating that the file extension has been corrected to the new extension.
Overall, this script checks if a file has the correct file extension (.fasta). If it does, nothing happens. If it doesn’t, the script renames the file to have a .fasta extension instead.
numbers=(10 20 30 40 50)
sum=0
for number in "${numbers[@]}"
do
sum=$((sum + number))
done
length=${#numbers[@]}
echo $(($sum/$length))
numbers=(10 20 30 40 50) - creates an array called numbers with 5 elements: 10, 20, 30, 40, and 50.
sum=0 - initializes a variable called sum with a value of 0.
for number in “${numbers[@]}” - starts a loop that will iterate through each element of the numbers array.
do - starts the loop body.
sum=$((sum + number)) - adds the current element of the numbers array to the sum variable.
done - ends the loop body.
length=${#numbers[@]} - calculates the length of the numbers array and stores it in a variable called length.
echo \(((\)sum/$length)) - divides the sum variable by the length variable and prints the result to the console.
Create an array called fruits that includes: “apple” “banana” “orange” “kiwi” “mango”. Loop through this array of strings, count the length of each string, and print the length of each string to the terminal.
fruits=("apple" "banana" "orange" "kiwi" "mango")
for fruit in "${fruits[@]}"
do
length=${#fruit}
echo "The length of $fruit is $length"
done
Use a loop to return the name of each sequence, and the total of each base (how many A, C, T, and G). Your output for each species should look like this:
Sequence: “tremoctopus_violaceus”
Number of A’s: 47
Number of T’s: 38
Number of A’s: 22
Number of T’s: 23
This one is challenging, try doing bits of the loop at a time and building out. For example return all the headers or all the sequence values first. Then put it together.
headers=($(grep ">" example.fasta | tr ' ' '"' | tr '\n' '"'| tr '>' ' '))
sequences=($(grep -v ">" example.fasta | tr '[:lower:]' '[:upper:]'))
counter=0
# Loop through the sequences and count the number of A's and T's
for seq in "${sequences[@]}"
do
counter=$((counter+1))
a_count=$(echo $seq | grep -o "A" | wc -l)
t_count=$(echo $seq | grep -o "T" | wc -l)
g_count=$(echo $seq | grep -o "G" | wc -l)
c_count=$(echo $seq | grep -o "C" | wc -l)
echo "Sequence: $headers[$counter]"
echo "Number of A's: $a_count"
echo "Number of T's: $t_count"
echo "Number of A's: $g_count"
echo "Number of T's: $c_count"
done
Create an array of three directories on your computer. Loop through these and print the number of files in each along with the name.
# Declare an array of directories
dirs=("/Users/alexdornburg/Documents/UCE_Workshop/TutorialFiles" "/Users/alexdornburg/Documents/UCE_Workshop" "/Users/alexdornburg/Documents")
# Loop through the array and print the number of files in each directory
for dir in "${dirs[@]}"
do
count=$(ls $dir | wc -l)
echo "There are $count files in $dir"
done
You can use $RANDOM to generate random numbers. Generate an array of random numbers, then loop through the array and print the numbers in descending order
# Generate an array of random numbers
numbers=()
i=0
while [ $i -lt 10 ]
do
numbers+=($RANDOM)
i=$((i + 1))
done
# Loop through the array and print the numbers in descending order
sorted=($(echo "${numbers[@]}" | tr ' ' '\n' | sort -nr))
for i in "${sorted[@]}"
do
echo $i
done
Most of this we have seen in the other examples, the only new thing here is sort -nr * -n: specifies a numerical sort. By default, sort sorts alphabetically, but using the -n option, it sorts numerically.
-r: specifies a reverse sort, meaning the output will be in descending order instead of ascending order.
So sort -nr means “sort numerically in reverse order” - or in other words, sort the numbers in the input in descending order.
Make an array of three directories that contain files with the same extension and then print files with that extension within those and all their subdirectories. I’ll use .pdf for this
To do this we can use the find command. The find command to search for .pdf files in each directory and its subdirectories. The find command takes several options, including -type f to search for only files (not directories), -name “*.pdf” to match files with a .pdf extension, and -print to print the names of the matching files to the console.
dirs=("/Users/alexdornburg/Documents/Italki_Essay" "/Users/alexdornburg/Documents/NITR-Review_paper" "/Users/alexdornburg/Documents/Triggerfish_Allometry_papers")
# Loop through the array and print the files in each directory
for dir in "${dirs[@]}"
do
# Use find command to search for .pdf files in the directory and its subdirectories
find "$dir" -type f -name "*.pdf" -print
done
We need to loop through two arrays and print only colors that start with “r” and sizes that contain the letter “m”.
Your arrays are: colors=(“red” “red” “red” “red” “ruby” “ruby” “ruby” “ruby” “blue” “green” “yellow” “orange” “purple” “pink” “teal” “cyan” “magenta” “navy” “olive” “maroon”)
sizes=(“small” “small” “medium” “large” “large” “medium” “small” “medium” “large” “small” “medium” “large” “small” “large” “medium” “medium” “small” “large” “large” “medium”)
# Declare two arrays
colors=("red" "blue" "green" "yellow" "orange" "rainbow" "purple" "pink" "teal" "ruby" "cyan" "magenta" "rosa" "navy" "olive" "maroon" "rotten cabbage")
sizes=("small" "medium" "large")
# Loop through both arrays and print only colors that start with "r" and sizes that contain the letter "m"
for i in "${colors[@]}"
do
if [[ $i == r* ]]
then
for j in "${sizes[@]}"
do
if [[ $j == *m* ]]
then
echo "Color: $i, Size: $j"
fi
done
fi
done
Write a function “add_numbers” that takes two numbers as arguments and returns their sum. Solution:
add_numbers() {
num1=$1
num2=$2
echo $(($num1 + $num2))
}
Write a function “reverse_string” that takes a string as an argument and prints it in reverse order. hint the rev function can reverse a printed string Solution:
reverse_string() {
string=$1
echo $string | rev
}
reverse_string "actcatatattt"
Write a function “is_palindrom” that takes a string as an argument and checks if it is a palindrome and returns a message if yes or no. Solution:
is_palindrome() {
string=$1
reverse_string=$(echo $string | rev)
if [[ $string == $reverse_string ]]; then
echo "$string is a palindrome."
else
echo "$string is not a palindrome."
fi
}
#use it
is_palindrome "otto"
Write a function “is_palindrom” that takes a string as an argument and checks if it is a palindrome and returns a message if yes or no. Solution:
is_palindrome() {
string=$1
reverse_string=$(echo $string | rev)
if [[ $string == $reverse_string ]]; then
echo "$string is a palindrome."
else
echo "$string is not a palindrome."
fi
}
#use it
is_palindrome "otto"
Write a function that takes a string as an argument and converts all the characters to uppercase. Solution:
to_uppercase() {
string=$1
echo $string | tr "[:lower:]" "[:upper:]"
}
#use it
to_uppercase "atagatagatacacacagatac"
Write a function that calculates the GC content of a fasta file.
calculate_gc() {
#rehashing from previous section
filename="$1"
if [[ "${filename##*.}" != "fasta" ]]; then
echo "Error: $filename is not a fasta file."
return 1
fi
total_bases=0
gc_bases=0
#first part if from our tutorial
while read line; do
if [[ ${line:0:1} == ">" ]]; then
continue
fi
total_bases=$(($total_bases + ${#line}))
gc_bases=$((($gc_bases + $(echo "$line" | grep -i -o "GC" | wc -l)))
done < "$filename"
gc_content=$(echo (("$gc_bases / $total_bases * 100")) )
echo "The GC content of $filename is $gc_content%."
}
The top of this we have already seen. if [[ ${line:0:1} == “>” ]]; then continue fi: checks whether the current line is a header line (i.e., starts with “>”). If it is, the loop skips to the next line.
total_bases=$(($total_bases + ${#line})) : This line adds the length of the current line to the total_bases variable.
gc_bases=$(($gc_bases + $(echo “$line” | grep -i -o “GC” | wc -l))) : This line uses grep to search for occurrences of the “GC” string in the current line (case-insensitively), counts the number of occurrences with wc -l, and adds the count to the gc_bases variable.
done < “$filename” This line ends the loop that reads each line of the input file.
gc_content=$((100 * $gc_bases / $total_bases)) : This line calculates the GC content as a percentage (i.e., the number of GC bases divided by the total number of bases, multiplied by 100) and assigns the result to a variable named gc_content. Note that the calculation is done with integer division using the (( )) syntax.Write a function or functions that reverse complements a fasta file. There are multiple solutions to this. I will supply a two function solution, you are welcome to try alternate approaches
Try to remember everything you have seen:
reverse_complement() {
seq=$1
rev_comp=$(echo "$seq" | tr 'ATCGatcg' 'TAGCtagc' | rev)
echo "$rev_comp"
}
reverse_complement_fasta() {
input_file="$1"
output_file="${input_file%.*}_revcomp.${input_file##*.}"
while read -r line; do
if [[ "${line:0:1}" == ">" ]]; then
echo "$line" >> "$output_file"
else
sequence="$line"
rev_comp=$(reverse_complement "$sequence")
echo "$rev_comp" >> "$output_file"
fi
done < "$input_file"
}
reverse_complement_fasta squid.fasta
cat squid_revcomp.fasta
“${line:0:1}” is a parameter expansion that extracts the first character of the string stored in the variable line. The syntax is ${parameter:offset:length}, where parameter is the name of the variable, offset is the position of the first character to extract (starting from 0), and length is the number of characters to extract. In this case, offset is 0 and length is 1, so the expansion returns the first character of the string.
Write an awk command that prints the first column of a file separated by a hyphen using the file elementarySchoolShark.txt
awk -F- '{print $1}' elementarySchoolShark.txt
Write an awk command that counts the number of lines in a file. elementarySchoolShark.txt
awk 'END {print NR}' elementarySchoolShark.txt
This is a clever trick, by using the END pattern, the awk command waits until it has processed the entire file before executing the associated action. In this case, the action is to print the value of NR, which gives the total number of records in the file. Remember, the NR variable in awk stands for the current record number being processed, so NR will be incremented for every record in the file.
Write an awk command that prints the lines of elementarySchoolShark.txt that contain both “shark” and “goblin”.
awk 'tolower($0) ~ /shark/ && tolower($0) ~ /goblin/ {print}' elementarySchoolShark.txt
In this command, tolower($0) converts the entire line to lowercase, and the tilde operator (~) is used to perform a case-insensitive pattern match against the regular expressions /shark/ and /great/. The && operator combines the two pattern matches with a logical AND, so the entire expression matches only if both regular expressions match. Finally, the print statement is executed for any line that matches both regular expressions.
Use awk to print the number of sequences in a fasta file
awk '/^>/ {i++} END {print i}' example.fasta
wait a minute, that is the same as for problem number 2! :-)
Print the length of each sequence in a FASTA file.
awk '/^>/ {if (seq) print length(seq); seq=""; next} {seq = seq $0} END {print length(seq)}' example.fasta
This is an odd way to code if you are used to python or R, but take a moment to reflect on this
/^>/ {if (seq) print length(seq); seq=““; next}: This pattern matches any line that starts with a > character, which indicates the start of a new sequence in a fasta file. The associated action is to first check if seq (a variable that stores the sequence) is not empty - this is to ensure that we don’t print the length of the first sequence in the file (which doesn’t have a > character before it). If seq is not empty, we print the length of the previous sequence using the length function, and then reset seq to an empty string. Finally, the next statement is used to skip to the next line of the file without executing any further actions on the current line.
{seq = seq $0}: This pattern matches any line that is not a header line (i.e. doesn’t start with >), and adds the sequence to the seq variable. Note that $0 refers to the entire line.
END {print length(seq)}: This is the END block, which is executed after all lines in the input file have been processed. It prints the length of the last sequence in the file.This builds on the last example. Print the sequence ID and length of each sequence in a FASTA file.
awk '/^>/ {if (seq) print id,length(seq); id=$0; seq=""; next} {seq = seq $0} END {print id,length(seq)}' example.fasta
This is the same logic as above, only we are now also storing the seq id. pretty cool! If this doesn’t make sense, go to the explanation for problem 5 to see if you can figure it out, since this is one additional step.
Print the GC content for each line in a fasta file. To do this, use printf with these options in logic like the above: “%s\t%.2f\n”
%s is a placeholder for a string value
\t inserts a tab character
%.2f is a placeholder for a floating-point number with two decimal places
inserts a newline character.
So, %s\t%.2f\n formats a string followed by a tab character and a floating-point number with two decimal places, and ends with a newline character.
awk '/^>/ {if (seq) printf "%s\t%.2f\n",id,100*gc/length(seq); id=$0; gc=0; seq=""; next} {seq = seq $0; for (i=1;i<=length($0);i++) if (substr($0,i,1)~/[GCgc]/) gc++} END {printf "%s\t%.2f\n",id,100*gc/length(seq)}' example.fasta
ok that is unreal! Let’s break it down
awk '/^>/ { # If the line starts with ">", then:
if (seq) # If there is a sequence already in the buffer, then:
printf "%s\t%.2f\n",id,100*gc/length(seq) # Calculate the GC content of the previous sequence and print it
id=$0; # Set the id to the current line
gc=0; # Reset the gc count
seq=""; # Clear the sequence buffer
next # Move on to the next line
}
{
seq = seq $0; # Add the current line to the sequence buffer
for (i=1;i<=length($0);i++) # Loop through the characters in the current line
if (substr($0,i,1)~/[GCgc]/) gc++ # If the character is a G or C (case insensitive), then increment the GC count
}
END { # After all the lines have been processed, do this:
printf "%s\t%.2f\n",id,100*gc/length(seq) # Calculate the GC content of the last sequence and print it
}' example.fasta
File compression is common. There are no example problems but review this cheat sheet by Rick White from his introduction to scripting course at UNC Charlotte if you ever get stuck.
#compress a file
gzip <filename>
# results in <filename>.gz
#Uncompresses files compressed by gunzip (.gz)
gunzip <filename>.gz
# results in <filename>
#Compresses files compressed by tar (tar.gz)
tar -cvzf <foldername.tar.gz>
#Arguments are
- -z: gzip compression
- -c: Creates a new .tar archive file.
- -v: Verbosely show the .tar file progress.
- -f: File name type of the archive file.
# List contents of tar.gz
tar -tvf <foldername.tar.gz>
# Prints a zipped file without opening it
gzcat <filename.gz> | more
gzcat <filename.gz> | less
#Uncompresses files compressed by tar (tar.gz)
tar -zxvf <foldername.tar.gz>
#arguments
- -z: many files
- -x: extract gzip
- -v: Verbosely show the .tar file progress.
- -f: File name type of the archive file.
# Compresses files compressed by tar (tar.bz2, more compression)
tar -cvjf <foldername.tar.gz>
#argument explanation
- -j: bz2 compression
- -c: Creates a new .tar archive file.
- -v: Verbosely show the .tar file progress.
- -f: File name type of the archive file.
#Uncompresses files compressed by tar (tar.bz2)
tar -xjvf <foldername.tar.bz2>
#argument explanation
- -j: bz2 file
- -x: extract gzip
- -v: Verbosely show the .tar file progress.
- -f: File name type of the archive file.