1. Introducing the Shell
Introduce Socrative
- You can import my quiz with SOC-24313213
- Start the quiz in teacher mode, so you can step through one question at a time
- Advise learners to go to https://b.socrative.com/login/student/
- Type in MICKLEY for the room name
QUESTION: How many of you have used the shell before?
Computers do 4 things:
- Run programs
- Store data
- Communicate with each other
- Interact with us
Interfaces
- Mostly we interact nowadays using a GUI (graphical interface)
- But we can also use a command interface (CLI) with the shell
- We type something in, the computer evaluates it, and prints out something for us to read
- This is what the shell does, it's a program that lets us run other programs
- The shell has been around a long time, since the 1960s. It's survived because it's simple and powerful
- Useful for repetitive tasks and constructing a workflow
- Comes bundled with lots of little simple programs
- Important if you're using something like a HPC cluster, may not have a GUI
Our hypothetical example: Nelle the marine biologist
- Nelle has collected 1520 samples of marine life from the North Pacific Gyre.
- She wants to run protein assays on each of these samples to measure the relative abundances of 300 different proteins.
- The assay machine she's using spits out a text file with one line for each protein.
- She's on a deadline and has to:
- Run each sample through the machine, which will take a few weeks
- Calculate statistics for each of the proteins separately using a program her supervisor wrote called goostat.
- Compare the statistics for the proteins to each other using goodiff, which her labmate wrote.
- Write up her results, ideally within a month.
- If she had to run goostat and goodiff by hand she'll have to type in the filenames and click OK 46,370 times.
- Even if she can do this quickly, it will take weeks, and she'll certainly have typeos.
- We're going to explore the shell to get at what she could do instead.
- We want to automate the repetitive steps in her workflow so that her computer can do all the work while she writes her paper.
- Plus, once she puts a workflow together, she can use it again if she collects more data.
2. Navigating Files and Directories
Moving around in the filesystem and seeing what's there
- Setup prompt
PS1='$ ' # (Or PS1="\n\w: \$ ")
- Explain that whoami is a program that is being run and displaying output
- Print working directory: find out where we are. Directory = folder
- Diagram a Unix filesystem and explain the leading / (root directory). There are 2 meanings for /
- Listing the contents of current directory with ls
- Kind of hard to tell what's what (which are folders?). So we'll add a "flag" (explain flags)
ls -F # Puts a slash after folders
- ls has lots of other options, let's look at the help. man stands for manual
man ls # Windows users: ls --help
- You can move around in man using mouse, or up/down arrows. Or b and spacebar to move a page. q to quit
- We can also tell ls to look in a different directory than the current one.
- -F and Desktop are parameters or arguments we're giving to the ls program to tell it how to run
- Hopefully you see the data-shell directory. If not, put up a red sticky
- Now we can see what's inside our data-shell directory
- Instead of just looking, we can change our current directory.
- Say we want to go to that data directory
cd Desktop # cd stands for change directory
cd data-shell
cd data
- cd doesn't tell us anything back, but we can check using our tools from before to see that we are in the data directory
- So now let's go back to the data-shell directory. We can try this
- Oops, that didn't work. cd can only see subdirectories inside the one we're in
- This takes us back. .. stands for the directory containing this, or the parent
cd ..
- We're back again
- We can see this parent directory if we use the -a (all) flag for ls
- We can also stack multiple flags together like this instead of writing them separately
- Explain hidden files, eg .bash_profile, .DS_Store, ./ (current directory).
- . and .. don't just belong to cd. Any program can use them, eg this shows my Desktop:
- So ls, cd, and pwd are how we navigate around our filesystem
- What happens if you just type cd and press enter? Where do you go? Let's figure it out,
- and put up a green sticky when think you have the answer
---------- Socrative #2 ----------
- Takes us back to the home directory
- Let's go back to the data directory. This time, we don't need to use 3 separate commands, we can string together
cd Desktop/data-shell/data
- So far, we've been using "relative paths" (explain). ls and cd are trying to find where we mean
- We could also use "absolute paths" these will work no matter where we are.
- Remember that leading / means root?
pwd
cd /Volumes/mickles/Desktop/data-shell
- Some shortcuts:
- Tilda is the same as the user's home directory (/Volumes/mickles for me)
- This takes you to the previous directory you were in, useful for switching back and forth
---------- Socrative #3 ----------
---------- Socrative #4 ----------
QUESTION: What does the command ls do when used with the -l and -h arguments? Use your stickies
Back to Nelle
- Nelle makes a directory called north-pacific-gyre for where the data came from
- Inside of it is a folder named by the date
- Notice how this is named.
- The filesystem sorts things by name, so if she makes more directories, they'll automatically be sorted
- This becomes more important later on if we want to run something that works with each directory in sequence
- Let's see what's inside the dated folder.
- Lots to type, introduce tab completion: folder 1, folder2, and contents
3. Working With Files and Directories
Creating, copying, deleting, and editing
- Ok, we know how to explore files and directories, but how do we create them?
cd ~/Desktop/data-shell
ls -F
- Let's create a new directory called thesis using mkdir
- Make directory. Relative path, so in current working directory
mkdir thesis
- Check that it's there. We can also check with our GUI
- Good naming conventions
- Don't use spaces. It's possible to (and I do), but the shell interprets them as arg breaks
- Don't begin the name with -, since that means "flag"
- Stick with letters, numbers, ., -, _ Many other characters have special meanings
- Now let's make a new file, using an editor in the shell called nano
- You could use any other text editor instead, nano is just convenient
- Notepad++ (Windows) or Sublime are good ones
- Explore nano commands using the control key (^)
- Write something like:
- Now let's write this out using Ctrl+O
- Our file exists!
- We can remove this file using rm
- Careful though! In shell, deleting is forever! No "are you sure?". No trash bin.
- Let's recreate the file, and move back one directory to data-shell
- Now let's try to remove the whole thesis directory
- Doesn't work. rm only works with files by default, not directories. This is a good thing.
- If we add the recursive (-r) flag, we can delete everything inside thesis, and it too
- But this is super powerful, and again you need to be careful!! We might forget what else is inside.
- A better way is to add the interactive flag to rm too
- We use y and n for yes and no (which also work)
---------- Socrative #5 ----------
- Let's recreate it again. This time we don't need to be in the thesis directory
pwd
mkdir thesis
nano thesis/draft.txt
ls thesis
- Let's change the name to something more appropriate
- mv stands for move. First argument is the file we're moving, 2nd is where to go
mv thesis/draft.txt thesis/quotes.txt
ls thesis
- Be careful with mv too. It overwrites files without telling you.
- You can avoid this with -i
- Let's move quotes.txt into the current working directory. Remember, . is pwd
mv thesis/quotes.txt .
ls thesis
- If we give a filename to ls, it only looks for that file
- We also have a copy command called cp
cp quotes.txt thesis/quotations.txt
ls quotes.txt thesis/quotations.txt
- Just to prove we have a copy, let's delete the original
rm quotes.txt
ls quotes.txt thesis/quotations.txt
- Names and extensions
- The files we're working with have names that are something dot something
- This extension at the end isn't required. We could have just called this quotes
- But extensions are helpful for us (and programs) to know how to interpret them
---------- Socrative #6 ----------
---------- Socrative #7 ----------
4. Pipes and Filters
Combining commands to do novel things
- Combining commands or programs together is where we really get into the shell's power
- Let's look in the molecules directory.
- This has some files describing some organic molecules in protein data bank (pdb) format
pwd
ls molecules
- Let's go into that directory and run wordcount
- This shows the # of lines, words, and characters
- The * character is a "wildcard". It matches 0 or more characters, so *.pdb matches all pdb files
- We could also use p*.pdb for only pentane and propane
- Another wildcard is ?, but it only matches a single character.
- So p?.pdb wouldn't match pentane.pdb, only pi.pdb
- We could use multiple wildcards at once
wc ??hane.p* # This will only match ethane
- One note: If nothing matches, our wildcard match gets passed as-is, eg:
- The shell is creating a list of matching files BEFORE running wc
---------- Socrative #8 ----------
- If we add the -l flag to wc we only get the # of lines
- We could also use -w and -c for the # of words or characters
- Say we wanted to know which file was the shortest.
- Easy with only 6 (methane), but what if there were thousands of files?
- First step, save the lengths to a file
- The > symbol REDIRECTS the output to the filename (and we don't see it!)
- Creates or overwrites the file
wc -l *.pdb > lengths.txt
ls lengths.txt
- Now we want to see what's in the file
- We can print it using cat = concatenate (can be used with multiple files)
- One disadvantage of cat is it dumps the entire file. Not so good if file is long
- Can use less instead to just show a screen at a time
less lengths.txt # Press q to quit
- Now that we have a file, we can use the sort command to sort its contents
- We also have to use the -n flag for numeric sort (otherwise 100 and 10 will end up together)
- We can save these sorted lengths
sort -n lengths.txt > sorted-lengths.txt
cat lengths.txt
cat sorted-lengths.txt
- We can also use a command called head to just get the first line. -n is the # of lines to get
head -n 1 sorted-lengths.txt
- Things are getting confusing with all these intermediate files
- Fortunately we can avoid those intermediates by running everything together
sort -n lengths.txt | head -n 1
- The | is called a pipe, and it's very useful!!!
- It means take the output of the left side and use it as input for the right side
- We could also do this for wordcount and sort
- And for everything all at once, no intermediate files!
- Go over what this is doing, basically reading backwards: head of sort of wordcount
wc -l *.pdb | sort -n | head -n 1
---------- Socrative #9 ----------
- So we end up with lots of little tools that do one job well and can be strung together
- Keeps things from getting too complicated
- wc and sort then act as filters and pipe between each other.
- They take input, transform and give us output
- We used > to redirect output to a file. We can also redirect a file to input using <
- Same as wc methane.pdb, but there's no filename to open, it's redirected
wc < methane.pdb
Back to our biologist Nelly
- Nelle decides to check the length of her data files
cd north-pacific-gyre/2012-07-03
wc -l *.txt
- Things like this can be good for error checking. There's a file that's too short, missing data
- Still a lot of work to go through though if she's got thousands of files
- So we do this instead to look at the shortest 5
wc -l *.txt | sort -n | head -n 5
- We could also look for files that are too big using tail (similar to head, but last lines)
- That looks ok sizewize, but what's that Z doing in the 2nd line?
- Should just be A or B for 2 different depths
wc -l *.txt | sort -n | tail -n 5
- Let's see if there are any others like it
- Turns out there are two, where depth wasn't recorded
- We could delete these files using rm, but we might want to use them later
- Instead we can just exclude them from all analyses with wildcards
- [AB] means match one character that is either an A or a B
---------- Socrative #10 ----------
---------- Socrative #11 ----------
5. Loops
- Also reduces amount of typing and mistakes
- We're going to work in the creatures directory
- Here we have two files, let's assume they're genome data files and we have a lot more than 2
- We can inspect one to see
- Say we wanted to modify these files, but we wanted to save a backup first
- We could try, this but it won't work
- That would expand to the following, and try to copy 2 files to a directory that doesn't exist
cp basilisk.dat unicorn.dat original-*.dat
- Instead, we'll have to use a loop. We'll come back to this example
- A simple example of a for loop
- Note that the > character here means that our command isn't finished yet.
- We need the done to finish it
- The for loop does something for each thing in a list. In this case, the list is the two filenames
- Each time through the loop, the filename we're working on is saved in a variable named filename
- Inside the loop, we can get and substitute the variable's value by putting a $ in front of it
- Finally, the thing we're actually doing each time is just head
- Note that the > now has multiple meanings. It can mean "redirect to a file" if we put it in our command. Or the shell prints it it's expecting us to type something, command not finished.
and $ are two different "prompts"
- We could use x as a variable name instead
- Indenting the things we're doing inside the loop makes the code easier to read
- Best to pick variable names that make sense with what you're doing, filename is better than x
---------- Socrative #12 ----------
- A slightly more complicated loop
- We could also use curly braces to get our variable ${filename} is the same as $filename
- We use a wildcard for the filenames instead of listing them ourself
- This time we run two commands. The first is echo, which just echos/prints the filename
- We couldn't just put $filename there.
- Then the shell would expand it to basilisk.dat and try to run that
- Finally, we take the first 100 rows, and then the last 20 of those, = rows 81-100
- Testing echo
- Say we had some filenames with spaces, eg red dragon.dat. We'd have to quote them. Otherwise, the shell would treat them as separate files
- Again, it's often easier to just avoid spaces
for filename in "red dragon.dat" "purple unicorn.dat"
- Back to our file copying problem, we can solve it with this loop
- Each time through it runs a different file as if we run this
- Check for copies
Back to our friend Nelle, building her pipeline
- First she wants to make sure she can select the right files
- Now she wants to run her goostats program on them and write the results to files
- To be safe, we're still using echo here
- All this typing is increasing our chance of mistakes though. Fortunately, we can reuse some of our typing
- Hitting the up arrow key gives us the last command. Note the semicolons, these separate different lines. We can then move around and change echo to bash goostats to run the program
- Now it's running the stats, but we have no idea as to progress! We can stop the for loop with CTRL+C
- Lets add echo $datafile; back in so we can see which file we're working on. If you know how many files you have, you can estimate how long this will take to run
- Editing the previous command still takes a while using the arrow keys. CTRL+A takes us the the beginning of the line, and CTRL+E to the end.
- Also, we could keep hitting up arrow to go through our history, eg find the ls command. Alternatively we could use the history command and pipe it through tail to get the last 15
- Notice that the history entries are numbered. We can run any of them with an exclamation point
!132 # Run the ls command"
---------- Socrative #13 ----------
---------- Socrative #14 ----------
6. Shell Scripts
How we save and reuse groups of commands
- Now we're going to save our whole workflow in a file, so that we can just run the file
- First let's go back to the molecules directory
cd ~/Desktop/data-shell/molecules
- Now let's edit a new file
- Put in the following: which selects lines 11-15
- We're not running this as a command, we're just putting it in a file
head -n octane.pdb | tail -n 5
- Save and exit: CTRL+O, CTRL+X
- Now we can run the file, which in turn runs the commands inside of it
- Compare to running the command directly: they're the same
head -n 15 octane.pdb | tail -n 5
- What if we want to select lines of an arbitrary file?
- Editing middle.sh isn't a great solution
- Instead, we can use a special variable called $1
- This will be replaced by whatever argument we give our middle.sh
nano middle.sh
head -n 15 "$1" | tail -n 5
- Now we can run the following
bash middle.sh octane.pdb
bash middle.sh pentane.pdb
- We put it in quotes in case it has spaces (better safe than sorry)
- What if we wanted to change the range of lines though?
- We can add more special variables for more arguments
nano middle.sh
head -n "$2" "$1" | tail -n "$3"
- Now we can run:
bash middle.sh pentane.pdb 15 5
bash middle.sh pentane.pdb 20 5
- Works great, but what if someone else needs to use this, or we want to use it 6 months later?
- Add Comments!!! They start with a #, and the computer ignores these lines when parsing them.
- What if we want to process many files in one pipeline?
- We could put something like this in a file, but it'd only work for .pdb
- $1 and $2 won't work either, we don't know how many files there will be
- Luckily, there's a special variable $@ which means "All the arguments" - `nano sorted.sh` - `wc -l "$@" | sort -n `
- And it works:
bash sorted.sh *.pdb
bash sorted.sh *.pdb ../creatures/*.dat
- What if we don't give it any arguments? Now $@ expands to nothing
- wc just waits for input, since it didn't get a filename
- We could save our history, to avoid typos, but it'll take some editing
history | tail -n 5 > redo-figure-3.sh
cat redo-figure-3.sh
Nelle's script
- Nelle forgot some arguments for goostats. Luckily, its easy to re-run, and she can make a script
- Now she can run it, specifying which files to run on
bash do-stats.sh *[AB].txt
- She could have put the *[AB].txt inside her script.
- This might be safer, but it's less flexible
---------- Socrative #15 ----------
7. Finding Things
cd ~/Desktop/data-shell/writing
cat haiku.txt
- Look for lines that have "not" in them
- Or day. But this includes words containing day
- To just get day, we can use -w (for word)
- We can also search for a phrase with quotes
grep -w "is not" haiku.txt
- It's also useful to get the line numbers of the lines that match
- Flags can be combined: line numbers and words
grep -n -w "the" haiku.txt
- Line numbers, words, and case-insensitive
grep -n -w -i "the" haiku.txt
- Or we can invert our search, show the lines that do NOT contain "the"
grep -n -w -v "the" haiku.txt
- Grep has lots of other options
- Grep supports something called regular expressions, which is like our wildcards
- -E extended regex, Quotes to prevent shell expansion, ^ = beginning, . = single character
- grep finds lines in files, but the find command finds files themselves
- This finds all the files in the current directory (and it's recursive)
- Only show directories
- Only show files
- We can also match by name
- This doesn't actually find all of them, remember the shell expands BEFORE command runs
find . -name *.txt
find . -name haiku.txt # Ends up same as this
- Putting in single quotes prevents shell from expanding
- Pretty similar to ls right? But find lets us restrict our search
- Shell runs whatever is inside $() first
wc -l $(find . -name '*.txt')
- Same as this
wc -l ./data/one.txt ./data/two.txt ./haiku.txt
- We often string grep and find together
- Find all the .pdb files contained in the parent directory of this one, then look for FE in them.
grep "FE" $(find .. -name '*.pdb')
---------- Socrative #15 ----------
Challenge
- Write a short explanatory comment for the following shell script:
wc -l $(find . -name '*.dat') | sort -n