04_regex.Rmd

\newpage

# Regular Expressions

```{r, echo=F}
library(knitr)
opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
options(width = 60)
```

## Overview

**Regular Expressions**, also known as **regex** and **regexp** is a syntax that allows for coders to run complex find-and-replace functions. I didn't learn regular expressions until I was a postdoc working at Duke University, but I wish I had learned this a lot earlier! This remains one of the most useful programming tools I have ever used. It is absolutely essential for working with any kind of large text files or large datasets.

A lot of programming tools in biology use input text files that require very specific formatting (e.g. .txt, .csv, .fasta, .nex). Sometimes, you might need to reorganize or recode data in a large text file. This can be a big pain that can lead to errors if you try to do everything manually. But regular expressions can automate the process.

Here's one example. As a PhD student I co-founded a project called the **Global Garlic Mustard Field Survey (GGMFS)** with collaborator Dr. Oliver Bossdorf at the University of Tuebingen -- yes the same Dr. Bossdorf mentioned in the [qplot Tutorial](https://colauttilab.github.io/RCrashCourse/2_qplot.html). We were fortunate to have over 100 collaborators across Europe and North America who helped to collect samples for the project. Details of the project were published in the Journal **Neobiota**: https://neobiota.pensoft.net/article/1270/ but one BIG problem is the way that each of these 100+ collaborators entered their data online. For example, latitudes and longitudes were entered in a variety of different formats. Regular expressions allowed me to write a small program to automatically convert all of these different formats to a common, decimal format that we could use for the analysis. This saved a huge amount of time and prevented errors that could have been introduced if we tried to edit these values by hand.

Often when you work with large datasets, you will need to automate some of your error correction, and regular expressions can be a big help here. For example, imagine a simple field where people were simply asked a simple yes or no question. You might find a variety of inputs such as: "YES, Y, yes, and Yes". These all mean the same thing but if you try to analyze it, R will treat these as different categories. Here again, regular expressions can be used to quickly change all the different examples to a common "Y" or even a boolean variable `TRUE`.

One final example, is pattern searching. This is common for the analysis of DNA, proteins or other large strings of data. You may want to find a particular sequence of data, possibly with a few variable sites: e.g. TCTA or TCAA or TCGA. This is another area where regular expressions can help.

### Universal

Regular expressions are a universal language that extends to many other programming languages, including **C/C#/C++**, **Python**, **Unix/Linux**, and **Perl**. We focus here on R but most of the syntax is mantained across programming languages.

> WARNING!

There is a very steep learning curve here, and the only way to really learn this is to drown yourself in examples. There are lots of exercises you can do for practice online. You should also try to apply these whenever you can.

## Functions

There are four main functions that use regular expressions in R. 

`grep()` and `grepl()` are equivalent to 'find' in your favorite word processor

They have the general form: `gsub("find this pattern", in.this.object)`

`grep()` outputs a vector with all the address locations (i.e. numbers) that match. Thus the output length is equal to the number of matches.

`grepl()` outputs a vector of `TRUE` (match) and `FALSE` (no match). Thus, the output length is equal to the length of the input object.

`sub()` and `gsub()` are equivalent to 'find and replace'

They have the general form: `grep("find this pattern", "and replace with this", in.this.object)`

`sub()` replaces only the first match, whereas `gsub()` replaces all of the matches.

Some specific examples are provided below to help you understand these similarities and differences.

There are two other more advanced functions in R. These aren't covered in this tutorial, but may be of use once you are more comortable with the above functions.

`regexpr()` provides more detailed info about the first match

`gregexpr()` provides more detailed results about all matches

> See ?regexpr and ?gregexpr for more info

### Examples

Some examples can help to understand the differences among the four main functions. Let's start with a simple data frame of species names.

```{r}
Species<-c("petiolata", "verticillatus", "salicaria", "minor")
print(Species)
```

### `grep()` 

This returns cell addresses matching query. Note the vector length compared to the input vector.

```{r}
grep("a",Species)
```

We can also see the specific values with the `value=T` parameter

```{r}
grep("a",Species, value=T)
```


### `grepl()` 

This returns a vector of `TRUE` (match) and `FALSE` (no match). Compare this output with `grep()`.

```{r}
grepl("a",Species)
```

### `sub()` 

This replaces the first match (in each cell)

```{r}
sub("l","L",Species)
```

### `gsub()`  

This replaces all matches (in each cell). Compare this output to `sub()`.

```{r}
gsub("l","L",Species)
```


## Wildcards

### `\` escape character

The backslash is a special character. It's called the 'escape' character because it is used to 'escapes' the interpretation of the next character that occurs after it. This is easier to understand by example, as shown below.

### `\\` in R

In the introduction, we discussed the universality of **regular expressions** in the sense that a similar syntax is used by many different programming langagues. But now here is one exception. In R, the double-escape is usually needed, whereas other programming languages typically use just one. The reason is a bit meta -- it's because we are running regular expressions within R object. So the first `\` is used to escape special characters in R, applying it to the second `\`, which is itself the special character that needs to be escaped to pass through the function. The second slash is followed by the 'escaped' character. Some examples are provided below.

If that isn't clear. Just remember that you need two backslashes instead of one.

### `\\w`

Instead of finding the letter `w`, the `\\w` is a **wildcard** character that represents any letter or digit. It also includes underscore `_` for some reason.

```{r}
gsub("w","X","...which 1-100 words get replaced?")
gsub("\\w","X","...which 1-100 words get replaced?")
```

### `\\W` 

This is the inverse of `\\w` find a character that is NOT a letter or number.

```{r}
gsub("\\W","X","...which 1-100  words get replaced?")
```

### `\\s`

This represents a space

```{r}
gsub("\\s","X","...which 1-100  words get replaced?")
```

### `\\t` 

This is a tab character. A lot of data files stored as text are tab-delimited (.tsv) as well as comma-delimited (.csv)

```{r}
gsub("\\t","X","...which 1-100 \t words get replaced?")
```

> Remember \t is a tab character:

```{r}
cat("A\t\t\tB C")
```


### `\\d` 

Digit characters

```{r}
gsub("\\d","X","...which 1-100  words get replaced?")
```

### `\\D` 

Non-digit characters

```{r}
gsub("\\D","X","...which 1-100  words get replaced?")
```

## New Lines

There are two special characters that indicate new lines in a text file. 

### `\\r`

This is the 'carriage return' special character

### `\\n` 

This is the 'newline' special character

### Big Problem

One or two of these is/are generated when you press the 'enter' key while writing a text file. These also add a source of headache and confusion when working with text files because:

> Macs/Unix and PC/Windows use different standards!

Unix/Mac files -- lines usually end with `\\n` only

Windows/DOS files -- lines usually end with `\\r\\n`

This can cause problems when moving text files between Windows/DOS and Mac/Unix machines, particularly with older operating systems or when working on remote computers that use very basic Linux software.


## Special characters

In addition to special characters that use the escape `\\`, there are a number of other special characters that don't use the escape, but have a special meaning.

Note that if you want to search for the characters below you would have to use the escape character. E.g. `\\.` if you wanted to search for a period `.`.

### `|`

This is sometimes called the **pipe** character, and it simply means 'or'. For example, we can search for `w or e`.

```{r}
gsub("w|e","X","...which 1-100  words get replaced?")
```

### `.` 

The period is a wild card that means 'anything'. This includes all of the `\\w` characters but also other characters like puncutation marks.

```{r}
gsub(".","X","...which 1-100  words get replaced?")
```

So how to search for a period `.`? As noted above, we have to use the escape character

```{r}
gsub("\\.","X","...which 1-100  words get replaced?")
```

### `*`, `?`, `+`, `{}` 

These special characters refer to details about the kind of search that we are trying to conduct. Look at these examples carefully, and remember that `sub` replaces the first match while `gsub` replaces all of the matches.

```{r}
sub("\\w","X","...which 1-100 words get replaced?")
gsub("\\w","X","...which 1-100 words get replaced?")
```

Now let's apply some of these special characters to see how they work.

#### `+`

Finds 'one or more' matches (i.e. at least one match)

```{r}
sub("\\w+","X","...which 1-100 words get replaced?")
gsub("\\w+","X","...which 1-100 words get replaced?")
```

Compare this match to the one above. Notice how we have replaced groups of letters instead of single letters. The algorithm works like this: 

  1. Start at the left and move to the right, one character at a time
  2. Check if the character is a letter or number (`\\w`). 
  3. If NO, move to the next character
  4. If YES, check the next character. If it is also a `\\w` then go to the next character. Repeat until the next character is not `\\w`, and replace the entire string of characters.

When run in the `sub` command, it does the above and then stops. When run with the `gsub` command, it continues to the next character, and then starts over.

### `*` 

This is a 'greedy' search (match the largest possible)

```{r}
sub("\\w*","X","...which 1-100 words get replaced?")
gsub("\\w*","X","...which 1-100 words get replaced?")
```

In the `sub` command, it detects a `.` as the first character, indicating no match. It replaces the 'null' or `0` match at the beginning, which has the effect of adding a character. In the `gsub` command it repeats this before each `.` until it finds the letter `w`. Then it finds a group of `\\w` matches, replacing with a single `X`. Then a space, which is skipped, then a `-`, which is another null match, prompting another insert.


### `?` 

This means 'match zero or one time'

```{r}
sub("\\w?","X","...which 1-100 words get replaced?")
gsub("\\w?","X","...which 1-100 words get replaced?")
```
Compare this to the `*` above. It behaves in a similar way, except it is not 'greedy' -- in the second example, each letter is replaced instead of entire words.

###  `+?` 

This is the 'lazy' version of `+` -- note in particular the difference in `sub` which replaces on the the first letter here but the whole word when `+` is used alone. In the `gsub` example we end up replacing every letter instead of whole words. Remember, `sub` runs the algorithm once and then stops, while `gsub` repeats until it reaches the end of the line.

```{r}
sub("\\w+?","X","...which 1-100 words get replaced?")
gsub("\\w+?","X","...which 1-100 words get replaced?")
```

### `*?`

Similarly, we can combine these characters for the 'lazy' version of `*`

```{r}
sub("\\w*?","X","...which 1-100 words get replaced?")
gsub("\\w*?","X","...which 1-100 words get replaced?")
```

> Try using `+*`. Why do you get an error message? 

### `{}`

Curly brackets are used to specify a number of matches, expanding on the options even further.

### `{n,m}` 

Find between $n$ to $m$ matches

```{r}
gsub("\\w{3,4}","X","...which 1-100 words get replaced?")
```

### `{n}` 

Find exactly $n$ matches

```{r}
gsub("\\w{3}","X","...which 1-100 words get replaced?")
```

### `{n,}`

Find $n$ or more matches

```{r}
gsub("\\w{4,}","X","...which 1-100 words get replaced?")
```

### {}?

As above, we can use `?` for the 'lazy' versions of these searches

```{r}
gsub("\\w{4,}?","X","...which 1-100 words get replaced?")
```


## `[]`

Square brackets allow us to find 'any' of the values listed within them. We can also use the dash `-` to specify a range of numbers or letters.

```{r}
gsub("[aceihw-z]","X","...which 1-100 words get replaced?")
```

In the above example, we search for 1 of any of the listed letters: a, c, e, i h, w, x, y, z. Note that x and y are included in the `x-z` statement.

What if we want to find 1 or more of these characters in a row?

```{r}
gsub("[aceihw-z]+","X","...which 1-100 words get replaced?")
```

## `^` and `$`

Use these characters to specify searches at the start `^` or end `$` of the input string.

### `^`

Find species starting with the letter *a*

```{r}
grep("^a",Species)
```

> IMPORTANT: `^` Also means 'negate' when used within `[]`

Find species containing any letter other than *a*

```{r}
grep("[^a]",Species)
```

Replace every letter except *a*

```{r}
gsub("[^a]","X",Species)
```

### `$`

Find species ending with *a*

```{r}
grep("a$",Species)
```


## `()`

Regular parentheses are used to 'capture' text, which can then be specified in the replacement string using `\\1`. Or you can capture multiple pieces of text and reorganize them by using the corresponding number -- `\\1` for the first set of`()`, `\\2` for the second set of `()`, etc. Some examples should help. 

Replace each word with its first letter

```{r}
gsub("(\\w)\\w+","\\1",
     "...which 1-100 words get replaced?")
```

Pull out only the numbers and reverse their order

```{r}
gsub(".*([0-9]+)-([0-9]+).*","\\2-\\1",
     "...which 1-100 words get replaced?")
```

Reverse first two letters of each word

```{r}
gsub("(\\w)(\\w)(\\w+)","\\2\\1\\3",
     "...which 1-100 words get replaced?")
```

## Scraping

Scraping is a method for collecting data from online sources. In R, we can use the functions `readLines` and `curl()`, both from the `curl` library, to 'scrape' data from websites. Websites with the `.html` extension are a special kind of text file. 

We can use regular expressions to pull out text from the website. Here's an example where we will scrape a record for the Green Fluorescent Protein (GFP) from the Protein Data Bank (PDB). Note that this is a file with the extension `.pdb` but this is a human-readable text file that can be opened in any text editor

First, we'll import the text into an R object.

```{r}
library(curl) 
```

You will have to use `install.packages("curl")` to download this package to your computer. You only need to do this once but you will have to use `library(curl)` whenever you want to use the commands, as explained in the [R Fundamentals Tutorial](https://colauttilab.github.io/RCrashCourse/1_fundamentals.html)

Now we can download a file to play with.

``` {r}
Prot<-readLines(curl("http://www.rcsb.org/pdb/files/1ema.pdb"))
```

> HINT: Download this link to your computer and open with a text file. 

This hint is a simple trick to understand what kind of file(s) you are working with.

This is a tab-delimited file, which we could import as a data frame using `read.delim` but we'll keep it this way to see how we can use regular expressions.

The `Prot` object we have made is a simple vector of strings, with each cell corresponding to a different row of text:

```{r}
length(Prot)
Prot[grep("TITLE",Prot)]
```

We can pull out the amino acid sequences, which are rows that start with the word 'ATOM'

```{r}
AAseq<-Prot[grep("^ATOM",Prot)]
length(AAseq)
AAseq[1]
```
> Try to isolate the 3-letter amino acid code

There are lots of possibilities. Take the time to try a few.

Here's one good option, since we know it's a tab-delimited file with the amino acid in the 4th column:

```{r}
gsub("ATOM\\t\\w+\\t\\w+\\t(\\w+).*","\\1",AAseq[1])
```

That didn't work. Sometimes the 'tabs' are actually just multiple 'spaces'

```{r}
AAchain<-gsub("ATOM\\s+\\w+\\s+\\w+\\s+(\\w+).*","\\1",AAseq)
AAchain[1:100]
```

## Examples 

Let's try practicing with a couple of examples

## Transect Data

Regular expressions are also useful with data objects

Imagine you have a repeated measures design. 3 transects (A-C) and 3 positions along each transect (1-3)

```{r}
Transect<-data.frame(Species=letters[1:20],
                     A1=rnorm(20), A2=rnorm(20), A3=rnorm(20),
                     B1=rnorm(20), B2=rnorm(20) ,B3=rnorm(20),
                     C1=rnorm(20), C2=rnorm(20), C3=rnorm(20))
head(Transect)
```

> TIP: the object `letters` contains lower-case letter, while `LETTERS` contains upper case.

### Challenge 

Use your knowledged gained above with substting data outlined in the [R Fundamentals Tutorial](https://colauttilab.github.io/RCrashCourse/1_fundamentals.html) to do the following:

  1. Subset only the columns that have an "A" in their name
  2. Subset the data for species "D"

Take the time to do this on your own. It will take you a while and you will make a lot of mistakes. That's all part of the learning process. The longer you struggle, the faster you will learn. 

Now here is a more challenging example:


## Genbank

Here is a line of code to import DNA from genbank. (The one line is broken up into three physical lines to make it easier to read)

```{r}
Lythrum_18S<-scan(
  "https://colauttilab.github.io/RCrashCourse/sequence.gb",
  what="character",sep="\n")
```

This is the sequence of the 18S subunit from the ribosome gene of *Lythrum salicaria* (from Genbank)

```{r}
print(Lythrum_18S)
```

Notice that each line is read in as a separate cell in a vector, with sequences beginning with a number ending with 1. We can take advantage of this to extract just the sequence data

### Challenge

**Before proceeding** try to do the following:

  1. Isolate only the rows containing DNA sequences. This should include
    a. Removing all of the characters that are not a, t, g, or c.
    b. Combining separate cells/lines into a single string. You can do this with using the `paste()` function with the `collapse=""` parameter 
  2. Convert lower-case to upper-case. To do this, you can use:
```{r, eval=F}
gsub("([actg])","\\U\\1",Seq,perl=T)
```
  The `\\U\\\1` means 'paste brackets as upper-case, and is only available as a **Perl** command, which is accessible in gsub with the `perl=T` parameter.
  3. Replace start codons (ATG) with "-->START-->ATG"
  4. Replace stop codons (TAA or TAG or TGA) with TAA or TAG or TGA followed by ">--STOP--|"


Take the time to stuggle with this and try different combinations until you find a way through. The more you struggle, the faster you will learn.

A cool thing about regular expressions is that there is rarely a single right answer, especially for complicated problems. When you are ready, Continue on to see one possible solution.


## Solutions

## Transects

You want to look at only transect A for the first 3 species:

```{r}
Transect[1:3,grep("A",names(Transect))]
```

Subset the data for the first plot of each transect:

```{r}
Transect[grep("1",names(Transect)),]
```

## Genbank

Use `.*` with `()` to delete everything before the DNA sequence

```{r, include=F}
Seq<-gsub(".*(1 [gatc])","",Lythrum_18S)
paste(Seq,collapse="")
```

Use the `.*` and space with `+` to eliminate all text before the sequence :

```{r, include=F}
Seq<-gsub(".*ORIGIN +","",paste(Seq,collapse=""))
Seq
```

Elimminate spaces and the two `//` at the end

```{r, include=F}
Seq<-gsub(" |//","",Seq)
Seq
```

Capital letters look nicer, but requires a PERL qualifier `\\U` that is not standard in R

```{r, include=F}
Seq<-gsub("([actg])","\\U\\1",Seq,perl=T)
print(Seq)
```

Look for start codons?

```{r, include=F}
gsub("ATG","-->START-->ATG",Seq)
```

Look for stop codons? 

```{r, include=F}
gsub("(TAA|TAG|TGA)","\\1>--STOP--|",Seq)
```


## More Exercises

Here are some more exercises to practice your skils. No solutions are given for these, you will have to solve them on your own. Note that they may find their way onto a test, assignment or quiz.

### 1. Consider a vector of email addresses scraped from the internet:

  * robert 'dot' colautti 'at' queensu 'dot' ca
  * chris.eckert[at]queensu.ca
  * lonnie.aarssen at queensu.ca

Use regular expressions to convert all email addresses to the standard format: name@queensu.ca

### 2. Create a random sequence of DNA:

```{r,results="hide"}
My.Seq<-sample(c("A","T","G","C"),1000,replace=T)
```

  * Replace T with U
  * Find all start codons (AUG) and stop codons (UAA, UAG, UGA)
  * Find all open reading frames (hint: consider each sequence beginning with AUG and ending with a stop codon; how do you know if both sequences are in the same reading frame?)
  * Count the length of bp for all open reading frames

### 3. More online examples

[http://regex.sketchengine.co.uk/extra_regexps.html](http://regex.sketchengine.co.uk/extra_regexps.html)

### 4. Regex Golf

Have fun! [LINK](https://alf.nu/RegexGolf)