-
-
Notifications
You must be signed in to change notification settings - Fork 26
/
Copy path04_regex.Rmd
630 lines (387 loc) · 21.6 KB
/
04_regex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
\newpage
# Regular Expressions
```{r, echo=F}
library(knitr)
opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
options(width = 60)
```
## Overview
**Regular Expressions**, also known as **regex** and **regexp** is a syntax that allows for coders to run complex find-and-replace functions. I didn't learn regular expressions until I was a postdoc working at Duke University, but I wish I had learned this a lot earlier! This remains one of the most useful programming tools I have ever used. It is absolutely essential for working with any kind of large text files or large datasets.
A lot of programming tools in biology use input text files that require very specific formatting (e.g. .txt, .csv, .fasta, .nex). Sometimes, you might need to reorganize or recode data in a large text file. This can be a big pain that can lead to errors if you try to do everything manually. But regular expressions can automate the process.
Here's one example. As a PhD student I co-founded a project called the **Global Garlic Mustard Field Survey (GGMFS)** with collaborator Dr. Oliver Bossdorf at the University of Tuebingen -- yes the same Dr. Bossdorf mentioned in the [qplot Tutorial](https://colauttilab.github.io/RCrashCourse/2_qplot.html). We were fortunate to have over 100 collaborators across Europe and North America who helped to collect samples for the project. Details of the project were published in the Journal **Neobiota**: https://neobiota.pensoft.net/article/1270/ but one BIG problem is the way that each of these 100+ collaborators entered their data online. For example, latitudes and longitudes were entered in a variety of different formats. Regular expressions allowed me to write a small program to automatically convert all of these different formats to a common, decimal format that we could use for the analysis. This saved a huge amount of time and prevented errors that could have been introduced if we tried to edit these values by hand.
Often when you work with large datasets, you will need to automate some of your error correction, and regular expressions can be a big help here. For example, imagine a simple field where people were simply asked a simple yes or no question. You might find a variety of inputs such as: "YES, Y, yes, and Yes". These all mean the same thing but if you try to analyze it, R will treat these as different categories. Here again, regular expressions can be used to quickly change all the different examples to a common "Y" or even a boolean variable `TRUE`.
One final example, is pattern searching. This is common for the analysis of DNA, proteins or other large strings of data. You may want to find a particular sequence of data, possibly with a few variable sites: e.g. TCTA or TCAA or TCGA. This is another area where regular expressions can help.
### Universal
Regular expressions are a universal language that extends to many other programming languages, including **C/C#/C++**, **Python**, **Unix/Linux**, and **Perl**. We focus here on R but most of the syntax is mantained across programming languages.
> WARNING!
There is a very steep learning curve here, and the only way to really learn this is to drown yourself in examples. There are lots of exercises you can do for practice online. You should also try to apply these whenever you can.
## Functions
There are four main functions that use regular expressions in R.
`grep()` and `grepl()` are equivalent to 'find' in your favorite word processor
They have the general form: `gsub("find this pattern", in.this.object)`
`grep()` outputs a vector with all the address locations (i.e. numbers) that match. Thus the output length is equal to the number of matches.
`grepl()` outputs a vector of `TRUE` (match) and `FALSE` (no match). Thus, the output length is equal to the length of the input object.
`sub()` and `gsub()` are equivalent to 'find and replace'
They have the general form: `grep("find this pattern", "and replace with this", in.this.object)`
`sub()` replaces only the first match, whereas `gsub()` replaces all of the matches.
Some specific examples are provided below to help you understand these similarities and differences.
There are two other more advanced functions in R. These aren't covered in this tutorial, but may be of use once you are more comortable with the above functions.
`regexpr()` provides more detailed info about the first match
`gregexpr()` provides more detailed results about all matches
> See ?regexpr and ?gregexpr for more info
### Examples
Some examples can help to understand the differences among the four main functions. Let's start with a simple data frame of species names.
```{r}
Species<-c("petiolata", "verticillatus", "salicaria", "minor")
print(Species)
```
### `grep()`
This returns cell addresses matching query. Note the vector length compared to the input vector.
```{r}
grep("a",Species)
```
We can also see the specific values with the `value=T` parameter
```{r}
grep("a",Species, value=T)
```
### `grepl()`
This returns a vector of `TRUE` (match) and `FALSE` (no match). Compare this output with `grep()`.
```{r}
grepl("a",Species)
```
### `sub()`
This replaces the first match (in each cell)
```{r}
sub("l","L",Species)
```
### `gsub()`
This replaces all matches (in each cell). Compare this output to `sub()`.
```{r}
gsub("l","L",Species)
```
## Wildcards
### `\` escape character
The backslash is a special character. It's called the 'escape' character because it is used to 'escapes' the interpretation of the next character that occurs after it. This is easier to understand by example, as shown below.
### `\\` in R
In the introduction, we discussed the universality of **regular expressions** in the sense that a similar syntax is used by many different programming langagues. But now here is one exception. In R, the double-escape is usually needed, whereas other programming languages typically use just one. The reason is a bit meta -- it's because we are running regular expressions within R object. So the first `\` is used to escape special characters in R, applying it to the second `\`, which is itself the special character that needs to be escaped to pass through the function. The second slash is followed by the 'escaped' character. Some examples are provided below.
If that isn't clear. Just remember that you need two backslashes instead of one.
### `\\w`
Instead of finding the letter `w`, the `\\w` is a **wildcard** character that represents any letter or digit. It also includes underscore `_` for some reason.
```{r}
gsub("w","X","...which 1-100 words get replaced?")
gsub("\\w","X","...which 1-100 words get replaced?")
```
### `\\W`
This is the inverse of `\\w` find a character that is NOT a letter or number.
```{r}
gsub("\\W","X","...which 1-100 words get replaced?")
```
### `\\s`
This represents a space
```{r}
gsub("\\s","X","...which 1-100 words get replaced?")
```
### `\\t`
This is a tab character. A lot of data files stored as text are tab-delimited (.tsv) as well as comma-delimited (.csv)
```{r}
gsub("\\t","X","...which 1-100 \t words get replaced?")
```
> Remember \t is a tab character:
```{r}
cat("A\t\t\tB C")
```
### `\\d`
Digit characters
```{r}
gsub("\\d","X","...which 1-100 words get replaced?")
```
### `\\D`
Non-digit characters
```{r}
gsub("\\D","X","...which 1-100 words get replaced?")
```
## New Lines
There are two special characters that indicate new lines in a text file.
### `\\r`
This is the 'carriage return' special character
### `\\n`
This is the 'newline' special character
### Big Problem
One or two of these is/are generated when you press the 'enter' key while writing a text file. These also add a source of headache and confusion when working with text files because:
> Macs/Unix and PC/Windows use different standards!
Unix/Mac files -- lines usually end with `\\n` only
Windows/DOS files -- lines usually end with `\\r\\n`
This can cause problems when moving text files between Windows/DOS and Mac/Unix machines, particularly with older operating systems or when working on remote computers that use very basic Linux software.
## Special characters
In addition to special characters that use the escape `\\`, there are a number of other special characters that don't use the escape, but have a special meaning.
Note that if you want to search for the characters below you would have to use the escape character. E.g. `\\.` if you wanted to search for a period `.`.
### `|`
This is sometimes called the **pipe** character, and it simply means 'or'. For example, we can search for `w or e`.
```{r}
gsub("w|e","X","...which 1-100 words get replaced?")
```
### `.`
The period is a wild card that means 'anything'. This includes all of the `\\w` characters but also other characters like puncutation marks.
```{r}
gsub(".","X","...which 1-100 words get replaced?")
```
So how to search for a period `.`? As noted above, we have to use the escape character
```{r}
gsub("\\.","X","...which 1-100 words get replaced?")
```
### `*`, `?`, `+`, `{}`
These special characters refer to details about the kind of search that we are trying to conduct. Look at these examples carefully, and remember that `sub` replaces the first match while `gsub` replaces all of the matches.
```{r}
sub("\\w","X","...which 1-100 words get replaced?")
gsub("\\w","X","...which 1-100 words get replaced?")
```
Now let's apply some of these special characters to see how they work.
#### `+`
Finds 'one or more' matches (i.e. at least one match)
```{r}
sub("\\w+","X","...which 1-100 words get replaced?")
gsub("\\w+","X","...which 1-100 words get replaced?")
```
Compare this match to the one above. Notice how we have replaced groups of letters instead of single letters. The algorithm works like this:
1. Start at the left and move to the right, one character at a time
2. Check if the character is a letter or number (`\\w`).
3. If NO, move to the next character
4. If YES, check the next character. If it is also a `\\w` then go to the next character. Repeat until the next character is not `\\w`, and replace the entire string of characters.
When run in the `sub` command, it does the above and then stops. When run with the `gsub` command, it continues to the next character, and then starts over.
### `*`
This is a 'greedy' search (match the largest possible)
```{r}
sub("\\w*","X","...which 1-100 words get replaced?")
gsub("\\w*","X","...which 1-100 words get replaced?")
```
In the `sub` command, it detects a `.` as the first character, indicating no match. It replaces the 'null' or `0` match at the beginning, which has the effect of adding a character. In the `gsub` command it repeats this before each `.` until it finds the letter `w`. Then it finds a group of `\\w` matches, replacing with a single `X`. Then a space, which is skipped, then a `-`, which is another null match, prompting another insert.
### `?`
This means 'match zero or one time'
```{r}
sub("\\w?","X","...which 1-100 words get replaced?")
gsub("\\w?","X","...which 1-100 words get replaced?")
```
Compare this to the `*` above. It behaves in a similar way, except it is not 'greedy' -- in the second example, each letter is replaced instead of entire words.
### `+?`
This is the 'lazy' version of `+` -- note in particular the difference in `sub` which replaces on the the first letter here but the whole word when `+` is used alone. In the `gsub` example we end up replacing every letter instead of whole words. Remember, `sub` runs the algorithm once and then stops, while `gsub` repeats until it reaches the end of the line.
```{r}
sub("\\w+?","X","...which 1-100 words get replaced?")
gsub("\\w+?","X","...which 1-100 words get replaced?")
```
### `*?`
Similarly, we can combine these characters for the 'lazy' version of `*`
```{r}
sub("\\w*?","X","...which 1-100 words get replaced?")
gsub("\\w*?","X","...which 1-100 words get replaced?")
```
> Try using `+*`. Why do you get an error message?
### `{}`
Curly brackets are used to specify a number of matches, expanding on the options even further.
### `{n,m}`
Find between $n$ to $m$ matches
```{r}
gsub("\\w{3,4}","X","...which 1-100 words get replaced?")
```
### `{n}`
Find exactly $n$ matches
```{r}
gsub("\\w{3}","X","...which 1-100 words get replaced?")
```
### `{n,}`
Find $n$ or more matches
```{r}
gsub("\\w{4,}","X","...which 1-100 words get replaced?")
```
### {}?
As above, we can use `?` for the 'lazy' versions of these searches
```{r}
gsub("\\w{4,}?","X","...which 1-100 words get replaced?")
```
## `[]`
Square brackets allow us to find 'any' of the values listed within them. We can also use the dash `-` to specify a range of numbers or letters.
```{r}
gsub("[aceihw-z]","X","...which 1-100 words get replaced?")
```
In the above example, we search for 1 of any of the listed letters: a, c, e, i h, w, x, y, z. Note that x and y are included in the `x-z` statement.
What if we want to find 1 or more of these characters in a row?
```{r}
gsub("[aceihw-z]+","X","...which 1-100 words get replaced?")
```
## `^` and `$`
Use these characters to specify searches at the start `^` or end `$` of the input string.
### `^`
Find species starting with the letter *a*
```{r}
grep("^a",Species)
```
> IMPORTANT: `^` Also means 'negate' when used within `[]`
Find species containing any letter other than *a*
```{r}
grep("[^a]",Species)
```
Replace every letter except *a*
```{r}
gsub("[^a]","X",Species)
```
### `$`
Find species ending with *a*
```{r}
grep("a$",Species)
```
## `()`
Regular parentheses are used to 'capture' text, which can then be specified in the replacement string using `\\1`. Or you can capture multiple pieces of text and reorganize them by using the corresponding number -- `\\1` for the first set of`()`, `\\2` for the second set of `()`, etc. Some examples should help.
Replace each word with its first letter
```{r}
gsub("(\\w)\\w+","\\1",
"...which 1-100 words get replaced?")
```
Pull out only the numbers and reverse their order
```{r}
gsub(".*([0-9]+)-([0-9]+).*","\\2-\\1",
"...which 1-100 words get replaced?")
```
Reverse first two letters of each word
```{r}
gsub("(\\w)(\\w)(\\w+)","\\2\\1\\3",
"...which 1-100 words get replaced?")
```
## Scraping
Scraping is a method for collecting data from online sources. In R, we can use the functions `readLines` and `curl()`, both from the `curl` library, to 'scrape' data from websites. Websites with the `.html` extension are a special kind of text file.
We can use regular expressions to pull out text from the website. Here's an example where we will scrape a record for the Green Fluorescent Protein (GFP) from the Protein Data Bank (PDB). Note that this is a file with the extension `.pdb` but this is a human-readable text file that can be opened in any text editor
First, we'll import the text into an R object.
```{r}
library(curl)
```
You will have to use `install.packages("curl")` to download this package to your computer. You only need to do this once but you will have to use `library(curl)` whenever you want to use the commands, as explained in the [R Fundamentals Tutorial](https://colauttilab.github.io/RCrashCourse/1_fundamentals.html)
Now we can download a file to play with.
``` {r}
Prot<-readLines(curl("http://www.rcsb.org/pdb/files/1ema.pdb"))
```
> HINT: Download this link to your computer and open with a text file.
This hint is a simple trick to understand what kind of file(s) you are working with.
This is a tab-delimited file, which we could import as a data frame using `read.delim` but we'll keep it this way to see how we can use regular expressions.
The `Prot` object we have made is a simple vector of strings, with each cell corresponding to a different row of text:
```{r}
length(Prot)
Prot[grep("TITLE",Prot)]
```
We can pull out the amino acid sequences, which are rows that start with the word 'ATOM'
```{r}
AAseq<-Prot[grep("^ATOM",Prot)]
length(AAseq)
AAseq[1]
```
> Try to isolate the 3-letter amino acid code
There are lots of possibilities. Take the time to try a few.
Here's one good option, since we know it's a tab-delimited file with the amino acid in the 4th column:
```{r}
gsub("ATOM\\t\\w+\\t\\w+\\t(\\w+).*","\\1",AAseq[1])
```
That didn't work. Sometimes the 'tabs' are actually just multiple 'spaces'
```{r}
AAchain<-gsub("ATOM\\s+\\w+\\s+\\w+\\s+(\\w+).*","\\1",AAseq)
AAchain[1:100]
```
## Examples
Let's try practicing with a couple of examples
## Transect Data
Regular expressions are also useful with data objects
Imagine you have a repeated measures design. 3 transects (A-C) and 3 positions along each transect (1-3)
```{r}
Transect<-data.frame(Species=letters[1:20],
A1=rnorm(20), A2=rnorm(20), A3=rnorm(20),
B1=rnorm(20), B2=rnorm(20) ,B3=rnorm(20),
C1=rnorm(20), C2=rnorm(20), C3=rnorm(20))
head(Transect)
```
> TIP: the object `letters` contains lower-case letter, while `LETTERS` contains upper case.
### Challenge
Use your knowledged gained above with substting data outlined in the [R Fundamentals Tutorial](https://colauttilab.github.io/RCrashCourse/1_fundamentals.html) to do the following:
1. Subset only the columns that have an "A" in their name
2. Subset the data for species "D"
Take the time to do this on your own. It will take you a while and you will make a lot of mistakes. That's all part of the learning process. The longer you struggle, the faster you will learn.
Now here is a more challenging example:
## Genbank
Here is a line of code to import DNA from genbank. (The one line is broken up into three physical lines to make it easier to read)
```{r}
Lythrum_18S<-scan(
"https://colauttilab.github.io/RCrashCourse/sequence.gb",
what="character",sep="\n")
```
This is the sequence of the 18S subunit from the ribosome gene of *Lythrum salicaria* (from Genbank)
```{r}
print(Lythrum_18S)
```
Notice that each line is read in as a separate cell in a vector, with sequences beginning with a number ending with 1. We can take advantage of this to extract just the sequence data
### Challenge
**Before proceeding** try to do the following:
1. Isolate only the rows containing DNA sequences. This should include
a. Removing all of the characters that are not a, t, g, or c.
b. Combining separate cells/lines into a single string. You can do this with using the `paste()` function with the `collapse=""` parameter
2. Convert lower-case to upper-case. To do this, you can use:
```{r, eval=F}
gsub("([actg])","\\U\\1",Seq,perl=T)
```
The `\\U\\\1` means 'paste brackets as upper-case, and is only available as a **Perl** command, which is accessible in gsub with the `perl=T` parameter.
3. Replace start codons (ATG) with "-->START-->ATG"
4. Replace stop codons (TAA or TAG or TGA) with TAA or TAG or TGA followed by ">--STOP--|"
Take the time to stuggle with this and try different combinations until you find a way through. The more you struggle, the faster you will learn.
A cool thing about regular expressions is that there is rarely a single right answer, especially for complicated problems. When you are ready, Continue on to see one possible solution.
## Solutions
## Transects
You want to look at only transect A for the first 3 species:
```{r}
Transect[1:3,grep("A",names(Transect))]
```
Subset the data for the first plot of each transect:
```{r}
Transect[grep("1",names(Transect)),]
```
## Genbank
Use `.*` with `()` to delete everything before the DNA sequence
```{r, include=F}
Seq<-gsub(".*(1 [gatc])","",Lythrum_18S)
paste(Seq,collapse="")
```
Use the `.*` and space with `+` to eliminate all text before the sequence :
```{r, include=F}
Seq<-gsub(".*ORIGIN +","",paste(Seq,collapse=""))
Seq
```
Elimminate spaces and the two `//` at the end
```{r, include=F}
Seq<-gsub(" |//","",Seq)
Seq
```
Capital letters look nicer, but requires a PERL qualifier `\\U` that is not standard in R
```{r, include=F}
Seq<-gsub("([actg])","\\U\\1",Seq,perl=T)
print(Seq)
```
Look for start codons?
```{r, include=F}
gsub("ATG","-->START-->ATG",Seq)
```
Look for stop codons?
```{r, include=F}
gsub("(TAA|TAG|TGA)","\\1>--STOP--|",Seq)
```
## More Exercises
Here are some more exercises to practice your skils. No solutions are given for these, you will have to solve them on your own. Note that they may find their way onto a test, assignment or quiz.
### 1. Consider a vector of email addresses scraped from the internet:
* robert 'dot' colautti 'at' queensu 'dot' ca
* chris.eckert[at]queensu.ca
* lonnie.aarssen at queensu.ca
Use regular expressions to convert all email addresses to the standard format: [email protected]
### 2. Create a random sequence of DNA:
```{r,results="hide"}
My.Seq<-sample(c("A","T","G","C"),1000,replace=T)
```
* Replace T with U
* Find all start codons (AUG) and stop codons (UAA, UAG, UGA)
* Find all open reading frames (hint: consider each sequence beginning with AUG and ending with a stop codon; how do you know if both sequences are in the same reading frame?)
* Count the length of bp for all open reading frames
### 3. More online examples
[http://regex.sketchengine.co.uk/extra_regexps.html](http://regex.sketchengine.co.uk/extra_regexps.html)
### 4. Regex Golf
Have fun! [LINK](https://alf.nu/RegexGolf)