10-blast.Rmd

<!-- Maybe switch to the paired end reads for next year (no enrichment)/ -->

# Ancient Pandemic

Scientists recently obtained and sequence DNA from 3 individuals who died about 4,000 years ago. They do this by drilling into the teeth and obtain a sample of dental plaque. The plaque is known to preserve DNA from pathogens. They started off with 34 individuals (30 from the southern site and 4 from the northern site, both described below) and then focused on pathogens recovered from 3 of them.

They focused on two indiviuals from a mass burial site in Charterhouse Warren, Somerset (southern England). This type of mass burial was unusual in the region and, therefore, might represent an event where a lot of people died, such as an epidemic or a war. Signs of fatal trauma were seen in some individuals buried at the site suggesting that the people in this gravesite were killed in all together. Both sequences were from children around ages 10 and 12 (plus or minus 3 years). They used radiocarbon dating of the mandible bone in one individual, dating the grave to ~4000 years ago.

A third individual was buried under a ring cairn monument in Levens, Cumbria (northern England). This person was buried on their own. This individual has also been dated to ~4000 years ago. This individual is a female who is 35-45 years and has pottery shards alongside her as well as planks, which might have been a wooden coffin that she was crouched in.

We'll start with data from one of the individuals at the southern site (C10098). The scientists sequenced all the DNA in the sample, then they enriched for the pathogen DNA and sequenced again. We'll use the enriched sample.


Log onto logrus, enter a screen, create a directory called ancientpathogen in your home directory and navigate into it.

<details>
  <summary>Click for Answer</summary>
```
mkdir ~/ancientpathogen
cd ~/ancientpathogen
```
</details>
\

Link to the data. You can find it in /home/data/nise/ancientpathogen/*q.gz. This is single end Illumina data. What does that mean?

<!-- The paper says they should be paired end but the data they uploaded to the archive is single end. -->

<details>
  <summary>Click for Answer</summary>
```
ln -s /home/data/nise/ancientpathogen/*q.gz .
```
</details>
\

Take a look at the filenames. Look at the file with ERR11262426 in the filename. This is a sample called C10098 from southern England. Note what the headers start with. Go back to the linux section if you need a refresher on fastq format.

<details>
  <summary>Click for Answer</summary>
```
ls -l

zcat ERR11262426.fastq.gz | head
```
</details>
\

How many reads are there? 


<details>
  <summary>Click for Answer</summary>
```
zgrep -c "^@ERR" ERR11262426.fastq.gz
```
</details>
\


Let's run FastQC, a tool that runs quality control metrics on fastq files. We can leave the fastq files zipped.

activate the fastqc-env environment.

```
conda activate fastqc-env
```

Make a directory called "fastqc" in your current directory. This will be our output directory.

<details>
  <summary>Click for Answer</summary>
```
mkdir fastqc
```
</details>
\


Now run FastQC.

```
fastqc -o fastqc ERR11262426.fastq.gz
```

Copy the html and zip files over to your desktop, then unzip the .zip file and open up the html file. Go through it as a group.


Let's look to see if there are any sequences from bacterial species included among the human sequences. I grabbed all of the complete bacterial genomes from Genbank as of July 21, 2023. We'll align our sequences to them using minimap2 (you could also use blast). Soft link to the fasta file with the bacterial genomes in it. It is in /home/data/nise/GbBac.fasta

<details>
  <summary>Click for Answer</summary>
```
ln -s /home/data/nise/GbBac.fasta .
```
</details>
\

Create a database of that file (note that minimap2 can create a database on the fly but it is faster to do it beforehand, especially if you will use it for more than one alignment). We will name the reference database bacteria.mmi. We'll use the short read preset (-x sr) since our fastq reads are short illumina reads.

```
minimap2 -x sr -d bacteria.mmi GbBac.fasta
```

We don't need to align all of the reads to the bacterial genomes, just a subset to get a sense for which bacterial reads are present. See if you can figure out how to get the first 1000 sequence reads from our fastq file.

<details>
  <summary>Click for Answer</summary>
```
zcat ERR11262438.fastq.gz |head -4000 > ERR11262438.1000.fastq
```
</details>
\

Run the alignment.

```
minimap2 -x sr -o ERR112624261000xbacteria.paf bacteria.mmi ERR11262426.1000.fastq
```


We can quickly figure out what the pathogen is by seeing which species was hit most often. There will be lots of species in our hit list, some present in very low quantities in the DNA sample and some being false positives because something from another species aligned to it incorrectly. But the true pathogen should have the most hits.

Hint: You can get the genus and species using the cut command and grabbing the 2nd and 3rd columns separated by underscores. Then you will want to count the instances of each genus_species then sort by the count (column 1).

<details>
  <summary>Click for Answer</summary>
```
cut -f 2-3 -d '_' ERR112624261000xbacteria.paf | sort |uniq -c | sort -n
```
</details>
\

What pathogen likely killed this person and what disease does it cause?

<details>
  <summary>Click for Answer</summary>
```
Yersinia pestis, which causes the plague.
```
</details>
\

Now see if you can do the same analysis for sample ERR11262438. This is sample C10928, the one from Northern England.

What pathogen likely killed this person and what disease does it cause?

<details>
  <summary>Click for Answer</summary>
```
Yersinia pestis, which causes the plague.
```
</details>
\


Citation (There is a spoiler here so don't open it until you have done the exercise.)

<details>
<summary>Click for Answer</summary>
```
Swali, P., Schulting, R., Gilardet, A. et al. Yersinia pestis genomes reveal plague in Britain 4000 years ago. Nat Commun 14, 2930 (2023). https://doi.org/10.1038/s41467-023-38393-w
```
</details>