Add files via upload

STRIDES · Dec 23, 2024 · 609eff7 · 609eff7
1 parent 0989c2e
commit 609eff7
Showing 1 changed file with 61 additions and 44 deletions.
diff --git a/notebooks/AWS-ParallelCluster.ipynb b/notebooks/AWS-ParallelCluster.ipynb
@@ -34,8 +34,8 @@
     "\n",
     "Please follow the installation instructions for the ParallelCluster UI provided here: [here](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/docs/Install_AWSParallelCluster.md). These instructions will guide you through the necessary steps to create a CloudFormation Stack through which you can access the AWS ParallelCluster UI. \n",
     "\n",
-    "Additionally, we urge you to check out the documents within the `docs/` folder of the repository for more bioinformatics and Gen AI tutorials.", 
-     "\n",
+    "Additionally, we urge you to check out the documents within the `docs/` folder of the repository for more bioinformatics and Gen AI tutorials.\n",
+    "\n",
     "Once you have created the Cloud Formation Stack for the PCUI, navigate to the user interface URL. It will look like this:"
    ]
   },
@@ -63,13 +63,13 @@
    "metadata": {},
    "source": [
     "### Create a Cluster \n",
-    "Let's create a cluster within the ParallelCluster environment.",
+    "Let's create a cluster within the ParallelCluster environment.\n",
     "\n",
     "![create-cluster.png](attachment:create-cluster.png)\n",
     "\n",
     "1. In the PCUI Clusters view, choose **Create cluster** > **Step by step**.\n",
     "2. In Cluster, **Name**, enter a name for your cluster.\n",
-    "3. Choose a **VPC** with a public subnet for your cluster, and choose Next.\n",
+    "3. Choose a **VPC** from the available options and choose Next. CloudLab users will have access to pre-configured VPC networks.\n",
     "4. In **Head node**, choose Add **SSM session**. This will allow you to access the head node through the **`Shell`** button. Change the instance type of your head node to **t2.xlarge**. \n",
     "5. In **Queues**, provide a name and subnet for your queue.\n",
     "6. In **Compute resources**, choose 1 for **Static nodes** and select **c5n.large** as the instance type for your compute resources. \n",
@@ -110,7 +110,7 @@
     "    AdditionalIamPolicies:\n",
     "      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore\n",
     "  Ssh:\n",
-    "    KeyName: snakemake-cluster-key-pair\n",
+    "    KeyName: Snakemake-cluster-key-pair\n",
     "Scheduling:\n",
     "  Scheduler: slurm\n",
     "  SlurmQueues:\n",
@@ -171,7 +171,7 @@
    "id": "73692fbe",
    "metadata": {},
    "source": [
-    "3. Install conda. We will be executing snakemake using conda. "
+    "3. Install conda. We will be executing Snakemake using conda. "
    ]
   },
   {
@@ -196,7 +196,7 @@
    "source": [
     "3. Install Snakemake and the Snakemake ParallelCluster plugin. \n",
     "\n",
-    "Note: the PCluster plugin requires snakemake > 8.0.0"
+    "Note: the PCluster plugin requires Snakemake > 8.0.0"
    ]
   },
   {
@@ -205,8 +205,8 @@
    "metadata": {},
    "source": [
     "```bash\n",
-    "pip3 install snakemake==8.25.5\n",
-    "pip3 install snakemake-executor-plugin-pcluster-slurm\n",
+    "pip3 install Snakemake==8.25.5\n",
+    "pip3 install Snakemake-executor-plugin-pcluster-slurm\n",
     "```"
    ]
   },
@@ -215,7 +215,7 @@
    "id": "5302b6ba",
    "metadata": {},
    "source": [
-    "Alternatively, you may use conda to install snakemake using the following command: "
+    "Alternatively, you may use conda to install Snakemake using the following command: "
    ]
   },
   {
@@ -224,7 +224,7 @@
    "metadata": {},
    "source": [
     "```bash\n",
-    "conda install bioconda::snakemake==8.25.5\n",
+    "conda install bioconda::Snakemake==8.25.5\n",
     "```"
    ]
   },
@@ -275,7 +275,7 @@
    "id": "907420c9",
    "metadata": {},
    "source": [
-    "2. Submit the job using an sbatch command "
+    "2. Submit the job using an `sbatch` command "
    ]
   },
   {
@@ -308,7 +308,7 @@
    "metadata": {},
    "source": [
     "```bash \n",
-    "mkdir hello-world-snakemake\n",
+    "mkdir hello-world-Snakemake\n",
     "vim Snakefile\n",
     "```"
    ]
@@ -339,7 +339,7 @@
    "id": "4590ade3",
    "metadata": {},
    "source": [
-    "2. Execute the workflow using the snakemake command, specifying `pcluster-slurm` as the executor."
+    "2. Execute the workflow using the Snakemake command, specifying `pcluster-slurm` as the executor."
    ]
   },
   {
@@ -348,7 +348,7 @@
    "metadata": {},
    "source": [
     "```bash\n",
-    "snakemake --executor pcluster-slurm \n",
+    "Snakemake --executor pcluster-slurm \n",
     "```"
    ]
   },
@@ -390,20 +390,27 @@
     "        \"output_2.txt\" \n",
     "```\n",
     "\n",
-    "**Shell Command:** The shell keyword is used to specify the shell command that will be executed to produce the output files.\n",
+    "**Shell Command:** \n",
+    "\n",
+    "The shell keyword is used to specify the shell command that will be executed to produce the output files.\n",
     "\n",
     "#### Commandline Command Breakdown: \n",
     "\n",
-    "**snakemake:** Invokes the snakemake tool. This tool will look for a Snakefile in the current working directory \n",
-    "**--executor pcluster-slurm:** The flag enables the workflow to be executed through the slurm cluster connected to the head node"
+    "**Snakemake:** \n",
+    "\n",
+    "Invokes the Snakemake tool. This tool will look for a Snakefile in the current working directory \n",
+    "\n",
+    "**--executor pcluster-slurm:** \n",
+    "\n",
+    "The flag enables the workflow to be executed through the slurm cluster connected to the head node"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "fbf47ae7",
    "metadata": {},
    "source": [
-    "## Submitting a Bioinformatics Snakemake workflow to the Slurm cluster\n",
+    "## Submitting a bioinformatics Snakemake workflow to the Slurm cluster\n",
     "\n",
     "In this example, we will use Snakemake and the pcluster-slurm plugin to run a Bioinformatics pipeline. \n",
     "\n",
@@ -612,17 +619,18 @@
     "* `SAMPLES = [\"A\", \"B\"]` defines the samples to be processed.\n",
     "\n",
     "**Workflow:**\n",
-    "* all: Specifies the final output files required to complete the workflow.\n",
+    "* **all:** Specifies the final output files required to complete the workflow.\n",
     "* Bioinformatics rules \n",
-    "        - bwa_index: Indexes the reference genome file (data/genome.fa) for alignment.\n",
-    "        - bwa_map: Maps the sequencing reads (data/samples/{sample}.fastq) to the indexed genome and converts the output to BAM format.\n",
-    "        - samtools_sort: Sorts the BAM files generated from the mapping step.\n",
-    "        - samtools_index: Indexes the sorted BAM files for faster access.\n",
-    "        - bcftools_call: Calls genetic variants from the sorted and indexed BAM files.\n",
-    "        - plot_quals: Generates a plot of the quality of the called variants.\n",
-    "* The conda environment required for each rule is derived from the `conda_env` variable found within the config file\n",
-    "* Each rule uses shell commands to perform the required bioinformatics tasks (e.g., bwa index, bwa mem, samtools sort, samtools index, bcftools mpileup, bcftools call).\n",
-    "* The order in which each rule must be run, is defined from the input and output parameters. For example, as the `bcf_tools` rule requires a sorted and indexed bam file as the input, it will be executed after the `samtools_index` rule. \n",
+    "  * **bwa_index:** Indexes the reference genome file (data/genome.fa) for alignment.\n",
+    "  * **bwa_map:** Maps the sequencing reads (data/samples/{sample}.fastq) to the indexed genome and converts the output to BAM format.\n",
+    "  * **samtools_sort:** Sorts the BAM files generated from the mapping step.\n",
+    "  * **samtools_index:** Indexes the sorted BAM files for faster access.\n",
+    "  * **bcftools_call:** Calls genetic variants from the sorted and indexed BAM files.\n",
+    "  * **plot_quals:** Generates a plot of the quality of the called variants.\n",
+    "  \n",
+    "* The **conda environment** required for each rule is derived from the `conda_env` variable found within the config file\n",
+    "* Each rule uses **shell commands** to perform the required bioinformatics tasks (e.g., bwa index, bwa mem, samtools sort, samtools index, bcftools mpileup, bcftools call).\n",
+    "* The **order** in which each rule must be run, is defined from the **input and output parameters**. For example, as the `bcf_tools` rule requires a sorted and indexed bam file as the input, it will be executed after the `samtools_index` rule. \n",
     "\n"
    ]
   },
@@ -635,13 +643,32 @@
     }
    },
    "source": [
-    "### Execute the workflow using the **snakemake** command, specifying **pcluster-slurm** as the executor and **conda** as the environment management system\n",
+    "### Execute the workflow \n",
+    "\n",
+    "Execute the workflow using the **Snakemake** command, specifying **pcluster-slurm** as the executor and **conda** as the environment management system\n",
     "\n",
     "\n",
     "```bash\n",
-    "snakemake --executor pcluster-slurm --use-conda -j 5\n",
+    "Snakemake --executor pcluster-slurm --use-conda -j 5\n",
     "```\n",
-    "### Command Breakdown \n"
+    "#### Commandline Command Breakdown: \n",
+    "\n",
+    "**Snakemake:** \n",
+    "\n",
+    "Invoke the Snakemake tool. \n",
+    "\n",
+    "**--executor pcluster-slurm:** \n",
+    "\n",
+    "Specify `pcluster-slurm` as the `--executor`\n",
+    "\n",
+    "**--use-conda** \n",
+    "\n",
+    "This flag tells Snakemake to use Conda environments for managing dependencies. When this flag is used, Snakemake will look for environment.yaml files specified in the workflow rules and create Conda environments accordingly. \n",
+    "\n",
+    "**-j** \n",
+    "\n",
+    "This flag specifies the number of jobs (or threads) to run in parallel.\n",
+    "\n"
    ]
   },
   {
@@ -655,18 +682,8 @@
     "\n",
     "## References: \n",
     "* [AWS ParallelCluster](https://docs.aws.amazon.com/parallelcluster/)\n",
-    "* [Snakemake Documentation](https://snakemake.readthedocs.io/en/stable/)\n",
-    "* [Snakemake `pcluster-slurm` plugin](https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/pcluster-slurm.html)\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "cf514196",
-   "metadata": {},
-   "source": [
-    "Graphics \n",
-    "- parallelcluster graphic \n",
-    "- snakemake files graphic"
+    "* [Snakemake Documentation](https://Snakemake.readthedocs.io/en/stable/)\n",
+    "* [Snakemake `pcluster-slurm` plugin](https://Snakemake.github.io/Snakemake-plugin-catalog/plugins/executor/pcluster-slurm.html)\n"
    ]
   }
  ],