Edit me

Snakemake Tutorial

Outline

Starting point: We have fastq files that have been run through kraken to identify taxonomy of reads.

The files that are the output of Kraken are found in the Kraken folder.

Run our fastq files through fastqc.
Filter our fastq files to retain only sequences that belong to Cryptosporidium.
Rerun fastqc to see how removing “contamination” effect the quality of our files.

Note: This Tutorials was run using snakemake (v 5.27.4).

I like to use ATOM as my text editor when I write snakemake workflows since it provides out-of-the-box syntax highlighting for Snakemake.

How Snakemake Works

Snakemake documenation can be found at its readthedocs website.

The basics of snakemake:

This is a rule based workflow management system.
- Rules generally specify:
  - Inputs: Files that the rule operates on (i.e. dependencies)
  - Outputs: Files that the rule creates
  - An action: Some command to run. This can be either:
    - BASH commands
    - Python script (*.py)
    - Inline python code
    - R script (*.R)
    - R markdown file (*.Rmd)
  - Inputs and outputs are used by snakemake to determine the order for which rules are to be run. If a rule B has an input produced as an output of rule A, then rule B will be run after rule A.
  - For determining whether output files have to be re-created, Snakemake checks whether the file modification date (i.e. the timestamp) of any input file of the same job is newer than the timestamp of the output file.
    - This can be overridden by marking an input file with the ancient() function. An example of ignoring timestamps is found here.
  - All the arguments that can be used in a rule can be found here.
- There can be multiple inputs/outputs in rules
  - Inputs/outputs can be named (using the = syntax), or just listed in order.
  - These files can be referred to in the shell command (or python/R scripts)
  - You can refer to the inputs/outputs of other rules like this: rules.rule_name.output
Snakemake will fill in wildcards based on what it finds in the output first. We will see an example of this later.
For relative paths don’t use ./PATH just leave it as PATH.

Running Snakemake

To run snakemake you will make a file called Snakefile,

Warning: IT MUST BE NAMED THIS snakemake will look for this file in the current directory you are running it in (although you could give a different one as a command-line argument if you wanted) for it to work. The contents of our first baby snakemake file looks contains the following.

rule fastqc:
	input:
		file = "files/Sample_123_L001_R1.fastq"
	output:
		"fastqc/Sample_123_L001_R1_fastqc.zip",
		"fastqc/Sample_123_L001_R1_fastqc.html"
	shell:
		'''
		mkdir -p fastqc
		module load fastqc/0.11.5
		fastqc files/Sample_123_L001_R1.fastq -o fastqc
		'''

Note here that we are using a relative path based on where our Snakefile is at.

We can run snakemake by requesting the file that is wanted. For this to run you will also need to tell snakemake how many cores to use when running.

snakemake fastqc/Sample_123_L001_R1_fastqc.html --cores 1

Or just tell snakemake to run and it will go through the entire pipeline, which is only one rule right not so we get the same output.

snakemake --cores 1

You should see the output of fastqc at this point something like this…

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       fastqc
        1

[Thu Apr  1 14:42:26 2021]
rule fastqc:
    input: files/Sample_123_L001_R1.fastq
    output: fastqc/Sample_123_R1_fastqc.zip, fastqc/Sample_123_L001_R1_fastqc.html
    jobid: 0

Started analysis of Sample_123_L001_R1.fastq
Approx 5% complete for Sample_123_L001_R1.fastq
Approx 10% complete for Sample_123_L001_R1.fastq
Approx 15% complete for Sample_123_L001_R1.fastq
Approx 20% complete for Sample_123_L001_R1.fastq
Approx 25% complete for Sample_123_L001_R1.fastq
Approx 30% complete for Sample_123_L001_R1.fastq
Approx 35% complete for Sample_123_L001_R1.fastq
Approx 40% complete for Sample_123_L001_R1.fastq
Approx 45% complete for Sample_123_L001_R1.fastq
Approx 50% complete for Sample_123_L001_R1.fastq
Approx 55% complete for Sample_123_L001_R1.fastq
Approx 60% complete for Sample_123_L001_R1.fastq
Approx 65% complete for Sample_123_L001_R1.fastq
Approx 70% complete for Sample_123_L001_R1.fastq
Approx 75% complete for Sample_123_L001_R1.fastq
Approx 80% complete for Sample_123_L001_R1.fastq
Approx 85% complete for Sample_123_L001_R1.fastq
Approx 90% complete for Sample_123_L001_R1.fastq
Approx 95% complete for Sample_123_L001_R1.fastq
Analysis complete for Sample_123_L001_R1.fastq
[Thu Apr  1 14:43:30 2021]
Finished job 0.
1 of 1 steps (100%) done

If you already ran snakemake, snakemake will just tell you there is nothing to do and you will see this output:

Building DAG of jobs...
Nothing to be done.
Complete log: $PATH/.snakemake/log/2021-02-17T144540.917872.snakemake.log

Thanks snakemake! You can keep running this and nothing “bad” will happen, how refreshing…

We can update some of this code to make it a bit more universal. This will become more useful as we go on.

rule fastqc:
	input:
		file = "files/Sample_123_L001_R1.fastq"
	params:
		outdir = 'fastqc'
	output:
		"fastqc/Sample_123_L001_R1_fastqc.zip",
		"fastqc/Sample_123_L001_R1_fastqc.html"
	shell:
		'''
		mkdir -p {params.outdir}
		module load fastqc/0.11.5
		fastqc {input.file} -o {params.outdir}
		'''

Note: {input.file} is the same as {input[0]} if you didn’t use file= in the input parameter. Remember that python (and thus snakemake) is a 0 based language so the 0 element is the first element in the series.

Scaling Our Pipeline with Wildcards

We can scale this pipeline by putting in wildcards that are placed in brackets rather than making one rule per sample.

rule fastqc:
	input:
		file = 'files/{sample}_R1.fastq'
	params:
		outdir = 'fastqc'
	output:
		"fastqc/{sample}_R1_fastqc.zip",
		"fastqc/{sample}_R1_fastqc.html"
	shell:
		'''
		mkdir -p {params.outdir}
		module load fastqc/0.11.5
		fastqc {input.file} -o {params.outdir}
		'''

However, if we have run snakemake with snakemake fastqc/Sample_123_L001_R1_fastqc.html --cores 1 it will tell us there is nothing to be done because it fills in the wildcards based on what it finds in the output files which is only fastqc/Sample_123_L001_R1_fastqc.zip and fastqc/Sample_123_L001_R1_fastqc.html. However, if we delete these files and run snakemake with snakemake fastqc/Sample_123_L001_R1_fastqc.html --cores 1 we will get the following error.

WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.

This same error would have happened if you had ran snakemake --cores 1 because snakemake doesn’t know what file you want or if you want all of them. We can either just ask for only one file with snakemake fastqc/Sample_456_L001_R1_fastqc.zip --cores 1 for example, or create a new rule called rule all: that will allow us to tell snakemake that we want all the files (or whichever we want). The rule all will be put as the first rule in our snakefile by convention so we can keep track of it easier.

rule all:
	input:
		'fastqc/Sample_123_R1_fastqc.zip', 
		'fastqc/Sample_456_L001_R1_fastqc.zip',
		'fastqc/Sample_123_R1_fastqc.html', 
		'fastqc/Sample_456_L001_R1_fastqc.html'

Cool, thats a bit better, but still a lot of typing and we don’t like that. We can use python syntax and the snakemake expand() function to make a list of file names we want.

SAMPLES = ['Sample_123_L001', 'Sample_456_L001']

rule all:
	input:
    	expand('fastqc/{sample}_R1_fastqc.zip', sample=SAMPLES),
    	expand('fastqc/{sample}_R1_fastqc.html', sample=SAMPLES)

But again this is still a lot of typing. This can be cleaned up more by using snakemake’s glob_wildcards() to generate the file names for us. For this we give a path and file extention of files we want to collect the wildcard portion of the name from.

SAMPLES, = glob_wildcards('files/{sample}_R1.fastq')

rule all:
	input:
		expand('fastqc/{sample}_R1_fastqc.zip', sample=SAMPLES),
		expand('fastqc/{sample}_R1_fastqc.html', sample=SAMPLES)

Important: I have noticed that glob_wildcards() will act recursively, if the files are in the top directory. So I recommend you have your files in their own directory OR that you put in full paths for input and output files to avoid weird things happening…

Whoops, this has only generated files for the R1 file and not the R2. We can add an additional wildcard to collect this information as well.

SAMPLES, READS, = glob_wildcards('files/{sample}_R{read}.fastq')

rule all:
	input:
		expand('fastqc/{sample}_R{read}_fastqc.zip', sample=SAMPLES, read=READS),
		expand('fastqc/{sample}_R{read}_fastqc.html', sample=SAMPLES, read=READS)

rule fastqc:
	input:
		file = 'files/{sample}_R{read}.fastq'
	params:
		outdir = 'fastqc'
	output:
		"fastqc/{sample}_R{read}_fastqc.zip",
		"fastqc/{sample}_R{read}_fastqc.html"
	shell:
		'''
		mkdir -p {params.outdir}
		module load fastqc/0.11.5
		fastqc {input.file} -o {params.outdir}
		'''

So now our dag is:

Snakemake Basics

Snakemake Tutorial

Outline

How Snakemake Works

Running Snakemake

Scaling Our Pipeline with Wildcards

Notes on Wildcards

Adding in Some Python

Running Python Scripts

Your Turn

Re-Running Rules

“Checkpoints”

Dynamic/Checkpoints

Some Helpful Commands

Keeping Things Tidy

Protected and Temporary Files

Troubleshooting (AKA mistakes I made)