flexible bioinformatics pipelines with snakemake

Written: Friday, May 10, 2013

Building bioinformatics pipelines can be a somewhat tedious process that seemingly never ends and often results in significant refactoring when assumptions or workflows change. The snakemake project is a small framework that facilitates building flexible, reuseable pipelines.

I am not going to go through the installation process in this entry, but since snakemake requires python3, suffice it to say that virtualenvs almost certainly will be involved since most systems default to python2.X.

I had a small revelation last night about how snakemake can be used to run the GATK pipeline. Like many folks, I use filename “tags” to track individual transformations and ordering of the tags in the filename to capture the ordering of the transformations.

Tool Added filename extension
mark duplicates md
quality recalibration recal
indel realignment realigned

So, a file with the name “stuff.realigned.recal.md.bam” has been realigned, then recalibrated, then had duplicates marked. Different orderings of tags like “md”, “realigned”, and “recal” represent different processing orders. A simple snakefile with rules for these transformations might look like this:

rule qualrecal:
    input: bam="{base}.bam"
    output: bam="{base}.recal.bam"
    shell: "touch {output}"

rule indelrealigner:
    input: bam="{base}.bam",intervals="{base}.intervals"
    output: bam="{base}.realigned.bam"
    shell: "touch {output}"

rule realignertargetcreator:
    input: "{base}.bam"
    output: "{base}.intervals"
    shell: 'touch {output}'

rule indexbam:
    input: '{base}.bam'
    output: '{base}.bam.bai'
    shell: 'touch {output}'

rule markdups:
    input: '{base}.bam'
    output: bam='{base}.md.bam',metrics='{base}.dupmetrics'
    shell: "touch {output.bam}; touch {output.metrics}"

I have not put any shell commands in this file just to keep things simple. Obviously, this Snakefile will not produce any meaningful output, but it is enough to be instructive.

If I add a rule, “final” that contains the output filename(s) of interest and then run snakemake -n, we can see how the finename(s) dictate the run order for the rules. This allows one to change the pipeline by simply adjusting output filenames. Quite cool!

rule final:
    input: "stuff.realigned.md.bam", "stuff.realigned.md.bam.bai"

Running snakemake -n with this rule included as the first rule in the snakefile yields:

rule realignertargetcreator:
        input: stuff.bam
        output: stuff.intervals
rule indelrealigner:
        input: stuff.bam, stuff.intervals
        output: stuff.realigned.bam
rule markdups:
        input: stuff.realigned.bam
        output: stuff.realigned.md.bam, stuff.realigned.dupmetrics
rule indexbam:
        input: stuff.realigned.md.bam
        output: stuff.realigned.md.bam.bai
rule final:
        input: stuff.realigned.md.bam, stuff.realigned.md.bam.bai
Job counts:
        count	jobs
        1	final
        1	indelrealigner
        1	realignertargetcreator
        1	indexbam
        1	markdups
        5

Changing the rule “final” to this:

rule final:
    input: "stuff.md.realigned.bam", "stuff.md.realigned.bam.bai"

and running snakemake -n again gives a different ordering of steps.

rule markdups:
        input: stuff.bam
        output: stuff.md.bam, stuff.dupmetrics
rule realignertargetcreator:
        input: stuff.md.bam
        output: stuff.md.intervals
rule indelrealigner:
        input: stuff.md.bam, stuff.md.intervals
        output: stuff.md.realigned.bam
rule indexbam:
        input: stuff.md.realigned.bam
        output: stuff.md.realigned.bam.bai
rule final:
        input: stuff.md.realigned.bam, stuff.md.realigned.bam.bai
Job counts:
        count	jobs
        1	final
        1	realignertargetcreator
        1	markdups
        1	indelrealigner
        1	indexbam
        5

Adding recalibration to the process is as simple as including “.recal” in the filename in the correct order. This works with one bam file or 1000 bam files. Nice….



blog comments powered by Disqus