Installation

This guide covers installing and configuring the GDCGenomicsQC pipeline.

Automatic Software Installation

The pipeline uses Snakemake’s built-in conda support to automatically install software dependencies defined in rule-level conda: directives. This means:

  • No manual installation of PLINK, bcftools, GATK, shapeit4, rfmix, etc.

  • Each rule can specify its own conda environment

  • Singularity containers are pulled automatically when using --use-singularity

Choose the installation method that matches your environment:

This scenario uses pre-installed modules and pre-cached Singularity images. Ideal for standard HPC environments like MSI at UMN.

Prerequisites:

  • Access to MSI HPC with SLURM scheduler

  • Module system available

Setup:

module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc

# Verify environment is set up
echo $SINGULARITY_CACHEDIR
echo $SNAKEMAKE_SINGULARITY_PREFIX

What the module provides:

The gdcgenomicsqc module sets up:

Snakemake availability:

The module does NOT provide Snakemake. You must have Snakemake available through one of these methods:

Clone the repository (if not already available):

git clone https://github.com/UMN-GDC/GDCGenomicsQC.git
cd GDCGenomicsQC

Run:

cd GDCGenomicsQC/workflow
snakemake --profile ../profiles/sandbox --configfile /path/to/your/config.yaml

Or using the wrapper script (after loading the module):

gdcgenomicsqc --configfile /path/to/your/config.yaml

Skip to Usage

Prerequisite Software

Regardless of loading method, the following software is required:

Prerequisite Software

Software

Required By

Notes

Snakemake (8+)

Pipeline execution

Conda recipe provided in envs/snakemake.yml

snakemake-executor-plugin-slurm

HPC job submission

Required for SLURM profiles

Singularity/Apptainer

Containerized tools

MSI module: module load apptainer

SLURM scheduler

HPC job scheduling

For profiles/hpc and profiles/sandbox

Conda or Mamba

Environment management

Mamba recommended for faster solving

Git

Repository access

For cloning the repository

Software Loading Methods

The pipeline supports multiple ways to access its dependencies. Choose the method that matches your HPC environment:

Software Loading Methods

Method

Best For

Setup Required

Module System (MSI)

MSI HPC clusters

module load gdcgenomicsqc

Module System (Sandbox)

Sandbox environments

module load gdcgenomicsqc

Conda Environment

Custom HPC or local

conda env create

Singularity/Apptainer

Container-based HPC

Pull images manually

System-wide Install

Local development

pip install / conda install

Software Environment Summary

Quick Reference: How to Load Software

Software

Conda Command

Module Command (MSI)

Module Command (Sandbox)

Snakemake

conda activate snakemake

module load miniconda && conda activate snakemake

module load miniconda && conda activate snakemake

GDC Pipeline

(via containers)

module load gdcgenomicsqc

module load gdcgenomicsqc

Apptainer

N/A

module load apptainer

module load apptainer

SLURM

N/A

(Usually default on HPC)

(Usually default on HPC)

External Dependencies

All software dependencies are automatically handled through conda environments and Singularity containers. The pipeline is entirely self-contained—you only need:

  • Snakemake (installed via conda as shown above)

  • Access to reference data (e.g., 1000 Genomes Project)

  • Sufficient storage for intermediate and output files

  • Appropriate HPC resources (see profile configurations)

No manual installation of external tools (PLINK, bcftools, GATK, etc.) is required.

Software Environment Files

The pipeline includes the following environment definitions in envs/:

Environment Files

File

Purpose

snakemake.yml

Snakemake and SLURM executor plugin

genomeUtils.yml

General genomic utilities (PLINK, bcftools, etc.)

rfmix.yml

RFMix for local ancestry inference

phenotypeSim.yml

Phenotype simulation tools

ancNreport.yml

Ancestry reporting and visualization

mash.yml

Mash distance estimation

karyoploteR.yml

Karyotype visualization

Container Images

The pipeline uses Singularity/Apptainer containers for reproducibility. Images are automatically pulled based on rule-level container: directives.

Container Images

Image

Contains

oras://ghcr.io/coffm049/gdcgenomicsqc/ancnreport:latest

Ancestry reporting environment

oras://ghcr.io/coffm049/gdcgenomicsqc/rfmix:latest

RFMix local ancestry inference

oras://ghcr.io/coffm049/gdcgenomicsqc/mash:latest

Mash distance estimation

oras://ghcr.io/coffm049/gdcgenomicsqc/phenotypesim:latest

Phenotype simulation tools

Troubleshooting

If jobs fail to start:

  • Verify SLURM is available: sbatch --version

  • Verify Snakemake is available: snakemake --version

  • Check that your config paths are correct

  • Ensure output directories are writable

If conda environments fail to resolve:

  • Use mamba instead of conda for faster solving

  • Set in config: conda-frontend: mamba

If containers fail to pull:

  • Check network connectivity

  • Configure cachedir: export SINGULARITY_CACHEDIR=/path/to/large/disk

For additional help, see the Usage guide or open an issue on GitHub.

Important

Every time you start a new session, you must rerun the environment setup steps:

  • Load the GDC module (if using module system)

  • Activate the snakemake conda environment

Example for a new session:

# For MSI HPC:
module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

# For Sandbox:
module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

# For other HPCs:
module use /path/to/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake