Usage

The GDCGenomicsQC pipeline is a comprehensive quality control pipeline for genomic data. It integrates standard QC procedures with ancestry estimation and optional advanced features.

Software Environment Setup 

Before running the pipeline, you need to set up your software environment. Choose the method that matches your HPC setup:

If your HPC has the GDC module pre-configured:

Step 1: Add module path and load the GDC module

module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc

Step 2: Activate snakemake environment

conda activate snakemake

Step 3: Verify installation

cd GDCGenomicsQC
snakemake --version

What the module provides:

Running the pipeline:

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config/config.yaml

Or with snakemake directly:

cd GDCGenomicsQC/workflow
snakemake --profile ../profiles/hpc --configfile ../config/config.yaml

If your sandbox environment has the GDC module pre-configured:

Step 1: Add module path and load the GDC module

module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc

Step 2: Activate snakemake environment

conda activate snakemake

Step 3: Verify installation

cd GDCGenomicsQC
snakemake --version

What the module provides:

Running the pipeline:

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config/config.yaml

Or with snakemake directly:

cd GDCGenomicsQC/workflow
snakemake --profile ../profiles/sandbox --configfile ../config/config.yaml

If you’re using your own Snakemake installation:

Step 1: Create the conda environment

# Clone the repository
git clone https://github.com/UMN-GDC/GDCGenomicsQC.git
cd GDCGenomicsQC

# Create the snakemake environment
conda env create -f envs/snakemake.yml
conda activate snakemake

Step 2: Verify installation

snakemake --version

Running the pipeline:

cd GDCGenomicsQC/workflow
snakemake --profile ../profiles/hpc --configfile ../config/config.yaml

See also: Installation for detailed setup options including Singularity-only environments.

Workflow Overview 

The pipeline processes input data through a structured sequence of stages:

GDC Genomics QC Workflow Diagram — Overview of the GDC Genomics QC Pipeline stages.

Initial QC: Sample and SNP filtering using PLINK
Relatedness: KING/PC-AiR/PC-Relate for kinship estimation
Standard QC: GWAS-level filters (MAF, HWE, missingness)
Phasing: Haplotype estimation via shapeit4
Global Ancestry: PCA/UMAP/VAE with Random Forest classification
Local Ancestry: RFMix for segment-level ancestry inference
Per-Ancestry QC: Ancestry-specific quality control

For more details on each module, see Genomics.

Configuration 

All pipeline options are configured via the config/config.yaml file. This replaces the older command-line flag approach.

Basic Configuration 

The INPUT parameter specifies your input genomic data. The pipeline automatically detects the format based on the file extension and whether {CHR} is present:

# Input genomic data template. Supports:
# - Per-chromosome VCF: "/path/to/vcf/chr{CHR}.vcf.gz" (use {CHR} placeholder)
# - Whole genome BED: "/path/to/data/merged.bed"
# - Whole genome PGEN: "/path/to/data/merged.pgen"
INPUT: "/path/to/vcf/chr{CHR}.vcf.gz"

# Alternative VCF template for ABCD-style paths (optional)
vcf_template: null

# Output directory for pipeline results
OUT_DIR: "/path/to/output/directory"

# Reference data directory
REF: "/path/to/reference/data"

# Local snakemake storage cache
local-storage-prefix: "/path/to/.snakemake/storage"

# Chromosomes to process
chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

# Relatedness estimation
relatedness:
    method: "king"  # Options: "0", "king"
    king_cutoff: 0.0884

SEX_CHECK: false
GRM: true
thin: false

# Ancestry analysis
ancestry:
    threshold: 0.8
    model: "pca"  # Options: pca, umap, vae, rfmix

# Local ancestry (RFMix)
localAncestry:
    RFMIX: false
    test: false
    thin_subjects: 0.1
    figures: "figures"
    chromosomes: null

# Internal PCA
internalPCA:
    plot: true
    color_by: null
    phenotype_file: null

See Genomics for detailed descriptions of all configuration options.

Running the Pipeline 

Choose your execution method based on your setup:

Use the gdcgenomicsqc wrapper script:

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config/config.yaml

Or use snakemake directly with the HPC profile:

cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml

Use the gdcgenomicsqc wrapper script:

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config/config.yaml

Or use snakemake directly with the sandbox profile:

cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/sandbox --configfile ../config/config.yaml

HPC execution:

cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml

Interactive/Testing:

cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/interactive --configfile ../config/config.yaml

Running Specific Rules 

Run only specific parts of the pipeline by specifying the rule name:

# Run only ancestry classification
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml classifyAncestry

# Run only initial QC
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml initialFilter

# Run only RFMix
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml RFMIX

Common rule targets include:

initialFilter - Initial sample/SNP quality control
convertPlinkPerChromosome - Per-chromosome conversion and filtering
convertPlinkSingleFile - Single file conversion and filtering
king - Relatedness estimation
estimateAncestry - Global ancestry classification
classifyAncestry - Generate ancestry classifications and plots
RFMIX - Local ancestry inference
phase - Phasing with shapeit4

Generating Reports 

Create an HTML report summarizing the workflow:

snakemake --profile=../profiles/hpc \
    --configfile ../config/config.yaml \
    --report --report-stylesheet ../report/stylesheet.css

The report will be generated at workflow/report.html.

Advanced Options 

Parallel Jobs 

Control the number of parallel SLURM jobs:

snakemake --profile=../profiles/hpc --configfile ../config/config.yaml -j 20

Dry Run 

Preview what will be executed without running:

snakemake -n --configfile ../config/config.yaml

Debugging 

Force re-execution of failed jobs:

snakemake --profile=../profiles/hpc --configfile ../config/config.yaml --rerun-triggers mtime

Master SLURM Job 

The pipeline includes a master SLURM script at workflow/snakemake.SLURM that coordinates all jobs. This is the recommended way to run the full pipeline on HPC.

The master job stays running and dispatches individual jobs to SLURM as needed:

# From the workflow directory
sbatch snakemake.SLURM

Or with a custom config:

sbatch --export=CONFIG=config_custom.yaml snakemake.SLURM

The master script:

workflow/snakemake.SLURM

#!/bin/bash
#SBATCH --job-name=smk_master
#SBATCH --output=snakemake_%j.log
#SBATCH --mem=4G
#SBATCH --time=72:00:00  # Enough time for the whole pipeline

source /users/4/coffm049/miniconda3/etc/profile.d/conda.sh
conda activate snakemake

# The magic flag is --executor slurm
snakemake --profile=../profiles/hpc

Custom SLURM Script 

For more control, create your own SLURM script:

#!/bin/bash
#SBATCH --job-name=gdc_qc
#SBATCH --output=logs/%x_%j.log
#SBATCH --error=logs/%x_%j.err
#SBATCH --time=7-00:00
#SBATCH --mem=64G
#SBATCH --cpus-per-task=8

cd $SLURM_SUBMIT_DIR/GDCGenomicsQC/workflow

snakemake --profile=../profiles/hpc \
    --configfile ../config/config.yaml \
    --jobs 20

Submit with:

sbatch run_pipeline.sh

Important

Every time you start a new session, you must rerun the environment setup steps:

Load the GDC module (if using module system)
Activate the snakemake conda environment

Example for a new session:

# For MSI HPC:
module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

# For Sandbox:
module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

# For other HPCs:
module use /path/to/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake