Usage
The GDCGenomicsQC pipeline is a comprehensive quality control pipeline for genomic data. It integrates standard QC procedures with ancestry estimation and optional advanced features.
Software Environment Setup
Before running the pipeline, you need to set up your software environment. Choose the method that matches your HPC setup:
If your HPC has the GDC module pre-configured:
Step 1: Add module path and load the GDC module
module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
Step 2: Activate snakemake environment
conda activate snakemake
Step 3: Verify installation
cd GDCGenomicsQC
snakemake --version
What the module provides:
Running the pipeline:
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config/config.yaml
Or with snakemake directly:
cd GDCGenomicsQC/workflow
snakemake --profile ../profiles/hpc --configfile ../config/config.yaml
If your sandbox environment has the GDC module pre-configured:
Step 1: Add module path and load the GDC module
module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
Step 2: Activate snakemake environment
conda activate snakemake
Step 3: Verify installation
cd GDCGenomicsQC
snakemake --version
What the module provides:
Running the pipeline:
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config/config.yaml
Or with snakemake directly:
cd GDCGenomicsQC/workflow
snakemake --profile ../profiles/sandbox --configfile ../config/config.yaml
If you’re using your own Snakemake installation:
Step 1: Create the conda environment
# Clone the repository
git clone https://github.com/UMN-GDC/GDCGenomicsQC.git
cd GDCGenomicsQC
# Create the snakemake environment
conda env create -f envs/snakemake.yml
conda activate snakemake
Step 2: Verify installation
snakemake --version
Running the pipeline:
cd GDCGenomicsQC/workflow
snakemake --profile ../profiles/hpc --configfile ../config/config.yaml
See also: Installation for detailed setup options including Singularity-only environments.
Workflow Overview
The pipeline processes input data through a structured sequence of stages:
Overview of the GDC Genomics QC Pipeline stages.
Initial QC: Sample and SNP filtering using PLINK
Relatedness: KING/PC-AiR/PC-Relate for kinship estimation
Standard QC: GWAS-level filters (MAF, HWE, missingness)
Phasing: Haplotype estimation via shapeit4
Global Ancestry: PCA/UMAP/VAE with Random Forest classification
Local Ancestry: RFMix for segment-level ancestry inference
Per-Ancestry QC: Ancestry-specific quality control
For more details on each module, see Genomics.
Configuration
All pipeline options are configured via the config/config.yaml file. This replaces
the older command-line flag approach.
Basic Configuration
The INPUT parameter specifies your input genomic data. The pipeline automatically
detects the format based on the file extension and whether {CHR} is present:
# Input genomic data template. Supports:
# - Per-chromosome VCF: "/path/to/vcf/chr{CHR}.vcf.gz" (use {CHR} placeholder)
# - Whole genome BED: "/path/to/data/merged.bed"
# - Whole genome PGEN: "/path/to/data/merged.pgen"
INPUT: "/path/to/vcf/chr{CHR}.vcf.gz"
# Alternative VCF template for ABCD-style paths (optional)
vcf_template: null
# Output directory for pipeline results
OUT_DIR: "/path/to/output/directory"
# Reference data directory
REF: "/path/to/reference/data"
# Local snakemake storage cache
local-storage-prefix: "/path/to/.snakemake/storage"
# Chromosomes to process
chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
# Relatedness estimation
relatedness:
method: "king" # Options: "0", "king"
king_cutoff: 0.0884
SEX_CHECK: false
GRM: true
thin: false
# Ancestry analysis
ancestry:
threshold: 0.8
model: "pca" # Options: pca, umap, vae, rfmix
# Local ancestry (RFMix)
localAncestry:
RFMIX: false
test: false
thin_subjects: 0.1
figures: "figures"
chromosomes: null
# Internal PCA
internalPCA:
plot: true
color_by: null
phenotype_file: null
See Genomics for detailed descriptions of all configuration options.
Running the Pipeline
Choose your execution method based on your setup:
Use the gdcgenomicsqc wrapper script:
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config/config.yaml
Or use snakemake directly with the HPC profile:
cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml
Use the gdcgenomicsqc wrapper script:
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config/config.yaml
Or use snakemake directly with the sandbox profile:
cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/sandbox --configfile ../config/config.yaml
HPC execution:
cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml
Interactive/Testing:
cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/interactive --configfile ../config/config.yaml
Running Specific Rules
Run only specific parts of the pipeline by specifying the rule name:
# Run only ancestry classification
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml classifyAncestry
# Run only initial QC
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml initialFilter
# Run only RFMix
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml RFMIX
Common rule targets include:
initialFilter- Initial sample/SNP quality controlconvertPlinkPerChromosome- Per-chromosome conversion and filteringconvertPlinkSingleFile- Single file conversion and filteringking- Relatedness estimationestimateAncestry- Global ancestry classificationclassifyAncestry- Generate ancestry classifications and plotsRFMIX- Local ancestry inferencephase- Phasing with shapeit4
Generating Reports
Create an HTML report summarizing the workflow:
snakemake --profile=../profiles/hpc \
--configfile ../config/config.yaml \
--report --report-stylesheet ../report/stylesheet.css
The report will be generated at workflow/report.html.
Advanced Options
Parallel Jobs
Control the number of parallel SLURM jobs:
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml -j 20
Dry Run
Preview what will be executed without running:
snakemake -n --configfile ../config/config.yaml
Debugging
Force re-execution of failed jobs:
snakemake --profile=../profiles/hpc --configfile ../config/config.yaml --rerun-triggers mtime
Master SLURM Job
The pipeline includes a master SLURM script at workflow/snakemake.SLURM that
coordinates all jobs. This is the recommended way to run the full pipeline on HPC.
The master job stays running and dispatches individual jobs to SLURM as needed:
# From the workflow directory
sbatch snakemake.SLURM
Or with a custom config:
sbatch --export=CONFIG=config_custom.yaml snakemake.SLURM
The master script:
#!/bin/bash
#SBATCH --job-name=smk_master
#SBATCH --output=snakemake_%j.log
#SBATCH --mem=4G
#SBATCH --time=72:00:00 # Enough time for the whole pipeline
source /users/4/coffm049/miniconda3/etc/profile.d/conda.sh
conda activate snakemake
# The magic flag is --executor slurm
snakemake --profile=../profiles/hpc
Custom SLURM Script
For more control, create your own SLURM script:
#!/bin/bash
#SBATCH --job-name=gdc_qc
#SBATCH --output=logs/%x_%j.log
#SBATCH --error=logs/%x_%j.err
#SBATCH --time=7-00:00
#SBATCH --mem=64G
#SBATCH --cpus-per-task=8
cd $SLURM_SUBMIT_DIR/GDCGenomicsQC/workflow
snakemake --profile=../profiles/hpc \
--configfile ../config/config.yaml \
--jobs 20
Submit with:
sbatch run_pipeline.sh
Important
Every time you start a new session, you must rerun the environment setup steps:
Load the GDC module (if using module system)
Activate the snakemake conda environment
Example for a new session:
# For MSI HPC:
module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
# For Sandbox:
module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
# For other HPCs:
module use /path/to/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake