Installation
This guide covers installing and configuring the GDCGenomicsQC pipeline.
Automatic Software Installation
The pipeline uses Snakemake’s built-in conda support to automatically install
software dependencies defined in rule-level conda: directives. This means:
No manual installation of PLINK, bcftools, GATK, shapeit4, rfmix, etc.
Each rule can specify its own conda environment
Singularity containers are pulled automatically when using
--use-singularity
Choose the installation method that matches your environment:
This scenario uses pre-installed modules and pre-cached Singularity images. Ideal for standard HPC environments like MSI at UMN.
Prerequisites:
Access to MSI HPC with SLURM scheduler
Module system available
Setup:
module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
# Verify environment is set up
echo $SINGULARITY_CACHEDIR
echo $SNAKEMAKE_SINGULARITY_PREFIX
What the module provides:
The gdcgenomicsqc module sets up:
Snakemake availability:
The module does NOT provide Snakemake. You must have Snakemake available through one of these methods:
Clone the repository (if not already available):
git clone https://github.com/UMN-GDC/GDCGenomicsQC.git
cd GDCGenomicsQC
Run:
cd GDCGenomicsQC/workflow
snakemake --profile ../profiles/sandbox --configfile /path/to/your/config.yaml
Or using the wrapper script (after loading the module):
gdcgenomicsqc --configfile /path/to/your/config.yaml
This scenario uses pre-installed modules and pre-cached Singularity images. Ideal for sandbox or testing environments.
Prerequisites:
Access to sandbox environment with SLURM scheduler
Module system available
Setup:
module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
# Verify environment is set up
echo $SINGULARITY_CACHEDIR
echo $SNAKEMAKE_SINGULARITY_PREFIX
What the module provides:
The gdcgenomicsqc module sets up:
Snakemake availability:
The module does NOT provide Snakemake. You must have Snakemake available through one of these methods:
Clone the repository (if not already available):
git clone https://github.com/UMN-GDC/GDCGenomicsQC.git
cd GDCGenomicsQC
Run:
cd GDCGenomicsQC/workflow
snakemake --profile ../profiles/sandbox --configfile /path/to/your/config.yaml
Or using the wrapper script (after loading the module):
gdcgenomicsqc --configfile /path/to/your/config.yaml
If you’re on an HPC system without the GDCGenomicsQC module, set up manually.
Prerequisites:
Access to HPC with SLURM scheduler
Git
Conda or Mamba
Singularity/Apptainer (check with
which apptainerorwhich singularity)1. Clone the Repository
git clone https://github.com/UMN-GDC/GDCGenomicsQC.git cd GDCGenomicsQC2. Set Up Snakemake Environment
# Create a conda/mamba environment conda env create -n snakemake -f envs/snakemake.yml conda activate snakemake3. Configure Apptainer/Cachedir (Optional)
If you want to pre-pull container images for offline use:
export APPTAINER_CACHEDIR=/path/to/container/cache export SNAKEMAKE_APPTAINER_PREFIX=/path/to/container/cache4. Configure Your Run
Edit the configuration file at
config/config.yamlto specify:
Input and output paths
Reference data locations
Pipeline options (relatedness, ancestry methods, etc.)
Example configuration:
INPUT: "/path/to/your/vcf/chr{CHR}.vcf.gz" OUT_DIR: "/path/to/output/directory" REF: "/path/to/reference/data" local-storage-prefix: "/path/to/.snakemake/storage" chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] relatedness: method: "king" king_cutoff: 0.0884 localAncestry: RFMIX: true test: true thin_subjects: 0.1 figures: "figures" thin: false5. Run
cd GDCGenomicsQC/workflow snakemake --profile ../profiles/hpc --configfile /path/to/your/config.yamlRequesting module installation: Contact your HPC administrators with:
The path to the repository:
/path/to/GDCGenomicsQCThe module location:
/path/to/GDCGenomicsQC/envs/gdcgenomicsMSIThe wrapper script:
/path/to/GDCGenomicsQC/envs/gdcgenomicsMSI/bin/gdcgenomicsqc
For local execution without SLURM. Useful for testing and small datasets.
Prerequisites:
Git
Conda or Mamba
4+ CPU cores recommended
16GB+ RAM for typical analyses
1. Clone the Repository
git clone https://github.com/UMN-GDC/GDCGenomicsQC.git
cd GDCGenomicsQC
2. Set Up Snakemake Environment
# Create a conda/mamba environment
conda env create -n snakemake -f envs/snakemake.yml
conda activate snakemake
3. Configure Your Run
Edit the configuration file at config/config.yaml to specify:
Input and output paths
Reference data locations
Pipeline options (relatedness, ancestry methods, etc.)
Example configuration:
INPUT: "/path/to/your/vcf/chr{CHR}.vcf.gz" OUT_DIR: "/path/to/output/directory" REF: "/path/to/reference/data" local-storage-prefix: "/path/to/.snakemake/storage" chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22] relatedness: method: "king" king_cutoff: 0.0884 localAncestry: RFMIX: true test: true thin_subjects: 0.1 figures: "figures" thin: false4. Run
cd GDCGenomicsQC/workflow snakemake --profile ../profiles/interactive --configfile /path/to/your/config.yamlOr for simple local execution (no profile):
snakemake --cores=4 --use-conda \ --configfile /path/to/config.yaml \ --directory /path/to/GDCGenomicsQC/workflow \ --snakefile /path/to/GDCGenomicsQC/workflow/Snakefile
If your HPC provides Singularity/Apptainer but you prefer not to use conda for Snakemake, you can install Snakemake via pip:
Prerequisites:
Singularity/Apptainer
Python 3.8+
pip
1. Install Snakemake via pip
pip install snakemake snakemake-executor-plugin-slurm
2. Clone the Repository
git clone https://github.com/UMN-GDC/GDCGenomicsQC.git
cd GDCGenomicsQC
3. Configure Container Cachedir
export SINGULARITY_CACHEDIR=/path/to/container/cache
export APPTAINER_CACHEDIR=/path/to/container/cache
4. Run with Singularity
cd GDCGenomicsQC/workflow
snakemake --use-singularity --profile ../profiles/hpc \
--configfile /path/to/your/config.yaml
Prerequisite Software
Regardless of loading method, the following software is required:
Software |
Required By |
Notes |
|---|---|---|
Snakemake (8+) |
Pipeline execution |
Conda recipe provided in |
snakemake-executor-plugin-slurm |
HPC job submission |
Required for SLURM profiles |
Singularity/Apptainer |
Containerized tools |
MSI module: |
SLURM scheduler |
HPC job scheduling |
For profiles/hpc and profiles/sandbox |
Conda or Mamba |
Environment management |
Mamba recommended for faster solving |
Git |
Repository access |
For cloning the repository |
Software Loading Methods
The pipeline supports multiple ways to access its dependencies. Choose the method that matches your HPC environment:
Method |
Best For |
Setup Required |
|---|---|---|
Module System (MSI) |
MSI HPC clusters |
|
Module System (Sandbox) |
Sandbox environments |
|
Conda Environment |
Custom HPC or local |
|
Singularity/Apptainer |
Container-based HPC |
Pull images manually |
System-wide Install |
Local development |
|
Software Environment Summary
Software |
Conda Command |
Module Command (MSI) |
Module Command (Sandbox) |
|---|---|---|---|
Snakemake |
|
|
|
GDC Pipeline |
(via containers) |
|
|
Apptainer |
N/A |
|
|
SLURM |
N/A |
(Usually default on HPC) |
(Usually default on HPC) |
External Dependencies
All software dependencies are automatically handled through conda environments and Singularity containers. The pipeline is entirely self-contained—you only need:
Snakemake (installed via conda as shown above)
Access to reference data (e.g., 1000 Genomes Project)
Sufficient storage for intermediate and output files
Appropriate HPC resources (see profile configurations)
No manual installation of external tools (PLINK, bcftools, GATK, etc.) is required.
Software Environment Files
The pipeline includes the following environment definitions in envs/:
File |
Purpose |
|---|---|
|
Snakemake and SLURM executor plugin |
|
General genomic utilities (PLINK, bcftools, etc.) |
|
RFMix for local ancestry inference |
|
Phenotype simulation tools |
|
Ancestry reporting and visualization |
|
Mash distance estimation |
|
Karyotype visualization |
Container Images
The pipeline uses Singularity/Apptainer containers for reproducibility. Images
are automatically pulled based on rule-level container: directives.
Image |
Contains |
|---|---|
|
Ancestry reporting environment |
|
RFMix local ancestry inference |
|
Mash distance estimation |
|
Phenotype simulation tools |
Troubleshooting
If jobs fail to start:
Verify SLURM is available:
sbatch --versionVerify Snakemake is available:
snakemake --versionCheck that your config paths are correct
Ensure output directories are writable
If conda environments fail to resolve:
Use
mambainstead ofcondafor faster solvingSet in config:
conda-frontend: mamba
If containers fail to pull:
Check network connectivity
Configure cachedir:
export SINGULARITY_CACHEDIR=/path/to/large/disk
For additional help, see the Usage guide or open an issue on GitHub.
Important
Every time you start a new session, you must rerun the environment setup steps:
Load the GDC module (if using module system)
Activate the snakemake conda environment
Example for a new session:
# For MSI HPC:
module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
# For Sandbox:
module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
# For other HPCs:
module use /path/to/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake