Tutorial: Quality Control Pipeline in Practice

This tutorial provides hands-on experience running the quality control (QC) pipeline in GDCGenomicsQC. The pipeline implements a two-stage approach: Initial QC (sample and variant missingness filtering) followed by Standard QC (MAF, HWE, heterozygosity, and optional sex checking).

Estimated completion time: 20-30 minutes

Learning objectives:

Run the Initial QC and Standard QC pipelines
Interpret output plots and data files
Configure QC thresholds for different study designs
Apply ancestry-specific QC workflows

Prerequisites

Setup:

Before starting, ensure you have access to Snakemake and the GDCGenomicsQC workflow. For detailed installation instructions, see:

Installation - Software setup (module, conda, or other methods)
Usage - Running the pipeline

If you’re using the MSI HPC cluster:

module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

Verify installation:

cd GDCGenomicsQC
snakemake --version

If you’re using the Sandbox environment:

module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

Verify installation:

cd GDCGenomicsQC
snakemake --version

If your HPC has the GDC module pre-configured:

# Replace with your HPC's module path:
module use /path/to/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

Verify installation:

cd GDCGenomicsQC
snakemake --version

If you’re using your own Snakemake installation:

conda activate snakemake
cd GDCGenomicsQC

Verify installation:

snakemake --version

Data Requirements:

Reference data configured (see Tutorial: Assembling 1000 Genomes Reference Data)
Genotype data in VCF, BED, or PGEN format

DAG Visualization

The pipeline DAG up to the run_initialQC rule shows the Initial QC workflow:

        ---
title: DAG
---
flowchart TB
	id0[run_initialQC]
	id1[kgMeta]
	id2[Standard_QC]
	id3[convertNfilt - CHR: 1 - subset: full]
	id4[convertNfilt - CHR: 2 - subset: full]
	id5[convertNfilt - CHR: 3 - subset: full]
	id6[convertNfilt - CHR: 4 - subset: full]
	id7[convertNfilt - CHR: 5 - subset: full]
	id8[convertNfilt - CHR: 6 - subset: full]
	id9[convertNfilt - CHR: 7 - subset: full]
	id10[convertNfilt - CHR: 8 - subset: full]
	id11[convertNfilt - CHR: 9 - subset: full]
	id12[convertNfilt - CHR: 10 - subset: full]
	id13[convertNfilt - CHR: 11 - subset: full]
	id14[convertNfilt - CHR: 12 - subset: full]
	id15[convertNfilt - CHR: 13 - subset: full]
	id16[convertNfilt - CHR: 14 - subset: full]
	id17[convertNfilt - CHR: 15 - subset: full]
	id18[convertNfilt - CHR: 16 - subset: full]
	id19[convertNfilt - CHR: 17 - subset: full]
	id20[convertNfilt - CHR: 18 - subset: full]
	id21[convertNfilt - CHR: 19 - subset: full]
	id22[convertNfilt - CHR: 20 - subset: full]
	id23[convertNfilt - CHR: 21 - subset: full]
	id24[convertNfilt - CHR: 22 - subset: full]
	id25[initialFilter]
	style id0 fill:#57D5D9,stroke-width:2px,color:#333333
	style id1 fill:#57D95F,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id2 fill:#D9A657,stroke-width:2px,color:#333333
	style id3 fill:#ADD957,stroke-width:2px,color:#333333
	style id4 fill:#ADD957,stroke-width:2px,color:#333333
	style id5 fill:#ADD957,stroke-width:2px,color:#333333
	style id6 fill:#ADD957,stroke-width:2px,color:#333333
	style id7 fill:#ADD957,stroke-width:2px,color:#333333
	style id8 fill:#ADD957,stroke-width:2px,color:#333333
	style id9 fill:#ADD957,stroke-width:2px,color:#333333
	style id10 fill:#ADD957,stroke-width:2px,color:#333333
	style id11 fill:#ADD957,stroke-width:2px,color:#333333
	style id12 fill:#ADD957,stroke-width:2px,color:#333333
	style id13 fill:#ADD957,stroke-width:2px,color:#333333
	style id14 fill:#ADD957,stroke-width:2px,color:#333333
	style id15 fill:#ADD957,stroke-width:2px,color:#333333
	style id16 fill:#ADD957,stroke-width:2px,color:#333333
	style id17 fill:#ADD957,stroke-width:2px,color:#333333
	style id18 fill:#ADD957,stroke-width:2px,color:#333333
	style id19 fill:#ADD957,stroke-width:2px,color:#333333
	style id20 fill:#ADD957,stroke-width:2px,color:#333333
	style id21 fill:#ADD957,stroke-width:2px,color:#333333
	style id22 fill:#ADD957,stroke-width:2px,color:#333333
	style id23 fill:#ADD957,stroke-width:2px,color:#333333
	style id24 fill:#ADD957,stroke-width:2px,color:#333333
	style id25 fill:#7ED957,stroke-width:2px,color:#333333
	id2 --> id0
	id3 --> id0
	id4 --> id0
	id5 --> id0
	id6 --> id0
	id7 --> id0
	id8 --> id0
	id9 --> id0
	id10 --> id0
	id11 --> id0
	id12 --> id0
	id13 --> id0
	id14 --> id0
	id15 --> id0
	id16 --> id0
	id17 --> id0
	id18 --> id0
	id19 --> id0
	id20 --> id0
	id21 --> id0
	id22 --> id0
	id23 --> id0
	id24 --> id0
	id25 --> id0
	id25 --> id2
	id1 --> id3
	id1 --> id4
	id1 --> id5
	id1 --> id6
	id1 --> id7
	id1 --> id8
	id1 --> id9
	id1 --> id10
	id1 --> id11
	id1 --> id12
	id1 --> id13
	id1 --> id14
	id1 --> id15
	id1 --> id16
	id1 --> id17
	id1 --> id18
	id1 --> id19
	id1 --> id20
	id1 --> id21
	id1 --> id22
	id1 --> id23
	id1 --> id24
	id1 --> id25
	id3 --> id25
	id4 --> id25
	id5 --> id25
	id6 --> id25
	id7 --> id25
	id8 --> id25
	id9 --> id25
	id10 --> id25
	id11 --> id25
	id12 --> id25
	id13 --> id25
	id14 --> id25
	id15 --> id25
	id16 --> id25
	id17 --> id25
	id18 --> id25
	id19 --> id25
	id20 --> id25
	id21 --> id25
	id22 --> id25
	id23 --> id25
	id24 --> id25

The rule graph provides a cleaner view of rule dependencies:

        ---
title: DAG
---
flowchart TB
	id0[run_initialQC]
	id1[kgMeta]
	id2[Standard_QC]
	id3[convertNfilt]
	id4[initialFilter]
	style id0 fill:#57D5D9,stroke-width:2px,color:#333333
	style id1 fill:#57D95F,stroke-width:2px,color:#333333
	style id2 fill:#D9A657,stroke-width:2px,color:#333333
	style id3 fill:#ADD957,stroke-width:2px,color:#333333
	style id4 fill:#7ED957,stroke-width:2px,color:#333333
	id2 --> id0
	id3 --> id0
	id4 --> id0
	id4 --> id2
	id1 --> id3
	id1 --> id4
	id3 --> id4

Required Input Files

This step requires the following input files:

QC Pipeline Input Files
Input File	Description
`INPUT: "chr{CHR}.vcf.gz"` (or .bed/.pgen)	Per-chromosome VCF, BED, or PGEN files with genotype data
`REF/1000G_highcoverage/population.txt`	Reference panel population labels (for ancestry QC subsets)
`REF/Homo_sapiens.GRCh38.dna.primary_assembly.fa`	Reference genome FASTA (if using reference allele correction)

Input Formats Supported:

The pipeline automatically detects format based on file extension:

Config Parameters for QC:

INPUT: "/path/to/data/chr{CHR}.vcf.gz"  # Per-chromosome VCF
OUT_DIR: "/path/to/output"
REF: "/path/to/reference"
local-storage-prefix: "/path/to/.snakemake/storage"

chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

# QC thresholds
relatedness:
    method: "king"  # "0" for none, "king" for removal
    king_cutoff: 0.0884

SEX_CHECK: true  # Enable/disable sex verification
GRM: true  # Compute genetic relationship matrix
thin: false

See also: Usage for configuration options, Installation for software setup.

Lab Exercise: Running QC Pipeline

Step 1: Create Configuration File

Create a configuration file for QC:

mkdir -p ~/qc_lab
cd ~/qc_lab
cat > config_qc.yaml << 'EOF'
INPUT: "/path/to/data/chr{CHR}.vcf.gz"
OUT_DIR: "/path/to/output/directory"
REF: "/path/to/reference/data"
local-storage-prefix: "/path/to/.snakemake/storage"

chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

relatedness:
    method: "0"
    king_cutoff: 0.0884

SEX_CHECK: true
thin: false
conda-frontend: mamba
EOF

Key parameters:

SEX_CHECK: Enable/disable sex verification (default: true)
relatedness.method: Relatedness filtering method (“0” for none, “king” or “primus” for removal)

Step 2: Run Initial QC

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_qc.yaml full/initialFilter.pgen -j 10

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_qc.yaml full/initialFilter.pgen -j 10

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_qc.yaml full/initialFilter.pgen -j 10

cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/hpc \
    --configfile ../config_qc.yaml \
    full/initialFilter.pgen \
    -j 10

The Initial QC stage performs:

Sample missingness (initial): Removes samples with >10% missing genotypes (--mind 0.1)
Variant missingness: Removes variants with >2% missingness (--geno 0.02)
Sample missingness (final): Removes samples with >2% missingness (--mind 0.02)
LD pruning: Creates pruned dataset for downstream analyses (--indep-pairwise 500 10 0.1)

Step 3: Run Standard QC

gdcgenomicsqc --configfile ../config_qc.yaml full/standardFilter.pgen -j 10

gdcgenomicsqc --configfile ../config_qc.yaml full/standardFilter.pgen -j 10

gdcgenomicsqc --configfile ../config_qc.yaml full/standardFilter.pgen -j 10

snakemake --profile=../profiles/hpc \
    --configfile ../config_qc.yaml \
    full/standardFilter.pgen \
    -j 10

The Standard QC stage applies additional filters:

Minor Allele Frequency (MAF): Removes variants with MAF < 1% (--maf 0.01)
Hardy-Weinberg Equilibrium (HWE): Removes variants failing HWE at p < 1×10⁻⁶ (discovery) and p < 1×10⁻¹⁰ (validation)
Heterozygosity check: Identifies samples with FWER > 3 standard deviations from mean
Sex check: Optionally verifies reported sex matches genetic sex

Step 4: Run Ancestry-Specific QC

After ancestry classification, run QC on specific ancestry groups:

gdcgenomicsqc --configfile ../config_qc.yaml EUR/standardFilter.pgen -j 10

gdcgenomicsqc --configfile ../config_qc.yaml EUR/standardFilter.pgen -j 10

gdcgenomicsqc --configfile ../config_qc.yaml EUR/standardFilter.pgen -j 10

snakemake --profile=../profiles/hpc \
    --configfile ../config_qc.yaml \
    EUR/standardFilter.pgen \
    -j 10

Available subsets are dynamically determined from classification results.

Visualizations

Reference Space (PCA): images/PC_reference_space.svg

Reference panel samples in PC space with density contours
Shows how target samples map to known ancestry groups

Reference Space (UMAP): images/UMAP_reference_space.svg

Reference panel samples in UMAP embedding with density contours
Nonlinear visualization of ancestry structure

Interpreting Pipeline Outputs

Sample Missingness Plot

File: {subset}/figures/smiss.svg

The sample missingness histogram shows the distribution of missing data per individual.

X-axis: Percentage of genotype calls missing per sample
Red vertical lines: Threshold cutoffs (10% initial, 2% final)
Samples to the right of the rightmost line are removed

Note: Since we used synthetic and (in case of the R25 data) imputed data, we don’t expect to see any missingness in this exercise.

Standard interpretation:

Sharp peak at low values: Good quality data
Long right tail: Problematic samples requiring investigation
Bimodal distribution: Possible batch effects or technology issues

Variant Missingness Plot

File: {subset}/figures/vmiss.svg

The variant missingness histogram shows the distribution of missing data per SNP.

X-axis: Percentage of samples missing genotype call per variant
Red vertical lines: Threshold cutoffs
Variants to the right of the line are removed

Note: Since we used synthetic and (in case of the R25 data) imputed data, we don’t expect to see any missingness in this exercise.

Standard interpretation:

Concentrated at low values: Clean variant calling
Long right tail: Possible strand flip issues, poor quality regions, or structural variants

Unplotted Output Files

The QC pipeline generates many intermediate files for detailed analysis:

Initial QC Outputs:

Standard QC Outputs:

Sample contents of initial.smiss:

IID	FID	F_MISS	N_MISS
S001	FAM001	0.001	150
S002	FAM001	0.008	1200

Sample contents of R_check.het:

The heterozygosity rate is calculated as: (N.NM. - O.HOM.) / N.NM.

Discussion Points: Multi-Ancestry and Admixed Study Designs

These questions explore QC considerations for diverse and admixed populations:

Ancestry-specific allele frequencies: The MAF filter (default 1%) may remove informative variants in population-specific contexts. How should MAF thresholds differ between ancestry groups? Should multi-ancestry studies use uniform or group-specific thresholds?
HWE assumptions: HWE testing assumes a randomly mating population. For admixed individuals, this assumption is violated. Should HWE filters be applied before or after ancestry classification? How do systematic departures from HWE in admixed populations affect downstream analysis?
Heterozygosity in admixed samples: Admixed individuals have higher heterozygosity than homogeneous populations. Does the 3-SD threshold appropriately capture excess heterozygosity as an outlier versus natural admixture? How does this affect false positive rates?
Missingness patterns: Samples with high global ancestry proportions may have higher missingness if reference panels poorly represent their ancestry. How should missingness thresholds account for reference panel coverage across diverse populations?
Sex chromosome handling: The pseudoautosomal regions and sex chromosome ploidy differ between ancestries. How should X-chromosome heterozygosity filters be adjusted for multi-ancestry studies?
Relatedness in family-structured populations: For studies with family structure across ancestries, should KING or PRIMUS be used? How does population structure affect kinship coefficient estimates?
Differential QC power: Some ancestry groups may have more variants removed due to technology bias (e.g., array density). How does differential QC success affect downstream GWAS power and potential for bias?
Strand alignment: Poorly aligned variants show as missing in specific ancestry groups. How do you distinguish true missingness from strand issues in multi-ancestry data?

For theoretical foundations—including population genetics principles, statistical tests for QC metrics, and best practices for diverse populations—refer to the accompanying lecture materials.

Next Steps

After completing this tutorial, proceed to:

Tutorial: Ancestry Classification in Practice - Classify ancestry using the QC-filtered data
Tutorial: Heritability Estimation with Multi-Ancestry Simulation - Estimate heritability using QC-filtered genotypes

See also:

Installation - Software setup (if not already done)
Usage - Running the full pipeline
Genomics - Technical details on QC methods