.. _tutorial_qc:

Tutorial: Quality Control Pipeline in Practice
==============================================

This tutorial provides hands-on experience running the quality control (QC)
pipeline in GDCGenomicsQC. The pipeline implements a two-stage approach:
Initial QC (sample and variant missingness filtering) followed by Standard QC
(MAF, HWE, heterozygosity, and optional sex checking).

**Estimated completion time**: 20-30 minutes

**Learning objectives**:

1. Run the Initial QC and Standard QC pipelines
2. Interpret output plots and data files
3. Configure QC thresholds for different study designs
4. Apply ancestry-specific QC workflows

----

Prerequisites
-------------

**Setup:**

Before starting, ensure you have access to Snakemake and the GDCGenomicsQC workflow.
For detailed installation instructions, see:

- :doc:`installation` - Software setup (module, conda, or other methods)
- :doc:`usage` - Running the pipeline

.. tabs::

   .. tab:: MSI HPC

      If you're using the MSI HPC cluster:

      .. code-block:: bash

          module use /projects/standard/gdc/public/GDCGenomicsQC/envs
          module load gdcgenomicsqc
          conda activate snakemake

      Verify installation:

      .. code-block:: bash

          cd GDCGenomicsQC
          snakemake --version

   .. tab:: Sandbox

      If you're using the Sandbox environment:

      .. code-block:: bash

          module use /scratch.global/GDC/GDCGenomicsQC/envs
          module load gdcgenomicsqc
          conda activate snakemake

      Verify installation:

      .. code-block:: bash

          cd GDCGenomicsQC
          snakemake --version

   .. tab:: Other HPCs

      If your HPC has the GDC module pre-configured:

      .. code-block:: bash

          # Replace with your HPC's module path:
          module use /path/to/GDCGenomicsQC/envs
          module load gdcgenomicsqc
          conda activate snakemake

      Verify installation:

      .. code-block:: bash

          cd GDCGenomicsQC
          snakemake --version

   .. tab:: Local Snakemake

      If you're using your own Snakemake installation:

      .. code-block:: bash

          conda activate snakemake
          cd GDCGenomicsQC

      Verify installation:

      .. code-block:: bash

          snakemake --version

**Data Requirements:**

- Reference data configured (see :doc:`tutorial_1kg_assembly`)
- Genotype data in VCF, BED, or PGEN format

.. _dag-visualization:

DAG Visualization
~~~~~~~~~~~~~~~~

The pipeline DAG up to the ``run_initialQC`` rule shows the Initial QC workflow:

.. mermaid:: dag_initialQC.mmd

The rule graph provides a cleaner view of rule dependencies:

.. mermaid:: rulegraph_initialQC.mmd

----

Required Input Files
~~~~~~~~~~~~~~~~~~~~

This step requires the following input files:

.. list-table:: QC Pipeline Input Files
   :widths: 35 65
   :header-rows: 1

   * - Input File
     - Description
   * - ``INPUT: "chr{CHR}.vcf.gz"`` (or .bed/.pgen)
     - Per-chromosome VCF, BED, or PGEN files with genotype data
   * - ``REF/1000G_highcoverage/population.txt``
     - Reference panel population labels (for ancestry QC subsets)
   * - ``REF/Homo_sapiens.GRCh38.dna.primary_assembly.fa``
     - Reference genome FASTA (if using reference allele correction)

**Input Formats Supported:**

The pipeline automatically detects format based on file extension:

+----------+------------------------------------------+
| Format   | Example Path                             |
+==========+==========================================+
| VCF      | ``/data/chr{CHR}.vcf.gz``                |
| PLINK BED| ``/data/chr{CHR}.bed``                  |
| PLINK PGEN| ``/data/chr{CHR}.pgen``                |
| Single file| ``/data/merged.bed`` (no ``{CHR}``)   |
+----------+------------------------------------------+

**Config Parameters for QC:**

.. code-block:: yaml

    INPUT: "/path/to/data/chr{CHR}.vcf.gz"  # Per-chromosome VCF
    OUT_DIR: "/path/to/output"
    REF: "/path/to/reference"
    local-storage-prefix: "/path/to/.snakemake/storage"

    chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

    # QC thresholds
    relatedness:
        method: "king"  # "0" for none, "king" for removal
        king_cutoff: 0.0884

    SEX_CHECK: true  # Enable/disable sex verification
    GRM: true  # Compute genetic relationship matrix
    thin: false

**See also:** :doc:`usage` for configuration options, :doc:`installation` for software setup.

----

Lab Exercise: Running QC Pipeline

Step 1: Create Configuration File
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create a configuration file for QC:

.. code-block:: bash

    mkdir -p ~/qc_lab
    cd ~/qc_lab
    cat > config_qc.yaml << 'EOF'
    INPUT: "/path/to/data/chr{CHR}.vcf.gz"
    OUT_DIR: "/path/to/output/directory"
    REF: "/path/to/reference/data"
    local-storage-prefix: "/path/to/.snakemake/storage"

    chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

    relatedness:
        method: "0"
        king_cutoff: 0.0884

    SEX_CHECK: true
    thin: false
    conda-frontend: mamba
    EOF

Key parameters:

- ``SEX_CHECK``: Enable/disable sex verification (default: true)
- ``relatedness.method``: Relatedness filtering method ("0" for none, "king" or "primus" for removal)

Step 2: Run Initial QC
~~~~~~~~~~~~~~~~~~~~~~~

.. tabs::

   .. tab:: MSI HPC

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          gdcgenomicsqc --configfile ../config_qc.yaml full/initialFilter.pgen -j 10

   .. tab:: Sandbox

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          gdcgenomicsqc --configfile ../config_qc.yaml full/initialFilter.pgen -j 10

   .. tab:: Other HPCs

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          gdcgenomicsqc --configfile ../config_qc.yaml full/initialFilter.pgen -j 10

   .. tab:: Local Snakemake

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          snakemake --profile=../profiles/hpc \
              --configfile ../config_qc.yaml \
              full/initialFilter.pgen \
              -j 10

The Initial QC stage performs:

1. **Sample missingness (initial)**: Removes samples with >10% missing genotypes (``--mind 0.1``)
2. **Variant missingness**: Removes variants with >2% missingness (``--geno 0.02``)
3. **Sample missingness (final)**: Removes samples with >2% missingness (``--mind 0.02``)
4. **LD pruning**: Creates pruned dataset for downstream analyses (``--indep-pairwise 500 10 0.1``)

Step 3: Run Standard QC
~~~~~~~~~~~~~~~~~~~~~~~

.. tabs::

   .. tab:: MSI HPC

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_qc.yaml full/standardFilter.pgen -j 10

   .. tab:: Sandbox

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_qc.yaml full/standardFilter.pgen -j 10

   .. tab:: Other HPCs

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_qc.yaml full/standardFilter.pgen -j 10

   .. tab:: Local Snakemake

      .. code-block:: bash

          snakemake --profile=../profiles/hpc \
              --configfile ../config_qc.yaml \
              full/standardFilter.pgen \
              -j 10

The Standard QC stage applies additional filters:

1. **Minor Allele Frequency (MAF)**: Removes variants with MAF < 1% (``--maf 0.01``)
2. **Hardy-Weinberg Equilibrium (HWE)**: Removes variants failing HWE at p < 1×10⁻⁶ (discovery) and p < 1×10⁻¹⁰ (validation)
3. **Heterozygosity check**: Identifies samples with FWER > 3 standard deviations from mean
4. **Sex check**: Optionally verifies reported sex matches genetic sex

Step 4: Run Ancestry-Specific QC
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

After ancestry classification, run QC on specific ancestry groups:

.. tabs::

   .. tab:: MSI HPC

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_qc.yaml EUR/standardFilter.pgen -j 10

   .. tab:: Sandbox

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_qc.yaml EUR/standardFilter.pgen -j 10

   .. tab:: Other HPCs

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_qc.yaml EUR/standardFilter.pgen -j 10

   .. tab:: Local Snakemake

      .. code-block:: bash

          snakemake --profile=../profiles/hpc \
              --configfile ../config_qc.yaml \
              EUR/standardFilter.pgen \
              -j 10

Available subsets are dynamically determined from classification results.

Visualizations
~~~~~~~~~~~~~~

**Reference Space (PCA)**: ``images/PC_reference_space.svg``

- Reference panel samples in PC space with density contours
- Shows how target samples map to known ancestry groups

**Reference Space (UMAP)**: ``images/UMAP_reference_space.svg``

- Reference panel samples in UMAP embedding with density contours
- Nonlinear visualization of ancestry structure

----

Interpreting Pipeline Outputs

Sample Missingness Plot
~~~~~~~~~~~~~~~~~~~~~~~

**File**: ``{subset}/figures/smiss.svg``

.. image:: images/smiss.svg
   :width: 600

The sample missingness histogram shows the distribution of missing data per individual.

- X-axis: Percentage of genotype calls missing per sample
- Red vertical lines: Threshold cutoffs (10% initial, 2% final)
- Samples to the right of the rightmost line are removed

**Note**: Since we used synthetic and (in case of the R25 data) imputed data, we don't expect to see any missingness in this exercise.

Standard interpretation:

- **Sharp peak at low values**: Good quality data
- **Long right tail**: Problematic samples requiring investigation
- **Bimodal distribution**: Possible batch effects or technology issues

Variant Missingness Plot
~~~~~~~~~~~~~~~~~~~~~~~~

**File**: ``{subset}/figures/vmiss.svg``

.. image:: images/vmiss.svg
   :width: 600

The variant missingness histogram shows the distribution of missing data per SNP.

- X-axis: Percentage of samples missing genotype call per variant
- Red vertical lines: Threshold cutoffs
- Variants to the right of the line are removed

**Note**: Since we used synthetic and (in case of the R25 data) imputed data, we don't expect to see any missingness in this exercise.

Standard interpretation:

- **Concentrated at low values**: Clean variant calling
- **Long right tail**: Possible strand flip issues, poor quality regions, or structural variants

Unplotted Output Files
~~~~~~~~~~~~~~~~~~~~~~~

The QC pipeline generates many intermediate files for detailed analysis:

**Initial QC Outputs**:

+-----------------------------------+----------------------------------------+
| File                              | Description                            |
+===================================+========================================+
| ``initialFilter.pgen/.pvar/.psam``| Merged, filtered dataset               |
+-----------------------------------+----------------------------------------+
| ``initialFilter.LDpruned.*``      | LD-pruned dataset for PCA/relatedness |
+-----------------------------------+----------------------------------------+
| ``initial.smiss``                 | Sample missingness table              |
+-----------------------------------+----------------------------------------+
| ``initial.vmiss``                 | Variant missingness table             |
+-----------------------------------+----------------------------------------+

**Standard QC Outputs**:

+-----------------------------------+----------------------------------------+
| File                              | Description                            |
+===================================+========================================+
| ``standardFilter.pgen/.pvar/.psam``| Final filtered dataset                |
+-----------------------------------+----------------------------------------+
| ``standardFilter.LDpruned.*``     | Final LD-pruned dataset               |
+-----------------------------------+----------------------------------------+
| ``MAF_check.afreq``               | Allele frequency table                |
+-----------------------------------+----------------------------------------+
| ``zoomhwe.hwe``                   | Variants failing HWE p < 1×10⁻⁵       |
+-----------------------------------+----------------------------------------+
| ``indepSNP.prune.in``             | Independent SNPs for heterozygosity   |
+-----------------------------------+----------------------------------------+
| ``R_check.het``                   | Heterozygosity rate per sample        |
+-----------------------------------+----------------------------------------+
| ``fail-het-qc.txt``               | Samples failing heterozygosity filter |
+-----------------------------------+----------------------------------------+
| ``sex_discrepancy.txt``           | Samples with sex check problems       |
+-----------------------------------+----------------------------------------+

**Sample contents of** ``initial.smiss``:

+--------+----------+----------+--------+
| IID    | FID      | F_MISS   | N_MISS |
+========+==========+==========+========+
| S001   | FAM001   | 0.001    | 150    |
+--------+----------+----------+--------+
| S002   | FAM001   | 0.008    | 1200   |
+--------+----------+----------+--------+

**Sample contents of** ``R_check.het``:

+--------+----------+--------+--------+
| IID    | FID      | O.HOM. | N.NM.  |
+========+==========+========+========+
| S001   | FAM001   | 2500   | 3000   |
+--------+----------+----------+--------+
| S002   | FAM001   | 2450   | 3000   |
+--------+----------+----------+--------+

The heterozygosity rate is calculated as: ``(N.NM. - O.HOM.) / N.NM.``

----

Discussion Points: Multi-Ancestry and Admixed Study Designs
-------------------------------------------------------------

These questions explore QC considerations for diverse and admixed populations:

1. **Ancestry-specific allele frequencies**: The MAF filter (default 1%) may remove
   informative variants in population-specific contexts. How should MAF thresholds
   differ between ancestry groups? Should multi-ancestry studies use uniform or
   group-specific thresholds?

2. **HWE assumptions**: HWE testing assumes a randomly mating population. For
   admixed individuals, this assumption is violated. Should HWE filters be applied
   before or after ancestry classification? How do systematic departures from HWE
   in admixed populations affect downstream analysis?

3. **Heterozygosity in admixed samples**: Admixed individuals have higher
   heterozygosity than homogeneous populations. Does the 3-SD threshold appropriately
   capture excess heterozygosity as an outlier versus natural admixture? How does
   this affect false positive rates?

4. **Missingness patterns**: Samples with high global ancestry proportions may
   have higher missingness if reference panels poorly represent their ancestry.
   How should missingness thresholds account for reference panel coverage across
   diverse populations?

5. **Sex chromosome handling**: The pseudoautosomal regions and sex chromosome
   ploidy differ between ancestries. How should X-chromosome heterozygosity filters
   be adjusted for multi-ancestry studies?

6. **Relatedness in family-structured populations**: For studies with family
   structure across ancestries, should KING or PRIMUS be used? How does population
   structure affect kinship coefficient estimates?

7. **Differential QC power**: Some ancestry groups may have more variants removed
   due to technology bias (e.g., array density). How does differential QC success
   affect downstream GWAS power and potential for bias?

8. **Strand alignment**: Poorly aligned variants show as missing in specific
   ancestry groups. How do you distinguish true missingness from strand issues
   in multi-ancestry data?

For theoretical foundations—including population genetics principles, statistical
tests for QC metrics, and best practices for diverse populations—refer to the
accompanying lecture materials.

----

Next Steps
---------

After completing this tutorial, proceed to:

- :doc:`tutorial_ancestry_classification` - Classify ancestry using the QC-filtered data
- :doc:`tutorial_heritability` - Estimate heritability using QC-filtered genotypes

**See also:**

- :doc:`installation` - Software setup (if not already done)
- :doc:`usage` - Running the full pipeline
- :doc:`genomics` - Technical details on QC methods