=====
Usage
=====

The **GDCGenomicsQC** pipeline is a comprehensive quality control pipeline for genomic data.
It integrates standard QC procedures with ancestry estimation and optional advanced features.

.. contents:: Table of Contents
   :depth: 2
   :local:

Software Environment Setup
-------------------------

Before running the pipeline, you need to set up your software environment.
Choose the method that matches your HPC setup:

.. tabs::

   .. tab:: MSI HPC

       If your HPC has the GDC module pre-configured:

       **Step 1: Add module path and load the GDC module**

       .. code-block:: bash

           module use /projects/standard/gdc/public/GDCGenomicsQC/envs
           module load gdcgenomicsqc

       **Step 2: Activate snakemake environment**

       .. code-block:: bash

           conda activate snakemake

       **Step 3: Verify installation**

       .. code-block:: bash

           cd GDCGenomicsQC
           snakemake --version

       **What the module provides:**

       +--------------------------------+------------------------------------------------+
       | Setting                        | Value                                           |
       +================================+================================================+
       | ``PATH``                        | Adds ``gdcgenomicsMSI/bin`` to PATH            |
       +--------------------------------+------------------------------------------------+
       | ``APPTAINER_CACHEDIR``          | ``/scratch.global/GDC/singularityimages``      |
       +--------------------------------+------------------------------------------------+
       | ``SNAKEMAKE_APPTAINER_PREFIX``  | ``/scratch.global/GDC/singularityimages``      |
       +--------------------------------+------------------------------------------------+

       **Running the pipeline:**

       .. code-block:: bash

           cd GDCGenomicsQC/workflow
           gdcgenomicsqc --configfile ../config/config.yaml

       Or with snakemake directly:

       .. code-block:: bash

           cd GDCGenomicsQC/workflow
           snakemake --profile ../profiles/hpc --configfile ../config/config.yaml

   .. tab:: Sandbox

       If your sandbox environment has the GDC module pre-configured:

       **Step 1: Add module path and load the GDC module**

       .. code-block:: bash

           module use /scratch.global/GDC/GDCGenomicsQC/envs
           module load gdcgenomicsqc

       **Step 2: Activate snakemake environment**

       .. code-block:: bash

           conda activate snakemake

       **Step 3: Verify installation**

       .. code-block:: bash

           cd GDCGenomicsQC
           snakemake --version

       **What the module provides:**

       +--------------------------------+------------------------------------------------+
       | Setting                        | Value                                           |
       +================================+================================================+
       | ``PATH``                        | Adds ``gdcgenomicsMSI/bin`` to PATH            |
       +--------------------------------+------------------------------------------------+
       | ``APPTAINER_CACHEDIR``          | ``/scratch.global/GDC/singularityimages``      |
       +--------------------------------+------------------------------------------------+
       | ``SNAKEMAKE_APPTAINER_PREFIX``  | ``/scratch.global/GDC/singularityimages``      |
       +--------------------------------+------------------------------------------------+

       **Running the pipeline:**

       .. code-block:: bash

           cd GDCGenomicsQC/workflow
           gdcgenomicsqc --configfile ../config/config.yaml

       Or with snakemake directly:

       .. code-block:: bash

           cd GDCGenomicsQC/workflow
           snakemake --profile ../profiles/sandbox --configfile ../config/config.yaml

   .. tab:: Local Snakemake (Conda)

      If you're using your own Snakemake installation:

      **Step 1: Create the conda environment**

      .. code-block:: bash

          # Clone the repository
          git clone https://github.com/UMN-GDC/GDCGenomicsQC.git
          cd GDCGenomicsQC

          # Create the snakemake environment
          conda env create -f envs/snakemake.yml
          conda activate snakemake

      **Step 2: Verify installation**

      .. code-block:: bash

          snakemake --version

      **Running the pipeline:**

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          snakemake --profile ../profiles/hpc --configfile ../config/config.yaml

**See also:** :doc:`installation` for detailed setup options including Singularity-only environments.

Workflow Overview
-----------------

The pipeline processes input data through a structured sequence of stages:

.. figure:: images/workflow_diagram.jpg
   :alt: GDC Genomics QC Workflow Diagram
   :align: center
   :width: 600px

   Overview of the GDC Genomics QC Pipeline stages.

1.  **Initial QC**: Sample and SNP filtering using PLINK
2.  **Relatedness**: KING/PC-AiR/PC-Relate for kinship estimation
3.  **Standard QC**: GWAS-level filters (MAF, HWE, missingness)
4.  **Phasing**: Haplotype estimation via shapeit4
5.  **Global Ancestry**: PCA/UMAP/VAE with Random Forest classification
6.  **Local Ancestry**: RFMix for segment-level ancestry inference
7.  **Per-Ancestry QC**: Ancestry-specific quality control

For more details on each module, see :doc:`genomics`.

Configuration
-------------

All pipeline options are configured via the ``config/config.yaml`` file. This replaces
the older command-line flag approach.

Basic Configuration
~~~~~~~~~~~~~~~~~~

The ``INPUT`` parameter specifies your input genomic data. The pipeline automatically
detects the format based on the file extension and whether ``{CHR}`` is present:

.. code-block:: yaml

    # Input genomic data template. Supports:
    # - Per-chromosome VCF: "/path/to/vcf/chr{CHR}.vcf.gz" (use {CHR} placeholder)
    # - Whole genome BED: "/path/to/data/merged.bed"
    # - Whole genome PGEN: "/path/to/data/merged.pgen"
    INPUT: "/path/to/vcf/chr{CHR}.vcf.gz"

    # Alternative VCF template for ABCD-style paths (optional)
    vcf_template: null

    # Output directory for pipeline results
    OUT_DIR: "/path/to/output/directory"

    # Reference data directory
    REF: "/path/to/reference/data"

    # Local snakemake storage cache
    local-storage-prefix: "/path/to/.snakemake/storage"

    # Chromosomes to process
    chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

    # Relatedness estimation
    relatedness:
        method: "king"  # Options: "0", "king"
        king_cutoff: 0.0884

    SEX_CHECK: false
    GRM: true
    thin: false

    # Ancestry analysis
    ancestry:
        threshold: 0.8
        model: "pca"  # Options: pca, umap, vae, rfmix

    # Local ancestry (RFMix)
    localAncestry:
        RFMIX: false
        test: false
        thin_subjects: 0.1
        figures: "figures"
        chromosomes: null

    # Internal PCA
    internalPCA:
        plot: true
        color_by: null
        phenotype_file: null

See :doc:`genomics` for detailed descriptions of all configuration options.

Running the Pipeline
-------------------

Choose your execution method based on your setup:

.. tabs::

   .. tab:: MSI HPC

      Use the ``gdcgenomicsqc`` wrapper script:

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          gdcgenomicsqc --configfile ../config/config.yaml

      Or use snakemake directly with the HPC profile:

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          snakemake --profile=../profiles/hpc --configfile ../config/config.yaml

   .. tab:: Sandbox

      Use the ``gdcgenomicsqc`` wrapper script:

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          gdcgenomicsqc --configfile ../config/config.yaml

      Or use snakemake directly with the sandbox profile:

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          snakemake --profile=../profiles/sandbox --configfile ../config/config.yaml

   .. tab:: Local Snakemake

      **HPC execution:**

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          snakemake --profile=../profiles/hpc --configfile ../config/config.yaml

      **Interactive/Testing:**

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          snakemake --profile=../profiles/interactive --configfile ../config/config.yaml

Running Specific Rules
~~~~~~~~~~~~~~~~~~~~

Run only specific parts of the pipeline by specifying the rule name:

.. code-block:: bash

    # Run only ancestry classification
    snakemake --profile=../profiles/hpc --configfile ../config/config.yaml classifyAncestry

    # Run only initial QC
    snakemake --profile=../profiles/hpc --configfile ../config/config.yaml initialFilter

    # Run only RFMix
    snakemake --profile=../profiles/hpc --configfile ../config/config.yaml RFMIX

Common rule targets include:

- ``initialFilter`` - Initial sample/SNP quality control
- ``convertPlinkPerChromosome`` - Per-chromosome conversion and filtering
- ``convertPlinkSingleFile`` - Single file conversion and filtering
- ``king`` - Relatedness estimation
- ``estimateAncestry`` - Global ancestry classification
- ``classifyAncestry`` - Generate ancestry classifications and plots
- ``RFMIX`` - Local ancestry inference
- ``phase`` - Phasing with shapeit4

Generating Reports
~~~~~~~~~~~~~~~~~

Create an HTML report summarizing the workflow:

.. code-block:: bash

    snakemake --profile=../profiles/hpc \
        --configfile ../config/config.yaml \
        --report --report-stylesheet ../report/stylesheet.css

The report will be generated at ``workflow/report.html``.

Advanced Options
---------------

Parallel Jobs
~~~~~~~~~~~~~

Control the number of parallel SLURM jobs:

.. code-block:: bash

    snakemake --profile=../profiles/hpc --configfile ../config/config.yaml -j 20

Dry Run
~~~~~~~

Preview what will be executed without running:

.. code-block:: bash

    snakemake -n --configfile ../config/config.yaml

Debugging
~~~~~~~~~

Force re-execution of failed jobs:

.. code-block:: bash

    snakemake --profile=../profiles/hpc --configfile ../config/config.yaml --rerun-triggers mtime

Master SLURM Job
---------------

The pipeline includes a master SLURM script at ``workflow/snakemake.SLURM`` that
coordinates all jobs. This is the recommended way to run the full pipeline on HPC.

The master job stays running and dispatches individual jobs to SLURM as needed:

.. code-block:: bash

    # From the workflow directory
    sbatch snakemake.SLURM

Or with a custom config:

.. code-block:: bash

    sbatch --export=CONFIG=config_custom.yaml snakemake.SLURM

The master script:

.. code-block:: bash
    :caption: workflow/snakemake.SLURM

    #!/bin/bash
    #SBATCH --job-name=smk_master
    #SBATCH --output=snakemake_%j.log
    #SBATCH --mem=4G
    #SBATCH --time=72:00:00  # Enough time for the whole pipeline

    source /users/4/coffm049/miniconda3/etc/profile.d/conda.sh
    conda activate snakemake

    # The magic flag is --executor slurm
    snakemake --profile=../profiles/hpc

Custom SLURM Script
------------------

For more control, create your own SLURM script:

.. code-block:: bash

    #!/bin/bash
    #SBATCH --job-name=gdc_qc
    #SBATCH --output=logs/%x_%j.log
    #SBATCH --error=logs/%x_%j.err
    #SBATCH --time=7-00:00
    #SBATCH --mem=64G
    #SBATCH --cpus-per-task=8

    cd $SLURM_SUBMIT_DIR/GDCGenomicsQC/workflow

    snakemake --profile=../profiles/hpc \
        --configfile ../config/config.yaml \
        --jobs 20

Submit with:

.. code-block:: bash

    sbatch run_pipeline.sh

.. important::

   **Every time you start a new session**, you must rerun the environment setup steps:

   - Load the GDC module (if using module system)
   - Activate the snakemake conda environment

   Example for a new session:

   .. code-block:: bash

       # For MSI HPC:
       module use /projects/standard/gdc/public/GDCGenomicsQC/envs
       module load gdcgenomicsqc
       conda activate snakemake

       # For Sandbox:
       module use /scratch.global/GDC/GDCGenomicsQC/envs
       module load gdcgenomicsqc
       conda activate snakemake

       # For other HPCs:
       module use /path/to/GDCGenomicsQC/envs
       module load gdcgenomicsqc
       conda activate snakemake