.. _tutorial_ancestry:

Tutorial: Ancestry Classification in Practice
============================================

This tutorial provides hands-on experience running the ancestry classification
pipeline in GDCGenomicsQC. For the theoretical background on dimension
reduction methods and classification techniques, see the accompanying lecture
slides.

**Estimated completion time**: 30-45 minutes

**Learning objectives**:

1. Run the ancestry classification pipeline using Snakemake
2. Configure different models and thresholds
3. Interpret pipeline outputs
4. Apply ancestry-specific subsetting

----

Prerequisites
-------------

**Setup:**

Before starting, ensure you have access to Snakemake and the GDCGenomicsQC workflow.
For detailed installation instructions, see:

- :doc:`installation` - Software setup (module, conda, or other methods)
- :doc:`usage` - Running the pipeline

.. tabs::

   .. tab:: MSI HPC

      If you're using the MSI HPC cluster:

      .. code-block:: bash

          module use /projects/standard/gdc/public/GDCGenomicsQC/envs
          module load gdcgenomicsqc
          conda activate snakemake

      Verify installation:

      .. code-block:: bash

          cd GDCGenomicsQC
          snakemake --version

   .. tab:: Sandbox

      If you're using the Sandbox environment:

      .. code-block:: bash

          module use /scratch.global/GDC/GDCGenomicsQC/envs
          module load gdcgenomicsqc
          conda activate snakemake

      Verify installation:

      .. code-block:: bash

          cd GDCGenomicsQC
          snakemake --version

   .. tab:: Other HPCs

      If your HPC has the GDC module pre-configured:

      .. code-block:: bash

          # Replace with your HPC's module path:
          module use /path/to/GDCGenomicsQC/envs
          module load gdcgenomicsqc
          conda activate snakemake

      Verify installation:

      .. code-block:: bash

          cd GDCGenomicsQC
          snakemake --version

   .. tab:: Local Snakemake

      If you're using your own Snakemake installation:

      .. code-block:: bash

          conda activate snakemake
          cd GDCGenomicsQC

      Verify installation:

      .. code-block:: bash

          snakemake --version

**Data Requirements:**

- Reference panel with population labels (see :doc:`tutorial_1kg_assembly`)
- QC-filtered genotype data (see :doc:`tutorial_qc_pipeline`)

.. _dag-visualization:

DAG Visualization
~~~~~~~~~~~~~~~~

The pipeline DAG up to the ``run_pca`` rule shows the workflow for preparing
the PCA reference and running ancestry classification:

.. mermaid:: dag_pca.mmd

The rule graph provides a cleaner view of rule dependencies:

.. mermaid:: rulegraph_pca.mmd

----

Required Input Files
~~~~~~~~~~~~~~~~~~~~

This step requires the following input files:

.. list-table:: Ancestry Classification Input Files
   :widths: 35 65
   :header-rows: 1

   * - Input File
     - Description
   * - ``INPUT: "chr{CHR}.vcf.gz"`` (or .bed/.pgen)
     - Per-chromosome genotype data (QC-filtered recommended)
   * - ``REF/1000G_highcoverage/population.txt``
     - Reference panel with population labels (IID, pop, superpop columns)
   * - ``REF/1000G_highcoverage/1000G_highCoveragephased.pruned.pgen``
     - LD-pruned, unrelated reference genotypes for PCA projection
   * - ``OUT_DIR/full/initialFilter.pgen`` (or ``_CHR.pgen``)
     - Initial QC-filtered sample genotypes

**Input from Previous Steps:**

The ancestry classification pipeline depends on:

1. **QC Pipeline** (tutorial_qc_pipeline): Produces filtered genotype files
2. **Reference Assembly** (tutorial_1kg_assembly): Provides reference panel

**Config Parameters for Ancestry:**

.. code-block:: yaml

    ancestry:
        threshold: 0.8  # Minimum posterior probability for classification
        model: "pca"    # Options: pca, umap, rfmix (vae not yet implemented)
        # Optional: reported_race: "/path/to/reported_race.tsv"

    INPUT: "/path/to/data/chr{CHR}.vcf.gz"
    OUT_DIR: "/path/to/output"
    REF: "/path/to/reference"
    local-storage-prefix: "/path/to/.snakemake/storage"

    chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

**See also:** :doc:`tutorial_qc_pipeline` for QC preprocessing, :doc:`tutorial_1kg_assembly` for reference data.

Lab Exercise: Running Ancestry Classification
----------------------------------------------

Step 1: Create Configuration File
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For this tutorial, we will classify ancestry using an admix-free resampling
of the 1000 Genomes reference panel. This reference was generated using the
CT-Sleb tool and is available on Harvard Dataverse. The resampling ensures
that only genetically unrelated, ancestry-appropriate samples are included,
providing a cleaner reference for classification.

Create a configuration file for ancestry classification:

.. code-block:: bash

    mkdir -p ~/ancestry_lab
    cd ~/ancestry_lab
    cat > config_ancestry.yaml << 'EOF'
    INPUT: "/path/to/data/chr{CHR}.vcf.gz"
    OUT_DIR: "/path/to/output/directory"
    REF: "/path/to/reference/data"
    local-storage-prefix: "/path/to/.snakemake/storage"

    chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

    ancestry:
        threshold: 0.8
        model: "pca"  # Options: pca, umap, rfmix (vae not yet implemented)

    relatedness:
        method: "0"
        king_cutoff: 0.0884

    localAncestry:
        RFMIX: true
        test: true
        thin_subjects: 0.1
        figures: "figures"
        chromosomes: null

    thin: false
    conda-frontend: mamba
    EOF

Key parameters:

- ``threshold``: Minimum posterior probability for confident classification (default: 0.8)
- ``model``: Embedding used for classification—``pca``, ``umap``, or ``rfmix``
  (Note: VAE is not yet implemented)

Step 2: Run Classification Pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Before classification, the pipeline runs KING to identify related samples.
For this tutorial using simulated data, you should not find any related
individuals (KING kinship coefficient ≈ 0), which serves as a good sanity
check that the simulated data is properly independent.

.. tabs::

   .. tab:: MSI HPC

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry -j 10

   .. tab:: Sandbox

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry -j 10

   .. tab:: Other HPCs

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry -j 10

   .. tab:: Local Snakemake

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          snakemake --profile=../profiles/hpc \
              --configfile ../config_ancestry.yaml \
              classifyAncestry \
              -j 10

This trains Random Forest models on reference coordinates and predicts ancestry
probabilities for your samples.

Step 3: Compare Models (PCA vs UMAP)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Modify ``model`` in your config to compare embeddings:

- **PCA** (default): Linear projection, strongest baseline
- **UMAP**: Nonlinear, good for visualization
- **VAE**: Not yet implemented

.. tabs::

   .. tab:: MSI HPC

      First edit your config to set ``model: "umap"``, then:

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry

   .. tab:: Sandbox

      First edit your config to set ``model: "umap"``, then:

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry

   .. tab:: Other HPCs

      First edit your config to set ``model: "umap"``, then:

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry

   .. tab:: Local Snakemake

      First edit your config to set ``model: "umap"``, then:

      .. code-block:: bash

          snakemake --profile=../profiles/hpc \
              --configfile ../config_ancestry.yaml \
              classifyAncestry

Step 4: Ancestry-Specific Subsetting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The pipeline creates keep files for each predicted ancestry:

.. tabs::

   .. tab:: MSI HPC

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_ancestry.yaml convertNfilt/CHR=20/subset=EUR

   .. tab:: Sandbox

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_ancestry.yaml convertNfilt/CHR=20/subset=EUR

   .. tab:: Other HPCs

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_ancestry.yaml convertNfilt/CHR=20/subset=EUR

   .. tab:: Local Snakemake

      .. code-block:: bash

          snakemake --profile=../profiles/hpc \
              --configfile ../config_ancestry.yaml \
              convertNfilt/CHR=20/subset=EUR

Available subsets are dynamically determined from classification results.

----

Interpreting Pipeline Outputs
----------------------------

Posterior Probabilities
~~~~~~~~~~~~~~~~~~~~~~~

**File**: ``01-globalAncestry/posterior_probabilities.tsv``

Sample output:

+----------+--------+--------+--------+--------+
| IID      | pca_AFR| pca_AMR| pca_EUR| pca_SAS|
+==========+========+========+========+========+
| Sample1  | 0.95   | 0.02   | 0.02   | 0.01   |
+----------+--------+--------+--------+--------+
| Sample2  | 0.05   | 0.10   | 0.83   | 0.02   |
+----------+--------+--------+--------+--------+
| Sample3  | 0.40   | 0.30   | 0.15   | 0.15   |
+----------+--------+--------+--------+--------+

Classifications
~~~~~~~~~~~~~~

**File**: ``01-globalAncestry/ancestry_classifications.tsv``

+----------+------------------+------------------+
| IID      | pca_predicted    | pca_confidence   |
+==========+==================+==================+
| Sample1  | AFR              | 0.95             |
+----------+------------------+------------------+
| Sample2  | EUR              | 0.83             |
+----------+------------------+------------------+
| Sample3  | uncertain        | 0.40             |
+----------+------------------+------------------+

Samples below threshold are labeled "uncertain" or grouped as "Other".

Keep Files
~~~~~~~~~~

PLINK-style files for ancestry-specific analyses:

- ``keep_AFR.txt``, ``keep_EUR.txt``, etc.
- ``keep_Other.txt`` (below threshold)

Visualizations
~~~~~~~~~~~~~~

**Stacked Area Plot**: ``posterior_probability_stacked_pca.svg``

- X-axis: Samples sorted by ancestry proportions
- Y-axis: Stacked posterior probabilities
- Identifies homogeneous and admixed individuals

**Classification Space**: ``images/ancestry_classification_space.svg``

- A visualization of classification using PCA
- Samples in PC space with reference density contours
- Color indicates predicted ancestry

Creating a Confusion Matrix with Reported Race Labels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To evaluate classification performance, you can compare predicted ancestry
labels against reported race/ethnicity data. Provide a tab-separated file
with sample IDs and reported labels:

**Input format** (``reported_race.tsv``):

+----------+-----------+
| IID      | reported  |
+==========+===========+
| Sample1  | AFR       |
+----------+-----------+
| Sample2  | EUR       |
+----------+-----------+
| Sample3  | unknown   |
+----------+-----------+

To generate the confusion matrix, add to your config:

.. code-block:: bash

    ancestry:
        threshold: 0.8
        model: "pca"
        reported_race: "/path/to/reported_race.tsv"

The pipeline will output:

- ``ancestry_confusion_matrix.tsv``: Contingency table of predicted vs. reported
- ``ancestry_confusion_matrix.svg``: Heatmap visualization

**Interpretation notes**:

- Self-reported race is a social construct, not a genetic one—expect imperfect
  concordance due to genetic ancestry not aligning with social categorization
- Admixed individuals may not map cleanly to discrete categories
- Discrepancies can reveal both classification errors and limitations of
  self-reported labels

----

Discussion Points

These questions extend the practical exercise into deeper methodological considerations:

1. **Model comparison**: How do posterior probability distributions differ between
   PCA and UMAP? Does this align with the simulation findings that PCA
   remains the strongest baseline? (VAE not yet available for comparison)

2. **Threshold selection**: What happens to the number of "uncertain" classifications
   as you vary the threshold from 0.6 to 0.95? How does this affect downstream
   sample sizes?

3. **Admixed samples**: Examine samples with mixed ancestry proportions in the
   stacked area plot. Should these be forced into discrete categories, or would
   soft probabilities be more appropriate for covariate adjustment?

4. **Reference panel bias**: How do classifications change if your target
   population differs from the reference panel? What are the implications for
   fairness and validity?

5. **Classification vs. covariates**: For GWAS adjustment, compare results using
   hard ancestry labels versus PCs as continuous covariates. Which approach is
   more appropriate and why?

6. **Confusion and error**: Which ancestry pairs are most frequently confused
   in your data? Is this consistent with the simulation results showing PCA as
   nearly perfect on pure-like samples?

7. **Uncertainty quantification**: The pipeline provides probability estimates.
   How should these be incorporated into downstream analyses? Should low-confidence
   samples be excluded or modeled differently?

For the theoretical foundations behind these methods—including PCA decomposition,
Random Forest ensemble learning, and evaluation metrics—refer to the accompanying
lecture materials.

----

Next Steps
---------

After completing this tutorial, you can:

- :doc:`tutorial_heritability` - Estimate heritability using ancestry-classified samples
- Return to :doc:`tutorial_qc_pipeline` - Run ancestry-specific QC using the keep files

**The ancestry classification outputs enable:**

- Ancestry-specific QC filters (``EUR/standardFilter.pgen``, ``AFR/standardFilter.pgen``, etc.)
- Per-ancestry heritability estimation
- Stratified GWAS analyses

**See also:**

- :doc:`installation` - Software setup (if not already done)
- :doc:`usage` - Running the full pipeline
- :doc:`genomics` - Technical details on ancestry methods