Tutorial: Ancestry Classification in Practice

This tutorial provides hands-on experience running the ancestry classification pipeline in GDCGenomicsQC. For the theoretical background on dimension reduction methods and classification techniques, see the accompanying lecture slides.

Estimated completion time: 30-45 minutes

Learning objectives:

  1. Run the ancestry classification pipeline using Snakemake

  2. Configure different models and thresholds

  3. Interpret pipeline outputs

  4. Apply ancestry-specific subsetting


Prerequisites

Setup:

Before starting, ensure you have access to Snakemake and the GDCGenomicsQC workflow. For detailed installation instructions, see:

  • Installation - Software setup (module, conda, or other methods)

  • Usage - Running the pipeline

If you’re using the MSI HPC cluster:

module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

Verify installation:

cd GDCGenomicsQC
snakemake --version

Data Requirements:

DAG Visualization

The pipeline DAG up to the run_pca rule shows the workflow for preparing the PCA reference and running ancestry classification:

        ---
title: DAG
---
flowchart TB
	id0[kgMeta]
	id1[PCAreference]
	id2[kgData - CHR: 1]
	id3[kgData - CHR: 2]
	id4[kgData - CHR: 3]
	id5[kgData - CHR: 4]
	id6[kgData - CHR: 5]
	id7[kgData - CHR: 6]
	id8[kgData - CHR: 7]
	id9[kgData - CHR: 8]
	id10[kgData - CHR: 9]
	id11[kgData - CHR: 10]
	id12[kgData - CHR: 11]
	id13[kgData - CHR: 12]
	id14[kgData - CHR: 13]
	id15[kgData - CHR: 14]
	id16[kgData - CHR: 15]
	id17[kgData - CHR: 16]
	id18[kgData - CHR: 17]
	id19[kgData - CHR: 18]
	id20[kgData - CHR: 19]
	id21[kgData - CHR: 20]
	id22[kgData - CHR: 21]
	id23[kgData - CHR: 22]
	id24[convertNfilt - CHR: 1 - subset: full]
	id25[convertNfilt - CHR: 2 - subset: full]
	id26[convertNfilt - CHR: 3 - subset: full]
	id27[convertNfilt - CHR: 4 - subset: full]
	id28[convertNfilt - CHR: 5 - subset: full]
	id29[convertNfilt - CHR: 6 - subset: full]
	id30[convertNfilt - CHR: 7 - subset: full]
	id31[convertNfilt - CHR: 8 - subset: full]
	id32[convertNfilt - CHR: 9 - subset: full]
	id33[convertNfilt - CHR: 10 - subset: full]
	id34[convertNfilt - CHR: 11 - subset: full]
	id35[convertNfilt - CHR: 12 - subset: full]
	id36[convertNfilt - CHR: 13 - subset: full]
	id37[convertNfilt - CHR: 14 - subset: full]
	id38[convertNfilt - CHR: 15 - subset: full]
	id39[convertNfilt - CHR: 16 - subset: full]
	id40[convertNfilt - CHR: 17 - subset: full]
	id41[convertNfilt - CHR: 18 - subset: full]
	id42[convertNfilt - CHR: 19 - subset: full]
	id43[convertNfilt - CHR: 20 - subset: full]
	id44[convertNfilt - CHR: 21 - subset: full]
	id45[convertNfilt - CHR: 22 - subset: full]
	id46[initialFilter]
	id47[kgAssemble]
	id48[run_pca]
	style id0 fill:#57D95F,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id1 fill:#D97657,stroke-width:2px,color:#333333
	style id2 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id3 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id4 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id5 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id6 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id7 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id8 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id9 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id10 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id11 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id12 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id13 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id14 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id15 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id16 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id17 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id18 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id19 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id20 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id21 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id22 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id23 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id24 fill:#ADD957,stroke-width:2px,color:#333333
	style id25 fill:#ADD957,stroke-width:2px,color:#333333
	style id26 fill:#ADD957,stroke-width:2px,color:#333333
	style id27 fill:#ADD957,stroke-width:2px,color:#333333
	style id28 fill:#ADD957,stroke-width:2px,color:#333333
	style id29 fill:#ADD957,stroke-width:2px,color:#333333
	style id30 fill:#ADD957,stroke-width:2px,color:#333333
	style id31 fill:#ADD957,stroke-width:2px,color:#333333
	style id32 fill:#ADD957,stroke-width:2px,color:#333333
	style id33 fill:#ADD957,stroke-width:2px,color:#333333
	style id34 fill:#ADD957,stroke-width:2px,color:#333333
	style id35 fill:#ADD957,stroke-width:2px,color:#333333
	style id36 fill:#ADD957,stroke-width:2px,color:#333333
	style id37 fill:#ADD957,stroke-width:2px,color:#333333
	style id38 fill:#ADD957,stroke-width:2px,color:#333333
	style id39 fill:#ADD957,stroke-width:2px,color:#333333
	style id40 fill:#ADD957,stroke-width:2px,color:#333333
	style id41 fill:#ADD957,stroke-width:2px,color:#333333
	style id42 fill:#ADD957,stroke-width:2px,color:#333333
	style id43 fill:#ADD957,stroke-width:2px,color:#333333
	style id44 fill:#ADD957,stroke-width:2px,color:#333333
	style id45 fill:#ADD957,stroke-width:2px,color:#333333
	style id46 fill:#7ED957,stroke-width:2px,color:#333333
	style id47 fill:#6ED957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id48 fill:#57C5D9,stroke-width:2px,color:#333333
	id46 --> id1
	id47 --> id1
	id0 --> id24
	id0 --> id25
	id0 --> id26
	id0 --> id27
	id0 --> id28
	id0 --> id29
	id0 --> id30
	id0 --> id31
	id0 --> id32
	id0 --> id33
	id0 --> id34
	id0 --> id35
	id0 --> id36
	id0 --> id37
	id0 --> id38
	id0 --> id39
	id0 --> id40
	id0 --> id41
	id0 --> id42
	id0 --> id43
	id0 --> id44
	id0 --> id45
	id0 --> id46
	id24 --> id46
	id25 --> id46
	id26 --> id46
	id27 --> id46
	id28 --> id46
	id29 --> id46
	id30 --> id46
	id31 --> id46
	id32 --> id46
	id33 --> id46
	id34 --> id46
	id35 --> id46
	id36 --> id46
	id37 --> id46
	id38 --> id46
	id39 --> id46
	id40 --> id46
	id41 --> id46
	id42 --> id46
	id43 --> id46
	id44 --> id46
	id45 --> id46
	id2 --> id47
	id3 --> id47
	id4 --> id47
	id5 --> id47
	id6 --> id47
	id7 --> id47
	id8 --> id47
	id9 --> id47
	id10 --> id47
	id11 --> id47
	id12 --> id47
	id13 --> id47
	id14 --> id47
	id15 --> id47
	id16 --> id47
	id17 --> id47
	id18 --> id47
	id19 --> id47
	id20 --> id47
	id21 --> id47
	id22 --> id47
	id23 --> id47
	id0 --> id47
	id0 --> id48
	id1 --> id48

    

The rule graph provides a cleaner view of rule dependencies:

        ---
title: DAG
---
flowchart TB
	id0[kgMeta]
	id1[PCAreference]
	id2[kgData]
	id3[convertNfilt]
	id4[initialFilter]
	id5[kgAssemble]
	id6[run_pca]
	style id0 fill:#57D95F,stroke-width:2px,color:#333333
	style id1 fill:#D97657,stroke-width:2px,color:#333333
	style id2 fill:#5FD957,stroke-width:2px,color:#333333
	style id3 fill:#ADD957,stroke-width:2px,color:#333333
	style id4 fill:#7ED957,stroke-width:2px,color:#333333
	style id5 fill:#6ED957,stroke-width:2px,color:#333333
	style id6 fill:#57C5D9,stroke-width:2px,color:#333333
	id5 --> id1
	id4 --> id1
	id0 --> id3
	id0 --> id4
	id3 --> id4
	id2 --> id5
	id0 --> id5
	id1 --> id6
	id0 --> id6

    

Required Input Files

This step requires the following input files:

Ancestry Classification Input Files

Input File

Description

INPUT: "chr{CHR}.vcf.gz" (or .bed/.pgen)

Per-chromosome genotype data (QC-filtered recommended)

REF/1000G_highcoverage/population.txt

Reference panel with population labels (IID, pop, superpop columns)

REF/1000G_highcoverage/1000G_highCoveragephased.pruned.pgen

LD-pruned, unrelated reference genotypes for PCA projection

OUT_DIR/full/initialFilter.pgen (or _CHR.pgen)

Initial QC-filtered sample genotypes

Input from Previous Steps:

The ancestry classification pipeline depends on:

  1. QC Pipeline (tutorial_qc_pipeline): Produces filtered genotype files

  2. Reference Assembly (tutorial_1kg_assembly): Provides reference panel

Config Parameters for Ancestry:

ancestry:
    threshold: 0.8  # Minimum posterior probability for classification
    model: "pca"    # Options: pca, umap, rfmix (vae not yet implemented)
    # Optional: reported_race: "/path/to/reported_race.tsv"

INPUT: "/path/to/data/chr{CHR}.vcf.gz"
OUT_DIR: "/path/to/output"
REF: "/path/to/reference"
local-storage-prefix: "/path/to/.snakemake/storage"

chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

See also: Tutorial: Quality Control Pipeline in Practice for QC preprocessing, Tutorial: Assembling 1000 Genomes Reference Data for reference data.

Lab Exercise: Running Ancestry Classification

Step 1: Create Configuration File

For this tutorial, we will classify ancestry using an admix-free resampling of the 1000 Genomes reference panel. This reference was generated using the CT-Sleb tool and is available on Harvard Dataverse. The resampling ensures that only genetically unrelated, ancestry-appropriate samples are included, providing a cleaner reference for classification.

Create a configuration file for ancestry classification:

mkdir -p ~/ancestry_lab
cd ~/ancestry_lab
cat > config_ancestry.yaml << 'EOF'
INPUT: "/path/to/data/chr{CHR}.vcf.gz"
OUT_DIR: "/path/to/output/directory"
REF: "/path/to/reference/data"
local-storage-prefix: "/path/to/.snakemake/storage"

chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

ancestry:
    threshold: 0.8
    model: "pca"  # Options: pca, umap, rfmix (vae not yet implemented)

relatedness:
    method: "0"
    king_cutoff: 0.0884

localAncestry:
    RFMIX: true
    test: true
    thin_subjects: 0.1
    figures: "figures"
    chromosomes: null

thin: false
conda-frontend: mamba
EOF

Key parameters:

  • threshold: Minimum posterior probability for confident classification (default: 0.8)

  • model: Embedding used for classification—pca, umap, or rfmix (Note: VAE is not yet implemented)

Step 2: Run Classification Pipeline

Before classification, the pipeline runs KING to identify related samples. For this tutorial using simulated data, you should not find any related individuals (KING kinship coefficient ≈ 0), which serves as a good sanity check that the simulated data is properly independent.

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry -j 10

This trains Random Forest models on reference coordinates and predicts ancestry probabilities for your samples.

Step 3: Compare Models (PCA vs UMAP)

Modify model in your config to compare embeddings:

  • PCA (default): Linear projection, strongest baseline

  • UMAP: Nonlinear, good for visualization

  • VAE: Not yet implemented

First edit your config to set model: "umap", then:

gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry

Step 4: Ancestry-Specific Subsetting

The pipeline creates keep files for each predicted ancestry:

gdcgenomicsqc --configfile ../config_ancestry.yaml convertNfilt/CHR=20/subset=EUR

Available subsets are dynamically determined from classification results.


Interpreting Pipeline Outputs

Posterior Probabilities

File: 01-globalAncestry/posterior_probabilities.tsv

Sample output:

IID

pca_AFR

pca_AMR

pca_EUR

pca_SAS

Sample1

0.95

0.02

0.02

0.01

Sample2

0.05

0.10

0.83

0.02

Sample3

0.40

0.30

0.15

0.15

Classifications

File: 01-globalAncestry/ancestry_classifications.tsv

IID

pca_predicted

pca_confidence

Sample1

AFR

0.95

Sample2

EUR

0.83

Sample3

uncertain

0.40

Samples below threshold are labeled “uncertain” or grouped as “Other”.

Keep Files

PLINK-style files for ancestry-specific analyses:

  • keep_AFR.txt, keep_EUR.txt, etc.

  • keep_Other.txt (below threshold)

Visualizations

Stacked Area Plot: posterior_probability_stacked_pca.svg

  • X-axis: Samples sorted by ancestry proportions

  • Y-axis: Stacked posterior probabilities

  • Identifies homogeneous and admixed individuals

Classification Space: images/ancestry_classification_space.svg

  • A visualization of classification using PCA

  • Samples in PC space with reference density contours

  • Color indicates predicted ancestry

Creating a Confusion Matrix with Reported Race Labels

To evaluate classification performance, you can compare predicted ancestry labels against reported race/ethnicity data. Provide a tab-separated file with sample IDs and reported labels:

Input format (reported_race.tsv):

IID

reported

Sample1

AFR

Sample2

EUR

Sample3

unknown

To generate the confusion matrix, add to your config:

ancestry:
    threshold: 0.8
    model: "pca"
    reported_race: "/path/to/reported_race.tsv"

The pipeline will output:

  • ancestry_confusion_matrix.tsv: Contingency table of predicted vs. reported

  • ancestry_confusion_matrix.svg: Heatmap visualization

Interpretation notes:

  • Self-reported race is a social construct, not a genetic one—expect imperfect concordance due to genetic ancestry not aligning with social categorization

  • Admixed individuals may not map cleanly to discrete categories

  • Discrepancies can reveal both classification errors and limitations of self-reported labels


Discussion Points

These questions extend the practical exercise into deeper methodological considerations:

  1. Model comparison: How do posterior probability distributions differ between PCA and UMAP? Does this align with the simulation findings that PCA remains the strongest baseline? (VAE not yet available for comparison)

  2. Threshold selection: What happens to the number of “uncertain” classifications as you vary the threshold from 0.6 to 0.95? How does this affect downstream sample sizes?

  3. Admixed samples: Examine samples with mixed ancestry proportions in the stacked area plot. Should these be forced into discrete categories, or would soft probabilities be more appropriate for covariate adjustment?

  4. Reference panel bias: How do classifications change if your target population differs from the reference panel? What are the implications for fairness and validity?

  5. Classification vs. covariates: For GWAS adjustment, compare results using hard ancestry labels versus PCs as continuous covariates. Which approach is more appropriate and why?

  6. Confusion and error: Which ancestry pairs are most frequently confused in your data? Is this consistent with the simulation results showing PCA as nearly perfect on pure-like samples?

  7. Uncertainty quantification: The pipeline provides probability estimates. How should these be incorporated into downstream analyses? Should low-confidence samples be excluded or modeled differently?

For the theoretical foundations behind these methods—including PCA decomposition, Random Forest ensemble learning, and evaluation metrics—refer to the accompanying lecture materials.


Next Steps

After completing this tutorial, you can:

The ancestry classification outputs enable:

  • Ancestry-specific QC filters (EUR/standardFilter.pgen, AFR/standardFilter.pgen, etc.)

  • Per-ancestry heritability estimation

  • Stratified GWAS analyses

See also:

  • Installation - Software setup (if not already done)

  • Usage - Running the full pipeline

  • Genomics - Technical details on ancestry methods