Tutorial: Ancestry Classification in Practice

This tutorial provides hands-on experience running the ancestry classification pipeline in GDCGenomicsQC. For the theoretical background on dimension reduction methods and classification techniques, see the accompanying lecture slides.

Estimated completion time: 30-45 minutes

Learning objectives:

Run the ancestry classification pipeline using Snakemake
Configure different models and thresholds
Interpret pipeline outputs
Apply ancestry-specific subsetting

Prerequisites

Setup:

Before starting, ensure you have access to Snakemake and the GDCGenomicsQC workflow. For detailed installation instructions, see:

Installation - Software setup (module, conda, or other methods)
Usage - Running the pipeline

If you’re using the MSI HPC cluster:

module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

Verify installation:

cd GDCGenomicsQC
snakemake --version

If you’re using the Sandbox environment:

module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

Verify installation:

cd GDCGenomicsQC
snakemake --version

If your HPC has the GDC module pre-configured:

# Replace with your HPC's module path:
module use /path/to/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

Verify installation:

cd GDCGenomicsQC
snakemake --version

If you’re using your own Snakemake installation:

conda activate snakemake
cd GDCGenomicsQC

Verify installation:

snakemake --version

Data Requirements:

Reference panel with population labels (see Tutorial: Assembling 1000 Genomes Reference Data)
QC-filtered genotype data (see Tutorial: Quality Control Pipeline in Practice)

DAG Visualization

The pipeline DAG up to the run_pca rule shows the workflow for preparing the PCA reference and running ancestry classification:

        ---
title: DAG
---
flowchart TB
	id0[kgMeta]
	id1[PCAreference]
	id2[kgData - CHR: 1]
	id3[kgData - CHR: 2]
	id4[kgData - CHR: 3]
	id5[kgData - CHR: 4]
	id6[kgData - CHR: 5]
	id7[kgData - CHR: 6]
	id8[kgData - CHR: 7]
	id9[kgData - CHR: 8]
	id10[kgData - CHR: 9]
	id11[kgData - CHR: 10]
	id12[kgData - CHR: 11]
	id13[kgData - CHR: 12]
	id14[kgData - CHR: 13]
	id15[kgData - CHR: 14]
	id16[kgData - CHR: 15]
	id17[kgData - CHR: 16]
	id18[kgData - CHR: 17]
	id19[kgData - CHR: 18]
	id20[kgData - CHR: 19]
	id21[kgData - CHR: 20]
	id22[kgData - CHR: 21]
	id23[kgData - CHR: 22]
	id24[convertNfilt - CHR: 1 - subset: full]
	id25[convertNfilt - CHR: 2 - subset: full]
	id26[convertNfilt - CHR: 3 - subset: full]
	id27[convertNfilt - CHR: 4 - subset: full]
	id28[convertNfilt - CHR: 5 - subset: full]
	id29[convertNfilt - CHR: 6 - subset: full]
	id30[convertNfilt - CHR: 7 - subset: full]
	id31[convertNfilt - CHR: 8 - subset: full]
	id32[convertNfilt - CHR: 9 - subset: full]
	id33[convertNfilt - CHR: 10 - subset: full]
	id34[convertNfilt - CHR: 11 - subset: full]
	id35[convertNfilt - CHR: 12 - subset: full]
	id36[convertNfilt - CHR: 13 - subset: full]
	id37[convertNfilt - CHR: 14 - subset: full]
	id38[convertNfilt - CHR: 15 - subset: full]
	id39[convertNfilt - CHR: 16 - subset: full]
	id40[convertNfilt - CHR: 17 - subset: full]
	id41[convertNfilt - CHR: 18 - subset: full]
	id42[convertNfilt - CHR: 19 - subset: full]
	id43[convertNfilt - CHR: 20 - subset: full]
	id44[convertNfilt - CHR: 21 - subset: full]
	id45[convertNfilt - CHR: 22 - subset: full]
	id46[initialFilter]
	id47[kgAssemble]
	id48[run_pca]
	style id0 fill:#57D95F,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id1 fill:#D97657,stroke-width:2px,color:#333333
	style id2 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id3 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id4 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id5 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id6 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id7 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id8 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id9 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id10 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id11 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id12 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id13 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id14 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id15 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id16 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id17 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id18 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id19 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id20 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id21 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id22 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id23 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id24 fill:#ADD957,stroke-width:2px,color:#333333
	style id25 fill:#ADD957,stroke-width:2px,color:#333333
	style id26 fill:#ADD957,stroke-width:2px,color:#333333
	style id27 fill:#ADD957,stroke-width:2px,color:#333333
	style id28 fill:#ADD957,stroke-width:2px,color:#333333
	style id29 fill:#ADD957,stroke-width:2px,color:#333333
	style id30 fill:#ADD957,stroke-width:2px,color:#333333
	style id31 fill:#ADD957,stroke-width:2px,color:#333333
	style id32 fill:#ADD957,stroke-width:2px,color:#333333
	style id33 fill:#ADD957,stroke-width:2px,color:#333333
	style id34 fill:#ADD957,stroke-width:2px,color:#333333
	style id35 fill:#ADD957,stroke-width:2px,color:#333333
	style id36 fill:#ADD957,stroke-width:2px,color:#333333
	style id37 fill:#ADD957,stroke-width:2px,color:#333333
	style id38 fill:#ADD957,stroke-width:2px,color:#333333
	style id39 fill:#ADD957,stroke-width:2px,color:#333333
	style id40 fill:#ADD957,stroke-width:2px,color:#333333
	style id41 fill:#ADD957,stroke-width:2px,color:#333333
	style id42 fill:#ADD957,stroke-width:2px,color:#333333
	style id43 fill:#ADD957,stroke-width:2px,color:#333333
	style id44 fill:#ADD957,stroke-width:2px,color:#333333
	style id45 fill:#ADD957,stroke-width:2px,color:#333333
	style id46 fill:#7ED957,stroke-width:2px,color:#333333
	style id47 fill:#6ED957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
	style id48 fill:#57C5D9,stroke-width:2px,color:#333333
	id46 --> id1
	id47 --> id1
	id0 --> id24
	id0 --> id25
	id0 --> id26
	id0 --> id27
	id0 --> id28
	id0 --> id29
	id0 --> id30
	id0 --> id31
	id0 --> id32
	id0 --> id33
	id0 --> id34
	id0 --> id35
	id0 --> id36
	id0 --> id37
	id0 --> id38
	id0 --> id39
	id0 --> id40
	id0 --> id41
	id0 --> id42
	id0 --> id43
	id0 --> id44
	id0 --> id45
	id0 --> id46
	id24 --> id46
	id25 --> id46
	id26 --> id46
	id27 --> id46
	id28 --> id46
	id29 --> id46
	id30 --> id46
	id31 --> id46
	id32 --> id46
	id33 --> id46
	id34 --> id46
	id35 --> id46
	id36 --> id46
	id37 --> id46
	id38 --> id46
	id39 --> id46
	id40 --> id46
	id41 --> id46
	id42 --> id46
	id43 --> id46
	id44 --> id46
	id45 --> id46
	id2 --> id47
	id3 --> id47
	id4 --> id47
	id5 --> id47
	id6 --> id47
	id7 --> id47
	id8 --> id47
	id9 --> id47
	id10 --> id47
	id11 --> id47
	id12 --> id47
	id13 --> id47
	id14 --> id47
	id15 --> id47
	id16 --> id47
	id17 --> id47
	id18 --> id47
	id19 --> id47
	id20 --> id47
	id21 --> id47
	id22 --> id47
	id23 --> id47
	id0 --> id47
	id0 --> id48
	id1 --> id48

The rule graph provides a cleaner view of rule dependencies:

        ---
title: DAG
---
flowchart TB
	id0[kgMeta]
	id1[PCAreference]
	id2[kgData]
	id3[convertNfilt]
	id4[initialFilter]
	id5[kgAssemble]
	id6[run_pca]
	style id0 fill:#57D95F,stroke-width:2px,color:#333333
	style id1 fill:#D97657,stroke-width:2px,color:#333333
	style id2 fill:#5FD957,stroke-width:2px,color:#333333
	style id3 fill:#ADD957,stroke-width:2px,color:#333333
	style id4 fill:#7ED957,stroke-width:2px,color:#333333
	style id5 fill:#6ED957,stroke-width:2px,color:#333333
	style id6 fill:#57C5D9,stroke-width:2px,color:#333333
	id5 --> id1
	id4 --> id1
	id0 --> id3
	id0 --> id4
	id3 --> id4
	id2 --> id5
	id0 --> id5
	id1 --> id6
	id0 --> id6

Required Input Files

This step requires the following input files:

Ancestry Classification Input Files
Input File	Description
`INPUT: "chr{CHR}.vcf.gz"` (or .bed/.pgen)	Per-chromosome genotype data (QC-filtered recommended)
`REF/1000G_highcoverage/population.txt`	Reference panel with population labels (IID, pop, superpop columns)
`REF/1000G_highcoverage/1000G_highCoveragephased.pruned.pgen`	LD-pruned, unrelated reference genotypes for PCA projection
`OUT_DIR/full/initialFilter.pgen` (or `_CHR.pgen`)	Initial QC-filtered sample genotypes

Input from Previous Steps:

The ancestry classification pipeline depends on:

QC Pipeline (tutorial_qc_pipeline): Produces filtered genotype files
Reference Assembly (tutorial_1kg_assembly): Provides reference panel

Config Parameters for Ancestry:

ancestry:
    threshold: 0.8  # Minimum posterior probability for classification
    model: "pca"    # Options: pca, umap, rfmix (vae not yet implemented)
    # Optional: reported_race: "/path/to/reported_race.tsv"

INPUT: "/path/to/data/chr{CHR}.vcf.gz"
OUT_DIR: "/path/to/output"
REF: "/path/to/reference"
local-storage-prefix: "/path/to/.snakemake/storage"

chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

See also: Tutorial: Quality Control Pipeline in Practice for QC preprocessing, Tutorial: Assembling 1000 Genomes Reference Data for reference data.

Lab Exercise: Running Ancestry Classification

Step 1: Create Configuration File

For this tutorial, we will classify ancestry using an admix-free resampling of the 1000 Genomes reference panel. This reference was generated using the CT-Sleb tool and is available on Harvard Dataverse. The resampling ensures that only genetically unrelated, ancestry-appropriate samples are included, providing a cleaner reference for classification.

Create a configuration file for ancestry classification:

mkdir -p ~/ancestry_lab
cd ~/ancestry_lab
cat > config_ancestry.yaml << 'EOF'
INPUT: "/path/to/data/chr{CHR}.vcf.gz"
OUT_DIR: "/path/to/output/directory"
REF: "/path/to/reference/data"
local-storage-prefix: "/path/to/.snakemake/storage"

chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

ancestry:
    threshold: 0.8
    model: "pca"  # Options: pca, umap, rfmix (vae not yet implemented)

relatedness:
    method: "0"
    king_cutoff: 0.0884

localAncestry:
    RFMIX: true
    test: true
    thin_subjects: 0.1
    figures: "figures"
    chromosomes: null

thin: false
conda-frontend: mamba
EOF

Key parameters:

threshold: Minimum posterior probability for confident classification (default: 0.8)
model: Embedding used for classification—pca, umap, or rfmix (Note: VAE is not yet implemented)

Step 2: Run Classification Pipeline

Before classification, the pipeline runs KING to identify related samples. For this tutorial using simulated data, you should not find any related individuals (KING kinship coefficient ≈ 0), which serves as a good sanity check that the simulated data is properly independent.

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry -j 10

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry -j 10

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry -j 10

cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/hpc \
    --configfile ../config_ancestry.yaml \
    classifyAncestry \
    -j 10

This trains Random Forest models on reference coordinates and predicts ancestry probabilities for your samples.

Step 3: Compare Models (PCA vs UMAP)

Modify model in your config to compare embeddings:

PCA (default): Linear projection, strongest baseline
UMAP: Nonlinear, good for visualization
VAE: Not yet implemented

First edit your config to set model: "umap", then:

gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry

First edit your config to set model: "umap", then:

gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry

First edit your config to set model: "umap", then:

gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry

First edit your config to set model: "umap", then:

snakemake --profile=../profiles/hpc \
    --configfile ../config_ancestry.yaml \
    classifyAncestry

Step 4: Ancestry-Specific Subsetting

The pipeline creates keep files for each predicted ancestry:

gdcgenomicsqc --configfile ../config_ancestry.yaml convertNfilt/CHR=20/subset=EUR

gdcgenomicsqc --configfile ../config_ancestry.yaml convertNfilt/CHR=20/subset=EUR

gdcgenomicsqc --configfile ../config_ancestry.yaml convertNfilt/CHR=20/subset=EUR

snakemake --profile=../profiles/hpc \
    --configfile ../config_ancestry.yaml \
    convertNfilt/CHR=20/subset=EUR

Available subsets are dynamically determined from classification results.

Interpreting Pipeline Outputs

Posterior Probabilities

File: 01-globalAncestry/posterior_probabilities.tsv

Sample output:

IID	pca_AFR	pca_AMR	pca_EUR	pca_SAS
Sample1	0.95	0.02	0.02	0.01
Sample2	0.05	0.10	0.83	0.02
Sample3	0.40	0.30	0.15	0.15

Classifications

File: 01-globalAncestry/ancestry_classifications.tsv

IID	pca_predicted	pca_confidence
Sample1	AFR	0.95
Sample2	EUR	0.83
Sample3	uncertain	0.40

Samples below threshold are labeled “uncertain” or grouped as “Other”.

Keep Files

PLINK-style files for ancestry-specific analyses:

keep_AFR.txt, keep_EUR.txt, etc.
keep_Other.txt (below threshold)

Visualizations

Stacked Area Plot: posterior_probability_stacked_pca.svg

X-axis: Samples sorted by ancestry proportions
Y-axis: Stacked posterior probabilities
Identifies homogeneous and admixed individuals

Classification Space: images/ancestry_classification_space.svg

A visualization of classification using PCA
Samples in PC space with reference density contours
Color indicates predicted ancestry

Creating a Confusion Matrix with Reported Race Labels

To evaluate classification performance, you can compare predicted ancestry labels against reported race/ethnicity data. Provide a tab-separated file with sample IDs and reported labels:

Input format (reported_race.tsv):

IID	reported
Sample1	AFR
Sample2	EUR
Sample3	unknown

To generate the confusion matrix, add to your config:

ancestry:
    threshold: 0.8
    model: "pca"
    reported_race: "/path/to/reported_race.tsv"

The pipeline will output:

ancestry_confusion_matrix.tsv: Contingency table of predicted vs. reported
ancestry_confusion_matrix.svg: Heatmap visualization

Interpretation notes:

Self-reported race is a social construct, not a genetic one—expect imperfect concordance due to genetic ancestry not aligning with social categorization
Admixed individuals may not map cleanly to discrete categories
Discrepancies can reveal both classification errors and limitations of self-reported labels

Discussion Points

These questions extend the practical exercise into deeper methodological considerations:

Model comparison: How do posterior probability distributions differ between PCA and UMAP? Does this align with the simulation findings that PCA remains the strongest baseline? (VAE not yet available for comparison)
Threshold selection: What happens to the number of “uncertain” classifications as you vary the threshold from 0.6 to 0.95? How does this affect downstream sample sizes?
Admixed samples: Examine samples with mixed ancestry proportions in the stacked area plot. Should these be forced into discrete categories, or would soft probabilities be more appropriate for covariate adjustment?
Reference panel bias: How do classifications change if your target population differs from the reference panel? What are the implications for fairness and validity?
Classification vs. covariates: For GWAS adjustment, compare results using hard ancestry labels versus PCs as continuous covariates. Which approach is more appropriate and why?
Confusion and error: Which ancestry pairs are most frequently confused in your data? Is this consistent with the simulation results showing PCA as nearly perfect on pure-like samples?
Uncertainty quantification: The pipeline provides probability estimates. How should these be incorporated into downstream analyses? Should low-confidence samples be excluded or modeled differently?

For the theoretical foundations behind these methods—including PCA decomposition, Random Forest ensemble learning, and evaluation metrics—refer to the accompanying lecture materials.

Next Steps

After completing this tutorial, you can:

Tutorial: Heritability Estimation with Multi-Ancestry Simulation - Estimate heritability using ancestry-classified samples
Return to Tutorial: Quality Control Pipeline in Practice - Run ancestry-specific QC using the keep files

The ancestry classification outputs enable:

Ancestry-specific QC filters (EUR/standardFilter.pgen, AFR/standardFilter.pgen, etc.)
Per-ancestry heritability estimation
Stratified GWAS analyses

See also:

Installation - Software setup (if not already done)
Usage - Running the full pipeline
Genomics - Technical details on ancestry methods