Tutorial: Ancestry Classification in Practice
This tutorial provides hands-on experience running the ancestry classification pipeline in GDCGenomicsQC. For the theoretical background on dimension reduction methods and classification techniques, see the accompanying lecture slides.
Estimated completion time: 30-45 minutes
Learning objectives:
Run the ancestry classification pipeline using Snakemake
Configure different models and thresholds
Interpret pipeline outputs
Apply ancestry-specific subsetting
Prerequisites
Setup:
Before starting, ensure you have access to Snakemake and the GDCGenomicsQC workflow. For detailed installation instructions, see:
Installation - Software setup (module, conda, or other methods)
Usage - Running the pipeline
If you’re using the MSI HPC cluster:
module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
Verify installation:
cd GDCGenomicsQC
snakemake --version
If you’re using the Sandbox environment:
module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
Verify installation:
cd GDCGenomicsQC
snakemake --version
If your HPC has the GDC module pre-configured:
# Replace with your HPC's module path:
module use /path/to/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
Verify installation:
cd GDCGenomicsQC
snakemake --version
If you’re using your own Snakemake installation:
conda activate snakemake
cd GDCGenomicsQC
Verify installation:
snakemake --version
Data Requirements:
Reference panel with population labels (see Tutorial: Assembling 1000 Genomes Reference Data)
QC-filtered genotype data (see Tutorial: Quality Control Pipeline in Practice)
DAG Visualization
The pipeline DAG up to the run_pca rule shows the workflow for preparing
the PCA reference and running ancestry classification:
---
title: DAG
---
flowchart TB
id0[kgMeta]
id1[PCAreference]
id2[kgData - CHR: 1]
id3[kgData - CHR: 2]
id4[kgData - CHR: 3]
id5[kgData - CHR: 4]
id6[kgData - CHR: 5]
id7[kgData - CHR: 6]
id8[kgData - CHR: 7]
id9[kgData - CHR: 8]
id10[kgData - CHR: 9]
id11[kgData - CHR: 10]
id12[kgData - CHR: 11]
id13[kgData - CHR: 12]
id14[kgData - CHR: 13]
id15[kgData - CHR: 14]
id16[kgData - CHR: 15]
id17[kgData - CHR: 16]
id18[kgData - CHR: 17]
id19[kgData - CHR: 18]
id20[kgData - CHR: 19]
id21[kgData - CHR: 20]
id22[kgData - CHR: 21]
id23[kgData - CHR: 22]
id24[convertNfilt - CHR: 1 - subset: full]
id25[convertNfilt - CHR: 2 - subset: full]
id26[convertNfilt - CHR: 3 - subset: full]
id27[convertNfilt - CHR: 4 - subset: full]
id28[convertNfilt - CHR: 5 - subset: full]
id29[convertNfilt - CHR: 6 - subset: full]
id30[convertNfilt - CHR: 7 - subset: full]
id31[convertNfilt - CHR: 8 - subset: full]
id32[convertNfilt - CHR: 9 - subset: full]
id33[convertNfilt - CHR: 10 - subset: full]
id34[convertNfilt - CHR: 11 - subset: full]
id35[convertNfilt - CHR: 12 - subset: full]
id36[convertNfilt - CHR: 13 - subset: full]
id37[convertNfilt - CHR: 14 - subset: full]
id38[convertNfilt - CHR: 15 - subset: full]
id39[convertNfilt - CHR: 16 - subset: full]
id40[convertNfilt - CHR: 17 - subset: full]
id41[convertNfilt - CHR: 18 - subset: full]
id42[convertNfilt - CHR: 19 - subset: full]
id43[convertNfilt - CHR: 20 - subset: full]
id44[convertNfilt - CHR: 21 - subset: full]
id45[convertNfilt - CHR: 22 - subset: full]
id46[initialFilter]
id47[kgAssemble]
id48[run_pca]
style id0 fill:#57D95F,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id1 fill:#D97657,stroke-width:2px,color:#333333
style id2 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id3 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id4 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id5 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id6 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id7 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id8 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id9 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id10 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id11 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id12 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id13 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id14 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id15 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id16 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id17 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id18 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id19 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id20 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id21 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id22 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id23 fill:#5FD957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id24 fill:#ADD957,stroke-width:2px,color:#333333
style id25 fill:#ADD957,stroke-width:2px,color:#333333
style id26 fill:#ADD957,stroke-width:2px,color:#333333
style id27 fill:#ADD957,stroke-width:2px,color:#333333
style id28 fill:#ADD957,stroke-width:2px,color:#333333
style id29 fill:#ADD957,stroke-width:2px,color:#333333
style id30 fill:#ADD957,stroke-width:2px,color:#333333
style id31 fill:#ADD957,stroke-width:2px,color:#333333
style id32 fill:#ADD957,stroke-width:2px,color:#333333
style id33 fill:#ADD957,stroke-width:2px,color:#333333
style id34 fill:#ADD957,stroke-width:2px,color:#333333
style id35 fill:#ADD957,stroke-width:2px,color:#333333
style id36 fill:#ADD957,stroke-width:2px,color:#333333
style id37 fill:#ADD957,stroke-width:2px,color:#333333
style id38 fill:#ADD957,stroke-width:2px,color:#333333
style id39 fill:#ADD957,stroke-width:2px,color:#333333
style id40 fill:#ADD957,stroke-width:2px,color:#333333
style id41 fill:#ADD957,stroke-width:2px,color:#333333
style id42 fill:#ADD957,stroke-width:2px,color:#333333
style id43 fill:#ADD957,stroke-width:2px,color:#333333
style id44 fill:#ADD957,stroke-width:2px,color:#333333
style id45 fill:#ADD957,stroke-width:2px,color:#333333
style id46 fill:#7ED957,stroke-width:2px,color:#333333
style id47 fill:#6ED957,stroke-width:2px,color:#333333,stroke-dasharray: 5 5
style id48 fill:#57C5D9,stroke-width:2px,color:#333333
id46 --> id1
id47 --> id1
id0 --> id24
id0 --> id25
id0 --> id26
id0 --> id27
id0 --> id28
id0 --> id29
id0 --> id30
id0 --> id31
id0 --> id32
id0 --> id33
id0 --> id34
id0 --> id35
id0 --> id36
id0 --> id37
id0 --> id38
id0 --> id39
id0 --> id40
id0 --> id41
id0 --> id42
id0 --> id43
id0 --> id44
id0 --> id45
id0 --> id46
id24 --> id46
id25 --> id46
id26 --> id46
id27 --> id46
id28 --> id46
id29 --> id46
id30 --> id46
id31 --> id46
id32 --> id46
id33 --> id46
id34 --> id46
id35 --> id46
id36 --> id46
id37 --> id46
id38 --> id46
id39 --> id46
id40 --> id46
id41 --> id46
id42 --> id46
id43 --> id46
id44 --> id46
id45 --> id46
id2 --> id47
id3 --> id47
id4 --> id47
id5 --> id47
id6 --> id47
id7 --> id47
id8 --> id47
id9 --> id47
id10 --> id47
id11 --> id47
id12 --> id47
id13 --> id47
id14 --> id47
id15 --> id47
id16 --> id47
id17 --> id47
id18 --> id47
id19 --> id47
id20 --> id47
id21 --> id47
id22 --> id47
id23 --> id47
id0 --> id47
id0 --> id48
id1 --> id48
The rule graph provides a cleaner view of rule dependencies:
---
title: DAG
---
flowchart TB
id0[kgMeta]
id1[PCAreference]
id2[kgData]
id3[convertNfilt]
id4[initialFilter]
id5[kgAssemble]
id6[run_pca]
style id0 fill:#57D95F,stroke-width:2px,color:#333333
style id1 fill:#D97657,stroke-width:2px,color:#333333
style id2 fill:#5FD957,stroke-width:2px,color:#333333
style id3 fill:#ADD957,stroke-width:2px,color:#333333
style id4 fill:#7ED957,stroke-width:2px,color:#333333
style id5 fill:#6ED957,stroke-width:2px,color:#333333
style id6 fill:#57C5D9,stroke-width:2px,color:#333333
id5 --> id1
id4 --> id1
id0 --> id3
id0 --> id4
id3 --> id4
id2 --> id5
id0 --> id5
id1 --> id6
id0 --> id6
Required Input Files
This step requires the following input files:
Input File |
Description |
|---|---|
|
Per-chromosome genotype data (QC-filtered recommended) |
|
Reference panel with population labels (IID, pop, superpop columns) |
|
LD-pruned, unrelated reference genotypes for PCA projection |
|
Initial QC-filtered sample genotypes |
Input from Previous Steps:
The ancestry classification pipeline depends on:
QC Pipeline (tutorial_qc_pipeline): Produces filtered genotype files
Reference Assembly (tutorial_1kg_assembly): Provides reference panel
Config Parameters for Ancestry:
ancestry:
threshold: 0.8 # Minimum posterior probability for classification
model: "pca" # Options: pca, umap, rfmix (vae not yet implemented)
# Optional: reported_race: "/path/to/reported_race.tsv"
INPUT: "/path/to/data/chr{CHR}.vcf.gz"
OUT_DIR: "/path/to/output"
REF: "/path/to/reference"
local-storage-prefix: "/path/to/.snakemake/storage"
chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
See also: Tutorial: Quality Control Pipeline in Practice for QC preprocessing, Tutorial: Assembling 1000 Genomes Reference Data for reference data.
Lab Exercise: Running Ancestry Classification
Step 1: Create Configuration File
For this tutorial, we will classify ancestry using an admix-free resampling of the 1000 Genomes reference panel. This reference was generated using the CT-Sleb tool and is available on Harvard Dataverse. The resampling ensures that only genetically unrelated, ancestry-appropriate samples are included, providing a cleaner reference for classification.
Create a configuration file for ancestry classification:
mkdir -p ~/ancestry_lab
cd ~/ancestry_lab
cat > config_ancestry.yaml << 'EOF'
INPUT: "/path/to/data/chr{CHR}.vcf.gz"
OUT_DIR: "/path/to/output/directory"
REF: "/path/to/reference/data"
local-storage-prefix: "/path/to/.snakemake/storage"
chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
ancestry:
threshold: 0.8
model: "pca" # Options: pca, umap, rfmix (vae not yet implemented)
relatedness:
method: "0"
king_cutoff: 0.0884
localAncestry:
RFMIX: true
test: true
thin_subjects: 0.1
figures: "figures"
chromosomes: null
thin: false
conda-frontend: mamba
EOF
Key parameters:
threshold: Minimum posterior probability for confident classification (default: 0.8)model: Embedding used for classification—pca,umap, orrfmix(Note: VAE is not yet implemented)
Step 2: Run Classification Pipeline
Before classification, the pipeline runs KING to identify related samples. For this tutorial using simulated data, you should not find any related individuals (KING kinship coefficient ≈ 0), which serves as a good sanity check that the simulated data is properly independent.
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry -j 10
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry -j 10
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry -j 10
cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/hpc \
--configfile ../config_ancestry.yaml \
classifyAncestry \
-j 10
This trains Random Forest models on reference coordinates and predicts ancestry probabilities for your samples.
Step 3: Compare Models (PCA vs UMAP)
Modify model in your config to compare embeddings:
PCA (default): Linear projection, strongest baseline
UMAP: Nonlinear, good for visualization
VAE: Not yet implemented
First edit your config to set model: "umap", then:
gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry
First edit your config to set model: "umap", then:
gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry
First edit your config to set model: "umap", then:
gdcgenomicsqc --configfile ../config_ancestry.yaml classifyAncestry
First edit your config to set model: "umap", then:
snakemake --profile=../profiles/hpc \
--configfile ../config_ancestry.yaml \
classifyAncestry
Step 4: Ancestry-Specific Subsetting
The pipeline creates keep files for each predicted ancestry:
gdcgenomicsqc --configfile ../config_ancestry.yaml convertNfilt/CHR=20/subset=EUR
gdcgenomicsqc --configfile ../config_ancestry.yaml convertNfilt/CHR=20/subset=EUR
gdcgenomicsqc --configfile ../config_ancestry.yaml convertNfilt/CHR=20/subset=EUR
snakemake --profile=../profiles/hpc \
--configfile ../config_ancestry.yaml \
convertNfilt/CHR=20/subset=EUR
Available subsets are dynamically determined from classification results.
Interpreting Pipeline Outputs
Posterior Probabilities
File: 01-globalAncestry/posterior_probabilities.tsv
Sample output:
IID |
pca_AFR |
pca_AMR |
pca_EUR |
pca_SAS |
|---|---|---|---|---|
Sample1 |
0.95 |
0.02 |
0.02 |
0.01 |
Sample2 |
0.05 |
0.10 |
0.83 |
0.02 |
Sample3 |
0.40 |
0.30 |
0.15 |
0.15 |
Classifications
File: 01-globalAncestry/ancestry_classifications.tsv
IID |
pca_predicted |
pca_confidence |
|---|---|---|
Sample1 |
AFR |
0.95 |
Sample2 |
EUR |
0.83 |
Sample3 |
uncertain |
0.40 |
Samples below threshold are labeled “uncertain” or grouped as “Other”.
Keep Files
PLINK-style files for ancestry-specific analyses:
keep_AFR.txt,keep_EUR.txt, etc.keep_Other.txt(below threshold)
Visualizations
Stacked Area Plot: posterior_probability_stacked_pca.svg
X-axis: Samples sorted by ancestry proportions
Y-axis: Stacked posterior probabilities
Identifies homogeneous and admixed individuals
Classification Space: images/ancestry_classification_space.svg
A visualization of classification using PCA
Samples in PC space with reference density contours
Color indicates predicted ancestry
Creating a Confusion Matrix with Reported Race Labels
To evaluate classification performance, you can compare predicted ancestry labels against reported race/ethnicity data. Provide a tab-separated file with sample IDs and reported labels:
Input format (reported_race.tsv):
IID |
reported |
|---|---|
Sample1 |
AFR |
Sample2 |
EUR |
Sample3 |
unknown |
To generate the confusion matrix, add to your config:
ancestry:
threshold: 0.8
model: "pca"
reported_race: "/path/to/reported_race.tsv"
The pipeline will output:
ancestry_confusion_matrix.tsv: Contingency table of predicted vs. reportedancestry_confusion_matrix.svg: Heatmap visualization
Interpretation notes:
Self-reported race is a social construct, not a genetic one—expect imperfect concordance due to genetic ancestry not aligning with social categorization
Admixed individuals may not map cleanly to discrete categories
Discrepancies can reveal both classification errors and limitations of self-reported labels
Discussion Points
These questions extend the practical exercise into deeper methodological considerations:
Model comparison: How do posterior probability distributions differ between PCA and UMAP? Does this align with the simulation findings that PCA remains the strongest baseline? (VAE not yet available for comparison)
Threshold selection: What happens to the number of “uncertain” classifications as you vary the threshold from 0.6 to 0.95? How does this affect downstream sample sizes?
Admixed samples: Examine samples with mixed ancestry proportions in the stacked area plot. Should these be forced into discrete categories, or would soft probabilities be more appropriate for covariate adjustment?
Reference panel bias: How do classifications change if your target population differs from the reference panel? What are the implications for fairness and validity?
Classification vs. covariates: For GWAS adjustment, compare results using hard ancestry labels versus PCs as continuous covariates. Which approach is more appropriate and why?
Confusion and error: Which ancestry pairs are most frequently confused in your data? Is this consistent with the simulation results showing PCA as nearly perfect on pure-like samples?
Uncertainty quantification: The pipeline provides probability estimates. How should these be incorporated into downstream analyses? Should low-confidence samples be excluded or modeled differently?
For the theoretical foundations behind these methods—including PCA decomposition, Random Forest ensemble learning, and evaluation metrics—refer to the accompanying lecture materials.
Next Steps
After completing this tutorial, you can:
Tutorial: Heritability Estimation with Multi-Ancestry Simulation - Estimate heritability using ancestry-classified samples
Return to Tutorial: Quality Control Pipeline in Practice - Run ancestry-specific QC using the keep files
The ancestry classification outputs enable:
Ancestry-specific QC filters (
EUR/standardFilter.pgen,AFR/standardFilter.pgen, etc.)Per-ancestry heritability estimation
Stratified GWAS analyses
See also:
Installation - Software setup (if not already done)
Usage - Running the full pipeline
Genomics - Technical details on ancestry methods