Tutorial: Heritability Estimation with Multi-Ancestry Simulation
This tutorial covers simulating phenotypes across multiple ancestries and estimating SNP heritability using the GDCGenomicsQC pipeline.
Estimated completion time: 2-3 hours
Learning objectives:
Configure phenotype simulation for two ancestry groups
Simulate phenotypes with controlled heritability and cross-ancestry genetic correlation
Run SNP heritability estimation using PC-relate
Compare heritability estimates across ancestries
Prerequisites
Setup:
Before starting, ensure you have access to Snakemake and the GDCGenomicsQC workflow. For detailed installation instructions, see:
Installation - Software setup (module, conda, or other methods)
Usage - Running the pipeline
If you’re using the MSI HPC cluster:
module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
Verify installation:
cd GDCGenomicsQC
snakemake --version
If you’re using the Sandbox environment:
module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
Verify installation:
cd GDCGenomicsQC
snakemake --version
If your HPC has the GDC module pre-configured:
# Replace with your HPC's module path:
module use /path/to/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
Verify installation:
cd GDCGenomicsQC
snakemake --version
If you’re using your own Snakemake installation:
conda activate snakemake
cd GDCGenomicsQC
Verify installation:
snakemake --version
Data Requirements:
Completed Tutorial: Assembling 1000 Genomes Reference Data (reference data assembled)
Completed Tutorial: Ancestry Classification in Practice (samples classified)
Access to an HPC cluster with sufficient memory (32GB+ recommended)
Required Input Files
This step requires the following input files:
Input File |
Description |
|---|---|
|
Reference panel genotypes (from Tutorial: Assembling 1000 Genomes Reference Data) |
|
Reference population labels |
|
Sample genotypes per ancestry (from Tutorial: Quality Control Pipeline in Practice) |
|
Phenotype values (format: IID, pheno1, pheno2…) |
|
Covariates (format: IID, cov1, cov2…) |
Input from Previous Tutorials:
tutorial_1kg_assembly: Provides reference panel genotypes
tutorial_qc_pipeline: Provides QC-filtered sample genotypes
tutorial_ancestry_classification: Provides ancestry labels for samples
Simulation Input Parameters:
phenotypeSimulation:
ancestries: ["AFR", "EUR"] # Two ancestry groups to simulate
n_sims: 10 # Number of phenotype simulations
heritability: 0.4 # SNP heritability (h²)
rho: 0.8 # Cross-ancestry genetic correlation
maf: 0.05 # Minor allele frequency threshold
seed: 42 # Random seed for reproducibility
skip_thinning: true # Skip SNP thinning
thin_count_snps: 1000000 # SNPs to thin to
thin_count_inds: 10000 # Individuals to thin to
Heritability Estimation Config:
snpHerit:
pheno: "/path/to/phenotype.tsv" # Phenotype file (IID, pheno)
covar: "/path/to/covariates.tsv" # Optional covariates
method: "AdjHE" # Estimation method
npc: 10 # Number of PCs to include
out: "heritability_estimates.txt" # Output file
mpheno: 1 # Phenotype column number or name
fixed_effects: null # List of fixed effect covariate names
qcovar: null # Quantitative covariate names (for GCTA)
covar_discrete: null # Discrete covariate names (for GCTA)
Output Files:
File |
Description |
|---|---|
|
Heritability estimates per ancestry |
|
Simulation results per ancestry |
See also: Genomics for heritability methodology, Tutorial: Quality Control Pipeline in Practice for QC pipeline.
Lab Exercise: Multi-Ancestry Heritability Estimation
Step 1: Configure Phenotype Simulation
The simulation uses the phenotypeSimulation config section to define:
Two ancestry groups to simulate
Number of simulations
Heritability (h²) for each ancestry
Cross-ancestry genetic correlation (ρ)
Minor allele frequency threshold
mkdir -p ~/heritability_lab
cd ~/heritability_lab
cat > config_heritability.yaml << 'EOF'
INPUT: "/path/to/data/chr{CHR}.vcf.gz"
REF: "/path/to/reference/storage"
OUT_DIR: "/path/to/output/directory"
local-storage-prefix: "/path/to/.snakemake/storage"
chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
phenotypeSimulation:
ancestries: ["AFR", "EUR"]
n_sims: 10
heritability: 0.4
rho: 0.8
maf: 0.05
seed: 42
skip_thinning: true
thin_count_snps: 1000000
thin_count_inds: 10000
snpHerit:
pheno: "/path/to/phenotype.tsv"
covar: "/path/to/covariates.tsv"
method: "AdjHE"
npc: 10
out: "heritability_estimates.txt"
mpheno: 1
fixed_effects: null
qcovar: null
covar_discrete: null
conda-frontend: mamba
EOF
Key parameters:
Step 2: Run Phenotype Simulation
The simulation rule:
Loads genotype data for both ancestries
Samples SNPs and individuals for simulation
Generates phenotypes with specified heritability
Creates PLINK .bed/.bim/.fam files for each ancestry
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_heritability.yaml simulatePhenotype -j 4
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_heritability.yaml simulatePhenotype -j 4
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_heritability.yaml simulatePhenotype -j 4
cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/hpc \
--configfile ../config_heritability.yaml \
simulatePhenotype \
-j 4
Output directory structure:
simulations/AFR_EUR/
├── AFR_simulation.bed
├── AFR_simulation.bim
├── AFR_simulation.fam
├── EUR_simulation.bed
├── EUR_simulation.bim
└── EUR_simulation.fam
Step 3: Run SNP Heritability Estimation
The snpHerit rule uses PC-relate to estimate SNP heritability:
Compute PCA on unrelated samples
Estimate kinship matrix using PC-relate
Fit mixed model using REML or method of moments
gdcgenomicsqc --configfile ../config_heritability.yaml snpHerit -j 4
gdcgenomicsqc --configfile ../config_heritability.yaml snpHerit -j 4
gdcgenomicsqc --configfile ../config_heritability.yaml snpHerit -j 4
snakemake --profile=../profiles/hpc \
--configfile ../config_heritability.yaml \
snpHerit \
-j 4
Heritability Configuration Options
Interpreting Results
Heritability Estimates Output
File: 03-snpHeritability/heritability_estimates.txt
Sample output:
Ancestry |
h2 |
SE |
|---|---|---|
AFR |
0.38 |
0.05 |
EUR |
0.42 |
0.04 |
Expected Results
Given simulation parameters:
True h² = 0.4 (specified)
Expected estimates: 0.35-0.45 (within sampling error)
Cross-ancestry ρ = 0.8
Differences between ancestries may reflect:
LD score differences
Sample size variation
Genetic architecture heterogeneity
Comparison Across Simulations
With multiple simulations (n_sims: 10), you can analyze:
Distribution of heritability estimates
Standard error of estimates
Bias in estimation method
Simulation Parameters: What to Explore
Vary these parameters to understand the methods:
Heritability: Test h² = 0.1, 0.3, 0.5, 0.7 - How does estimation accuracy change?
Cross-ancestry correlation: Test ρ = 0.3, 0.5, 0.8, 1.0 - What happens when ρ = 1 (identical genetic architecture)?
Sample size: Vary
thin_count_inds- How does precision improve with more samples?SNP density: Vary
thin_count_snps- Effect of SNP count on heritability estimatesMAF threshold: Test maf = 0.01, 0.05, 0.10 - Impact of rare variant inclusion
Discussion Points
Estimation bias: How close are the estimated h² values to the true simulated value (0.4)? What factors contribute to bias?
Method comparison: Compare REML vs. method of moments estimates. Which is more accurate? More precise?
Cross-ancestry correlation: When ρ < 1, what does this imply about genetic architecture differences? How does this affect meta-analysis?
Sample size effects: How do standard errors change with sample size? Is there a point of diminishing returns?
PC correction: How many PCs are optimal for controlling population structure? What happens with too few or too many?
Heritability heterogeneity: Why might h² differ between AFR and EUR even when simulated with the same true value?
For the theoretical foundations of SNP heritability estimation—including PC-relate methodology, REML estimation, and genetic correlation—refer to the accompanying lecture materials. For using the reference panel, see Tutorial: Assembling 1000 Genomes Reference Data.
Next Steps
After completing this tutorial, you have explored:
Phenotype simulation with controlled heritability
SNP heritability estimation using PC-relate
Cross-ancestry genetic correlation
Further analyses to consider:
GWAS on simulated phenotypes with known true effects
Compare heritability estimates across different ancestry groups
Test different REML/MOM methods and PC covariates
See also:
Installation - Software setup (if not already done)
Tutorial: Quality Control Pipeline in Practice - QC pipeline
Tutorial: Ancestry Classification in Practice - Ancestry classification
Genomics - Technical details on heritability methods