Tutorial: Assembling 1000 Genomes Reference Data

This tutorial covers the process of downloading, processing, and assembling the 1000 Genomes (1kG) high-coverage reference panel for use in ancestry classification and genetic quality control pipelines.

Estimated completion time: 1-2 hours

Learning objectives:

  1. Understand the data sources and download process for 1kG reference data

  2. Run the meta-data download checkpoint

  3. Execute the VCF download rule for chromosome-level data

  4. Assemble and merge VCF files into PLINK2 format

  5. Apply LD pruning and relatedness filtering


Prerequisites

Setup:

Before starting, ensure you have access to Snakemake and the GDCGenomicsQC workflow. For detailed installation instructions, see:

  • Installation - Software setup (module, conda, or other methods)

  • Usage - Running the pipeline

If you’re using the MSI HPC cluster:

module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

Verify installation:

cd GDCGenomicsQC
snakemake --version

Data Requirements:

  • Sufficient storage (approximately 100GB for reference data)

  • Network access to 1000 Genomes FTP server


Required Input Files

This step downloads data from external sources:

1kG Assembly Input Files

Input Source

Description

https://ftp.1000genomes.ebi.ac.uk/

1000 Genomes FTP server (downloaded by pipeline)

https://ftp.ncbi.nlm.nih.gov/genomes/all/

NCBI reference genome repository

REF/ (output directory)

Local storage for downloaded reference data

Downloaded Files:

The kgMeta checkpoint downloads:

Metadata Files (kgMeta)

File

Description

population.txt

Sample population assignments (2504 samples)

pedigree.txt

Family relationships and phasing information

hg38map.txt

Genetic map for Eagle phasing

hg19ToHg38.over.chain.gz

Chain file for coordinate liftover

Homo_sapiens.GRCh38.dna.primary_assembly.fa

Reference genome FASTA

The kgData rule downloads:

VCF Files (kgData)

File Pattern

Description

1kGP_high_coverage_Illumina.chr{1-22}.filtered.SNV_INDEL_SV_phased_panel.vcf.gz

High-coverage phased VCF per chromosome

.vcf.gz.tbi

Tabix index files

Config Parameters:

REF: "/path/to/reference/storage"  # Output directory for reference data
OUT_DIR: "/path/to/output"
local-storage-prefix: "/path/to/.snakemake/storage"

chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

Output Files:

After assembly, these files are used by other tutorials:

Assembly Output Files

File

Used By

1000G_highCoveragephased.pgen

Ancestry classification, all tutorials

1000G_highCoveragephased.pruned.pgen

PCA projection, ancestry classification

population.txt

All ancestry analysis steps

See also: Tutorial: Ancestry Classification in Practice for using reference data.

Lab Exercise: Assembling 1kG Reference Panel

Step 1: Configure Reference Paths

The reference data pipeline requires a base reference directory. Set this in your configuration file:

mkdir -p ~/reference_lab
cd ~/reference_lab
cat > config_reference.yaml << 'EOF'
REF: "/path/to/reference/storage"
OUT_DIR: "/path/to/output/directory"
local-storage-prefix: "/path/to/.snakemake/storage"

chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

conda-frontend: mamba
EOF

Key paths created by the pipeline:

  • {REF}/1000G_highcoverage/ - Main reference directory

  • {REF}/Homo_sapiens.GRCh38.dna.primary_assembly.fa - Reference genome

Step 2: Download Metadata (kgMeta Checkpoint)

The kgMeta checkpoint downloads essential reference files:

  • population.txt: Sample population assignments (2504 samples)

  • pedigree.txt: Family relationships and phasing information

  • hg38map.txt: Genetic map for Eagle phasing

  • hg19ToHg38.over.chain.gz: Chain file for coordinate liftover

  • Homo_sapiens.GRCh38.dna.primary_assembly.fa: Reference genome FASTA

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_reference.yaml kgMeta -j 4

This checkpoint only needs to run once. The output files are cached for subsequent runs.

Step 3: Download VCF Data (kgData Rule)

The kgData rule downloads phased VCF files for each chromosome from the 1000 Genomes FTP server:

  • Source: https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/

  • Files: 1kGP_high_coverage_Illumina.chr{1-22}.filtered.SNV_INDEL_SV_phased_panel.vcf.gz

  • Index: .vcf.gz.tbi files

gdcgenomicsqc --configfile ../config_reference.yaml kgData -j 22

This rule is parallelized by chromosome. Using -j 22 allows downloading all chromosomes concurrently.

Step 4: Assemble into PLINK2 Format (kgAssemble Rule)

The kgAssemble rule performs the core processing:

  1. Convert VCF to PGEN: Each chromosome VCF is converted to PLINK2 binary genotype format (pgen)

  2. Merge chromosomes: All chromosome files are merged into a single dataset

  3. Reference allele correction: Aligns alleles to the reference genome FASTA

  4. Variant ID standardization: Sets variant IDs to chr#:pos:ref:alt format

  5. LD pruning: Removes linked variants (window: 1000kb, step: 1, r²: 0.1)

  6. Relatedness filtering: Removes related samples (KING cutoff: 0.0884)

gdcgenomicsqc --configfile ../config_reference.yaml kgAssemble -j 8

Output files:


Understanding the Processing Steps

VCF to PGEN Conversion

The pipeline uses PLINK2 for format conversion with several quality filters:

  • --maf 0.05: Remove rare variants (minor allele frequency < 5%)

  • --snps-only just-acgt: Remove indels and non-standard variants

  • --rm-dup force-first: Handle duplicate SNPs by keeping first occurrence

Merging Strategy

Chromosomes are processed independently then merged using plink2 --pmerge-list. This approach:

  • Reduces memory requirements during processing

  • Allows parallel chromosome conversion

  • Creates a single merged dataset for downstream analysis

Relatedness Filtering

The KING kinship coefficient cutoff of 0.0884 corresponds to second-degree relationships (equivalent to grandparent-grandchild or half-siblings). This ensures:

  • Reference panel contains only unrelated individuals

  • PCA and classification are not biased by family structure

  • Downstream analyses assume sample independence


Pipeline Outputs

Population Assignments

File: 1000G_highcoverage/population.txt

SampleID

Population

SuperPop

HG00096

GBR

EUR

NA18498

YRI

AFR

NA12878

CEU

EUR

Sample file includes both population (pop) and super-population (superpop) labels for flexible grouping.

LD-Pruned Reference

The pruned dataset contains:

  • ~500,000 independent variants (after LD pruning)

  • ~2,500 samples (after relatedness filtering)

  • Standardized variant IDs for compatibility with downstream pipelines


Discussion Points

  1. Data source selection: Why use the high-coverage 2022 release rather than the original 1000 Genomes Phase 3? What are the trade-offs in sample size vs. coverage depth?

  2. Alternative reference panels: How would you adapt this pipeline for the TOPMed reference or the Human Genome Diversity Project (HGDP)? What preprocessing steps would change?

  3. Updating the reference: The 1000 Genomes Project is periodically updated. How would you modify this pipeline to incorporate new releases while maintaining backwards compatibility?

  4. Storage considerations: The reference data requires ~100GB. What strategies could reduce storage requirements (e.g., compression, selective chromosome downloading)?

  5. Computational resources: The kgAssemble rule uses significant memory (32GB) and CPU (8 threads). How do these requirements scale with additional chromosomes or sample sizes?

  6. Quality control: What additional QC steps could be added to the assembly process? Consider variant-level filters (missingness, HWE) and sample-level filters (call rate, heterozygosity).

For more information on using this reference data for ancestry classification, see Tutorial: Ancestry Classification in Practice.


Next Steps

After completing this tutorial, proceed to:

The reference panel enables:

  • PCA projection of study samples onto reference space

  • Random Forest ancestry classification

  • Ancestry-specific QC filtering

See also:

  • Installation - Software setup (if not already done)

  • Usage - Running the full pipeline

  • Genomics - Technical details on reference-based methods