Tutorial: Assembling 1000 Genomes Reference Data

This tutorial covers the process of downloading, processing, and assembling the 1000 Genomes (1kG) high-coverage reference panel for use in ancestry classification and genetic quality control pipelines.

Estimated completion time: 1-2 hours

Learning objectives:

Understand the data sources and download process for 1kG reference data
Run the meta-data download checkpoint
Execute the VCF download rule for chromosome-level data
Assemble and merge VCF files into PLINK2 format
Apply LD pruning and relatedness filtering

Prerequisites

Setup:

Before starting, ensure you have access to Snakemake and the GDCGenomicsQC workflow. For detailed installation instructions, see:

Installation - Software setup (module, conda, or other methods)
Usage - Running the pipeline

If you’re using the MSI HPC cluster:

module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

Verify installation:

cd GDCGenomicsQC
snakemake --version

If you’re using the Sandbox environment:

module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

Verify installation:

cd GDCGenomicsQC
snakemake --version

If your HPC has the GDC module pre-configured:

# Replace with your HPC's module path:
module use /path/to/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake

Verify installation:

cd GDCGenomicsQC
snakemake --version

If you’re using your own Snakemake installation:

conda activate snakemake
cd GDCGenomicsQC

Verify installation:

snakemake --version

Data Requirements:

Sufficient storage (approximately 100GB for reference data)
Network access to 1000 Genomes FTP server

Required Input Files

This step downloads data from external sources:

1kG Assembly Input Files
Input Source	Description
`https://ftp.1000genomes.ebi.ac.uk/`	1000 Genomes FTP server (downloaded by pipeline)
`https://ftp.ncbi.nlm.nih.gov/genomes/all/`	NCBI reference genome repository
`REF/` (output directory)	Local storage for downloaded reference data

Downloaded Files:

The kgMeta checkpoint downloads:

Metadata Files (kgMeta)
File	Description
`population.txt`	Sample population assignments (2504 samples)
`pedigree.txt`	Family relationships and phasing information
`hg38map.txt`	Genetic map for Eagle phasing
`hg19ToHg38.over.chain.gz`	Chain file for coordinate liftover
`Homo_sapiens.GRCh38.dna.primary_assembly.fa`	Reference genome FASTA

The kgData rule downloads:

VCF Files (kgData)
File Pattern	Description
`1kGP_high_coverage_Illumina.chr{1-22}.filtered.SNV_INDEL_SV_phased_panel.vcf.gz`	High-coverage phased VCF per chromosome
`.vcf.gz.tbi`	Tabix index files

Config Parameters:

REF: "/path/to/reference/storage"  # Output directory for reference data
OUT_DIR: "/path/to/output"
local-storage-prefix: "/path/to/.snakemake/storage"

chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

Output Files:

After assembly, these files are used by other tutorials:

Assembly Output Files
File	Used By
`1000G_highCoveragephased.pgen`	Ancestry classification, all tutorials
`1000G_highCoveragephased.pruned.pgen`	PCA projection, ancestry classification
`population.txt`	All ancestry analysis steps

See also: Tutorial: Ancestry Classification in Practice for using reference data.

Lab Exercise: Assembling 1kG Reference Panel

Step 1: Configure Reference Paths

The reference data pipeline requires a base reference directory. Set this in your configuration file:

mkdir -p ~/reference_lab
cd ~/reference_lab
cat > config_reference.yaml << 'EOF'
REF: "/path/to/reference/storage"
OUT_DIR: "/path/to/output/directory"
local-storage-prefix: "/path/to/.snakemake/storage"

chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

conda-frontend: mamba
EOF

Key paths created by the pipeline:

{REF}/1000G_highcoverage/ - Main reference directory
{REF}/Homo_sapiens.GRCh38.dna.primary_assembly.fa - Reference genome

Step 2: Download Metadata (kgMeta Checkpoint)

The kgMeta checkpoint downloads essential reference files:

population.txt: Sample population assignments (2504 samples)
pedigree.txt: Family relationships and phasing information
hg38map.txt: Genetic map for Eagle phasing
hg19ToHg38.over.chain.gz: Chain file for coordinate liftover
Homo_sapiens.GRCh38.dna.primary_assembly.fa: Reference genome FASTA

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_reference.yaml kgMeta -j 4

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_reference.yaml kgMeta -j 4

cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_reference.yaml kgMeta -j 4

cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/hpc \
    --configfile ../config_reference.yaml \
    kgMeta \
    -j 4

This checkpoint only needs to run once. The output files are cached for subsequent runs.

Step 3: Download VCF Data (kgData Rule)

The kgData rule downloads phased VCF files for each chromosome from the 1000 Genomes FTP server:

Source: https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/
Files: 1kGP_high_coverage_Illumina.chr{1-22}.filtered.SNV_INDEL_SV_phased_panel.vcf.gz
Index: .vcf.gz.tbi files

gdcgenomicsqc --configfile ../config_reference.yaml kgData -j 22

gdcgenomicsqc --configfile ../config_reference.yaml kgData -j 22

gdcgenomicsqc --configfile ../config_reference.yaml kgData -j 22

snakemake --profile=../profiles/hpc \
    --configfile ../config_reference.yaml \
    kgData \
    -j 22

This rule is parallelized by chromosome. Using -j 22 allows downloading all chromosomes concurrently.

Step 4: Assemble into PLINK2 Format (kgAssemble Rule)

The kgAssemble rule performs the core processing:

Convert VCF to PGEN: Each chromosome VCF is converted to PLINK2 binary genotype format (pgen)
Merge chromosomes: All chromosome files are merged into a single dataset
Reference allele correction: Aligns alleles to the reference genome FASTA
Variant ID standardization: Sets variant IDs to chr#:pos:ref:alt format
LD pruning: Removes linked variants (window: 1000kb, step: 1, r²: 0.1)
Relatedness filtering: Removes related samples (KING cutoff: 0.0884)

gdcgenomicsqc --configfile ../config_reference.yaml kgAssemble -j 8

gdcgenomicsqc --configfile ../config_reference.yaml kgAssemble -j 8

gdcgenomicsqc --configfile ../config_reference.yaml kgAssemble -j 8

snakemake --profile=../profiles/hpc \
    --configfile ../config_reference.yaml \
    kgAssemble \
    -j 8

Output files:

Understanding the Processing Steps

VCF to PGEN Conversion

The pipeline uses PLINK2 for format conversion with several quality filters:

--maf 0.05: Remove rare variants (minor allele frequency < 5%)
--snps-only just-acgt: Remove indels and non-standard variants
--rm-dup force-first: Handle duplicate SNPs by keeping first occurrence

Merging Strategy

Chromosomes are processed independently then merged using plink2 --pmerge-list. This approach:

Reduces memory requirements during processing
Allows parallel chromosome conversion
Creates a single merged dataset for downstream analysis

Relatedness Filtering

The KING kinship coefficient cutoff of 0.0884 corresponds to second-degree relationships (equivalent to grandparent-grandchild or half-siblings). This ensures:

Reference panel contains only unrelated individuals
PCA and classification are not biased by family structure
Downstream analyses assume sample independence

Pipeline Outputs

Population Assignments

File: 1000G_highcoverage/population.txt

SampleID	Population	SuperPop
HG00096	GBR	EUR
NA18498	YRI	AFR
NA12878	CEU	EUR

Sample file includes both population (pop) and super-population (superpop) labels for flexible grouping.

LD-Pruned Reference

The pruned dataset contains:

~500,000 independent variants (after LD pruning)
~2,500 samples (after relatedness filtering)
Standardized variant IDs for compatibility with downstream pipelines

Discussion Points

Data source selection: Why use the high-coverage 2022 release rather than the original 1000 Genomes Phase 3? What are the trade-offs in sample size vs. coverage depth?
Alternative reference panels: How would you adapt this pipeline for the TOPMed reference or the Human Genome Diversity Project (HGDP)? What preprocessing steps would change?
Updating the reference: The 1000 Genomes Project is periodically updated. How would you modify this pipeline to incorporate new releases while maintaining backwards compatibility?
Storage considerations: The reference data requires ~100GB. What strategies could reduce storage requirements (e.g., compression, selective chromosome downloading)?
Computational resources: The kgAssemble rule uses significant memory (32GB) and CPU (8 threads). How do these requirements scale with additional chromosomes or sample sizes?
Quality control: What additional QC steps could be added to the assembly process? Consider variant-level filters (missingness, HWE) and sample-level filters (call rate, heterozygosity).

For more information on using this reference data for ancestry classification, see Tutorial: Ancestry Classification in Practice.

Next Steps

After completing this tutorial, proceed to:

Tutorial: Quality Control Pipeline in Practice - Run QC on your samples (uses REF for ancestry QC)
Tutorial: Ancestry Classification in Practice - Classify ancestry using the reference panel

The reference panel enables:

PCA projection of study samples onto reference space
Random Forest ancestry classification
Ancestry-specific QC filtering

See also:

Installation - Software setup (if not already done)
Usage - Running the full pipeline
Genomics - Technical details on reference-based methods