Tutorial: Assembling 1000 Genomes Reference Data
This tutorial covers the process of downloading, processing, and assembling the 1000 Genomes (1kG) high-coverage reference panel for use in ancestry classification and genetic quality control pipelines.
Estimated completion time: 1-2 hours
Learning objectives:
Understand the data sources and download process for 1kG reference data
Run the meta-data download checkpoint
Execute the VCF download rule for chromosome-level data
Assemble and merge VCF files into PLINK2 format
Apply LD pruning and relatedness filtering
Prerequisites
Setup:
Before starting, ensure you have access to Snakemake and the GDCGenomicsQC workflow. For detailed installation instructions, see:
Installation - Software setup (module, conda, or other methods)
Usage - Running the pipeline
If you’re using the MSI HPC cluster:
module use /projects/standard/gdc/public/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
Verify installation:
cd GDCGenomicsQC
snakemake --version
If you’re using the Sandbox environment:
module use /scratch.global/GDC/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
Verify installation:
cd GDCGenomicsQC
snakemake --version
If your HPC has the GDC module pre-configured:
# Replace with your HPC's module path:
module use /path/to/GDCGenomicsQC/envs
module load gdcgenomicsqc
conda activate snakemake
Verify installation:
cd GDCGenomicsQC
snakemake --version
If you’re using your own Snakemake installation:
conda activate snakemake
cd GDCGenomicsQC
Verify installation:
snakemake --version
Data Requirements:
Sufficient storage (approximately 100GB for reference data)
Network access to 1000 Genomes FTP server
Required Input Files
This step downloads data from external sources:
Input Source |
Description |
|---|---|
|
1000 Genomes FTP server (downloaded by pipeline) |
|
NCBI reference genome repository |
|
Local storage for downloaded reference data |
Downloaded Files:
The kgMeta checkpoint downloads:
File |
Description |
|---|---|
|
Sample population assignments (2504 samples) |
|
Family relationships and phasing information |
|
Genetic map for Eagle phasing |
|
Chain file for coordinate liftover |
|
Reference genome FASTA |
The kgData rule downloads:
File Pattern |
Description |
|---|---|
|
High-coverage phased VCF per chromosome |
|
Tabix index files |
Config Parameters:
REF: "/path/to/reference/storage" # Output directory for reference data
OUT_DIR: "/path/to/output"
local-storage-prefix: "/path/to/.snakemake/storage"
chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
Output Files:
After assembly, these files are used by other tutorials:
File |
Used By |
|---|---|
|
Ancestry classification, all tutorials |
|
PCA projection, ancestry classification |
|
All ancestry analysis steps |
See also: Tutorial: Ancestry Classification in Practice for using reference data.
Lab Exercise: Assembling 1kG Reference Panel
Step 1: Configure Reference Paths
The reference data pipeline requires a base reference directory. Set this in your configuration file:
mkdir -p ~/reference_lab
cd ~/reference_lab
cat > config_reference.yaml << 'EOF'
REF: "/path/to/reference/storage"
OUT_DIR: "/path/to/output/directory"
local-storage-prefix: "/path/to/.snakemake/storage"
chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
conda-frontend: mamba
EOF
Key paths created by the pipeline:
{REF}/1000G_highcoverage/- Main reference directory{REF}/Homo_sapiens.GRCh38.dna.primary_assembly.fa- Reference genome
Step 2: Download Metadata (kgMeta Checkpoint)
The kgMeta checkpoint downloads essential reference files:
population.txt: Sample population assignments (2504 samples)pedigree.txt: Family relationships and phasing informationhg38map.txt: Genetic map for Eagle phasinghg19ToHg38.over.chain.gz: Chain file for coordinate liftoverHomo_sapiens.GRCh38.dna.primary_assembly.fa: Reference genome FASTA
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_reference.yaml kgMeta -j 4
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_reference.yaml kgMeta -j 4
cd GDCGenomicsQC/workflow
gdcgenomicsqc --configfile ../config_reference.yaml kgMeta -j 4
cd GDCGenomicsQC/workflow
snakemake --profile=../profiles/hpc \
--configfile ../config_reference.yaml \
kgMeta \
-j 4
This checkpoint only needs to run once. The output files are cached for subsequent runs.
Step 3: Download VCF Data (kgData Rule)
The kgData rule downloads phased VCF files for each chromosome from
the 1000 Genomes FTP server:
Source:
https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/Files:
1kGP_high_coverage_Illumina.chr{1-22}.filtered.SNV_INDEL_SV_phased_panel.vcf.gzIndex:
.vcf.gz.tbifiles
gdcgenomicsqc --configfile ../config_reference.yaml kgData -j 22
gdcgenomicsqc --configfile ../config_reference.yaml kgData -j 22
gdcgenomicsqc --configfile ../config_reference.yaml kgData -j 22
snakemake --profile=../profiles/hpc \
--configfile ../config_reference.yaml \
kgData \
-j 22
This rule is parallelized by chromosome. Using -j 22 allows downloading
all chromosomes concurrently.
Step 4: Assemble into PLINK2 Format (kgAssemble Rule)
The kgAssemble rule performs the core processing:
Convert VCF to PGEN: Each chromosome VCF is converted to PLINK2 binary genotype format (
pgen)Merge chromosomes: All chromosome files are merged into a single dataset
Reference allele correction: Aligns alleles to the reference genome FASTA
Variant ID standardization: Sets variant IDs to
chr#:pos:ref:altformatLD pruning: Removes linked variants (window: 1000kb, step: 1, r²: 0.1)
Relatedness filtering: Removes related samples (KING cutoff: 0.0884)
gdcgenomicsqc --configfile ../config_reference.yaml kgAssemble -j 8
gdcgenomicsqc --configfile ../config_reference.yaml kgAssemble -j 8
gdcgenomicsqc --configfile ../config_reference.yaml kgAssemble -j 8
snakemake --profile=../profiles/hpc \
--configfile ../config_reference.yaml \
kgAssemble \
-j 8
Output files:
Understanding the Processing Steps
VCF to PGEN Conversion
The pipeline uses PLINK2 for format conversion with several quality filters:
--maf 0.05: Remove rare variants (minor allele frequency < 5%)--snps-only just-acgt: Remove indels and non-standard variants--rm-dup force-first: Handle duplicate SNPs by keeping first occurrence
Merging Strategy
Chromosomes are processed independently then merged using plink2 --pmerge-list.
This approach:
Reduces memory requirements during processing
Allows parallel chromosome conversion
Creates a single merged dataset for downstream analysis
Pipeline Outputs
Population Assignments
File: 1000G_highcoverage/population.txt
SampleID |
Population |
SuperPop |
|---|---|---|
HG00096 |
GBR |
EUR |
NA18498 |
YRI |
AFR |
NA12878 |
CEU |
EUR |
Sample file includes both population (pop) and super-population (superpop)
labels for flexible grouping.
LD-Pruned Reference
The pruned dataset contains:
~500,000 independent variants (after LD pruning)
~2,500 samples (after relatedness filtering)
Standardized variant IDs for compatibility with downstream pipelines
Discussion Points
Data source selection: Why use the high-coverage 2022 release rather than the original 1000 Genomes Phase 3? What are the trade-offs in sample size vs. coverage depth?
Alternative reference panels: How would you adapt this pipeline for the TOPMed reference or the Human Genome Diversity Project (HGDP)? What preprocessing steps would change?
Updating the reference: The 1000 Genomes Project is periodically updated. How would you modify this pipeline to incorporate new releases while maintaining backwards compatibility?
Storage considerations: The reference data requires ~100GB. What strategies could reduce storage requirements (e.g., compression, selective chromosome downloading)?
Computational resources: The
kgAssemblerule uses significant memory (32GB) and CPU (8 threads). How do these requirements scale with additional chromosomes or sample sizes?Quality control: What additional QC steps could be added to the assembly process? Consider variant-level filters (missingness, HWE) and sample-level filters (call rate, heterozygosity).
For more information on using this reference data for ancestry classification, see Tutorial: Ancestry Classification in Practice.
Next Steps
After completing this tutorial, proceed to:
Tutorial: Quality Control Pipeline in Practice - Run QC on your samples (uses REF for ancestry QC)
Tutorial: Ancestry Classification in Practice - Classify ancestry using the reference panel
The reference panel enables:
PCA projection of study samples onto reference space
Random Forest ancestry classification
Ancestry-specific QC filtering
See also:
Installation - Software setup (if not already done)
Usage - Running the full pipeline
Genomics - Technical details on reference-based methods