.. _tutorial_1kg_assembly:

Tutorial: Assembling 1000 Genomes Reference Data
================================================

This tutorial covers the process of downloading, processing, and assembling
the 1000 Genomes (1kG) high-coverage reference panel for use in ancestry
classification and genetic quality control pipelines.

**Estimated completion time**: 1-2 hours

**Learning objectives**:

1. Understand the data sources and download process for 1kG reference data
2. Run the meta-data download checkpoint
3. Execute the VCF download rule for chromosome-level data
4. Assemble and merge VCF files into PLINK2 format
5. Apply LD pruning and relatedness filtering

----

Prerequisites
-------------

**Setup:**

Before starting, ensure you have access to Snakemake and the GDCGenomicsQC workflow.
For detailed installation instructions, see:

- :doc:`installation` - Software setup (module, conda, or other methods)
- :doc:`usage` - Running the pipeline

.. tabs::

   .. tab:: MSI HPC

      If you're using the MSI HPC cluster:

      .. code-block:: bash

          module use /projects/standard/gdc/public/GDCGenomicsQC/envs
          module load gdcgenomicsqc
          conda activate snakemake

      Verify installation:

      .. code-block:: bash

          cd GDCGenomicsQC
          snakemake --version

   .. tab:: Sandbox

      If you're using the Sandbox environment:

      .. code-block:: bash

          module use /scratch.global/GDC/GDCGenomicsQC/envs
          module load gdcgenomicsqc
          conda activate snakemake

      Verify installation:

      .. code-block:: bash

          cd GDCGenomicsQC
          snakemake --version

   .. tab:: Other HPCs

      If your HPC has the GDC module pre-configured:

      .. code-block:: bash

          # Replace with your HPC's module path:
          module use /path/to/GDCGenomicsQC/envs
          module load gdcgenomicsqc
          conda activate snakemake

      Verify installation:

      .. code-block:: bash

          cd GDCGenomicsQC
          snakemake --version

   .. tab:: Local Snakemake

      If you're using your own Snakemake installation:

      .. code-block:: bash

          conda activate snakemake
          cd GDCGenomicsQC

      Verify installation:

      .. code-block:: bash

          snakemake --version

**Data Requirements:**

- Sufficient storage (approximately 100GB for reference data)
- Network access to 1000 Genomes FTP server

----

Required Input Files
~~~~~~~~~~~~~~~~~~~~

This step downloads data from external sources:

.. list-table:: 1kG Assembly Input Files
   :widths: 35 65
   :header-rows: 1

   * - Input Source
     - Description
   * - ``https://ftp.1000genomes.ebi.ac.uk/``
     - 1000 Genomes FTP server (downloaded by pipeline)
   * - ``https://ftp.ncbi.nlm.nih.gov/genomes/all/``
     - NCBI reference genome repository
   * - ``REF/`` (output directory)
     - Local storage for downloaded reference data

**Downloaded Files:**

The ``kgMeta`` checkpoint downloads:

.. list-table:: Metadata Files (kgMeta)
   :widths: 40 60
   :header-rows: 1

   * - File
     - Description
   * - ``population.txt``
     - Sample population assignments (2504 samples)
   * - ``pedigree.txt``
     - Family relationships and phasing information
   * - ``hg38map.txt``
     - Genetic map for Eagle phasing
   * - ``hg19ToHg38.over.chain.gz``
     - Chain file for coordinate liftover
   * - ``Homo_sapiens.GRCh38.dna.primary_assembly.fa``
     - Reference genome FASTA

The ``kgData`` rule downloads:

.. list-table:: VCF Files (kgData)
   :widths: 40 60
   :header-rows: 1

   * - File Pattern
     - Description
   * - ``1kGP_high_coverage_Illumina.chr{1-22}.filtered.SNV_INDEL_SV_phased_panel.vcf.gz``
     - High-coverage phased VCF per chromosome
   * - ``.vcf.gz.tbi``
     - Tabix index files

**Config Parameters:**

.. code-block:: yaml

    REF: "/path/to/reference/storage"  # Output directory for reference data
    OUT_DIR: "/path/to/output"
    local-storage-prefix: "/path/to/.snakemake/storage"

    chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

**Output Files:**

After assembly, these files are used by other tutorials:

.. list-table:: Assembly Output Files
   :widths: 40 60
   :header-rows: 1

   * - File
     - Used By
   * - ``1000G_highCoveragephased.pgen``
     - Ancestry classification, all tutorials
   * - ``1000G_highCoveragephased.pruned.pgen``
     - PCA projection, ancestry classification
   * - ``population.txt``
     - All ancestry analysis steps

**See also:** :doc:`tutorial_ancestry_classification` for using reference data.

Lab Exercise: Assembling 1kG Reference Panel
---------------------------------------------

Step 1: Configure Reference Paths
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The reference data pipeline requires a base reference directory. Set this
in your configuration file:

.. code-block:: bash

    mkdir -p ~/reference_lab
    cd ~/reference_lab
    cat > config_reference.yaml << 'EOF'
    REF: "/path/to/reference/storage"
    OUT_DIR: "/path/to/output/directory"
    local-storage-prefix: "/path/to/.snakemake/storage"

    chromosomes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

    conda-frontend: mamba
    EOF

Key paths created by the pipeline:

- ``{REF}/1000G_highcoverage/`` - Main reference directory
- ``{REF}/Homo_sapiens.GRCh38.dna.primary_assembly.fa`` - Reference genome

Step 2: Download Metadata (kgMeta Checkpoint)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``kgMeta`` checkpoint downloads essential reference files:

- ``population.txt``: Sample population assignments (2504 samples)
- ``pedigree.txt``: Family relationships and phasing information
- ``hg38map.txt``: Genetic map for Eagle phasing
- ``hg19ToHg38.over.chain.gz``: Chain file for coordinate liftover
- ``Homo_sapiens.GRCh38.dna.primary_assembly.fa``: Reference genome FASTA

.. tabs::

   .. tab:: MSI HPC

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          gdcgenomicsqc --configfile ../config_reference.yaml kgMeta -j 4

   .. tab:: Sandbox

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          gdcgenomicsqc --configfile ../config_reference.yaml kgMeta -j 4

   .. tab:: Other HPCs

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          gdcgenomicsqc --configfile ../config_reference.yaml kgMeta -j 4

   .. tab:: Local Snakemake

      .. code-block:: bash

          cd GDCGenomicsQC/workflow
          snakemake --profile=../profiles/hpc \
              --configfile ../config_reference.yaml \
              kgMeta \
              -j 4

This checkpoint only needs to run once. The output files are cached for
subsequent runs.

Step 3: Download VCF Data (kgData Rule)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``kgData`` rule downloads phased VCF files for each chromosome from
the 1000 Genomes FTP server:

- Source: ``https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/``
- Files: ``1kGP_high_coverage_Illumina.chr{1-22}.filtered.SNV_INDEL_SV_phased_panel.vcf.gz``
- Index: ``.vcf.gz.tbi`` files

.. tabs::

   .. tab:: MSI HPC

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_reference.yaml kgData -j 22

   .. tab:: Sandbox

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_reference.yaml kgData -j 22

   .. tab:: Other HPCs

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_reference.yaml kgData -j 22

   .. tab:: Local Snakemake

      .. code-block:: bash

          snakemake --profile=../profiles/hpc \
              --configfile ../config_reference.yaml \
              kgData \
              -j 22

This rule is parallelized by chromosome. Using ``-j 22`` allows downloading
all chromosomes concurrently.

Step 4: Assemble into PLINK2 Format (kgAssemble Rule)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``kgAssemble`` rule performs the core processing:

1. **Convert VCF to PGEN**: Each chromosome VCF is converted to PLINK2
   binary genotype format (``pgen``)
2. **Merge chromosomes**: All chromosome files are merged into a single
   dataset
3. **Reference allele correction**: Aligns alleles to the reference genome
   FASTA
4. **Variant ID standardization**: Sets variant IDs to ``chr#:pos:ref:alt``
   format
5. **LD pruning**: Removes linked variants (window: 1000kb, step: 1, r²: 0.1)
6. **Relatedness filtering**: Removes related samples (KING cutoff: 0.0884)

.. tabs::

   .. tab:: MSI HPC

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_reference.yaml kgAssemble -j 8

   .. tab:: Sandbox

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_reference.yaml kgAssemble -j 8

   .. tab:: Other HPCs

      .. code-block:: bash

          gdcgenomicsqc --configfile ../config_reference.yaml kgAssemble -j 8

   .. tab:: Local Snakemake

      .. code-block:: bash

          snakemake --profile=../profiles/hpc \
              --configfile ../config_reference.yaml \
              kgAssemble \
              -j 8

Output files:

+------------------------------------------+----------------------------------+
| File                                     | Description                      |
+==========================================+==================================+
| ``1000G_highCoveragephased.pgen``        | Full merged genotype file        |
+------------------------------------------+----------------------------------+
| ``1000G_highCoveragephased.pvar``       | Variant information              |
+------------------------------------------+----------------------------------+
| ``1000G_highCoveragephased.psam``        | Sample information with ancestry |
+------------------------------------------+----------------------------------+
| ``1000G_highCoveragephased.pruned.pgen`` | LD-pruned, unrelated samples     |
+------------------------------------------+----------------------------------+

----

Understanding the Processing Steps
-----------------------------------

VCF to PGEN Conversion
~~~~~~~~~~~~~~~~~~~~~~

The pipeline uses PLINK2 for format conversion with several quality filters:

- ``--maf 0.05``: Remove rare variants (minor allele frequency < 5%)
- ``--snps-only just-acgt``: Remove indels and non-standard variants
- ``--rm-dup force-first``: Handle duplicate SNPs by keeping first occurrence

Merging Strategy
~~~~~~~~~~~~~~~~

Chromosomes are processed independently then merged using ``plink2 --pmerge-list``.
This approach:

- Reduces memory requirements during processing
- Allows parallel chromosome conversion
- Creates a single merged dataset for downstream analysis

Relatedness Filtering
~~~~~~~~~~~~~~~~~~~~~

The KING kinship coefficient cutoff of 0.0884 corresponds to second-degree
relationships (equivalent to grandparent-grandchild or half-siblings).
This ensures:

- Reference panel contains only unrelated individuals
- PCA and classification are not biased by family structure
- Downstream analyses assume sample independence

----

Pipeline Outputs
----------------

Population Assignments
~~~~~~~~~~~~~~~~~~~~~

**File**: ``1000G_highcoverage/population.txt``

+----------+-----------+-------------+
| SampleID | Population| SuperPop    |
+==========+===========+=============+
| HG00096  | GBR       | EUR         |
+----------+-----------+-------------+
| NA18498  | YRI       | AFR         |
+----------+-----------+-------------+
| NA12878  | CEU       | EUR         |
+----------+-----------+-------------+

Sample file includes both population (``pop``) and super-population (``superpop``)
labels for flexible grouping.

LD-Pruned Reference
~~~~~~~~~~~~~~~~~~~

The pruned dataset contains:

- ~500,000 independent variants (after LD pruning)
- ~2,500 samples (after relatedness filtering)
- Standardized variant IDs for compatibility with downstream pipelines

----

Discussion Points
-----------------

1. **Data source selection**: Why use the high-coverage 2022 release rather
   than the original 1000 Genomes Phase 3? What are the trade-offs in sample
   size vs. coverage depth?

2. **Alternative reference panels**: How would you adapt this pipeline for
   the TOPMed reference or the Human Genome Diversity Project (HGDP)?
   What preprocessing steps would change?

3. **Updating the reference**: The 1000 Genomes Project is periodically
   updated. How would you modify this pipeline to incorporate new releases
   while maintaining backwards compatibility?

4. **Storage considerations**: The reference data requires ~100GB. What
   strategies could reduce storage requirements (e.g., compression, selective
   chromosome downloading)?

5. **Computational resources**: The ``kgAssemble`` rule uses significant
   memory (32GB) and CPU (8 threads). How do these requirements scale with
   additional chromosomes or sample sizes?

6. **Quality control**: What additional QC steps could be added to the
   assembly process? Consider variant-level filters (missingness, HWE) and
   sample-level filters (call rate, heterozygosity).

For more information on using this reference data for ancestry classification,
see :doc:`tutorial_ancestry_classification`.

----

Next Steps
---------

After completing this tutorial, proceed to:

- :doc:`tutorial_qc_pipeline` - Run QC on your samples (uses REF for ancestry QC)
- :doc:`tutorial_ancestry_classification` - Classify ancestry using the reference panel

**The reference panel enables:**

- PCA projection of study samples onto reference space
- Random Forest ancestry classification
- Ancestry-specific QC filtering

**See also:**

- :doc:`installation` - Software setup (if not already done)
- :doc:`usage` - Running the full pipeline
- :doc:`genomics` - Technical details on reference-based methods