VCF Processing Pipeline ======================== The VCF (Variant Call Format) processing pipeline in IDEAL-GENOM provides tools for handling post-imputation VCF files. It automates the complete workflow from imputed VCF files to analysis-ready PLINK binary format, including quality filtering, normalization, annotation, and format conversion. Overview -------- The VCF pipeline consists of two main components: 1. **VCF Processing**: Filter, normalize, annotate, and concatenate imputed VCF files 2. **PLINK Conversion**: Convert processed VCF files to PLINK binary format **Key Features:** - Parallel processing of multiple chromosome VCF files - Quality filtering based on imputation R² score - Reference genome normalization - Variant ID annotation with dbSNP - Automatic handling of multiallelic variants - Efficient conversion to PLINK format - Resource-aware parallel execution Use Cases --------- This pipeline is ideal for: - **Post-Imputation Processing**: Clean and prepare imputed genotype data - **TOPMed/HRC Imputation**: Process results from imputation servers - **Data Harmonization**: Standardize variant representation and IDs - **GWAS Preparation**: Convert VCF to PLINK for association analysis Prerequisites ------------- **Required Software:** - BCFtools (v1.10+) for VCF manipulation - PLINK 2.0 for format conversion - Python 3.11+ with required packages **Required Data:** - Imputed VCF files (can be gzipped) - Reference genome FASTA (auto-downloaded if not provided) - dbSNP VCF for annotation (optional but recommended) **Input File Formats:** Supported input formats: - ``.vcf.gz`` - Gzipped VCF files (recommended) - ``.vcf`` - Uncompressed VCF files - ``.zip`` - Zipped VCF files (automatically extracted) Quick Start ----------- **1. Get the VCF Configuration Template** .. code-block:: bash # Copy the VCF processing template cp yaml_configs/vcf_config_template.yaml my_vcf_pipeline.yaml **2. Edit Configuration** Update paths to your imputed data: .. code-block:: yaml pipeline: name: "vcf_processing" base_output_dir: "/data/output" steps: - name: "process_vcf" enabled: true init_params: input_path: "/data/imputed_vcfs" output_path: "${base_output_dir}" output_name: "processed_data.vcf.gz" execute_params: r2_threshold: 0.3 build: "38" **3. Run the Pipeline** .. code-block:: bash ideal-genom run --config my_vcf_pipeline.yaml Pipeline Steps -------------- Step 1: VCF Processing ^^^^^^^^^^^^^^^^^^^^^^^ Processes imputed VCF files through multiple quality control and normalization steps. **What it does:** 1. **Unzip Files**: Extract compressed VCF files if needed 2. **Filter by R²**: Remove poorly imputed variants (R² < threshold) 3. **Normalize**: Split multiallelic variants and left-align indels 4. **Reference Normalization**: Ensure variants match reference genome 5. **Annotate IDs**: Add or update variant IDs with dbSNP rs numbers 6. **Concatenate**: Merge all chromosome VCFs into a single file 7. **Index**: Create tabix index for the final VCF **Configuration:** .. code-block:: yaml - name: "process_vcf" enabled: true module: "ideal_genom.post_imputation.vcf_process" class: "ProcessVCF" init_params: input_path: "/data/imputed_vcfs" output_path: "${base_output_dir}" input_name: "placeholder" output_name: "processed_data.vcf.gz" execute_params: password: null # For password-protected zip files r2_threshold: 0.3 # Minimum imputation quality build: "38" # Genome build ("37" or "38") ref_genome: null # Custom reference (auto-downloaded if null) ref_annotation: null # dbSNP VCF for annotation max_threads: null # Thread count (auto-detect if null) **Parameters Explained:** **init_params:** - **input_path** (string): Directory containing imputed VCF files - Can contain ``.vcf.gz``, ``.vcf``, or ``.zip`` files - Automatically detects chromosome-specific files - **output_path** (string): Where to save processed files - **input_name** (string): Placeholder for compatibility (can be any value) - **output_name** (string): Name for final concatenated VCF file **execute_params:** - **password** (string or null): Password for encrypted zip files - **r2_threshold** (float, 0-1): Minimum R² (imputation quality) score - 0.3: Liberal (includes more variants, some lower quality) - 0.5: Moderate (balance quality and quantity) - 0.8: Conservative (high quality, fewer variants) - Variants with R² < threshold are removed - **build** (string): Genome build version - "37": GRCh37/hg19 - "38": GRCh38/hg38 - **ref_genome** (string or null): Path to reference genome FASTA - ``null``: Auto-download from Ensembl based on build - Custom path: Use your own reference FASTA file - **ref_annotation** (string or null): Path to dbSNP VCF for variant annotation - ``null``: Skip annotation step - Path to dbSNP VCF: Annotate variants with rs IDs - **max_threads** (int or null): Maximum threads for parallel processing - ``null``: Auto-detect (uses available cores) - Specific number: Limit thread usage **Processing Pipeline Details:** .. code-block:: text Input VCF files (per chromosome) ↓ [1. Unzip if needed] ↓ [2. Filter: R² >= threshold] ↓ [3. Normalize: split multiallelic, left-align] ↓ [4. Reference normalization] ↓ [5. Annotate with dbSNP IDs] ↓ [6. Concatenate all chromosomes] ↓ [7. Index final VCF] ↓ Output: processed_data.vcf.gz + .tbi **Output Files:** .. code-block:: text process_vcf/ ├── unzipped/ # Extracted VCF files │ ├── chr1.vcf.gz │ ├── chr2.vcf.gz │ └── ... ├── filtered/ # After R² filtering │ ├── chr1_filtered.vcf.gz │ ├── chr2_filtered.vcf.gz │ └── ... ├── normalized/ # After normalization │ ├── chr1_norm.vcf.gz │ ├── chr2_norm.vcf.gz │ └── ... ├── ref_normalized/ # After reference normalization │ ├── chr1_refnorm.vcf.gz │ └── ... ├── annotated/ # After dbSNP annotation │ ├── chr1_annotated.vcf.gz │ └── ... └── processed_data.vcf.gz # Final concatenated file └── processed_data.vcf.gz.tbi # Tabix index Step 2: PLINK Conversion ^^^^^^^^^^^^^^^^^^^^^^^^^ Converts processed VCF files to PLINK binary format for GWAS and other analyses. **What it does:** 1. Converts VCF to PLINK1.9 binary format (.bed/.bim/.fam) 2. Handles sample ID formatting 3. Updates family information if provided 4. Optimizes for downstream analysis **Configuration:** .. code-block:: yaml - name: "plink_conversion" enabled: true module: "ideal_genom.post_imputation.vcf_to_plink" class: "GetPLINK" init_params: input_path: "${steps.process_vcf.output_path}" input_name: "${steps.process_vcf.concatenated_file}" output_path: "${base_output_dir}" output_name: "imputed_data_plink" execute_params: double_id: true # Use sample ID for both FID and IID for_fam_update_file: null # FAM file for updating family info threads: null # Thread count (auto-detect) memory: null # Memory in MB (auto-detect) **Parameters Explained:** **init_params:** - **input_path** (string): Directory containing processed VCF - **input_name** (string): Name of input VCF file (without extension) - **output_path** (string): Where to save PLINK files - **output_name** (string): Prefix for output PLINK files **execute_params:** - **double_id** (bool): Sample ID handling - ``true``: Set FID = IID (recommended for most cases) - ``false``: Use VCF sample IDs as-is - **for_fam_update_file** (string or null): Path to FAM file for updating - ``null``: No update, use VCF phenotype information - Path: Update .fam with information from this file (FID, IID, phenotype, sex) - **threads** (int or null): Number of threads for conversion - **memory** (int or null): Memory allocation in MB **Output Files:** .. code-block:: text plink_conversion/ ├── imputed_data_plink.bed # Binary genotype file ├── imputed_data_plink.bim # Variant information ├── imputed_data_plink.fam # Sample information └── imputed_data_plink.log # Conversion log Complete Workflow Example -------------------------- **Full VCF Processing Configuration:** .. code-block:: yaml pipeline: name: "imputed_data_processing" base_output_dir: "/data/processed_imputation" steps: # Step 1: Process VCF files - name: "process_vcf" enabled: true module: "ideal_genom.post_imputation.vcf_process" class: "ProcessVCF" init_params: input_path: "/data/imputation_results" output_path: "${base_output_dir}" input_name: "placeholder" output_name: "imputed_clean.vcf.gz" execute_params: password: null r2_threshold: 0.3 build: "38" ref_genome: null ref_annotation: "/data/references/dbSNP156_GRCh38.vcf.gz" max_threads: null # Step 2: Convert to PLINK - name: "plink_conversion" enabled: true module: "ideal_genom.post_imputation.vcf_to_plink" class: "GetPLINK" init_params: input_path: "${steps.process_vcf.output_path}" input_name: "${steps.process_vcf.concatenated_file}" output_path: "${base_output_dir}" output_name: "imputed_plink" execute_params: double_id: true for_fam_update_file: "/data/original_study.fam" threads: null memory: null settings: logging: level: "INFO" file_logging: true resources: max_memory: null max_threads: null files: keep_intermediate: true **Running the Pipeline:** .. code-block:: bash # Validate ideal-genom validate --config vcf_pipeline.yaml # Execute ideal-genom run --config vcf_pipeline.yaml Input Data Organization ----------------------- Expected Directory Structure ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For chromosome-specific VCF files: .. code-block:: text /data/imputation_results/ ├── chr1.dose.vcf.gz ├── chr2.dose.vcf.gz ├── chr3.dose.vcf.gz ├── ... ├── chr22.dose.vcf.gz └── chrX.dose.vcf.gz Or as zip files: .. code-block:: text /data/imputation_results/ ├── chr1.zip ├── chr2.zip ├── ... └── chr22.zip **File Naming Patterns:** The pipeline automatically detects files with these patterns: - ``chr*.vcf.gz`` - ``chr*.vcf`` - ``*.dose.vcf.gz`` - ``chromosome_*.vcf.gz`` - ``chr*.zip`` VCF File Requirements ^^^^^^^^^^^^^^^^^^^^^ **Required INFO Fields:** - **R2** or **INFO**: Imputation quality score (R²) - Used for quality filtering - Typically ranges from 0 to 1 Quality Control Considerations ------------------------------- R² Threshold Selection ^^^^^^^^^^^^^^^^^^^^^^ **Recommended Thresholds:** .. code-block:: text R² Threshold | Quality | Use Case --------------|--------------|-------------------------------- 0.3 | Liberal | Discovery, maximize variants 0.5 | Moderate | Standard analysis 0.8 | Conservative| High-confidence variants only 0.9 | Very strict | Fine-mapping, functional studies **Trade-offs:** - **Lower threshold (0.3)**: More variants, some lower quality - **Higher threshold (0.8)**: Fewer variants, higher confidence **Check Imputation Quality:** .. code-block:: bash # Before filtering bcftools query -f '%INFO/R2\n' chr1.vcf.gz | \\ awk '{sum+=$1; count++} END {print "Mean R2:", sum/count}' # Distribution bcftools query -f '%INFO/R2\n' chr1.vcf.gz | \\ awk '{if($1<0.3) low++; else if($1<0.8) med++; else high++} END {print "R2<0.3:", low, "0.3-0.8:", med, "R2>=0.8:", high}' Reference Genome Selection ^^^^^^^^^^^^^^^^^^^^^^^^^^ **Auto-Download (Recommended):** Set ``ref_genome: null`` and specify build: - Build "37": Downloads GRCh37 from Ensembl - Build "38": Downloads GRCh38 from Ensembl **Custom Reference:** Use your own reference if: - Using non-standard reference assembly - Working offline - Need specific reference version Ensure reference matches imputation reference! dbSNP Annotation ^^^^^^^^^^^^^^^^ **Benefits of Annotation:** - Standardized variant IDs (rs numbers) - Easier result interpretation - Facilitates meta-analysis - Improves result sharing **Download dbSNP:** .. code-block:: bash # GRCh38 wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.40.gz.tbi # GRCh37 wget https://ftp.ncbi.nih.gov/snp/archive/b151/VCF/GCF_000001405.25.gz Best Practices -------------- Pre-Processing Checklist ^^^^^^^^^^^^^^^^^^^^^^^^^ Before running the VCF pipeline: 1. **Verify File Integrity:** .. code-block:: bash # Check VCF files for vcf in chr*.vcf.gz; do bcftools view -H $vcf | head -1 || echo "Error in $vcf" done 2. **Check Sample Counts:** .. code-block:: bash # Ensure consistent sample numbers for vcf in chr*.vcf.gz; do echo "$vcf: $(bcftools query -l $vcf | wc -l) samples" done 3. **Inspect Imputation Quality:** .. code-block:: bash # Check R² distribution bcftools query -f '%INFO/R2\n' chr1.vcf.gz | \\ python -c "import sys; import numpy as np; \\ r2 = [float(x) for x in sys.stdin]; \\ print(f'Mean: {np.mean(r2):.3f}'); \\ print(f'Median: {np.median(r2):.3f}')" Workflow Optimization ^^^^^^^^^^^^^^^^^^^^^ **For Large Datasets (>500K variants per chromosome):** .. code-block:: yaml execute_params: r2_threshold: 0.5 # More stringent max_threads: 16 # Use more cores settings: resources: max_memory: 64000 # 64GB max_threads: 16 files: keep_intermediate: false # Save disk space **For Small Datasets:** .. code-block:: yaml execute_params: r2_threshold: 0.3 # More lenient max_threads: 4 settings: files: keep_intermediate: true # Keep for inspection **For Repeated Analysis:** Set ``keep_intermediate: true`` and ``recompute: false`` in init_params to avoid reprocessing. Common Workflows ---------------- Workflow 1: Basic VCF Processing ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Process and convert imputed VCF files: .. code-block:: yaml pipeline: steps: - name: "process_vcf" enabled: true execute_params: r2_threshold: 0.3 build: "38" - name: "plink_conversion" enabled: true execute_params: double_id: true Workflow 2: High-Quality Variant Subset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Extract only high-confidence variants: .. code-block:: yaml pipeline: steps: - name: "process_vcf" enabled: true execute_params: r2_threshold: 0.8 # Strict filtering build: "38" ref_annotation: "/data/dbSNP.vcf.gz" Workflow 3: Update Phenotype Information ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Merge with existing phenotype data: .. code-block:: yaml pipeline: steps: - name: "process_vcf" enabled: true - name: "plink_conversion" enabled: true execute_params: for_fam_update_file: "/data/phenotypes.fam" double_id: true Troubleshooting --------------- Common Issues ^^^^^^^^^^^^^ **Issue: "R2 field not found in VCF"** Solutions: 1. Check INFO field name (might be INFO, IMP, IMPINFO) 2. Verify VCF format with ``bcftools view -h file.vcf.gz`` 3. Update code to match your VCF's R² field name **Issue: "Reference genome download fails"** Solutions: 1. Check internet connection 2. Download manually and provide path with ``ref_genome`` 3. Use alternative Ensembl mirror **Issue: "Memory error during concatenation"** Solutions: 1. Set ``max_memory`` explicitly in settings 2. Reduce ``max_threads`` (more memory per thread) 3. Process subsets of chromosomes separately 4. Close other applications **Issue: "Sample IDs don't match phenotype file"** Solutions: 1. Verify sample ID format in VCF 2. Check FAM file sample ID format (FID and IID) 3. Use ``double_id: true`` if VCF has only one ID per sample 4. Manually create matching FAM file **Issue: "Missing variants after filtering"** Solutions: 1. Lower ``r2_threshold`` (e.g., from 0.8 to 0.3) 2. Check imputation quality of input files 3. Verify R² field is correctly parsed Performance Issues ^^^^^^^^^^^^^^^^^^ **Slow processing:** 1. Increase ``max_threads`` in execute_params 2. Use SSD storage for temporary files 3. Reduce ``r2_threshold`` to filter earlier 4. Process chromosomes in parallel manually **High memory usage:** 1. Reduce ``max_threads`` (each thread uses memory) 2. Process chromosomes separately 3. Set explicit ``max_memory`` limit 4. Enable ``keep_intermediate: false`` Output Validation ----------------- Verify Pipeline Success ^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Check file integrity plink2 --bfile imputed_plink --validate # Verify sample count wc -l imputed_plink.fam # Check variant info head imputed_plink.bim Integration with QC Pipeline ----------------------------- After VCF processing, run the QC pipeline: .. code-block:: bash # Step 1: Process VCF ideal-genom run --config vcf_pipeline.yaml # Step 2: Run QC on converted PLINK files # Update qc_pipeline.yaml to use processed data ideal-genom run --config qc_pipeline.yaml **QC Configuration Update:** .. code-block:: yaml pipeline: steps: - name: "sample_qc" init_params: input_path: "/data/processed_imputation/plink_conversion" input_name: "imputed_plink" See Also -------- - :doc:`configuration` - Full parameter reference - :doc:`qc_pipeline` - Quality control after VCF processing - :doc:`gwas_pipeline` - Association analysis workflow - :doc:`troubleshooting` - Detailed problem-solving External Resources ------------------ **BCFtools Documentation:** - Main page: http://samtools.github.io/bcftools/ - VCF filtering: http://samtools.github.io/bcftools/bcftools.html#view - Normalization: http://samtools.github.io/bcftools/bcftools.html#norm **PLINK Documentation:** - PLINK 2.0: https://www.cog-genomics.org/plink/2.0/ - VCF import: https://www.cog-genomics.org/plink/2.0/input#vcf **Imputation Services:** - TOPMed: https://imputation.biodatacatalyst.nhlbi.nih.gov/ - Michigan Imputation Server: https://imputationserver.sph.umich.edu/ - HRC: http://www.haplotype-reference-consortium.org/ **VCF Format:** - Specification: https://samtools.github.io/hts-specs/VCFv4.2.pdf - Best practices: https://www.internationalgenome.org/wiki/Analysis/vcf4.0/