Examples ======== This page provides practical examples of using IDEAL-GENOM for different types of genomic studies. Each example includes complete YAML configuration files and step-by-step instructions. Example 1: Basic QC Pipeline ----------------------------- This example demonstrates a standard quality control pipeline for a case-control GWAS study. **Study Setup:** - 2,000 samples (1,000 cases, 1,000 controls) - 500,000 SNPs genotyped on Illumina array - European population - Standard QC thresholds **Complete Configuration (qc_basic.yaml):** .. code-block:: yaml pipeline: name: "basic_qc_pipeline" base_output_dir: "/data/gwas_study/qc_output" steps: # Step 1: Sample QC - name: "sample_qc" enabled: true module: "ideal_genom.qc.sample_qc" class: "SampleQC" init_params: input_path: "/data/gwas_study/raw_data" input_name: "gwas_data" output_path: "${base_output_dir}/sample_qc" output_name: "sample_clean" reference_path: "data/1000genomes_build_38" reference_name: "1kG_phase3_GRCh38" built: "38" recompute: false execute_params: rename_snp: true hh_to_missing: true use_kinship: true ind_pair: [50, 5, 0.2] mind: 0.1 sex_check: [0.2, 0.8] maf: 0.01 het_deviation: 3 kinship: 0.354 ibd_threshold: 0.185 # Step 2: Ancestry QC - name: "ancestry_qc" enabled: true module: "ideal_genom.qc.ancestry_qc" class: "AncestryQC" init_params: input_path: "${steps.sample_qc.output_path}" input_name: "${steps.sample_qc.output_name}" output_path: "${base_output_dir}/ancestry_qc" output_name: "ancestry_clean" reference_path: "data/1000genomes_build_38" reference_name: "1kG_phase3_GRCh38" built: "38" execute_params: ind_pair: [50, 5, 0.2] pca: 10 maf: 0.05 ref_threshold: 3 stu_threshold: 3 reference_pop: "EUR" num_pcs: 10 # Step 3: Variant QC - name: "variant_qc" enabled: true module: "ideal_genom.qc.variant_qc" class: "VariantQC" init_params: input_path: "${steps.ancestry_qc.output_path}" input_name: "${steps.ancestry_qc.output_name}" output_path: "${base_output_dir}/variant_qc" output_name: "final_clean" high_ld_file: "data/ld_regions_files/high-LD-regions_GRCH38.txt" execute_params: chr_y: 24 miss_data_rate: 0.1 diff_genotype_rate: 0.0001 geno: 0.05 maf: 0.01 hwe: 0.000001 settings: logging: level: "INFO" file_logging: true resources: max_memory: null max_threads: null files: keep_intermediate: true **Execution:** .. code-block:: bash # Validate configuration ideal-genom validate --config qc_basic.yaml # Preview pipeline steps ideal-genom run --config qc_basic.yaml --dry-run # Execute pipeline ideal-genom run --config qc_basic.yaml **Output Structure:** .. code-block:: text qc_output/ ├── sample_qc/ │ ├── sample_clean.bed/bim/fam │ ├── excluded_samples.txt │ └── qc_report.html ├── ancestry_qc/ │ ├── ancestry_clean.bed/bim/fam │ ├── pca_results.txt │ └── ancestry_plot.png └── variant_qc/ ├── final_clean.bed/bim/fam ├── excluded_variants.txt └── qc_summary.txt Example 2: Complete GWAS Workflow ---------------------------------- This example shows a full workflow from QC through GWAS analysis using linear mixed models. **Study Setup:** - Post-QC dataset: 1,800 samples, 450,000 SNPs - Qualitative trait (e.g., Parkinson's disease status) - Account for population structure with PCA - Control for relatedness with GRM **Configuration (gwas_complete.yaml):** .. code-block:: yaml pipeline: name: "complete_gwas" base_output_dir: "/data/gwas_study/gwas_results" steps: # Step 1: Preparatory analysis - name: "gwas_prep" enabled: true module: "ideal_genom.gwas.preparatory" class: "Preparatory" init_params: input_path: "/data/gwas_study/qc_output/variant_qc" input_name: "final_clean" output_path: "${base_output_dir}/prep" output_name: "gwas_ready" high_ld_file: "data/ld_regions_files/high-LD-regions_GRCH38.txt" execute_params: ind_pair: [50, 5, 0.2] pca: 10 maf: 0.05 # Step 2: Linear Mixed Model - name: "gwas_glmm" enabled: true module: "ideal_genom.gwas.gen_linear_mix_model" class: "GWAS_GLMM" init_params: input_path: "${steps.gwas_prep.output_path}" input_name: "${steps.gwas_prep.output_name}" output_path: "${base_output_dir}/glmm" output_name: "glmm_results" execute_params: maf: 0.01 settings: logging: level: "INFO" file_logging: true resources: max_threads: 8 max_memory: 32000 **Execution:** .. code-block:: bash # Run complete GWAS pipeline ideal-genom run --config gwas_complete.yaml Example 3: VCF Post-Imputation Processing ------------------------------------------ This example demonstrates processing imputed VCF files from TOPMed or Michigan Imputation Server. **Study Setup:** - Imputed VCF files for chromosomes 1-22 - R² quality scores from imputation - Convert to PLINK for downstream analysis - GRCh38 genome build **Configuration (vcf_process.yaml):** .. code-block:: yaml pipeline: name: "imputed_data_processing" base_output_dir: "/data/imputation_study/processed" steps: # Step 1: Process VCF files - name: "process_vcf" enabled: true module: "ideal_genom.post_imputation.vcf_process" class: "ProcessVCF" init_params: input_path: "/data/imputation_study/imputed_vcfs" output_path: "${base_output_dir}/vcf" input_name: "placeholder" output_name: "imputed_filtered.vcf.gz" execute_params: password: null r2_threshold: 0.3 build: "38" ref_genome: null ref_annotation: "/data/references/dbSNP156_GRCh38.vcf.gz" max_threads: null # Step 2: Convert to PLINK - name: "plink_conversion" enabled: true module: "ideal_genom.post_imputation.vcf_to_plink" class: "GetPLINK" init_params: input_path: "${steps.process_vcf.output_path}" input_name: "imputed_filtered" output_path: "${base_output_dir}/plink" output_name: "imputed_plink" execute_params: double_id: true for_fam_update_file: null threads: null memory: null settings: logging: level: "INFO" file_logging: true files: keep_intermediate: true **Execution:** .. code-block:: bash # Process imputed data ideal-genom run --config vcf_process.yaml Example 5: Population Structure Analysis ----------------------------------------- This example focuses on detailed population structure analysis with Fst statistics and projection. **Study Setup:** - Post-QC dataset with known population labels - Calculate Fst statistics between populations - Project samples onto reference PCA space **Configuration (population_analysis.yaml):** .. code-block:: yaml pipeline: name: "population_structure" base_output_dir: "/data/pop_structure/output" steps: # Ancestry QC with PCA - name: "ancestry_analysis" enabled: true module: "ideal_genom.qc.ancestry_qc" class: "AncestryQC" init_params: input_path: "/data/pop_structure/clean_data" input_name: "qc_passed" output_path: "${base_output_dir}/ancestry" output_name: "ancestry_results" reference_path: "data/1000genomes_build_38" reference_name: "1kG_phase3_GRCh38" built: "38" execute_params: ind_pair: [50, 5, 0.2] pca: 20 maf: 0.05 ref_threshold: 6 stu_threshold: 6 reference_pop: "ALL" num_pcs: 20 # Fst calculation - name: "fst_calculation" enabled: true module: "ideal_genom.population.fst_stats" class: "FstSummary" init_params: input_path: "${steps.ancestry_analysis.output_path}" input_name: "${steps.ancestry_analysis.output_name}" output_path: "${base_output_dir}/fst" population_file: "/data/pop_structure/populations.txt" execute_params: pairwise: true window_size: 50000 # Dimensionality reduction - name: "dimensionality_reduction" enabled: true module: "ideal_genom.population.projection" class: "DimensionalityReductionPipeline" init_params: input_path: "${steps.ancestry_analysis.output_path}" input_name: "${steps.ancestry_analysis.output_name}" output_path: "${base_output_dir}/projection" reference_pca: "${steps.ancestry_analysis.pca_file}" execute_params: num_components: 10 **Execution:** .. code-block:: bash ideal-genom run --config population_analysis.yaml Python API Examples ------------------- Using IDEAL-GENOM Programmatically ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Example 1: Running QC Steps Individually** .. code-block:: python from pathlib import Path from ideal_genom.qc.sample_qc import SampleQC from ideal_genom.qc.ancestry_qc import AncestryQC from ideal_genom.qc.variant_qc import VariantQC # Step 1: Sample QC sample_qc = SampleQC( input_path=Path("/data/raw_data"), input_name="genotype_data", output_path=Path("/data/output/sample_qc"), output_name="sample_clean", build="38" ) sample_qc.execute_sample_qc_pipeline({ "rename_snp": True, "hh_to_missing": True, "use_kinship": True, "ind_pair": [50, 5, 0.2], "mind": 0.1, "sex_check": [0.2, 0.8], "maf": 0.01, "het_deviation": 3, "kinship": 0.354 }) # Step 2: Ancestry QC ancestry_qc = AncestryQC( input_path=Path("/data/output/sample_qc"), input_name="sample_clean", output_path=Path("/data/output/ancestry_qc"), output_name="ancestry_clean", reference_path=Path("data/1000genomes_build_38"), build="38" ) ancestry_qc.execute_ancestry_qc_pipeline({ "ind_pair": [50, 5, 0.2], "pca": 10, "maf": 0.05, "ref_threshold": 3, "stu_threshold": 3, "reference_pop": "EUR", "num_pcs": 10 }) # Step 3: Variant QC variant_qc = VariantQC( input_path=Path("/data/output/ancestry_qc"), input_name="ancestry_clean", output_path=Path("/data/output/variant_qc"), output_name="final_clean" ) variant_qc.execute_variant_qc_pipeline({ "chr_y": 24, "miss_data_rate": 0.1, "diff_genotype_rate": 0.0001, "geno": 0.05, "maf": 0.01, "hwe": 0.000001 }) print("QC pipeline completed successfully!") **Example 2: Custom GWAS Analysis** .. code-block:: python from pathlib import Path from ideal_genom.gwas.preparatory import Preparatory from ideal_genom.gwas.gen_linear_mix_model import GWAS_GLMM import pandas as pd # Prepare data for GWAS prep = Preparatory( input_path=Path("/data/qc_output/variant_qc"), input_name="final_clean", output_path=Path("/data/gwas/prep"), output_name="gwas_ready", high_ld_file=Path("data/ld_regions_files/high-LD-regions_GRCH38.txt") ) prep.execute_preparatory_pipeline({ "ind_pair": [50, 5, 0.2], "pca": 10, "maf": 0.05 }) # Run GLMM glmm = GWAS_GLMM( input_path=Path("/data/gwas/prep"), input_name="gwas_ready", output_path=Path("/data/gwas/results"), output_name="glmm_results" ) glmm.execute_gwas_glmm_pipeline({ "maf": 0.01, "pruned_file": Path("/data/gwas/prep/pruned_data") }) # Load and inspect results results = pd.read_csv("/data/gwas/results/glmm_results.assoc.txt", sep='\t') significant = results[results['p'] < 5e-8] print(f"Found {len(significant)} genome-wide significant variants") **Example 3: VCF Processing Pipeline** .. code-block:: python from pathlib import Path from ideal_genom.post_imputation.vcf_process import ProcessVCF from ideal_genom.post_imputation.vcf_to_plink import GetPLINK # Process VCF files vcf_processor = ProcessVCF( input_path=Path("/data/imputed_vcfs"), output_path=Path("/data/processed"), input_name="placeholder", output_name="imputed_clean.vcf.gz" ) vcf_processor.execute_process_vcf_pipeline({ "password": None, "r2_threshold": 0.3, "build": "38", "ref_genome": None, "ref_annotation": "/data/references/dbSNP.vcf.gz", "max_threads": 16 }) # Convert to PLINK plink_converter = GetPLINK( input_path=Path("/data/processed"), input_name="imputed_clean", output_path=Path("/data/plink_output"), output_name="imputed_plink" ) plink_converter.execute_plink_conversion_pipeline({ "double_id": True, "for_fam_update_file": None, "threads": 8, "memory": 32000 }) print("VCF processing and conversion completed!") Jupyter Notebook Examples -------------------------- The package includes interactive Jupyter notebooks in the ``notebooks/`` directory: **Available Notebooks:** - ``01-sample_qc.ipynb``: Interactive sample QC with live plotting - ``02-ancestry_qc.ipynb``: Population structure analysis with visualizations - ``03-variant_qc.ipynb``: Variant-level quality control - ``04-population.ipynb``: Population genetics analysis **Notebook Features:** - Step-by-step explanations - Interactive parameter tuning - Real-time visualizations - Result interpretation guides - Export-ready plots Common Patterns --------------- Pattern 1: Sequential Pipeline Execution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Run pipelines in sequence with proper data flow: .. code-block:: bash # Step 1: QC Pipeline ideal-genom run --config qc_pipeline.yaml # Step 2: GWAS Pipeline (uses QC output) ideal-genom run --config gwas_pipeline.yaml # Step 3: Population Analysis ideal-genom run --config population_analysis.yaml Pattern 2: Conditional Step Execution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Enable/disable steps based on needs: .. code-block:: yaml steps: - name: "sample_qc" enabled: true # Always run - name: "ancestry_qc" enabled: true # Run if population structure is a concern - name: "variant_qc" enabled: false # Skip if already done Best Practices -------------- Configuration Management ^^^^^^^^^^^^^^^^^^^^^^^^ 1. **Use Templates**: Start with templates from ``yaml_configs/`` 2. **Version Control**: Track your YAML configurations in git 3. **Comment Parameters**: Add comments explaining non-standard values 4. **Validate First**: Always run ``ideal-genom validate`` before execution .. code-block:: yaml # Good: Well-documented configuration execute_params: maf: 0.05 # Higher MAF for small sample size hwe: 0.000001 # Standard threshold het_deviation: 4 # Lenient for diverse population Data Organization ^^^^^^^^^^^^^^^^^ Organize your project directory: .. code-block:: text project/ ├── configs/ │ ├── qc_pipeline.yaml │ ├── gwas_pipeline.yaml │ └── vcf_pipeline.yaml ├── data/ │ ├── raw/ │ ├── processed/ │ └── results/ ├── scripts/ │ ├── run_analysis.sh │ └── visualize_results.py └── notebooks/ └── exploratory_analysis.ipynb Next Steps ---------- - Explore the :doc:`configuration` guide for detailed parameter explanations - Check the :doc:`troubleshooting` guide for common issues - Review pipeline-specific documentation: - :doc:`getting_started` - Quick start guide - :doc:`gwas_pipeline` - GWAS analysis - :doc:`vcf_pipeline` - VCF processing