Configuration Guide

This comprehensive guide explains the YAML-based configuration system in IDEAL-GENOM v1.1.0. The configuration file controls all aspects of your genomic analysis pipeline, from data paths to QC thresholds.

Overview

IDEAL-GENOM uses a single YAML configuration file that defines:

  • Pipeline metadata (name, output directory)

  • Analysis steps to execute (QC, GWAS, VCF processing)

  • Parameters for each step (thresholds, options)

  • Global settings (logging, resources, file handling)

Benefits of YAML Configuration:

  • Single Source of Truth: All settings in one file

  • Hierarchical Structure: Clear organization of related parameters

  • Variable Substitution: Reference values dynamically (e.g., ${base_output_dir})

  • Step Control: Enable/disable steps without editing code

  • Self-Documenting: Comments explain parameters inline

Configuration File Structure

A configuration file has three main sections:

pipeline:
  # Pipeline metadata and steps
  name: "my_analysis"
  base_output_dir: "/path/to/output"
  steps:
    - name: "step_name"
      # Step configuration...

settings:
  # Global settings
  logging: { ... }
  resources: { ... }
  files: { ... }

Getting Started with Configuration

1. Start from a Template

Copy a template from the repository:

cp yaml_configs/qc_pipeline_config_template.yaml my_config.yaml

2. Edit Required Fields

At minimum, update these paths:

pipeline:
  base_output_dir: "/your/output/path"  # Where results will be saved
  steps:
    - name: "sample_qc"
      init_params:
        input_path: "/your/input/path"  # Where your PLINK files are
        input_name: "your_dataset"      # PLINK file prefix (without .bed/.bim/.fam)

3. Validate Your Configuration

ideal-genom validate --config my_config.yaml

Pipeline Section

The pipeline section defines your analysis workflow.

Pipeline Metadata

pipeline:
  name: "my_study_qc"           # Descriptive name for this analysis
  base_output_dir: "/data/output"  # Root directory for all outputs
name (string, required)

A descriptive identifier for your pipeline. Used in logging and output organization.

base_output_dir (string, required)

Absolute path where all pipeline outputs will be saved. Each step creates subdirectories here.

Pipeline Steps

Steps are executed in the order listed:

pipeline:
  steps:
    - name: "sample_qc"
      enabled: true
      module: "ideal_genom.qc.sample_qc"
      class: "SampleQC"
      init_params:
        # Parameters passed to class __init__
      execute_params:
        # Parameters passed to execute() method
name (string, required)

Unique identifier for this step. Used for variable substitution and logging.

enabled (boolean, required)

Set to true to run this step, false to skip it.

module (string, required)

Python module path containing the step’s class.

class (string, required)

Class name to instantiate for this step.

init_params (mapping, required)

Parameters passed to the class constructor (__init__).

execute_params (mapping, optional)

Parameters passed to the execute() method when running the step.

Variable Substitution

Reference values from elsewhere in the configuration:

pipeline:
  base_output_dir: "/data/output"
  steps:
    - name: "sample_qc"
      init_params:
        output_path: "${base_output_dir}"  # Expands to /data/output

    - name: "variant_qc"
      init_params:
        # Use output from previous step
        input_path: "${steps.sample_qc.clean_dir}"
        input_name: "${steps.sample_qc.output_name}"

Available substitutions:

  • ${base_output_dir} - Pipeline’s base output directory

  • ${steps.STEP_NAME.ATTRIBUTE} - Attributes from previous steps - .clean_dir - Path to clean output files - .output_name - Output file prefix - .output_path - Output directory path

QC Pipeline Configuration

Sample QC Step

Performs individual-level quality control:

- name: "sample_qc"
  enabled: true
  module: "ideal_genom.qc.sample_qc"
  class: "SampleQC"
  init_params:
    input_path: "/data/input"           # Directory containing PLINK files
    input_name: "mydata"                # PLINK file prefix
    output_path: "${base_output_dir}"   # Output directory
    output_name: "mydata_sampleQCed"    # Output file prefix
    high_ld_regions_file: "auto"        # LD regions file (or "auto" for built-in)
    build: "38"                         # Genome build: "37" or "38"
  execute_params:
    rename_snp: true                    # Rename SNPs to chr:pos format
    hh_to_missing: true                 # Convert homozygous haploid to missing
    use_kinship: true                   # Use kinship instead of IBD
    ind_pair: [50, 5, 0.2]              # LD pruning [window, step, r²]
    mind: 0.02                          # Max missing rate per individual
    sex_check: [0.2, 0.8]               # F coefficient [female_max, male_min]
    maf: 0.01                           # Minor allele frequency threshold
    het_deviation: 3                    # Heterozygosity SD threshold
    kinship: 0.354                      # Kinship coefficient threshold
    ibd_threshold: 0.185                # IBD threshold for duplicates

init_params:

  • input_path (string): Directory containing input .bed/.bim/.fam files

  • input_name (string): PLINK file prefix (e.g., “mydata” for mydata.bed)

  • output_path (string): Where to save QC results

  • output_name (string): Prefix for output files

  • high_ld_regions_file (string): Path to high-LD regions file, or “auto” to use built-in

  • build (string): Genome build version - “37” (GRCh37/hg19) or “38” (GRCh38/hg38)

execute_params:

  • rename_snp (bool): Rename SNPs to chr:pos format for consistency

  • hh_to_missing (bool): Convert heterozygous haploid calls to missing

  • use_kinship (bool): Use KING kinship estimation (recommended over IBD)

  • ind_pair (list[int]): LD pruning parameters [window_size_kb, step_size_kb, r²_threshold]

    • window_size: SNP window in variant count (default: 50)

    • step_size: Step size in variant count (default: 5)

    • r² threshold: Correlation threshold (default: 0.2)

  • mind (float, 0-1): Maximum missing genotype rate per individual (default: 0.02 = 2%)

  • sex_check (list[float]): F coefficient thresholds [female_max, male_min]

    • female_max: Maximum F for females (default: 0.2)

    • male_min: Minimum F for males (default: 0.8)

    • Samples outside these ranges fail sex check

  • maf (float, 0-0.5): Minor allele frequency threshold for LD pruning

  • het_deviation (float): Standard deviations from mean heterozygosity (default: 3)

  • kinship (float): Kinship coefficient threshold for relatedness

    • 0.354: 1st degree relatives

    • 0.177: 2nd degree relatives

    • 0.088: 3rd degree relatives

  • ibd_threshold (float): IBD threshold for identifying duplicates/monozygotic twins

Ancestry QC Step

Detects population structure and removes ancestry outliers:

- name: "ancestry_qc"
  enabled: true
  module: "ideal_genom.qc.ancestry_qc"
  class: "AncestryQC"
  init_params:
    input_path: "${steps.sample_qc.clean_dir}"
    input_name: "${steps.sample_qc.output_name}"
    output_path: "${base_output_dir}"
    output_name: "mydata_ancestryQCed"
    high_ld_regions_file: "auto"
    build: "38"
  execute_params:
    ind_pair: [50, 5, 0.2]        # LD pruning for PCA
    pca: 10                       # Number of principal components
    maf: 0.01                     # MAF threshold for PCA
    ref_threshold: 4              # SD threshold for reference outliers
    stu_threshold: 4              # SD threshold for study outliers
    reference_pop: "EUR"          # Expected population
    num_pcs: 10                   # PCs for ancestry assignment
    distance_metric: "infinity"   # Distance metric for outlier detection

execute_params:

  • ind_pair (list[int]): LD pruning parameters for PCA variants

  • pca (int): Number of principal components to compute

  • maf (float): MAF threshold for variants included in PCA

  • ref_threshold (float): Standard deviations for reference population outliers

  • stu_threshold (float): Standard deviations for study population outliers

  • reference_pop (string): Expected population ancestry

    • “EUR”: European

    • “AFR”: African

    • “AMR”: Admixed American

    • “EAS”: East Asian

    • “SAS”: South Asian

  • num_pcs (int): Number of PCs used for ancestry classification

  • distance_metric (string): “euclidean”, “manhattan”, or “infinity” (Chebyshev)

Variant QC Step

Performs variant-level quality control:

- name: "variant_qc"
  enabled: true
  module: "ideal_genom.qc.variant_qc"
  class: "VariantQC"
  init_params:
    input_path: "${steps.ancestry_qc.clean_dir}"
    input_name: "${steps.ancestry_qc.output_name}"
    output_path: "${base_output_dir}"
    output_name: "mydata_variantQCed"
  execute_params:
    miss_data_rate: 0.02          # Max missing rate across samples
    diff_genotype_rate: 1.0e-5    # Differential missingness p-value
    geno: 0.02                    # Max missing rate per variant
    maf: 0.01                     # Minor allele frequency
    hwe: 1.0e-6                   # Hardy-Weinberg equilibrium p-value
    chr_y: 24                     # Y chromosome identifier

execute_params:

  • miss_data_rate (float, 0-1): Maximum overall missing data rate threshold

  • diff_genotype_rate (float): P-value threshold for differential missingness between cases/controls

  • geno (float, 0-1): Maximum missing genotype rate per variant

  • maf (float, 0-0.5): Minor allele frequency threshold

    • Standard GWAS: 0.01-0.05

    • Rare variant analysis: 0.001-0.01

    • Very strict: 0.001

  • hwe (float, 0-1): Hardy-Weinberg equilibrium p-value threshold

    • Standard: 1e-6

    • Strict: 1e-10 (for genotyping array data)

    • Relaxed: 1e-4

  • chr_y (int): Y chromosome identifier (23 for hg19, 24 for hg38)

Population Analysis Step

Performs dimensionality reduction and population visualization:

- name: "dimensionality_reduction"
  enabled: true
  module: "ideal_genom.population.projection"
  class: "DimensionalityReductionPipeline"
  init_params:
    input_path: "${steps.variant_qc.clean_dir}"
    input_name: "${steps.variant_qc.output_name}"
    output_path: "${base_output_dir}"
    build: "38"
    high_ld_regions_file: "auto"
    generate_plot: true
  execute_params:
    # PCA parameters
    pca_params:
      pca: 10
    force_pca_recompute: false

    # UMAP parameters
    run_umap: true
    umap_params:
      n_neighbors: 15
      min_dist: 0.1
      n_components: 2

    # t-SNE parameters
    run_tsne: true
    tsne_params:
      perplexity: 30

    # Plotting options
    case_control_markers: true
    plot_format: "png"
    dpi: 600

execute_params:

  • pca_params (mapping): PCA configuration

    • pca (int): Number of components to compute

  • force_pca_recompute (bool): Recompute PCA even if results exist

  • run_umap (bool): Enable UMAP analysis

  • umap_params (mapping): UMAP configuration

    • n_neighbors (int): Number of neighbors (5-50, default: 15)

    • min_dist (float): Minimum distance (0.0-1.0, default: 0.1)

    • n_components (int): Output dimensions (typically 2 or 3)

  • run_tsne (bool): Enable t-SNE analysis

  • tsne_params (mapping): t-SNE configuration

    • perplexity (int): Perplexity value (5-50, default: 30)

  • case_control_markers (bool): Color by case/control status

  • plot_format (string): “png”, “svg”, or “pdf”

  • dpi (int): Plot resolution (default: 600)

Settings Section

Global settings that apply to the entire pipeline:

Logging Settings

settings:
  logging:
    level: "INFO"              # Logging verbosity
    file_logging: true         # Write to log file
    console_logging: true      # Print to console

level (string): Log message detail level

  • “DEBUG”: Very detailed, for troubleshooting

  • “INFO”: Standard informational messages (recommended)

  • “WARNING”: Only warnings and errors

  • “ERROR”: Only errors

file_logging (bool): Save logs to pipeline.log in output directory

console_logging (bool): Print log messages to terminal

Resource Settings

settings:
  resources:
    max_memory: null           # Maximum memory in MB
    max_threads: null          # Maximum CPU threads

max_memory (int or null): Maximum memory allocation in MB

  • null: Auto-detect (uses 2/3 of available RAM)

  • Explicit value: Set specific limit (e.g., 32000 for 32GB)

max_threads (int or null): Maximum CPU threads to use

  • null: Auto-detect (uses available cores - 2)

  • Explicit value: Set specific number

File Management Settings

settings:
  files:
    keep_intermediate: true    # Preserve temporary files
    compress_outputs: false    # Compress output files
    overwrite_existing: false  # Overwrite existing results

keep_intermediate (bool): Keep temporary intermediate files

  • true: Keep all files (useful for debugging)

  • false: Clean up after each step (saves disk space)

compress_outputs (bool): Compress output files with gzip

overwrite_existing (bool): Overwrite existing output files

  • true: Overwrite without asking

  • false: Fail if outputs exist (safer)

Report Generation Settings

settings:
  reports:
    generate_reports: true     # Generate visualization reports
    plot_format: "png"         # Plot file format

generate_reports (bool): Automatically generate QC plots and reports

plot_format (string): Output format for plots

  • “png”: Standard format, good quality

  • “svg”: Vector format, scalable

  • “pdf”: Publication-ready format

Advanced Configuration Patterns

Conditional Step Execution

Skip steps based on your needs:

pipeline:
  steps:
    - name: "sample_qc"
      enabled: true
    - name: "ancestry_qc"
      enabled: false  # Skip for homogeneous population
    - name: "variant_qc"
      enabled: true
      init_params:
        # Connect directly to sample QC
        input_path: "${steps.sample_qc.clean_dir}"

Using Pre-existing Results

Resume pipeline from intermediate step:

pipeline:
  steps:
    - name: "sample_qc"
      enabled: false  # Already completed
    - name: "variant_qc"
      enabled: true
      init_params:
        # Use existing sample QC output
        input_path: "/data/output/my_study/sample_qc/clean_files"
        input_name: "mydata_sampleQCed"

Multiple Output Directories

Organize outputs by analysis type:

pipeline:
  base_output_dir: "/data/project"
  steps:
    - name: "sample_qc"
      init_params:
        output_path: "${base_output_dir}/qc_results"
    - name: "gwas_prep"
      init_params:
        output_path: "${base_output_dir}/gwas_analysis"

Parameter Tuning Guidelines

Sample QC Thresholds

For Standard Case-Control GWAS:

  • mind: 0.02 (2% missing)

  • maf: 0.01 (1% MAF)

  • het_deviation: 3 SD

  • kinship: 0.354 (exclude 1st degree relatives)

For Rare Variant Analysis:

  • mind: 0.01 (stricter)

  • maf: 0.001 (include rare variants)

  • het_deviation: 4 SD (more lenient)

For Family-Based Studies:

  • kinship: 0.088 (allow up to 3rd degree relatives)

  • Adjust sex_check if samples include children

Ancestry QC Thresholds

For Homogeneous Populations:

  • ref_threshold: 6 SD (softer)

  • stu_threshold: 6 SD (softer)

  • Consider disabling ancestry QC entirely

Variant QC Thresholds

For Array-Based Data:

  • geno: 0.02 (2% missing)

  • hwe: 1e-10 (very strict)

  • maf: 0.01

For Sequencing Data:

  • geno: 0.05 (more lenient)

  • hwe: 1e-6 (standard)

  • maf: 0.001 (include rare variants)

Common Configuration Examples

Minimal QC Pipeline

pipeline:
  name: "minimal_qc"
  base_output_dir: "/data/output"
  steps:
    - name: "sample_qc"
      enabled: true
      module: "ideal_genom.qc.sample_qc"
      class: "SampleQC"
      init_params:
        input_path: "/data/input"
        input_name: "mydata"
        output_path: "${base_output_dir}"
        output_name: "mydata_clean"
        high_ld_regions_file: "auto"
        build: "38"
      execute_params:
        mind: 0.02
        maf: 0.01

settings:
  logging:
    level: "INFO"

Complete QC with Ancestry

pipeline:
  name: "full_qc"
  base_output_dir: "/data/output"
  steps:
    - name: "sample_qc"
      enabled: true
      module: "ideal_genom.qc.sample_qc"
      class: "SampleQC"
      init_params:
        input_path: "/data/input"
        input_name: "mydata"
        output_path: "${base_output_dir}"
        output_name: "mydata_sampleQCed"
        high_ld_regions_file: "auto"
        build: "38"
      execute_params:
        mind: 0.02
        sex_check: [0.2, 0.8]
        maf: 0.01
        het_deviation: 3
        kinship: 0.354

    - name: "ancestry_qc"
      enabled: true
      module: "ideal_genom.qc.ancestry_qc"
      class: "AncestryQC"
      init_params:
        input_path: "${steps.sample_qc.clean_dir}"
        input_name: "${steps.sample_qc.output_name}"
        output_path: "${base_output_dir}"
        output_name: "mydata_ancestryQCed"
        high_ld_regions_file: "auto"
        build: "38"
      execute_params:
        pca: 10
        ref_threshold: 4
        stu_threshold: 4
        reference_pop: "EUR"

    - name: "variant_qc"
      enabled: true
      module: "ideal_genom.qc.variant_qc"
      class: "VariantQC"
      init_params:
        input_path: "${steps.ancestry_qc.clean_dir}"
        input_name: "${steps.ancestry_qc.output_name}"
        output_path: "${base_output_dir}"
        output_name: "mydata_final"
      execute_params:
        geno: 0.02
        maf: 0.01
        hwe: 1.0e-6

Troubleshooting Configuration

Configuration validation fails:

  1. Check YAML syntax (indentation, colons, quotes)

  2. Verify all required fields are present

  3. Ensure paths exist and are accessible

  4. Check module and class names are correct

Pipeline runs but produces no output:

  1. Verify enabled: true for desired steps

  2. Check input file paths are correct

  3. Review pipeline.log for errors

  4. Ensure output directory is writable

Memory errors:

  1. Set max_memory explicitly

  2. Reduce max_threads to free memory

  3. Process datasets in batches

  4. Enable keep_intermediate: false to save space

Variable substitution not working:

  1. Ensure correct syntax: ${variable_name}

  2. Check referenced step names match exactly

  3. Verify step order (can’t reference future steps)

See Also

Docker Paths:

When using Docker, paths should be relative to the container’s /data directory:

{
    "input_directory": "/data/inputData",
    "input_prefix": "mydata",
    "output_directory": "/data/outputData",
    "output_prefix": "clean_data",
    "high_ld_file": "/data/dependables/high-LD-regions.txt"
}

Steps Configuration

The steps.json file controls which pipeline steps to execute:

{
    "ancestry": true,
    "sample": true,
    "variant": true,
    "umap": true,
    "fst": true
}

Step Dependencies:

  • sampleancestryvariantdim reductionfst

  • You can skip steps, but maintain dependencies

  • Results from previous steps are required for subsequent steps

Advanced Configuration

Custom LD Regions

Provide your own high-LD regions file:

# high-LD-regions.txt format
1   48000000    52000000    # Chromosome, start, end
2   85000000    100000000
6   25000000    35000000

Performance Tuning

Memory Optimization:

  • Increase ind_pair window size for large datasets

  • Reduce pca components if memory is limited

  • Process chromosomes separately for very large datasets

Speed Optimization:

  • Use SSD storage for temporary files

  • Increase available CPU cores

  • Consider splitting large datasets

Disk Space Management:

  • Monitor intermediate file sizes

  • Clean up temporary files regularly

  • Use compression for archival storage

Best Practices

  1. Version Control: Keep configuration files under version control

  2. Documentation: Document parameter choices and rationale

  3. Validation: Always validate results visually

  4. Backup: Keep copies of successful configurations

  5. Testing: Test parameter changes on small datasets first

Troubleshooting

Common Configuration Issues:

  • Path not found: Check absolute paths and permissions

  • Parameter out of range: Verify threshold values are reasonable

  • JSON syntax errors: Validate JSON format

  • Memory errors: Reduce dataset size or adjust parameters

See the Troubleshooting Guide guide for more detailed solutions.