.. IDEAL-GENOM documentation master file

IDEAL-GENOM Documentation
==========================

.. image:: https://readthedocs.org/projects/verus-ideal-genom/badge/?version=latest
   :target: https://verus-ideal-genom.readthedocs.io/en/latest/
   :alt: Documentation Status

.. image:: https://img.shields.io/pypi/v/ideal-genom.svg
   :target: https://pypi.org/project/ideal-genom/
   :alt: PyPI version

**IDEAL-GENOM** is a comprehensive Python package for automated, reproducible analysis of human genotype data. It provides end-to-end pipelines for genomic quality control (QC), post-imputation VCF processing, and genome-wide association studies (GWAS). The package wraps years of research expertise from CGE Tübingen, integrating PLINK 1.9/2.0, GCTA, and BCFtools with rich reporting and visualizations.

Version: **1.1.0**

🎯 Key Features
---------------

**Comprehensive Pipelines**
   - **Genomic QC**: Sample QC, Ancestry QC, and Variant QC for case-control studies
   - **GWAS Analysis**: Generalized Linear Models (GLM) and Mixed Models (GLMM)
   - **VCF Processing**: Post-imputation filtering, normalization, and conversion to PLINK
   - **Population Structure**: FST statistics, PCA, UMAP visualization, and ancestry projection

**Advanced Analytics**
   - **Sample Quality Control**: Missingness, sex verification, heterozygosity, relatedness (kinship/IBD)
   - **Ancestry Analysis**: Population stratification detection with 1000 Genomes reference
   - **Variant Filtering**: Hardy-Weinberg equilibrium, MAF, genotype rate, differential missingness
   - **GWAS Tools**: Association testing, top-hits extraction, gene annotation (Ensembl/RefSeq)
   - **Dimensionality Reduction**: PCA and UMAP for population structure visualization

**Modern Design**
   - **YAML Configuration**: Single configuration file with clear, hierarchical structure
   - **Flexible Pipeline System**: Enable/disable steps, customize parameters per analysis
   - **Multiple Interfaces**: Command-line tool, Python API, Jupyter notebooks
   - **Docker Support**: Pre-configured container with all genomic tools installed
   - **Automated Workflows**: Pipeline executor handles dependencies and data flow
   - **Rich Reporting**: Publication-ready plots and comprehensive QC metrics

**Modern Design**
   - **YAML Configuration**: Single configuration file with clear, hierarchical structure
   - **Flexible Pipeline System**: Enable/disable steps, customize parameters per analysis
   - **Multiple Interfaces**: Command-line tool, Python API, Jupyter notebooks
   - **Docker Support**: Pre-configured container with all genomic tools installed
   - **Automated Workflows**: Pipeline executor handles dependencies and data flow
   - **Rich Reporting**: Publication-ready plots and comprehensive QC metrics

**Developer Friendly**
   - **Reproducible**: All steps, parameters, and outputs logged
   - **Extensible**: Modular architecture for adding custom analysis steps
   - **Well Documented**: Comprehensive guides, API reference, and examples
   - **Type Hints**: Full type annotations for better IDE support

Quick Start
-----------

**Installation**

.. code-block:: bash

   pip install ideal-genom

**Basic Usage**

.. code-block:: bash

   # Generate a configuration template
   ideal-genom template --output my_pipeline.yaml
   
   # Edit the configuration file to match your data
   nano my_pipeline.yaml
   
   # Validate your configuration
   ideal-genom validate --config my_pipeline.yaml
   
   # Run the pipeline
   ideal-genom run --config my_pipeline.yaml

**Python API**

.. code-block:: python

   from ideal_genom.core.config import load_config
   from ideal_genom.core.pipeline import PipelineExecutor
   
   # Load configuration
   config = load_config("my_pipeline.yaml")
   
   # Create and execute pipeline
   executor = PipelineExecutor(config)
   executor.execute()

Available Pipelines
-------------------

**QC Pipeline** - Quality control for case-control studies
   1. Sample QC: Individual-level quality control
   2. Ancestry QC: Population structure and outlier detection
   3. Variant QC: SNP-level quality control
   4. Population Visualization: UMAP/t-SNE plots

**GWAS Pipeline** - Genome-wide association analysis
   1. Preparatory: LD pruning and PCA decomposition
   2. GLM Analysis: Fixed effects association testing
   3. GLMM Analysis: Mixed model with genetic relationship matrix
   4. Annotation: Gene mapping and functional annotation

**VCF Pipeline** - Post-imputation processing
   1. VCF Processing: Filter, normalize, annotate, concatenate
   2. PLINK Conversion: Convert to PLINK binary format
   3. Quality filtering: R² threshold, multiallelic handling

Documentation Contents
----------------------

.. toctree::
   :maxdepth: 2
   :caption: User Guide

   installation
   getting_started
   configuration
   examples
   
.. toctree::
   :maxdepth: 2
   :caption: Pipelines

   qc_pipeline
   gwas_pipeline
   vcf_pipeline

.. toctree::
   :maxdepth: 2
   :caption: API Reference
   
   api_overview

.. toctree::
   :maxdepth: 1
   :caption: Additional Resources

   faq
   troubleshooting
   contributing
   changelog

Supported Tools
---------------

IDEAL-GENOM integrates the following genomic analysis tools:

- **PLINK 1.9**: Classic PLINK for QC and association analysis
- **PLINK 2.0**: Modern version with improved performance (AVX2 optimized)
- **GCTA**: Genetic relationship matrix and mixed model analysis
- **BCFtools**: VCF manipulation and quality filtering

These tools are automatically used by the pipeline and must be installed separately or use the provided Docker image.

Citation
--------

If you use IDEAL-GENOM in your research, please cite:

.. code-block:: bibtex

   @software{ideal_genom_2026,
     title = {IDEAL-GENOM: Comprehensive Genomic Analysis Pipeline},
     author = {Giraldo González, Luis and Tenghe, Amabel},
     year = {2026},
     version = {0.2.0},
     url = {https://github.com/cge-tubingens/ideal-genom-qc}
   }

Getting Help
------------

- **Documentation**: https://ideal-genom-qc.readthedocs.io/
- **Issues**: https://github.com/cge-tubingens/cge-comrare-pipeline/issues
- **Examples**: See the :doc:`examples` page for complete workflows

License
-------

IDEAL-GENOM is released under the MIT License. See the LICENSE file in the repository for details.

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`