GWAS Modules
Modules for genome-wide association studies including preparatory steps, linear models, and mixed models.
Preparatory Analysis
Class for preparatory steps before GWAS.
This class handles the pruning of high linkage disequilibrium (LD) regions and performs Principal Component Analysis (PCA) on the pruned data. It uses PLINK software for the pruning and PCA operations, ensuring that the input data is in the correct format and that necessary files are present. It also manages the fetching of high LD regions if they are not provided, using the FetcherLDRegions class from the ideal_genom package. It is designed to be flexible with parameters such as missing rate, minor allele frequency, and number of principal components to compute. It also allows for memory and thread management during PLINK execution.
- class ideal_genom.gwas.preparatory.Preparatory(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, high_ld_regions_file: str | Path, build: str = '38')[source]
Bases:
objectA class for preprocessing genomic data in preparation for analysis.
This class handles the preparatory steps needed for genomic data analysis, including input validation, LD (Linkage Disequilibrium) pruning, and PCA (Principal Component Analysis) decomposition.
- high_ld_file
Path to the high LD regions file. If not found, will be fetched automatically
- Type:
str or Path
- Raises:
ValueError – If input_path or output_path is None, or if input_name or output_name is None
TypeError – If input_path or output_path is not of type str or Path, or if input_name or output_name is not of type str, or if build is not of type str
FileNotFoundError – If the specified input_path or output_path does not exist, or if the required PLINK files (.bed, .bim, .fam) are not found, or if the high LD file is not found and cannot be fetched.
Notes
This class uses PLINK software for genomic data processing operations.
Note
The class assumes that PLINK is installed and available in the system PATH.
- __init__(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, high_ld_regions_file: str | Path, build: str = '38') None[source]
- execute_ld_prunning(mind: float = 0.2, maf: float = 0.01, geno: float = 0.1, hwe: float = 5e-06, ind_pair: list = [50, 5, 0.2], memory: int | None = None, threads: int | None = None) None[source]
Execute LD (Linkage Disequilibrium) pruning on genetic data using PLINK.
This method performs LD pruning in two steps: 1. Excludes high LD regions and identifies independent SNPs 2. Extracts the identified independent SNPs
- Parameters:
mind (float, optional (default=0.2)) – Missing rate per individual threshold. Excludes individuals with missing rate higher than threshold.
maf (float, optional (default=0.01)) – Minor allele frequency threshold. Must be between 0 and 0.5.
geno (float, optional (default=0.1)) – Missing rate per SNP threshold. Must be between 0 and 1.
hwe (float, optional (default=5e-6)) – Hardy-Weinberg equilibrium exact test p-value threshold. Must be between 0 and 1.
ind_pair (list, optional (default=[50, 5, 0.2])) – Parameters for pairwise pruning: [window size(variants), step size(variants), r^2 threshold]
memory (int, optional (default=None)) – Memory in MB to allocate. If None, uses 2/3 of available system memory.
- Returns:
The results are saved to disk and the pruned file path is stored in self.pruned_file
- Return type:
None
- Raises:
TypeError – If mind, maf, geno, or hwe are not float
ValueError – If maf is not between 0 and 0.5 If geno is not between 0 and 1 If hwe is not between 0 and 1
Notes
Uses PLINK software for the pruning operations. Operates on chromosomes 1-22 only. Automatically determines optimal thread count based on system CPU cores.
- execute_pc_decomposition(pca: int = 10, threads: int | None = None, memory: int | None = None) None[source]
Execute PCA decomposition on pruned PLINK binary files.
This method performs Principal Component Analysis (PCA) on the pruned genotype data using PLINK software. It requires the existence of pruned binary PLINK files (.bed, .bim, .fam) and generates PCA eigenvectors and eigenvalues.
- Parameters:
pca (int, default=10) – Number of principal components to compute. Must be greater than 0.
- Return type:
None
- Raises:
TypeError – If pca parameter is not an integer.
ValueError – If pca parameter is less than 1.
FileNotFoundError – If any of the required pruned PLINK files (.bed, .bim, .fam) are not found.
Notes
The method automatically determines the optimal number of threads to use based on CPU count, reserving 2 cores for other processes. If CPU count cannot be determined, it defaults to 10 threads.
The output files will be created in the same directory as the input files, using the input name as prefix with extensions .eigenvec and .eigenval.
- execute_preparatory_pipeline(preparatory_params: dict) None[source]
Execute the full preparatory pipeline including LD pruning and PCA decomposition.
This method combines the LD pruning and PCA decomposition steps into a single pipeline for ease of use. It first performs LD pruning on the input genotype data, followed by PCA decomposition on the pruned data.
- Parameters:
mind (float, optional (default=0.2)) – Missing rate per individual threshold for LD pruning.
maf (float, optional (default=0.01)) – Minor allele frequency threshold for LD pruning.
geno (float, optional (default=0.1)) – Missing rate per SNP threshold for LD pruning.
hwe (float, optional (default=5e-6)) – Hardy-Weinberg equilibrium exact test p-value threshold for LD pruning.
ind_pair (list, optional (default=[50, 5, 0.2])) – Parameters for pairwise pruning during LD pruning.
pca (int, optional (default=10)) – Number of principal components to compute during PCA decomposition.
memory (int, optional (default=None)) – Memory in MB to allocate for PLINK operations.
- Return type:
None
Notes
This method sequentially calls execute_ld_prunning and execute_pc_decomposition.
General Linear Model (GLM)
This module provides a class for performing Genome-Wide Association Studies (GWAS) using a Generalized Linear Model (GLM) with PLINK2.
It includes methods for association analysis, obtaining top hits, and annotating SNPs with gene information.
- class ideal_genom.gwas.gen_linear_model.GWAS_GLM(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, recompute: bool = True)[source]
Bases:
objectClass for performing Genome-Wide Association Studies (GWAS) using a Generalized Linear Model (GLM) with PLINK2.
This class provides methods to perform association analysis, obtain top hits, and annotate SNPs with gene information.
- input_path
Path to the input directory.
- Type:
Path
- output_path
Path to the output directory.
- Type:
Path
- results_dir
Directory where the results will be saved.
- Type:
Path
- Raises:
ValueError – If input_path, output_path, input_name, or output_name are not provided.
FileNotFoundError – If the specified input_path or output_path does not exist.
FileNotFoundError – If the required PLINK files (.bed, .bim, .fam) are not found in the input_path.
TypeError – If input_name or output_name are not strings, or if recompute is not a boolean.
- __init__(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, recompute: bool = True) None[source]
- glm_association_analysis(maf: float = 0.01, mind: float = 0.1, hwe: float = 5e-06, ci: float = 0.95) None[source]
Perform fixed model association analysis using PLINK2.
This method performs a fixed model association analysis on genomic data using PLINK2. It checks the validity of the input parameters, ensures necessary files exist, and executes the PLINK2 command to perform the analysis.
- Parameters:
- Returns:
A dictionary containing the status of the process, the step name, and the output directory.
- Return type:
- Raises:
TypeError – If any of the input parameters are not of type float.
ValueError – If any of the input parameters are out of their respective valid ranges.
FileNotFoundError – If the required PCA file is not found.
- get_top_hits(maf: float = 0.01) None[source]
Get the top hits from the GWAS results.
- Parameters:
maf (float) – Minor allele frequency threshold. Must be a float between 0 and 0.5.
- Returns:
A dictionary containing the process status, step name, and output directory.
- Return type:
- Raises:
TypeError – If maf is not of type float.
ValueError – If maf is not between 0 and 0.5.
Notes
- The function performs the following steps:
Validates the type and range of the maf parameter.
Computes the number of threads to use based on the available CPU cores.
Loads the results of the association analysis and renames columns according to GCTA requirements.
Prepares a .ma file with the necessary columns.
If recompute is True, constructs and executes a GCTA command to perform conditional and joint analysis.
Returns a dictionary with the process status, step name, and output directory.
- annotate_top_hits(gtf_path: str | None = None, build: str = '38', anno_source: str = 'ensembl') None[source]
Annotate top SNP hits from COJO analysis with gene information.
This method reads the COJO joint analysis results, extracts the top SNPs, and annotates them with gene information using the specified genome build and annotation source. The annotated results are saved to a TSV file.
- Parameters:
gtf_path (Optional[str], default=None) – Path to the GTF (Gene Transfer Format) file for custom annotation. If None, the annotation will use default resources.
build (str, default='38') – Genome build version to use for annotation (‘38’ for GRCh38, etc.).
anno_source (str, default="ensembl") – Source of annotations to use (e.g., “ensembl”, “refseq”).
- Returns:
A dictionary containing: - ‘pass’: Boolean indicating if the process completed successfully - ‘step’: The name of the step (‘annotate_hits’) - ‘output’: Dictionary with output file paths
- Return type:
- Raises:
FileExistsError – If the COJO results file is not found in the results directory.
Notes
The annotated results are saved to ‘top_hits_annotated.tsv’ in the results directory.
- execute_gwas_glm_pipeline(glm_params: dict) None[source]
Execute the complete GWAS fixed effects pipeline.
This method orchestrates the full GWAS analysis workflow using a generalized linear model (GLM). It sequentially executes all necessary steps: performing the association analysis, extracting top hits, and annotating them with gene information.
- Parameters:
maf (float, optional (default=0.01)) – Minor allele frequency threshold for filtering SNPs. Must be between 0 and 0.5.
mind (float, optional (default=0.1)) – Individual missingness threshold. Must be between 0 and 1.
hwe (float, optional (default=5e-6)) – Hardy-Weinberg equilibrium threshold. Must be between 0 and 1.
ci (float, optional (default=0.95)) – Confidence interval threshold. Must be between 0 and 1.
gtf_path (Optional[str], optional (default=None)) – Path to the GTF file for custom annotation. If None, uses default annotation resources.
build (str, optional (default='38')) – Genome build version to use for annotation (‘38’ for GRCh38, ‘37’ for GRCh37).
anno_source (str, optional (default='ensembl')) – Source of annotations to use (e.g., ‘ensembl’, ‘refseq’).
- Returns:
Results are saved to the results directory specified during initialization.
- Return type:
None
- Raises:
TypeError – If any of the numeric parameters are not of the correct type.
ValueError – If any of the parameters are out of their respective valid ranges.
FileNotFoundError – If required input files (e.g., PCA file) are not found.
Notes
This method sequentially calls: 1. glm_association_analysis() - Performs the GLM-based GWAS analysis 2. get_top_hits() - Extracts top significant hits using conditional analysis 3. annotate_top_hits() - Annotates hits with gene information
The pipeline expects that preparatory steps (LD pruning and PCA) have already been performed, as it requires the existence of .eigenvec files.
Examples
>>> gwas = GWASfixed(input_path='data/', input_name='genotypes', ... output_path='results/', output_name='gwas_results') >>> gwas.execute_gwasfixed_pipeline(maf=0.05, hwe=1e-6, build='38')
General Linear Mixed Model (GLMM)
This module provides a class for performing Genome-Wide Association Studies (GWAS) using a Generalized Linear Mixed Model (GLMM) with GCTA.
It includes methods for association analysis, obtaining top hits, and annotating SNPs with gene information.
- class ideal_genom.gwas.gen_linear_mix_model.GWAS_GLMM(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, recompute: bool = True)[source]
Bases:
objectClass for performing Genome-Wide Association Studies (GWAS) using a Generalized Linear Mixed Model (GLM) with GCTA.
This class provides methods to perform association analysis, obtain top hits, and annotate SNPs with gene information.
- Parameters:
input_path (str) – Path to the input directory containing PLINK files.
input_name (str) – Base name of the input PLINK files (without extensions).
output_path (str) – Path to the output directory where results will be saved.
output_name (str) – Base name for the output files.
recompute (bool) – Flag indicating whether to recompute the analysis if results already exist. Default is True.
- Raises:
ValueError – If input_path, output_path, input_name, or output_name are not provided.
FileNotFoundError – If the specified input_path or output_path does not exist.
FileNotFoundError – If the required PLINK files (.bed, .bim, .fam) are not found in the input_path.
TypeError – If input_name or output_name are not strings, or if recompute is not a boolean.
- __init__(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, recompute: bool = True) None[source]
- prepare_aux_files() None[source]
Prepares auxiliary files for GWAS analysis by processing phenotype and sex data.
This function reads a .fam file, extracts and recodes phenotype and sex information, and writes the processed data to new files in the specified results directory.
- Returns:
A dictionary containing the status of the process, the step name, and the output directory.
- Return type:
- compute_grm(pruned_file: Path, max_threads: int | None = None) None[source]
Compute the Genetic Relationship Matrix (GRM) using GCTA software.
This method computes the GRM for the given input data using the GCTA software. It allows for multi-threaded execution and can optionally recompute the GRM if specified.
- Parameters:
max_threads (int, optional) – The maximum number of threads to use for computation. If not specified, it defaults to the number of available CPU cores minus two. If the number of CPU cores cannot be determined, it defaults to 10.
- Returns:
- A dictionary containing the following keys:
’pass’ (bool): Indicates whether the process completed successfully.
’step’ (str): The name of the step performed (‘compute_grm’).
’output’ (dict): A dictionary containing the output file paths with the key ‘gcta_out’.
- Return type:
- run_gwas_glmm(maf: float = 0.01) None[source]
Runs a Genome-Wide Association Study (GWAS) using a generalized linear mixed model (GLMMM).
- Parameters:
maf (float) – Minor allele frequency threshold for filtering SNPs. Default is 0.01.
- Returns:
A dictionary containing the status of the process, the step name, and the output directory.
- Return type:
- Raises:
TypeError – If maf is not of type float.
ValueError – If maf is not between 0 and 1.
FileExistsError – If required input files are not found in the results directory.
- get_top_hits(maf: float = 0.01) None[source]
Get the top hits from the GWAS results.
This function processes the results of a genome-wide association study (GWAS) to identify the top hits based. It prepares the necessary files and optionally recomputes the results using GCTA.
- Parameters:
maf (float, optional) – Minor allele frequency threshold. Default is 0.01. Must be between 0 and 1.
- Returns:
A dictionary containing the status of the process, the step name, and the output directory.
- Return type:
- Raises:
TypeError – If maf is not of type float.
ValueError – If maf is not between 0 and 0.5.
- annotate_top_hits(gtf_path: str | None = None, build: str = '38', anno_source: str = 'ensembl') None[source]
Annotate top genetic hits from GWAS analysis with gene information.
This method loads top hits from COJO analysis results, annotates them with gene information using the specified genome build and annotation source, and saves the annotated results to a TSV file.
- Parameters:
gtf_path (Optional[str], default=None) – Path to a GTF file for custom annotation. If None, will use built-in annotation resources.
build (str, default='38') – Genome build version to use for annotation (e.g., ‘38’, ‘37’).
anno_source (str, default='ensembl') – Source of the annotation data (e.g., ‘ensembl’).
- Returns:
A dictionary containing: - ‘pass’: bool - Whether the process completed successfully - ‘step’: str - The name of the processing step - ‘output’: dict - Dictionary of output file paths
- Return type:
- Raises:
FileExistsError – If the COJO file is not found in the results directory.
Notes
The annotated results are saved to ‘top_hits_annotated.tsv’ in the results directory.
- execute_gwas_glmm_pipeline(glmm_params: dict) None[source]
Execute the complete GWAS random effects pipeline.
This method orchestrates the full GWAS analysis workflow using a generalized linear mixed model (GLMM). It sequentially executes all necessary steps: preparing auxiliary files, computing the genetic relationship matrix (GRM), running the GWAS analysis, extracting top hits, and annotating them with gene information.
- Parameters:
maf (float, optional (default=0.01)) – Minor allele frequency threshold for filtering SNPs. Must be between 0 and 1.
max_threads (int, optional (default=None)) – Maximum number of threads to use for GRM computation. If None, uses optimal thread count based on available CPU cores.
gtf_path (Optional[str], optional (default=None)) – Path to the GTF file for custom annotation. If None, uses default annotation resources.
build (str, optional (default='38')) – Genome build version to use for annotation (‘38’ for GRCh38, ‘37’ for GRCh37).
anno_source (str, optional (default='ensembl')) – Source of annotations to use (e.g., ‘ensembl’, ‘refseq’).
- Returns:
Results are saved to the results directory specified during initialization.
- Return type:
None
- Raises:
TypeError – If maf is not of type float.
ValueError – If maf is not between 0 and 1.
FileExistsError – If required intermediate files are not found during the pipeline execution.
Notes
This method sequentially calls: 1. prepare_aux_files() - Prepares phenotype and covariate files 2. compute_grm() - Computes the genetic relationship matrix 3. run_gwas_glmm() - Performs the GWAS analysis 4. get_top_hits() - Extracts top significant hits 5. annotate_top_hits() - Annotates hits with gene information
Examples
>>> gwas = GWASrandom(input_path='data/', input_name='genotypes', ... output_path='results/', output_name='gwas_results') >>> gwas.execute_gwasrandom_pipeline(maf=0.05, build='38')