GWAS Modules

Modules for genome-wide association studies including preparatory steps, linear models, and mixed models.

Preparatory Analysis

Class for preparatory steps before GWAS.

This class handles the pruning of high linkage disequilibrium (LD) regions and performs Principal Component Analysis (PCA) on the pruned data. It uses PLINK software for the pruning and PCA operations, ensuring that the input data is in the correct format and that necessary files are present. It also manages the fetching of high LD regions if they are not provided, using the FetcherLDRegions class from the ideal_genom package. It is designed to be flexible with parameters such as missing rate, minor allele frequency, and number of principal components to compute. It also allows for memory and thread management during PLINK execution.

class ideal_genom.gwas.preparatory.Preparatory(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, high_ld_regions_file: str | Path, build: str = '38')[source]

Bases: object

A class for preprocessing genomic data in preparation for analysis.

This class handles the preparatory steps needed for genomic data analysis, including input validation, LD (Linkage Disequilibrium) pruning, and PCA (Principal Component Analysis) decomposition.

input_path

Path to the directory containing input PLINK files (.bed, .bim, .fam)

Type:: str or Path

input_name

Base name of the input PLINK files (without extension)

Type:: str

output_path

Path to the directory where output files will be saved

Type:: str or Path

output_name

Base name for the output files

Type:: str

high_ld_file

Path to the high LD regions file. If not found, will be fetched automatically

Type:: str or Path

build

Genome build version, either ‘38’ or ‘37’

Type:: str, default=’38’

Raises:

ValueError – If input_path or output_path is None, or if input_name or output_name is None
TypeError – If input_path or output_path is not of type str or Path, or if input_name or output_name is not of type str, or if build is not of type str
FileNotFoundError – If the specified input_path or output_path does not exist, or if the required PLINK files (.bed, .bim, .fam) are not found, or if the high LD file is not found and cannot be fetched.

Notes

This class uses PLINK software for genomic data processing operations.

Note

The class assumes that PLINK is installed and available in the system PATH.

__init__(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, high_ld_regions_file: str | Path, build: str = '38') → None[source]

execute_ld_prunning(mind: float = 0.2, maf: float = 0.01, geno: float = 0.1, hwe: float = 5e-06, ind_pair: list = [50, 5, 0.2], memory: int | None = None, threads: int | None = None) → None[source]

Execute LD (Linkage Disequilibrium) pruning on genetic data using PLINK.

This method performs LD pruning in two steps: 1. Excludes high LD regions and identifies independent SNPs 2. Extracts the identified independent SNPs

Parameters:

mind (float, optional (default=0.2)) – Missing rate per individual threshold. Excludes individuals with missing rate higher than threshold.
maf (float, optional (default=0.01)) – Minor allele frequency threshold. Must be between 0 and 0.5.
geno (float, optional (default=0.1)) – Missing rate per SNP threshold. Must be between 0 and 1.
hwe (float, optional (default=5e-6)) – Hardy-Weinberg equilibrium exact test p-value threshold. Must be between 0 and 1.
ind_pair (list, optional (default=[50, 5, 0.2])) – Parameters for pairwise pruning: [window size(variants), step size(variants), r^2 threshold]
memory (int, optional (default=None)) – Memory in MB to allocate. If None, uses 2/3 of available system memory.

Returns:

The results are saved to disk and the pruned file path is stored in self.pruned_file

Return type:

None

Raises:

TypeError – If mind, maf, geno, or hwe are not float
ValueError – If maf is not between 0 and 0.5 If geno is not between 0 and 1 If hwe is not between 0 and 1

Notes

Uses PLINK software for the pruning operations. Operates on chromosomes 1-22 only. Automatically determines optimal thread count based on system CPU cores.

execute_pc_decomposition(pca: int = 10, threads: int | None = None, memory: int | None = None) → None[source]

Execute PCA decomposition on pruned PLINK binary files.

This method performs Principal Component Analysis (PCA) on the pruned genotype data using PLINK software. It requires the existence of pruned binary PLINK files (.bed, .bim, .fam) and generates PCA eigenvectors and eigenvalues.

Parameters:

pca (int, default=10) – Number of principal components to compute. Must be greater than 0.

Return type:

None

Raises:

TypeError – If pca parameter is not an integer.
ValueError – If pca parameter is less than 1.
FileNotFoundError – If any of the required pruned PLINK files (.bed, .bim, .fam) are not found.

Notes

The method automatically determines the optimal number of threads to use based on CPU count, reserving 2 cores for other processes. If CPU count cannot be determined, it defaults to 10 threads.

The output files will be created in the same directory as the input files, using the input name as prefix with extensions .eigenvec and .eigenval.

execute_preparatory_pipeline(preparatory_params: dict) → None[source]

Execute the full preparatory pipeline including LD pruning and PCA decomposition.

This method combines the LD pruning and PCA decomposition steps into a single pipeline for ease of use. It first performs LD pruning on the input genotype data, followed by PCA decomposition on the pruned data.

Parameters:

mind (float, optional (default=0.2)) – Missing rate per individual threshold for LD pruning.
maf (float, optional (default=0.01)) – Minor allele frequency threshold for LD pruning.
geno (float, optional (default=0.1)) – Missing rate per SNP threshold for LD pruning.
hwe (float, optional (default=5e-6)) – Hardy-Weinberg equilibrium exact test p-value threshold for LD pruning.
ind_pair (list, optional (default=[50, 5, 0.2])) – Parameters for pairwise pruning during LD pruning.
pca (int, optional (default=10)) – Number of principal components to compute during PCA decomposition.
memory (int, optional (default=None)) – Memory in MB to allocate for PLINK operations.

Return type:

None

Notes

This method sequentially calls execute_ld_prunning and execute_pc_decomposition.

General Linear Model (GLM)

This module provides a class for performing Genome-Wide Association Studies (GWAS) using a Generalized Linear Model (GLM) with PLINK2.

It includes methods for association analysis, obtaining top hits, and annotating SNPs with gene information.

class ideal_genom.gwas.gen_linear_model.GWAS_GLM(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, recompute: bool = True)[source]

Bases: object

Class for performing Genome-Wide Association Studies (GWAS) using a Generalized Linear Model (GLM) with PLINK2.

This class provides methods to perform association analysis, obtain top hits, and annotate SNPs with gene information.

input_path

Path to the input directory.

Type:: Path

output_path

Path to the output directory.

Type:: Path

input_name

Base name of the input PLINK files.

Type:: str

output_name

Base name for the output files.

Type:: str

recompute

Flag indicating whether to recompute the analysis.

Type:: bool

results_dir

Directory where the results will be saved.

Type:: Path

Raises:

ValueError – If input_path, output_path, input_name, or output_name are not provided.
FileNotFoundError – If the specified input_path or output_path does not exist.
FileNotFoundError – If the required PLINK files (.bed, .bim, .fam) are not found in the input_path.
TypeError – If input_name or output_name are not strings, or if recompute is not a boolean.

__init__(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, recompute: bool = True) → None[source]

glm_association_analysis(maf: float = 0.01, mind: float = 0.1, hwe: float = 5e-06, ci: float = 0.95) → None[source]

Perform fixed model association analysis using PLINK2.

This method performs a fixed model association analysis on genomic data using PLINK2. It checks the validity of the input parameters, ensures necessary files exist, and executes the PLINK2 command to perform the analysis.

Parameters:

maf (float) – Minor allele frequency threshold. Must be between 0 and 0.5.
mind (float) – Individual missingness threshold. Must be between 0 and 1.
hwe (float) – Hardy-Weinberg equilibrium threshold. Must be between 0 and 1.
ci (float) – Confidence interval threshold. Must be between 0 and 1.

Returns:

A dictionary containing the status of the process, the step name, and the output directory.

Return type:

dict

Raises:

TypeError – If any of the input parameters are not of type float.
ValueError – If any of the input parameters are out of their respective valid ranges.
FileNotFoundError – If the required PCA file is not found.

get_top_hits(maf: float = 0.01) → None[source]

Get the top hits from the GWAS results.

Parameters:

maf (float) – Minor allele frequency threshold. Must be a float between 0 and 0.5.

Returns:

A dictionary containing the process status, step name, and output directory.

Return type:

dict

Raises:

TypeError – If maf is not of type float.
ValueError – If maf is not between 0 and 0.5.

Notes

The function performs the following steps:

Validates the type and range of the maf parameter.
Computes the number of threads to use based on the available CPU cores.
Loads the results of the association analysis and renames columns according to GCTA requirements.
Prepares a .ma file with the necessary columns.
If recompute is True, constructs and executes a GCTA command to perform conditional and joint analysis.
Returns a dictionary with the process status, step name, and output directory.

annotate_top_hits(gtf_path: str | None = None, build: str = '38', anno_source: str = 'ensembl') → None[source]

Annotate top SNP hits from COJO analysis with gene information.

This method reads the COJO joint analysis results, extracts the top SNPs, and annotates them with gene information using the specified genome build and annotation source. The annotated results are saved to a TSV file.

Parameters:

gtf_path (Optional[str], default=None) – Path to the GTF (Gene Transfer Format) file for custom annotation. If None, the annotation will use default resources.
build (str, default='38') – Genome build version to use for annotation (‘38’ for GRCh38, etc.).
anno_source (str, default="ensembl") – Source of annotations to use (e.g., “ensembl”, “refseq”).

Returns:

A dictionary containing: - ‘pass’: Boolean indicating if the process completed successfully - ‘step’: The name of the step (‘annotate_hits’) - ‘output’: Dictionary with output file paths

Return type:

dict

Raises:

FileExistsError – If the COJO results file is not found in the results directory.

Notes

The annotated results are saved to ‘top_hits_annotated.tsv’ in the results directory.

execute_gwas_glm_pipeline(glm_params: dict) → None[source]

Execute the complete GWAS fixed effects pipeline.

This method orchestrates the full GWAS analysis workflow using a generalized linear model (GLM). It sequentially executes all necessary steps: performing the association analysis, extracting top hits, and annotating them with gene information.

Parameters:

maf (float, optional (default=0.01)) – Minor allele frequency threshold for filtering SNPs. Must be between 0 and 0.5.
mind (float, optional (default=0.1)) – Individual missingness threshold. Must be between 0 and 1.
hwe (float, optional (default=5e-6)) – Hardy-Weinberg equilibrium threshold. Must be between 0 and 1.
ci (float, optional (default=0.95)) – Confidence interval threshold. Must be between 0 and 1.
gtf_path (Optional[str], optional (default=None)) – Path to the GTF file for custom annotation. If None, uses default annotation resources.
build (str, optional (default='38')) – Genome build version to use for annotation (‘38’ for GRCh38, ‘37’ for GRCh37).
anno_source (str, optional (default='ensembl')) – Source of annotations to use (e.g., ‘ensembl’, ‘refseq’).

Returns:

Results are saved to the results directory specified during initialization.

Return type:

None

Raises:

TypeError – If any of the numeric parameters are not of the correct type.
ValueError – If any of the parameters are out of their respective valid ranges.
FileNotFoundError – If required input files (e.g., PCA file) are not found.

Notes

This method sequentially calls: 1. glm_association_analysis() - Performs the GLM-based GWAS analysis 2. get_top_hits() - Extracts top significant hits using conditional analysis 3. annotate_top_hits() - Annotates hits with gene information

The pipeline expects that preparatory steps (LD pruning and PCA) have already been performed, as it requires the existence of .eigenvec files.

Examples

>>> gwas = GWASfixed(input_path='data/', input_name='genotypes',
...                  output_path='results/', output_name='gwas_results')
>>> gwas.execute_gwasfixed_pipeline(maf=0.05, hwe=1e-6, build='38')

General Linear Mixed Model (GLMM)

This module provides a class for performing Genome-Wide Association Studies (GWAS) using a Generalized Linear Mixed Model (GLMM) with GCTA.

It includes methods for association analysis, obtaining top hits, and annotating SNPs with gene information.

class ideal_genom.gwas.gen_linear_mix_model.GWAS_GLMM(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, recompute: bool = True)[source]

Bases: object

Class for performing Genome-Wide Association Studies (GWAS) using a Generalized Linear Mixed Model (GLM) with GCTA.

This class provides methods to perform association analysis, obtain top hits, and annotate SNPs with gene information.

Parameters:

input_path (str) – Path to the input directory containing PLINK files.
input_name (str) – Base name of the input PLINK files (without extensions).
output_path (str) – Path to the output directory where results will be saved.
output_name (str) – Base name for the output files.
recompute (bool) – Flag indicating whether to recompute the analysis if results already exist. Default is True.

Raises:

ValueError – If input_path, output_path, input_name, or output_name are not provided.
FileNotFoundError – If the specified input_path or output_path does not exist.
FileNotFoundError – If the required PLINK files (.bed, .bim, .fam) are not found in the input_path.
TypeError – If input_name or output_name are not strings, or if recompute is not a boolean.

__init__(input_path: str | Path, input_name: str, output_path: str | Path, output_name: str, recompute: bool = True) → None[source]

prepare_aux_files() → None[source]

Prepares auxiliary files for GWAS analysis by processing phenotype and sex data.

This function reads a .fam file, extracts and recodes phenotype and sex information, and writes the processed data to new files in the specified results directory.

Returns:: A dictionary containing the status of the process, the step name, and the output directory.
Return type:: dict

compute_grm(pruned_file: Path, max_threads: int | None = None) → None[source]

Compute the Genetic Relationship Matrix (GRM) using GCTA software.

This method computes the GRM for the given input data using the GCTA software. It allows for multi-threaded execution and can optionally recompute the GRM if specified.

Parameters:

max_threads (int, optional) – The maximum number of threads to use for computation. If not specified, it defaults to the number of available CPU cores minus two. If the number of CPU cores cannot be determined, it defaults to 10.

Returns:

A dictionary containing the following keys:

’pass’ (bool): Indicates whether the process completed successfully.
’step’ (str): The name of the step performed (‘compute_grm’).
’output’ (dict): A dictionary containing the output file paths with the key ‘gcta_out’.

Return type:

dict

run_gwas_glmm(maf: float = 0.01) → None[source]

Runs a Genome-Wide Association Study (GWAS) using a generalized linear mixed model (GLMMM).

Parameters:

maf (float) – Minor allele frequency threshold for filtering SNPs. Default is 0.01.

Returns:

A dictionary containing the status of the process, the step name, and the output directory.

Return type:

dict

Raises:

TypeError – If maf is not of type float.
ValueError – If maf is not between 0 and 1.
FileExistsError – If required input files are not found in the results directory.

get_top_hits(maf: float = 0.01) → None[source]

Get the top hits from the GWAS results.

This function processes the results of a genome-wide association study (GWAS) to identify the top hits based. It prepares the necessary files and optionally recomputes the results using GCTA.

Parameters:

maf (float, optional) – Minor allele frequency threshold. Default is 0.01. Must be between 0 and 1.

Returns:

A dictionary containing the status of the process, the step name, and the output directory.

Return type:

dict

Raises:

TypeError – If maf is not of type float.
ValueError – If maf is not between 0 and 0.5.

annotate_top_hits(gtf_path: str | None = None, build: str = '38', anno_source: str = 'ensembl') → None[source]

Annotate top genetic hits from GWAS analysis with gene information.

This method loads top hits from COJO analysis results, annotates them with gene information using the specified genome build and annotation source, and saves the annotated results to a TSV file.

Parameters:

gtf_path (Optional[str], default=None) – Path to a GTF file for custom annotation. If None, will use built-in annotation resources.
build (str, default='38') – Genome build version to use for annotation (e.g., ‘38’, ‘37’).
anno_source (str, default='ensembl') – Source of the annotation data (e.g., ‘ensembl’).

Returns:

A dictionary containing: - ‘pass’: bool - Whether the process completed successfully - ‘step’: str - The name of the processing step - ‘output’: dict - Dictionary of output file paths

Return type:

dict

Raises:

FileExistsError – If the COJO file is not found in the results directory.

Notes

The annotated results are saved to ‘top_hits_annotated.tsv’ in the results directory.

execute_gwas_glmm_pipeline(glmm_params: dict) → None[source]

Execute the complete GWAS random effects pipeline.

This method orchestrates the full GWAS analysis workflow using a generalized linear mixed model (GLMM). It sequentially executes all necessary steps: preparing auxiliary files, computing the genetic relationship matrix (GRM), running the GWAS analysis, extracting top hits, and annotating them with gene information.

Parameters:

maf (float, optional (default=0.01)) – Minor allele frequency threshold for filtering SNPs. Must be between 0 and 1.
max_threads (int, optional (default=None)) – Maximum number of threads to use for GRM computation. If None, uses optimal thread count based on available CPU cores.
gtf_path (Optional[str], optional (default=None)) – Path to the GTF file for custom annotation. If None, uses default annotation resources.
build (str, optional (default='38')) – Genome build version to use for annotation (‘38’ for GRCh38, ‘37’ for GRCh37).
anno_source (str, optional (default='ensembl')) – Source of annotations to use (e.g., ‘ensembl’, ‘refseq’).

Returns:

Results are saved to the results directory specified during initialization.

Return type:

None

Raises:

TypeError – If maf is not of type float.
ValueError – If maf is not between 0 and 1.
FileExistsError – If required intermediate files are not found during the pipeline execution.

Notes

This method sequentially calls: 1. prepare_aux_files() - Prepares phenotype and covariate files 2. compute_grm() - Computes the genetic relationship matrix 3. run_gwas_glmm() - Performs the GWAS analysis 4. get_top_hits() - Extracts top significant hits 5. annotate_top_hits() - Annotates hits with gene information

Examples

>>> gwas = GWASrandom(input_path='data/', input_name='genotypes',
...                   output_path='results/', output_name='gwas_results')
>>> gwas.execute_gwasrandom_pipeline(maf=0.05, build='38')