ideal_genom_qc.AncestryQC

class ideal_genom_qc.AncestryQC.AncestryQC(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_file: Path, reference_files: dict = {}, recompute_merge: bool = True, built: str = '38', rename_snps: bool = False)

Bases: object

merge_reference_study(ind_pair: list = [50, 5, 0.2]) → None

Merge reference and study data by applying quality control filters and merging steps. This method performs a series of quality control steps to merge study data with reference data: 1. Filters problematic SNPs 2. Performs LD pruning 3. Fixes chromosome mismatches 4. Fixes position mismatches 5. Fixes allele flips 6. Removes remaining mismatches 7. Merges the datasets

Parameters:: ind_pair (list, default [50, 5, 0.2]) – Parameters for LD pruning: [window size, step size, r2 threshold]
Return type:: None

Notes

If recompute_merge is False, the method will skip the merging process and expect merged data to already exist in the merging directory.

Raises:: TypeError – If ind_pair is not a list

run_pca(ref_population: str, pca: int = 10, maf: float = 0.01, num_pca: int = 10, ref_threshold: float = 4, stu_threshold: float = 4) → None

Performs Principal Component Analysis (PCA) on genetic data and identifies ancestry outliers.

This method executes a complete PCA workflow including: 1. Running the PCA analysis 2. Identifying ancestry outliers 3. Removing identified outliers 4. Generating PCA plots

Parameters:

ref_population (str) – Reference population identifier for ancestry comparison
pca (int, optional) – Number of principal components to calculate (default=10)
maf (float, optional) – Minor allele frequency threshold for filtering (default=0.01)
num_pca (int, optional) – Number of principal components to use in outlier detection (default=10)
ref_threshold (float, optional) – Threshold for reference population outlier detection (default=4)
stu_threshold (float, optional) – Threshold for study population outlier detection (default=4)

Returns:

Results are saved to specified output directories

Return type:

None

Notes

The method uses the GenomicOutlierAnalyzer class to perform the analysis and saves results in the directories specified during class initialization.

class ideal_genom_qc.AncestryQC.GenomicOutlierAnalyzer(input_path: Path, input_name: str, merged_file: Path, reference_tags: Path, output_path: Path, output_name: str)

Bases: object

draw_pca_plot(plot_dir: Path = PosixPath('.'), plot_name: str = 'pca_plot.jpeg') → None

Generate 2D and 3D PCA plots from eigenvector data and population tags. This method creates two PCA visualization plots: - A 2D scatter plot showing PC1 vs PC2 colored by super-population - A 3D scatter plot showing PC1 vs PC2 vs PC3 colored by super-population

Parameters:

plot_dir (Path, optional) – Directory path where plots will be saved. Defaults to current directory. If directory doesn’t exist, plots will be saved in self.output_path
plot_name (str, optional) – Base name for the plot files. Defaults to ‘pca_plot.jpeg’. Final filenames will be prefixed with ‘2D-’ and ‘3D-’

Return type:

None

Raises:

TypeError – If plot_dir is not a Path object If plot_name is not a string

Notes

Requires the following class attributes to be set: - self.population_tags : Path to population tags file (tab-separated) - self.einvectors : Path to eigenvectors file (space-separated) - self.output_path : Path to output directory (used if plot_dir doesn’t exist) The population tags file should contain columns ‘ID1’, ‘ID2’, and ‘SuperPop’ The eigenvectors file should contain the principal components data

execute_drop_ancestry_outliers(output_dir: Path = PosixPath('.')) → None

Drop ancestry outliers from the study data by removing samples identified as ancestry outliers using PLINK command line tool. This method reads a file containing samples identified as ancestry outliers and creates new binary PLINK files excluding these samples.

Parameters:: output_dir (Path, optional) – Directory where the cleaned files will be saved. If not provided or doesn’t exist, files will be saved in self.output_path.
Return type:: None
Raises:: TypeError – If output_dir is not a Path object.

Notes

The method creates new PLINK binary files (.bed, .bim, .fam) with the suffix ‘-ancestry-cleaned’ excluding the samples listed in self.ancestry_fails file.

execute_pca(pca: int = 10, maf: float = 0.01) → None

Perform Principal Component Analysis (PCA) on the genetic data using PLINK.

This method executes PCA on the merged genetic data file, calculating the specified number of principal components. It automatically determines the optimal number of threads and memory allocation based on system resources.

Parameters:

pca (int, default=10) – Number of principal components to calculate. Must be a positive integer.
maf (float, default=0.01) – Minor allele frequency threshold for filtering variants. Must be between 0 and 0.5.

Return type:

None

Raises:

TypeError – If pca is not an integer or maf is not a float
ValueError – If pca is not positive or maf is not between 0 and 0.5

Notes

The method creates two output files: - {output_name}-pca.eigenvec: Contains the eigenvectors (PC loadings) - {output_name}-pca.eigenval: Contains the eigenvalues

The results are stored in self.einvectors and self.eigenvalues attributes.

find_ancestry_outliers(ref_threshold: float, stu_threshold: float, reference_pop: str, num_pcs: int = 2, fails_dir: Path = PosixPath('.')) → None

Identifies ancestry outliers in the dataset based on PCA analysis. This method analyzes population structure using principal component analysis (PCA) and identifies samples that are potential ancestry outliers based on their distance from reference populations.

Parameters:

ref_threshold (float) – Distance threshold for reference population samples
stu_threshold (float) – Distance threshold for study population samples
reference_pop (str) – Name of the reference population to compare against
num_pcs (int, optional) – Number of principal components to use in the analysis (default is 2)
fails_dir (Path, optional) – Directory path to save failed samples information (default is empty Path)

Returns:

Results are stored in the ancestry_fails attribute

Return type:

None

Raises:

TypeError – If parameters are not of the expected type
ValueError – If num_pcs is not a positive integer

Notes

The method requires: - A reference tags file with population information - An eigenvectors file from PCA analysis - Both files should be previously set in the class instance The results are saved in: - population_tags: CSV file with population assignments - ancestry_fails: List of samples identified as ancestry outliers

class ideal_genom_qc.AncestryQC.ReferenceGenomicMerger(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_regions: Path, reference_files: dict, built: str = '38')

Bases: object

execute_filter_prob_snps() → None

Executes the filtering of problematic SNPs (A->T and C->G) from both study and reference data. This method performs the following operations: 1. Identifies and filters A->T and C->G SNPs from study data 2. Identifies and filters A->T and C->G SNPs from reference data 3. Creates new PLINK binary files excluding the identified problematic SNPs 4. Uses maximum available CPU threads (total cores - 2) and 2/3 of available memory The method handles both renamed and original SNP scenarios, determined by self.renamed_snps.

Returns:

None

Side Effects:

Creates filtered SNP list files in the output directory

Creates new PLINK binary files (.bed, .bim, .fam) in the output directory

Sets self.reference_AC_GT_filtered and self.study_AC_GT_filtered paths

Logs progress and statistics of filtering operations

Requires:

Valid PLINK binary files for both study and reference data

Proper initialization of input_path, output_path, and reference_files

execute_fix_allele_flip() → None

Executes the allele flipping process between study data and reference panel.

This method performs the following steps: 1. Identifies SNPs requiring allele flipping between study and reference data 2. Creates a list of SNPs to flip 3. Generates a new reference panel with flipped alleles using PLINK

The method uses multi-threading capabilities based on available CPU cores, reserving 2 cores for system processes when possible.

Returns:

None

Side Effects:

Creates a .toFlip file containing SNPs requiring allele flipping

Generates new PLINK binary files (.bed, .bim, .fam) with flipped alleles

Logs the number of SNPs requiring flipping

Updates self.reference_flipped with the path to new flipped reference files

Dependencies:

PLINK must be installed and accessible in system PATH

Requires valid PLINK binary files for both study and reference data

Requires write permissions in output directory

execute_fix_chromosome_mismatch() → None

Fix chromosome mismatch between study data and reference panel.

This method executes PLINK commands to correct any chromosome mismatches between the study data and reference panel datasets. It identifies mismatches using internal methods and updates the chromosome assignments in the reference panel to match the study data.

The method performs the following steps: 1. Identifies chromosome mismatches between study and reference BIM files 2. Creates an update file for chromosome reassignment 3. Executes PLINK command to update chromosome assignments in reference panel

Return type:: None

Notes

Creates new PLINK binary files with updated chromosome assignments
The updated files are saved with ‘-updateChr’ suffix

execute_fix_possition_mismatch() → None

Fixes position mismatches between study data and reference panel.

This method executes PLINK commands to update the positions of SNPs in the reference panel to match those in the study data. It processes previously identified position mismatches and creates new binary PLINK files with corrected positions.

The method: 1. Determines optimal thread count for processing 2. Identifies position mismatches between study and reference BIM files 3. Updates reference panel positions using PLINK 4. Creates new binary files with corrected positions

Returns:

None

Side Effects:

Creates new PLINK binary files (.bed, .bim, .fam) with updated positions

Logs the number of SNPs being updated

Modifies self.reference_fixed_pos with path to updated files

Dependencies:

Requires PLINK to be installed and accessible

Expects pruned study and reference files to exist

Requires previous chromosome fixing step to be completed

execute_ld_pruning(ind_pair: list) → None

Execute linkage disequilibrium (LD) pruning on study and reference data.

This method performs LD-based pruning using PLINK to remove highly correlated SNPs from both study and reference datasets. The pruning is done using a sliding window approach where SNPs are removed based on their pairwise correlation (r²).

Parameters:

ind_pair (list) –

A list containing three elements:

ind_pair[0] (int): Window size in SNPs
ind_pair[1] (int): Number of SNPs to shift the window at each step
ind_pair[2] (float): r² threshold for pruning

Raises:

TypeError – If ind_pair is not a list.
TypeError – If first two elements of ind_pair are not integers.
TypeError – If third element of ind_pair is not a float.

Return type:

None

Notes

Uses PLINK’s –indep-pairwise command for pruning.
Excludes high LD regions specified in self.high_ld_regions.
Creates pruned datasets for both study and reference data.
Updates self.pruned_reference and self.pruned_study with paths to pruned files.
Uses all available CPU threads except 2 for processing.

execute_merge_data() → None

Merge study and reference data using PLINK.

This method merges the pruned study data with the cleaned reference data using PLINK’s –bmerge functionality. It automatically determines the optimal number of threads to use based on available CPU cores.

The method: 1. Calculates optimal thread count (CPU count - 2 or half of available cores) 2. Constructs PLINK command for merging datasets 3. Executes the merge operation via shell command

Returns:

None

Side effects:

Creates merged PLINK binary files (.bed, .bim, .fam) in the output directory

Logs the merge operation

execute_remove_mismatches() → None

Removes mismatched SNPs from the reference data based on allele comparisons between study and reference datasets.

This method performs the following steps: 1. Determines optimal thread count for processing 2. Identifies allele mismatches between study and reference BIM files 3. Creates a list of SNPs to remove 4. Generates a cleaned reference dataset excluding mismatched SNPs

The method utilizes PLINK to perform the actual SNP removal while maintaining allele order.

Returns:

None

Side Effects:

Creates a file listing SNPs to be removed at {output_path}/{reference_bim_stem}.toRemove

Generates cleaned reference files at {output_path}/{reference_bim_stem}-cleaned.bed/bim/fam

Logs the number of SNPs being removed