ideal_genom_qc.AncestryQC
- class ideal_genom_qc.AncestryQC.AncestryQC(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_file: Path, reference_files: dict = {}, recompute_merge: bool = True, built: str = '38', rename_snps: bool = False)
Bases:
object- merge_reference_study(ind_pair: list = [50, 5, 0.2]) None
Merge reference and study data by applying quality control filters and merging steps. This method performs a series of quality control steps to merge study data with reference data: 1. Filters problematic SNPs 2. Performs LD pruning 3. Fixes chromosome mismatches 4. Fixes position mismatches 5. Fixes allele flips 6. Removes remaining mismatches 7. Merges the datasets
- Parameters:
ind_pair (list, default [50, 5, 0.2]) – Parameters for LD pruning: [window size, step size, r2 threshold]
- Return type:
None
Notes
If recompute_merge is False, the method will skip the merging process and expect merged data to already exist in the merging directory.
- Raises:
TypeError – If ind_pair is not a list
- run_pca(ref_population: str, pca: int = 10, maf: float = 0.01, num_pca: int = 10, ref_threshold: float = 4, stu_threshold: float = 4) None
Performs Principal Component Analysis (PCA) on genetic data and identifies ancestry outliers.
This method executes a complete PCA workflow including: 1. Running the PCA analysis 2. Identifying ancestry outliers 3. Removing identified outliers 4. Generating PCA plots
- Parameters:
ref_population (str) – Reference population identifier for ancestry comparison
pca (int, optional) – Number of principal components to calculate (default=10)
maf (float, optional) – Minor allele frequency threshold for filtering (default=0.01)
num_pca (int, optional) – Number of principal components to use in outlier detection (default=10)
ref_threshold (float, optional) – Threshold for reference population outlier detection (default=4)
stu_threshold (float, optional) – Threshold for study population outlier detection (default=4)
- Returns:
Results are saved to specified output directories
- Return type:
None
Notes
The method uses the GenomicOutlierAnalyzer class to perform the analysis and saves results in the directories specified during class initialization.
- class ideal_genom_qc.AncestryQC.GenomicOutlierAnalyzer(input_path: Path, input_name: str, merged_file: Path, reference_tags: Path, output_path: Path, output_name: str)
Bases:
object- draw_pca_plot(plot_dir: Path = PosixPath('.'), plot_name: str = 'pca_plot.jpeg') None
Generate 2D and 3D PCA plots from eigenvector data and population tags. This method creates two PCA visualization plots: - A 2D scatter plot showing PC1 vs PC2 colored by super-population - A 3D scatter plot showing PC1 vs PC2 vs PC3 colored by super-population
- Parameters:
plot_dir (Path, optional) – Directory path where plots will be saved. Defaults to current directory. If directory doesn’t exist, plots will be saved in self.output_path
plot_name (str, optional) – Base name for the plot files. Defaults to ‘pca_plot.jpeg’. Final filenames will be prefixed with ‘2D-’ and ‘3D-’
- Return type:
None
- Raises:
TypeError – If plot_dir is not a Path object If plot_name is not a string
Notes
Requires the following class attributes to be set: - self.population_tags : Path to population tags file (tab-separated) - self.einvectors : Path to eigenvectors file (space-separated) - self.output_path : Path to output directory (used if plot_dir doesn’t exist) The population tags file should contain columns ‘ID1’, ‘ID2’, and ‘SuperPop’ The eigenvectors file should contain the principal components data
- execute_drop_ancestry_outliers(output_dir: Path = PosixPath('.')) None
Drop ancestry outliers from the study data by removing samples identified as ancestry outliers using PLINK command line tool. This method reads a file containing samples identified as ancestry outliers and creates new binary PLINK files excluding these samples.
- Parameters:
output_dir (Path, optional) – Directory where the cleaned files will be saved. If not provided or doesn’t exist, files will be saved in self.output_path.
- Return type:
None
- Raises:
TypeError – If output_dir is not a Path object.
Notes
The method creates new PLINK binary files (.bed, .bim, .fam) with the suffix ‘-ancestry-cleaned’ excluding the samples listed in self.ancestry_fails file.
- execute_pca(pca: int = 10, maf: float = 0.01) None
Perform Principal Component Analysis (PCA) on the genetic data using PLINK.
This method executes PCA on the merged genetic data file, calculating the specified number of principal components. It automatically determines the optimal number of threads and memory allocation based on system resources.
- Parameters:
pca (int, default=10) – Number of principal components to calculate. Must be a positive integer.
maf (float, default=0.01) – Minor allele frequency threshold for filtering variants. Must be between 0 and 0.5.
- Return type:
None
- Raises:
TypeError – If pca is not an integer or maf is not a float
ValueError – If pca is not positive or maf is not between 0 and 0.5
Notes
The method creates two output files: - {output_name}-pca.eigenvec: Contains the eigenvectors (PC loadings) - {output_name}-pca.eigenval: Contains the eigenvalues
The results are stored in self.einvectors and self.eigenvalues attributes.
- find_ancestry_outliers(ref_threshold: float, stu_threshold: float, reference_pop: str, num_pcs: int = 2, fails_dir: Path = PosixPath('.')) None
Identifies ancestry outliers in the dataset based on PCA analysis. This method analyzes population structure using principal component analysis (PCA) and identifies samples that are potential ancestry outliers based on their distance from reference populations.
- Parameters:
ref_threshold (float) – Distance threshold for reference population samples
stu_threshold (float) – Distance threshold for study population samples
reference_pop (str) – Name of the reference population to compare against
num_pcs (int, optional) – Number of principal components to use in the analysis (default is 2)
fails_dir (Path, optional) – Directory path to save failed samples information (default is empty Path)
- Returns:
Results are stored in the ancestry_fails attribute
- Return type:
None
- Raises:
TypeError – If parameters are not of the expected type
ValueError – If num_pcs is not a positive integer
Notes
The method requires: - A reference tags file with population information - An eigenvectors file from PCA analysis - Both files should be previously set in the class instance The results are saved in: - population_tags: CSV file with population assignments - ancestry_fails: List of samples identified as ancestry outliers
- class ideal_genom_qc.AncestryQC.ReferenceGenomicMerger(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_regions: Path, reference_files: dict, built: str = '38')
Bases:
object- execute_filter_prob_snps() None
Executes the filtering of problematic SNPs (A->T and C->G) from both study and reference data. This method performs the following operations: 1. Identifies and filters A->T and C->G SNPs from study data 2. Identifies and filters A->T and C->G SNPs from reference data 3. Creates new PLINK binary files excluding the identified problematic SNPs 4. Uses maximum available CPU threads (total cores - 2) and 2/3 of available memory The method handles both renamed and original SNP scenarios, determined by self.renamed_snps.
Returns:
None
Side Effects:
Creates filtered SNP list files in the output directory
Creates new PLINK binary files (.bed, .bim, .fam) in the output directory
Sets self.reference_AC_GT_filtered and self.study_AC_GT_filtered paths
Logs progress and statistics of filtering operations
Requires:
Valid PLINK binary files for both study and reference data
Proper initialization of input_path, output_path, and reference_files
- execute_fix_allele_flip() None
Executes the allele flipping process between study data and reference panel.
This method performs the following steps: 1. Identifies SNPs requiring allele flipping between study and reference data 2. Creates a list of SNPs to flip 3. Generates a new reference panel with flipped alleles using PLINK
The method uses multi-threading capabilities based on available CPU cores, reserving 2 cores for system processes when possible.
Returns:
None
Side Effects:
Creates a .toFlip file containing SNPs requiring allele flipping
Generates new PLINK binary files (.bed, .bim, .fam) with flipped alleles
Logs the number of SNPs requiring flipping
Updates self.reference_flipped with the path to new flipped reference files
Dependencies:
PLINK must be installed and accessible in system PATH
Requires valid PLINK binary files for both study and reference data
Requires write permissions in output directory
- execute_fix_chromosome_mismatch() None
Fix chromosome mismatch between study data and reference panel.
This method executes PLINK commands to correct any chromosome mismatches between the study data and reference panel datasets. It identifies mismatches using internal methods and updates the chromosome assignments in the reference panel to match the study data.
The method performs the following steps: 1. Identifies chromosome mismatches between study and reference BIM files 2. Creates an update file for chromosome reassignment 3. Executes PLINK command to update chromosome assignments in reference panel
- Return type:
None
Notes
Creates new PLINK binary files with updated chromosome assignments
The updated files are saved with ‘-updateChr’ suffix
- execute_fix_possition_mismatch() None
Fixes position mismatches between study data and reference panel.
This method executes PLINK commands to update the positions of SNPs in the reference panel to match those in the study data. It processes previously identified position mismatches and creates new binary PLINK files with corrected positions.
The method: 1. Determines optimal thread count for processing 2. Identifies position mismatches between study and reference BIM files 3. Updates reference panel positions using PLINK 4. Creates new binary files with corrected positions
Returns:
None
Side Effects:
Creates new PLINK binary files (.bed, .bim, .fam) with updated positions
Logs the number of SNPs being updated
Modifies self.reference_fixed_pos with path to updated files
Dependencies:
Requires PLINK to be installed and accessible
Expects pruned study and reference files to exist
Requires previous chromosome fixing step to be completed
- execute_ld_pruning(ind_pair: list) None
Execute linkage disequilibrium (LD) pruning on study and reference data.
This method performs LD-based pruning using PLINK to remove highly correlated SNPs from both study and reference datasets. The pruning is done using a sliding window approach where SNPs are removed based on their pairwise correlation (r²).
- Parameters:
ind_pair (list) –
A list containing three elements:
ind_pair[0] (int): Window size in SNPs
ind_pair[1] (int): Number of SNPs to shift the window at each step
ind_pair[2] (float): r² threshold for pruning
- Raises:
TypeError – If ind_pair is not a list.
TypeError – If first two elements of ind_pair are not integers.
TypeError – If third element of ind_pair is not a float.
- Return type:
None
Notes
Uses PLINK’s –indep-pairwise command for pruning.
Excludes high LD regions specified in self.high_ld_regions.
Creates pruned datasets for both study and reference data.
Updates self.pruned_reference and self.pruned_study with paths to pruned files.
Uses all available CPU threads except 2 for processing.
- execute_merge_data() None
Merge study and reference data using PLINK.
This method merges the pruned study data with the cleaned reference data using PLINK’s –bmerge functionality. It automatically determines the optimal number of threads to use based on available CPU cores.
The method: 1. Calculates optimal thread count (CPU count - 2 or half of available cores) 2. Constructs PLINK command for merging datasets 3. Executes the merge operation via shell command
Returns:
None
Side effects:
Creates merged PLINK binary files (.bed, .bim, .fam) in the output directory
Logs the merge operation
- execute_remove_mismatches() None
Removes mismatched SNPs from the reference data based on allele comparisons between study and reference datasets.
This method performs the following steps: 1. Determines optimal thread count for processing 2. Identifies allele mismatches between study and reference BIM files 3. Creates a list of SNPs to remove 4. Generates a cleaned reference dataset excluding mismatched SNPs
The method utilizes PLINK to perform the actual SNP removal while maintaining allele order.
Returns:
None
Side Effects:
Creates a file listing SNPs to be removed at {output_path}/{reference_bim_stem}.toRemove
Generates cleaned reference files at {output_path}/{reference_bim_stem}-cleaned.bed/bim/fam
Logs the number of SNPs being removed