ideal_genom_qc.UMAPplot
Module to draw plots based on UMAP dimension reduction
- class ideal_genom_qc.UMAPplot.UMAPplot(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_file: Path, built: str = '38', recompute_pca: bool = True)
Bases:
object- compute_pcas(pca: int = 10) None
Computes Principal Component Analysis (PCA) using PLINK.
This method performs PCA on the LD-pruned dataset using PLINK’s –pca command. The analysis generates eigenvectors and eigenvalues that can be used for population structure analysis and visualization.
- Parameters:
pca (int, default=10) – Number of principal components to compute. Should be a positive integer. Values below 3 will trigger a warning as they may be insufficient for meaningful analysis.
- Returns:
Results are written to disk in the results directory with the input_name prefix.
- Return type:
None
- Raises:
TypeError – If pca parameter is not an integer.
ValueError – If pca parameter is not positive.
Notes
If recompute_pca is False, the method will skip PCA computation
Uses PLINK’s –pca command on the LD-pruned dataset
Output files are saved in the results directory specified during initialization
- generate_plots(color_hue_file: Path = None, case_control_markers: bool = True, n_neighbors: list = [5], min_dist: list = [0.5], metric: list = ['euclidean'], random_state: int = None, umap_kwargs: dict = {}) None
Generate UMAP plots with different parameter combinations. This method generates UMAP (Uniform Manifold Approximation and Projection) plots using various combinations of parameters. It can incorporate color coding based on metadata and case-control markers.
- Parameters:
color_hue_file (Path, optional) – Path to a tab-separated file containing color hue information. The file should have at least 3 columns, where the first two are ID1 and ID2, and the third column contains the values for color coding. Default is None.
case_control_markers (bool, optional) – Whether to include case-control markers in the plots. If True, reads from the .fam file. Default is True. If color_hue_file is not provided, the difference between case and control will be used as hue.
n_neighbors (list of int, optional) – List of values for the n_neighbors parameter in UMAP. Each value must be positive. Default is [5].
min_dist (list of float, optional) – List of values for the min_dist parameter in UMAP. Each value must be non-negative. Default is [0.5].
metric (list of str, optional) – List of distance metrics to use in UMAP. Default is [‘euclidean’].
random_state (int, optional) – Random seed for reproducibility. Must be non-negative. Default is None.
umap_kwargs (dict, optional) – Additional keyword arguments to pass to the UMAP constructor. Default is an empty dictionary.
- Returns:
Saves UMAP plots as JPEG files and parameters as a CSV file in the results directory.
- Return type:
None
- Raises:
TypeError – If input parameters are not of the correct type.
ValueError – If input parameters have invalid values.
FileNotFoundError – If color_hue_file is specified but not found.
Notes
The method creates a grid of all possible parameter combinations and generates a UMAP plot for each. Parameters for each plot are saved in ‘plots_parameters.csv’.
- ld_pruning(maf: float = 0.001, geno: float = 0.1, mind: float = 0.2, hwe: float = 5e-08, ind_pair: list = [50, 5, 0.2]) None
Perform Linkage Disequilibrium (LD) pruning on genetic data using PLINK. This method filters SNPs based on specified thresholds for various quality control metrics and performs LD-based pruning to remove highly correlated variants.
- Parameters:
maf (float, default=0.001) – Minor allele frequency threshold. Variants with MAF below this value are removed. Must be between 0 and 0.5.
geno (float, default=0.1) – Maximum per-SNP missing rate. Variants with missing rate above this are removed. Must be between 0 and 1.
mind (float, default=0.2) – Maximum per-individual missing rate. Samples with missing rate above this are removed. Must be between 0 and 1. Recommended range is 0.02 to 0.1.
hwe (float, default=5e-8) – Hardy-Weinberg equilibrium exact test p-value threshold. Variants with p-value below this are removed. Must be between 0 and 1.
ind_pair (list, default=[50, 5, 0.2]) – Parameters for pairwise LD pruning: [window size, step size, r² threshold].
- Returns:
Creates pruned PLINK binary files in the results directory.
- Return type:
None
Notes
Skips processing if recompute_pca is False
Uses multithreading with optimal thread count based on system CPU
Generates intermediate files: .prune.in and .prune.out
Creates final LD-pruned dataset with ‘-LDpruned’ suffix
- Raises:
TypeError – If input parameters are not of type float
ValueError – If input parameters are outside their valid ranges
UserWarning – If mind parameter is outside recommended range