ideal_genom_qc.UMAPplot

Module to draw plots based on UMAP dimension reduction

class ideal_genom_qc.UMAPplot.UMAPplot(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_file: Path, built: str = '38', recompute_pca: bool = True)

Bases: object

compute_pcas(pca: int = 10) None

Computes Principal Component Analysis (PCA) using PLINK.

This method performs PCA on the LD-pruned dataset using PLINK’s –pca command. The analysis generates eigenvectors and eigenvalues that can be used for population structure analysis and visualization.

Parameters:

pca (int, default=10) – Number of principal components to compute. Should be a positive integer. Values below 3 will trigger a warning as they may be insufficient for meaningful analysis.

Returns:

Results are written to disk in the results directory with the input_name prefix.

Return type:

None

Raises:
  • TypeError – If pca parameter is not an integer.

  • ValueError – If pca parameter is not positive.

Notes

  • If recompute_pca is False, the method will skip PCA computation

  • Uses PLINK’s –pca command on the LD-pruned dataset

  • Output files are saved in the results directory specified during initialization

generate_plots(color_hue_file: Path = None, case_control_markers: bool = True, n_neighbors: list = [5], min_dist: list = [0.5], metric: list = ['euclidean'], random_state: int = None, umap_kwargs: dict = {}) None

Generate UMAP plots with different parameter combinations. This method generates UMAP (Uniform Manifold Approximation and Projection) plots using various combinations of parameters. It can incorporate color coding based on metadata and case-control markers.

Parameters:
  • color_hue_file (Path, optional) – Path to a tab-separated file containing color hue information. The file should have at least 3 columns, where the first two are ID1 and ID2, and the third column contains the values for color coding. Default is None.

  • case_control_markers (bool, optional) – Whether to include case-control markers in the plots. If True, reads from the .fam file. Default is True. If color_hue_file is not provided, the difference between case and control will be used as hue.

  • n_neighbors (list of int, optional) – List of values for the n_neighbors parameter in UMAP. Each value must be positive. Default is [5].

  • min_dist (list of float, optional) – List of values for the min_dist parameter in UMAP. Each value must be non-negative. Default is [0.5].

  • metric (list of str, optional) – List of distance metrics to use in UMAP. Default is [‘euclidean’].

  • random_state (int, optional) – Random seed for reproducibility. Must be non-negative. Default is None.

  • umap_kwargs (dict, optional) – Additional keyword arguments to pass to the UMAP constructor. Default is an empty dictionary.

Returns:

Saves UMAP plots as JPEG files and parameters as a CSV file in the results directory.

Return type:

None

Raises:
  • TypeError – If input parameters are not of the correct type.

  • ValueError – If input parameters have invalid values.

  • FileNotFoundError – If color_hue_file is specified but not found.

Notes

The method creates a grid of all possible parameter combinations and generates a UMAP plot for each. Parameters for each plot are saved in ‘plots_parameters.csv’.

ld_pruning(maf: float = 0.001, geno: float = 0.1, mind: float = 0.2, hwe: float = 5e-08, ind_pair: list = [50, 5, 0.2]) None

Perform Linkage Disequilibrium (LD) pruning on genetic data using PLINK. This method filters SNPs based on specified thresholds for various quality control metrics and performs LD-based pruning to remove highly correlated variants.

Parameters:
  • maf (float, default=0.001) – Minor allele frequency threshold. Variants with MAF below this value are removed. Must be between 0 and 0.5.

  • geno (float, default=0.1) – Maximum per-SNP missing rate. Variants with missing rate above this are removed. Must be between 0 and 1.

  • mind (float, default=0.2) – Maximum per-individual missing rate. Samples with missing rate above this are removed. Must be between 0 and 1. Recommended range is 0.02 to 0.1.

  • hwe (float, default=5e-8) – Hardy-Weinberg equilibrium exact test p-value threshold. Variants with p-value below this are removed. Must be between 0 and 1.

  • ind_pair (list, default=[50, 5, 0.2]) – Parameters for pairwise LD pruning: [window size, step size, r² threshold].

Returns:

Creates pruned PLINK binary files in the results directory.

Return type:

None

Notes

  • Skips processing if recompute_pca is False

  • Uses multithreading with optimal thread count based on system CPU

  • Generates intermediate files: .prune.in and .prune.out

  • Creates final LD-pruned dataset with ‘-LDpruned’ suffix

Raises:
  • TypeError – If input parameters are not of type float

  • ValueError – If input parameters are outside their valid ranges

  • UserWarning – If mind parameter is outside recommended range