ideal_genom_qc.SampleQC

Module to perform sample quality control

class ideal_genom_qc.SampleQC.SampleQC(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_file: Path, built: str = '38')

Bases: object

clean_input_folder() None

Removes specific files from the input folder.

This method cleans the input folder by removing files that contain ‘missing’ or ‘renamed’ in their names, excluding log files. The cleaning process helps maintain organization by removing temporary or processed files.

Returns:

None

clean_result_folder() None

Clean the results folder by removing specific file types.

This method removes all .bed, .bim, and .fam files from the results directory, while preserving log files. The cleaning process helps maintain a tidy workspace by removing intermediate or temporary files from previous runs.

Returns:

None

execute_drop_samples() None

Execute the removal of samples that failed quality control checks using PLINK. This method performs the following steps: 1. Determines the appropriate binary file name based on previous processing steps 2. Reads the fail_samples.txt file containing samples to be removed 3. Executes PLINK command to create new binary files excluding failed samples

Raises:

FileNotFoundError: If the fail_samples.txt file is not found in the fails directory

Returns:

None

Notes:

  • The output files will be created with suffix ‘-clean-samples’

  • The method preserves allele order during the operation

  • Input files must be in PLINK binary format (.bed, .bim, .fam)

execute_duplicate_relatedness(kingship: float = 0.354, use_king: bool = True) None

Execute duplicate and relatedness analysis on the genotype data. This method performs either IBD (Identity by Descent) or KING kinship coefficient analysis to identify duplicate samples and related individuals in the dataset.

Parameters:
  • kingship (float, optional) – The KING kinship coefficient threshold for identifying related samples. Default is 0.354, which corresponds to duplicates/MZ twins.

  • use_king (bool, optional) – If True, uses KING algorithm for relatedness analysis. If False, uses traditional IBD analysis. Default is True.

Return type:

None

Raises:

TypeError – If kingship is not a float or use_king is not a boolean.

Notes

The method will store the analysis type (KING or IBD) in the use_king attribute.

execute_haploid_to_missing(hh_to_missing: bool = True) None

Convert haploid genotypes to missing values in PLINK binary files. This method uses PLINK’s –set-hh-missing flag to convert haploid genotypes to missing values in the genotype data. This is often useful for quality control of genetic data, particularly for variants on sex chromosomes.

Parameters:

hh_to_missing (bool, default=True) – If True, converts haploid genotypes to missing values. If False, skips the conversion step.

Return type:

None

Raises:

TypeError – If hh_to_missing is not a boolean value.

Notes

The method uses PLINK to process the binary files (.bed, .bim, .fam) and creates new files with suffix ‘-hh-missing’. The input files are determined based on whether SNPs have been previously renamed (checks self.renamed_snps).

execute_heterozygosity_rate(maf: float = 0.01) None

Executes heterozygosity rate analysis on genetic data using PLINK.

This method performs a series of PLINK commands to analyze heterozygosity rates in genetic data, separating SNPs based on minor allele frequency (MAF) threshold and computing heterozygosity for both groups.

Parameters:

maf (float, optional) – Minor allele frequency threshold used to split SNPs into two groups. Must be between 0 and 0.5. Default is 0.01.

Return type:

None

Raises:
  • TypeError – If maf is not a float

  • ValueError – If maf is not between 0 and 0.5

  • FileNotFoundError – If any of the expected output files are not created

Notes

The method: 1. Extracts autosomal SNPs 2. Splits SNPs based on MAF threshold 3. Computes missingness 4. Converts to PED/MAP format 5. Computes heterozygosity for both MAF groups

The computation uses optimized threading based on available CPU cores and memory.

execute_ibd() None

Execute Identity by Descent (IBD) analysis using PLINK.

This method performs duplicate and relatedness checks using IBD analysis. It runs two PLINK commands: 1. Generates genome-wide IBD estimates 2. Calculates missing genotype rates

The method uses optimal thread count based on available CPU cores and validates input/output files.

Returns:

None

Raises:

FileNotFoundError: If required input pruned file is missing or if expected output files are not generated

Required instance attributes:

pruned_file: Path to pruned PLINK binary file results_dir: Directory path for output files output_name: Base name for output files ibd_miss: Path to missing genotype rate file (set by method) genome: Path to IBD estimates file (set by method)

execute_kingship(kingship: float = 0.354) None

Execute kinship analysis to identify and handle sample relatedness. This method performs kinship analysis using PLINK2 to identify duplicate samples and related individuals. It first computes a kinship coefficient matrix for all samples and then prunes samples based on the specified kingship threshold.

Parameters:

kingship (float, optional) – The kinship coefficient threshold used to identify related samples. Must be between 0 and 1. Samples with kinship coefficients above this threshold will be marked for removal. Default is 0.354 (equivalent to first-degree relatives).

Return type:

None

Raises:
  • TypeError – If kingship parameter is not a float.

  • ValueError – If kingship parameter is not between 0 and 1.

  • FileNotFoundError – If the expected output file from PLINK2 is not created.

Notes

  • Uses PLINK2 to compute kinship coefficients and perform sample pruning

  • Automatically determines optimal thread count and memory usage based on system resources

  • Creates output files with kinship coefficient matrix and list of samples to be removed

  • Updates self.kinship_miss with path to file containing samples to be removed

execute_ld_pruning(ind_pair: list = [50, 5, 0.2]) None

Execute LD (Linkage Disequilibrium) pruning on genetic data using PLINK. This method performs LD pruning in three steps: 1. Excludes complex/high LD regions 2. Identifies SNPs for pruning using indep-pairwise test 3. Creates final pruned dataset

Parameters:

ind_pair (list, optional) – List of three elements for LD pruning parameters: - Window size (int): Number of SNPs to analyze in each window - Step size (int): Number of SNPs to shift window at each step - r² threshold (float): Correlation coefficient threshold for pruning Default is [50, 5, 0.2]

Raises:
  • TypeError – If ind_pair is not a list If first two elements of ind_pair are not integers If third element of ind_pair is not float

  • ValueError – If ind_pair does not contain exactly three elements If window size or step size is not positive If r² threshold is not between 0 and 1

  • FileNotFoundError – If required pruning input file is not found

Notes

  • Uses available CPU cores (leaving 2 cores free) and 2/3 of available memory

  • Creates intermediate and final files with suffixes: * ‘-LDregionExcluded’ * ‘-LDregionExcluded-prunning’ * ‘-LDpruned’

  • Updates self.pruned_file with path to final pruned dataset

execute_miss_genotype(mind: float = 0.2) dict

Execute missing genotype analysis using PLINK to identify and filter samples with high missingness rates. This method performs two main operations: 1. Generates missingness statistics for all samples 2. Filters samples based on the specified missingness threshold (mind)

Parameters:

mind (float, optional) – The missingness threshold for sample filtering (default is 0.2). Samples with missingness rates above this threshold will be removed. Recommended range is between 0.02 and 0.1.

Returns:

Dictionary containing missingness analysis results

Return type:

dict

Raises:
  • TypeError – If mind parameter is not a float

  • ValueError – If mind parameter is not between 0 and 1

  • FileNotFoundError – If the output .imiss file is not generated If mind value is outside recommended range (0.02-0.1)

Notes

This function creates two files: - {input_name}-missing.imiss: Contains missingness statistics for all samples - {output_name}-mind.bed: New binary file with filtered samples

execute_rename_snpid(rename: bool = True) None

Executes the SNP ID renaming process using PLINK2. This method renames SNP IDs in the PLINK binary files to a standardized format of ‘chr:pos:a1:a2’. The renaming is performed using PLINK2’s –set-all-var-ids parameter.

Parameter:

rename (bool, optional): Flag to control whether SNP renaming should be performed.

Defaults to True.

Returns:

None

Raises:

TypeError: If rename parameter is not a boolean.

Notes:

  • The renamed files will be saved with ‘-renamed’ suffix

  • Thread count is optimized based on available CPU cores

  • The new SNP ID format will be: chromosome:position:allele1:allele2

  • Sets self.renamed_snps to True if renaming is performed

execute_sex_check(sex_check: list = [0.2, 0.8]) None

Execute sex check using PLINK to identify potential sex discrepancies in genetic data. This method performs sex check analysis by: 1. Running PLINK’s –check-sex command on pruned data 2. Extracting X chromosome SNPs 3. Calculating missingness rates for X chromosome SNPs

Parameters:

sex_check (list of float, default=[0.2, 0.8]) – List containing two float values that define the F-statistic boundaries for sex determination. The values must sum to 1.0. First value is the lower bound, second is the upper bound. Samples with F-statistics below the first value are called female, above the second value are called male.

Return type:

None

Raises:
  • TypeError – If sex_check is not a list or if its elements are not floats

  • ValueError – If sex_check doesn’t contain exactly 2 elements or if they don’t sum to 1

Notes

The method creates the following output files: - {output_name}-sexcheck.sexcheck : Contains sex check results - {output_name}-xchr.bed/bim/fam : X chromosome SNP data - {output_name}-xchr-missing.imiss : X chromosome missingness data The number of threads used is automatically determined based on available CPU cores, using max(available cores - 2, 1) or falling back to half of logical cores if CPU count cannot be determined.

get_fail_samples(call_rate_thres: float, std_deviation_het: float, maf_het: float, ibd_threshold: float) DataFrame

Get samples that failed quality control checks and generate a summary DataFrame. This method performs multiple QC checks on samples: 1. Call rate check 2. Sex check 3. Heterozygosity rate check (for MAF > and < threshold) 4. Duplicates/Relatedness check (using either KING or IBD)

Parameters:
  • call_rate_thres (float) – Threshold for call rate filtering

  • std_deviation_het (float) – Number of standard deviations to use for heterozygosity filtering

  • maf_het (float) – Minor allele frequency threshold for heterozygosity check

  • ibd_threshold (float) – Threshold for IBD filtering (only used if use_king=False)

Returns:

Summary DataFrame containing: - Counts of samples failing each QC check - Number of duplicated sample IDs - Total failures

Return type:

pd.DataFrame

Raises:
  • FileNotFoundError – If any required input files are missing

  • TypeError – If unexpected column types are found in summary DataFrame

Notes

The method saves a detailed fail_samples.txt file with all failed samples and their failure reasons. Samples failing multiple checks are only counted once in the final summary.

report_call_rate(directory: Path, filename: str, threshold: float, plots_dir: Path = None, y_axis_cap: int = 10, color: str = '#1B9E77', line_color: str = '#D95F02') DataFrame

Generate sample call rate analysis plots and identify samples failing the call rate threshold. This method reads a PLINK-format missing rate file, creates visualization plots, and identifies samples that fail the specified call rate threshold. It generates two sets of plots: 1. Histograms showing the distribution of missing SNPs (F_MISS) 2. Scatterplots showing different views of the call rate data

Parameters:
  • directory (Path) – Directory containing the input file

  • filename (str) – Name of the PLINK format missing rate file

  • threshold (float) – Call rate threshold for sample filtering (in terms of F_MISS)

  • plots_dir (Path, optional) – Directory where plots will be saved. If None, uses default plots directory

  • y_axis_cap (int, optional) – Maximum value for y-axis in capped histogram plots. Default is 10

  • color (str, optional) – Color for the main plot elements. Default is ‘#1B9E77’

  • line_color (str, optional) – Color for threshold lines in plots. Default is ‘#D95F02’

Returns:

DataFrame containing samples that failed the call rate threshold with columns: - FID: Family ID - IID: Individual ID - Failure: Always set to ‘Call rate’

Return type:

pd.DataFrame

Notes

The method generates two JPEG files: - call_rate_{threshold}_histogram.jpeg: Contains histogram plots - call_rate_{threshold}_scatterplot.jpeg: Contains scatter plots

report_heterozygosity_rate(directory: str, summary_ped_filename: str, autosomal_filename: str, std_deviation_het: float, maf: float, split: str, plots_dir: str, y_axis_cap: float = 80) DataFrame

Analyze and report heterozygosity rates for samples, creating visualization plots and identifying samples that fail heterozygosity rate checks. This function loads heterozygosity and autosomal call rate data, merges them, identifies samples with deviant heterozygosity rates, and generates visualization plots to aid in quality control analysis.

Parameters:
  • directory (str) – Path to the directory containing input files

  • summary_ped_filename (str) – Filename of the summary PED file containing heterozygosity information

  • autosomal_filename (str) – Filename of the autosomal file containing call rate information

  • std_deviation_het (float) – Number of standard deviations to use as threshold for identifying deviant samples

  • maf (float) – Minor allele frequency threshold used in the analysis

  • split (str) – Direction of MAF comparison (‘>’ or ‘<’)

  • plots_dir (str) – Directory where plot files will be saved

  • y_axis_cap (float, optional) – Maximum value for y-axis in capped histogram plot (default: 80)

Returns:

DataFrame containing samples that failed heterozygosity rate check with columns: - FID: Family ID - IID: Individual ID - Failure: Description of failure type

Return type:

pd.DataFrame

Notes

The function generates two types of plots: 1. Histograms of heterozygosity rates (both uncapped and capped) 2. Scatter plot of heterozygosity rate vs missing SNP proportion Files are saved as JPEG images in the specified plots directory.

report_ibd_analysis(ibd_threshold: float = 0.185, chunk_size: int = 100000) DataFrame

Analyze IBD (Identity By Descent) to identify duplicated or related samples. This method processes IBD analysis results to identify sample pairs with IBD scores above a specified threshold, indicating potential duplicates or related individuals. For identified pairs, it uses missingness data to determine which sample should be removed (keeping the sample with lower missingness rate).

Parameters:
  • ibd_threshold (float, default=0.185) – The PI_HAT threshold above which samples are considered related. Typical values: >0.98 for duplicates, >0.5 for first-degree relatives.

  • chunk_size (int, default=100000) – Number of rows to process at a time when reading the genome file.

Returns:

A DataFrame containing samples to be removed, with columns: - FID: Family ID - IID: Individual ID - Failure: Reason for removal (‘Duplicates and relatedness (IBD)’) Returns empty DataFrame if no related pairs are found or if KING is used.

Return type:

pd.DataFrame

Raises:
  • TypeError – If ibd_threshold is not a float.

  • FileNotFoundError – If required input files (*.imiss or *.genome) are not found.

Notes

The method requires two input files: - {output_name}-ibd-missing.imiss: Contains sample missingness information - {output_name}-ibd.genome: Contains pairwise IBD estimates For each related pair, the sample with higher missingness rate is marked for removal.

report_sex_check(directory: Path, sex_check_filename: str, xchr_imiss_filename: str, plots_dir: Path = None) DataFrame

Creates a sex check report and visualization based on PLINK’s sex check results. This function reads sex check data and X chromosome missingness data, merges them, and generates a scatter plot to visualize potential sex discrepancies. It also identifies samples that fail sex check quality control.

Parameters:
  • directory (Path) – Path to the directory containing input files

  • sex_check_filename (str) – Filename of PLINK’s sex check results (typically .sexcheck file)

  • xchr_imiss_filename (str) – Filename of X chromosome missingness data

  • plots_dir (Path, optional) – Directory where the plot should be saved. If None, uses default plots directory

Returns:

DataFrame containing samples that failed sex check QC with columns: - FID: Family ID - IID: Individual ID - Failure: Type of failure (always ‘Sex check’)

Return type:

pd.DataFrame

Notes

The function creates a scatter plot with: - Blue hollow circles for samples with Male PEDSEX - Green hollow circles for samples with Female PEDSEX - Red filled circles for problematic samples - Dotted red vertical lines at F=0.2 and F=0.8 The plot is saved as ‘sex_check.jpeg’ in the specified plots directory.