ideal_genom_qc.SampleQC
Module to perform sample quality control
- class ideal_genom_qc.SampleQC.SampleQC(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_file: Path, built: str = '38')
Bases:
object- clean_input_folder() None
Removes specific files from the input folder.
This method cleans the input folder by removing files that contain ‘missing’ or ‘renamed’ in their names, excluding log files. The cleaning process helps maintain organization by removing temporary or processed files.
- Returns:
None
- clean_result_folder() None
Clean the results folder by removing specific file types.
This method removes all .bed, .bim, and .fam files from the results directory, while preserving log files. The cleaning process helps maintain a tidy workspace by removing intermediate or temporary files from previous runs.
- Returns:
None
- execute_drop_samples() None
Execute the removal of samples that failed quality control checks using PLINK. This method performs the following steps: 1. Determines the appropriate binary file name based on previous processing steps 2. Reads the fail_samples.txt file containing samples to be removed 3. Executes PLINK command to create new binary files excluding failed samples
Raises:
FileNotFoundError: If the fail_samples.txt file is not found in the fails directory
Returns:
None
Notes:
The output files will be created with suffix ‘-clean-samples’
The method preserves allele order during the operation
Input files must be in PLINK binary format (.bed, .bim, .fam)
Execute duplicate and relatedness analysis on the genotype data. This method performs either IBD (Identity by Descent) or KING kinship coefficient analysis to identify duplicate samples and related individuals in the dataset.
- Parameters:
kingship (float, optional) – The KING kinship coefficient threshold for identifying related samples. Default is 0.354, which corresponds to duplicates/MZ twins.
use_king (bool, optional) – If True, uses KING algorithm for relatedness analysis. If False, uses traditional IBD analysis. Default is True.
- Return type:
None
- Raises:
TypeError – If kingship is not a float or use_king is not a boolean.
Notes
The method will store the analysis type (KING or IBD) in the use_king attribute.
- execute_haploid_to_missing(hh_to_missing: bool = True) None
Convert haploid genotypes to missing values in PLINK binary files. This method uses PLINK’s –set-hh-missing flag to convert haploid genotypes to missing values in the genotype data. This is often useful for quality control of genetic data, particularly for variants on sex chromosomes.
- Parameters:
hh_to_missing (bool, default=True) – If True, converts haploid genotypes to missing values. If False, skips the conversion step.
- Return type:
None
- Raises:
TypeError – If hh_to_missing is not a boolean value.
Notes
The method uses PLINK to process the binary files (.bed, .bim, .fam) and creates new files with suffix ‘-hh-missing’. The input files are determined based on whether SNPs have been previously renamed (checks self.renamed_snps).
- execute_heterozygosity_rate(maf: float = 0.01) None
Executes heterozygosity rate analysis on genetic data using PLINK.
This method performs a series of PLINK commands to analyze heterozygosity rates in genetic data, separating SNPs based on minor allele frequency (MAF) threshold and computing heterozygosity for both groups.
- Parameters:
maf (float, optional) – Minor allele frequency threshold used to split SNPs into two groups. Must be between 0 and 0.5. Default is 0.01.
- Return type:
None
- Raises:
TypeError – If maf is not a float
ValueError – If maf is not between 0 and 0.5
FileNotFoundError – If any of the expected output files are not created
Notes
The method: 1. Extracts autosomal SNPs 2. Splits SNPs based on MAF threshold 3. Computes missingness 4. Converts to PED/MAP format 5. Computes heterozygosity for both MAF groups
The computation uses optimized threading based on available CPU cores and memory.
- execute_ibd() None
Execute Identity by Descent (IBD) analysis using PLINK.
This method performs duplicate and relatedness checks using IBD analysis. It runs two PLINK commands: 1. Generates genome-wide IBD estimates 2. Calculates missing genotype rates
The method uses optimal thread count based on available CPU cores and validates input/output files.
Returns:
None
Raises:
FileNotFoundError: If required input pruned file is missing or if expected output files are not generated
- Required instance attributes:
pruned_file: Path to pruned PLINK binary file results_dir: Directory path for output files output_name: Base name for output files ibd_miss: Path to missing genotype rate file (set by method) genome: Path to IBD estimates file (set by method)
- execute_kingship(kingship: float = 0.354) None
Execute kinship analysis to identify and handle sample relatedness. This method performs kinship analysis using PLINK2 to identify duplicate samples and related individuals. It first computes a kinship coefficient matrix for all samples and then prunes samples based on the specified kingship threshold.
- Parameters:
kingship (float, optional) – The kinship coefficient threshold used to identify related samples. Must be between 0 and 1. Samples with kinship coefficients above this threshold will be marked for removal. Default is 0.354 (equivalent to first-degree relatives).
- Return type:
None
- Raises:
TypeError – If kingship parameter is not a float.
ValueError – If kingship parameter is not between 0 and 1.
FileNotFoundError – If the expected output file from PLINK2 is not created.
Notes
Uses PLINK2 to compute kinship coefficients and perform sample pruning
Automatically determines optimal thread count and memory usage based on system resources
Creates output files with kinship coefficient matrix and list of samples to be removed
Updates self.kinship_miss with path to file containing samples to be removed
- execute_ld_pruning(ind_pair: list = [50, 5, 0.2]) None
Execute LD (Linkage Disequilibrium) pruning on genetic data using PLINK. This method performs LD pruning in three steps: 1. Excludes complex/high LD regions 2. Identifies SNPs for pruning using indep-pairwise test 3. Creates final pruned dataset
- Parameters:
ind_pair (list, optional) – List of three elements for LD pruning parameters: - Window size (int): Number of SNPs to analyze in each window - Step size (int): Number of SNPs to shift window at each step - r² threshold (float): Correlation coefficient threshold for pruning Default is [50, 5, 0.2]
- Raises:
TypeError – If ind_pair is not a list If first two elements of ind_pair are not integers If third element of ind_pair is not float
ValueError – If ind_pair does not contain exactly three elements If window size or step size is not positive If r² threshold is not between 0 and 1
FileNotFoundError – If required pruning input file is not found
Notes
Uses available CPU cores (leaving 2 cores free) and 2/3 of available memory
Creates intermediate and final files with suffixes: * ‘-LDregionExcluded’ * ‘-LDregionExcluded-prunning’ * ‘-LDpruned’
Updates self.pruned_file with path to final pruned dataset
- execute_miss_genotype(mind: float = 0.2) dict
Execute missing genotype analysis using PLINK to identify and filter samples with high missingness rates. This method performs two main operations: 1. Generates missingness statistics for all samples 2. Filters samples based on the specified missingness threshold (mind)
- Parameters:
mind (float, optional) – The missingness threshold for sample filtering (default is 0.2). Samples with missingness rates above this threshold will be removed. Recommended range is between 0.02 and 0.1.
- Returns:
Dictionary containing missingness analysis results
- Return type:
dict
- Raises:
TypeError – If mind parameter is not a float
ValueError – If mind parameter is not between 0 and 1
FileNotFoundError – If the output .imiss file is not generated If mind value is outside recommended range (0.02-0.1)
Notes
This function creates two files: - {input_name}-missing.imiss: Contains missingness statistics for all samples - {output_name}-mind.bed: New binary file with filtered samples
- execute_rename_snpid(rename: bool = True) None
Executes the SNP ID renaming process using PLINK2. This method renames SNP IDs in the PLINK binary files to a standardized format of ‘chr:pos:a1:a2’. The renaming is performed using PLINK2’s –set-all-var-ids parameter.
Parameter:
- rename (bool, optional): Flag to control whether SNP renaming should be performed.
Defaults to True.
Returns:
None
Raises:
TypeError: If rename parameter is not a boolean.
Notes:
The renamed files will be saved with ‘-renamed’ suffix
Thread count is optimized based on available CPU cores
The new SNP ID format will be: chromosome:position:allele1:allele2
Sets self.renamed_snps to True if renaming is performed
- execute_sex_check(sex_check: list = [0.2, 0.8]) None
Execute sex check using PLINK to identify potential sex discrepancies in genetic data. This method performs sex check analysis by: 1. Running PLINK’s –check-sex command on pruned data 2. Extracting X chromosome SNPs 3. Calculating missingness rates for X chromosome SNPs
- Parameters:
sex_check (list of float, default=[0.2, 0.8]) – List containing two float values that define the F-statistic boundaries for sex determination. The values must sum to 1.0. First value is the lower bound, second is the upper bound. Samples with F-statistics below the first value are called female, above the second value are called male.
- Return type:
None
- Raises:
TypeError – If sex_check is not a list or if its elements are not floats
ValueError – If sex_check doesn’t contain exactly 2 elements or if they don’t sum to 1
Notes
The method creates the following output files: - {output_name}-sexcheck.sexcheck : Contains sex check results - {output_name}-xchr.bed/bim/fam : X chromosome SNP data - {output_name}-xchr-missing.imiss : X chromosome missingness data The number of threads used is automatically determined based on available CPU cores, using max(available cores - 2, 1) or falling back to half of logical cores if CPU count cannot be determined.
- get_fail_samples(call_rate_thres: float, std_deviation_het: float, maf_het: float, ibd_threshold: float) DataFrame
Get samples that failed quality control checks and generate a summary DataFrame. This method performs multiple QC checks on samples: 1. Call rate check 2. Sex check 3. Heterozygosity rate check (for MAF > and < threshold) 4. Duplicates/Relatedness check (using either KING or IBD)
- Parameters:
call_rate_thres (float) – Threshold for call rate filtering
std_deviation_het (float) – Number of standard deviations to use for heterozygosity filtering
maf_het (float) – Minor allele frequency threshold for heterozygosity check
ibd_threshold (float) – Threshold for IBD filtering (only used if use_king=False)
- Returns:
Summary DataFrame containing: - Counts of samples failing each QC check - Number of duplicated sample IDs - Total failures
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If any required input files are missing
TypeError – If unexpected column types are found in summary DataFrame
Notes
The method saves a detailed fail_samples.txt file with all failed samples and their failure reasons. Samples failing multiple checks are only counted once in the final summary.
- report_call_rate(directory: Path, filename: str, threshold: float, plots_dir: Path = None, y_axis_cap: int = 10, color: str = '#1B9E77', line_color: str = '#D95F02') DataFrame
Generate sample call rate analysis plots and identify samples failing the call rate threshold. This method reads a PLINK-format missing rate file, creates visualization plots, and identifies samples that fail the specified call rate threshold. It generates two sets of plots: 1. Histograms showing the distribution of missing SNPs (F_MISS) 2. Scatterplots showing different views of the call rate data
- Parameters:
directory (Path) – Directory containing the input file
filename (str) – Name of the PLINK format missing rate file
threshold (float) – Call rate threshold for sample filtering (in terms of F_MISS)
plots_dir (Path, optional) – Directory where plots will be saved. If None, uses default plots directory
y_axis_cap (int, optional) – Maximum value for y-axis in capped histogram plots. Default is 10
color (str, optional) – Color for the main plot elements. Default is ‘#1B9E77’
line_color (str, optional) – Color for threshold lines in plots. Default is ‘#D95F02’
- Returns:
DataFrame containing samples that failed the call rate threshold with columns: - FID: Family ID - IID: Individual ID - Failure: Always set to ‘Call rate’
- Return type:
pd.DataFrame
Notes
The method generates two JPEG files: - call_rate_{threshold}_histogram.jpeg: Contains histogram plots - call_rate_{threshold}_scatterplot.jpeg: Contains scatter plots
- report_heterozygosity_rate(directory: str, summary_ped_filename: str, autosomal_filename: str, std_deviation_het: float, maf: float, split: str, plots_dir: str, y_axis_cap: float = 80) DataFrame
Analyze and report heterozygosity rates for samples, creating visualization plots and identifying samples that fail heterozygosity rate checks. This function loads heterozygosity and autosomal call rate data, merges them, identifies samples with deviant heterozygosity rates, and generates visualization plots to aid in quality control analysis.
- Parameters:
directory (str) – Path to the directory containing input files
summary_ped_filename (str) – Filename of the summary PED file containing heterozygosity information
autosomal_filename (str) – Filename of the autosomal file containing call rate information
std_deviation_het (float) – Number of standard deviations to use as threshold for identifying deviant samples
maf (float) – Minor allele frequency threshold used in the analysis
split (str) – Direction of MAF comparison (‘>’ or ‘<’)
plots_dir (str) – Directory where plot files will be saved
y_axis_cap (float, optional) – Maximum value for y-axis in capped histogram plot (default: 80)
- Returns:
DataFrame containing samples that failed heterozygosity rate check with columns: - FID: Family ID - IID: Individual ID - Failure: Description of failure type
- Return type:
pd.DataFrame
Notes
The function generates two types of plots: 1. Histograms of heterozygosity rates (both uncapped and capped) 2. Scatter plot of heterozygosity rate vs missing SNP proportion Files are saved as JPEG images in the specified plots directory.
- report_ibd_analysis(ibd_threshold: float = 0.185, chunk_size: int = 100000) DataFrame
Analyze IBD (Identity By Descent) to identify duplicated or related samples. This method processes IBD analysis results to identify sample pairs with IBD scores above a specified threshold, indicating potential duplicates or related individuals. For identified pairs, it uses missingness data to determine which sample should be removed (keeping the sample with lower missingness rate).
- Parameters:
ibd_threshold (float, default=0.185) – The PI_HAT threshold above which samples are considered related. Typical values: >0.98 for duplicates, >0.5 for first-degree relatives.
chunk_size (int, default=100000) – Number of rows to process at a time when reading the genome file.
- Returns:
A DataFrame containing samples to be removed, with columns: - FID: Family ID - IID: Individual ID - Failure: Reason for removal (‘Duplicates and relatedness (IBD)’) Returns empty DataFrame if no related pairs are found or if KING is used.
- Return type:
pd.DataFrame
- Raises:
Notes
The method requires two input files: - {output_name}-ibd-missing.imiss: Contains sample missingness information - {output_name}-ibd.genome: Contains pairwise IBD estimates For each related pair, the sample with higher missingness rate is marked for removal.
- report_sex_check(directory: Path, sex_check_filename: str, xchr_imiss_filename: str, plots_dir: Path = None) DataFrame
Creates a sex check report and visualization based on PLINK’s sex check results. This function reads sex check data and X chromosome missingness data, merges them, and generates a scatter plot to visualize potential sex discrepancies. It also identifies samples that fail sex check quality control.
- Parameters:
directory (Path) – Path to the directory containing input files
sex_check_filename (str) – Filename of PLINK’s sex check results (typically .sexcheck file)
xchr_imiss_filename (str) – Filename of X chromosome missingness data
plots_dir (Path, optional) – Directory where the plot should be saved. If None, uses default plots directory
- Returns:
DataFrame containing samples that failed sex check QC with columns: - FID: Family ID - IID: Individual ID - Failure: Type of failure (always ‘Sex check’)
- Return type:
pd.DataFrame
Notes
The function creates a scatter plot with: - Blue hollow circles for samples with Male PEDSEX - Green hollow circles for samples with Female PEDSEX - Red filled circles for problematic samples - Dotted red vertical lines at F=0.2 and F=0.8 The plot is saved as ‘sex_check.jpeg’ in the specified plots directory.