Sample QC Module

The Sample QC module performs sample-level quality control on genotype data.

Main Class

class ideal_genom.qc.sample_qc.SampleQC(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_regions_file: Path, build: str = '38')[source]

Bases: object

__init__(input_path: Path, input_name: str, output_path: Path, output_name: str, high_ld_regions_file: Path, build: str = '38') → None[source]

Initialize SampleQC class for quality control of genetic data.

This class handles quality control procedures for genetic data files in PLINK binary format (bed, bim, fam). It sets up the directory structure and validates input files.

Parameters:

input_path (Path) – Directory path containing the input PLINK files
input_name (str) – Base name of the input PLINK files (without extension)
output_path (Path) – Directory path where output files will be saved
output_name (str) – Base name for output files (without extension)
high_ld_regions_file (Path) – Path to file containing high LD regions. If not found, will be fetched from package
built (str, optional) – Genome build version, either ‘37’ or ‘38’ (default=’38’)

Raises:

TypeError – If input types are incorrect
ValueError – If genome build version is not ‘37’ or ‘38’
FileNotFoundError – If input paths or required PLINK files are not found

renamed_snps

Flag indicating if SNPs should be renamed

Type:: bool

hh_to_missing

Flag indicating if heterozygous haploid genotypes should be set to missing

Type:: bool

pruned_file

Placeholder for pruned file path

Type:: None

results_dir

Directory for all QC results

Type:: Path

fails_dir

Directory for failed samples

Type:: Path

clean_dir

Directory for clean files

Type:: Path

plots_dir

Directory for QC plots

Type:: Path

execute_preprocessing(rename: bool = True, hh_to_missing: bool = True) → None[source]

Executes the SNP ID renaming process and/or Convert haploid genotypes to missing valuesusing PLINK2.

This method renames SNP IDs in the PLINK binary files to a standardized format of ‘chr:pos:a1:a2’. The renaming is performed using PLINK2’s –set-all-var-ids parameter. This method uses PLINK’s –set-hh-missing flag to convert haploid genotypes to missing values in the genotype data. This is often useful for quality control of genetic data, particularly for variants on sex chromosomes.

Parameters:

(bool (hh_to_missing) – Defaults to True.
optional) (Flag to control whether haploid genotypes should be converted to missing values.) – Defaults to True.
(bool – Defaults to True.
optional) – Defaults to True.

Return type:

None

Raises:

TypeError – If rename parameter is not a boolean.:
TypeError – If hh_to_missing parameter is not a boolean.:

Notes

The renamed files will be saved with ‘-renamed’ suffix
Thread count is optimized based on available CPU cores
The new SNP ID format will be: chromosome:position:allele1:allele2
Sets self.renamed_snps to True if renaming is performed

execute_ld_pruning(ind_pair: list = [50, 5, 0.2]) → None[source]

Execute LD (Linkage Disequilibrium) pruning on genetic data using PLINK.

This method performs LD pruning in three steps: 1. Excludes complex/high LD regions 2. Identifies SNPs for pruning using indep-pairwise test 3. Creates final pruned dataset

Parameters:

ind_pair (list, optional) – List of three elements for LD pruning parameters: - Window size (int): Number of SNPs to analyze in each window - Step size (int): Number of SNPs to shift window at each step - r² threshold (float): Correlation coefficient threshold for pruning Default is [50, 5, 0.2]

Raises:

TypeError – If ind_pair is not a list If first two elements of ind_pair are not integers If third element of ind_pair is not float
ValueError – If ind_pair does not contain exactly three elements If window size or step size is not positive If r² threshold is not between 0 and 1
FileNotFoundError – If required pruning input file is not found

Notes

Uses available CPU cores (leaving 2 cores free) and 2/3 of available memory
Creates intermediate and final files with suffixes: * ‘-LDregionExcluded’ * ‘-LDregionExcluded-prunning’ * ‘-LDpruned’
Updates self.pruned_file with path to final pruned dataset

execute_miss_genotype() → None[source]

Execute missing genotype analysis using PLINK to generate sample missingness statistics.

This method generates genome-wide missingness statistics for all samples in the dataset. The statistics are used later to identify samples with high missingness rates during the quality control process.

Parameters:: None
Return type:: None
Raises:: FileNotFoundError – If the output .smiss file is not generated

Notes

This function creates one file: - {input_name}-missing.smiss: Contains missingness statistics for all samples

The method automatically optimizes thread count and memory usage based on available system resources (uses max(CPU cores - 2, 1) threads and 2/3 of available memory).

execute_sex_check(sex_check: list = [0.2, 0.8]) → None[source]

Execute sex check using PLINK to identify potential sex discrepancies in genetic data.

This method performs sex check analysis by: 1. Running PLINK’s –check-sex command on pruned data 2. Extracting X chromosome SNPs 3. Calculating missingness rates for X chromosome SNPs

Parameters:

sex_check (list of float, default=[0.2, 0.8]) – List containing two float values that define the F-statistic boundaries for sex determination. First value is the lower bound (max-female-xf), second is the upper bound (min-male-xf). Samples with F-statistics below the first value are called female, above the second value are called male.

Return type:

None

Raises:

TypeError – If sex_check is not a list or if its elements are not floats
ValueError – If sex_check doesn’t contain exactly 2 elements

Notes

The method creates the following output files: - {output_name}-sexcheck.sexcheck : Contains sex check results - {output_name}-xchr.bed/bim/fam : X chromosome SNP data - {output_name}-xchr-missing.smiss : X chromosome missingness data

The number of threads used is automatically determined based on available CPU cores, using max(available cores - 2, 1) or falling back to half of logical cores if CPU count cannot be determined.

execute_heterozygosity_rate(maf: float = 0.01) → None[source]

Executes heterozygosity rate analysis on genetic data using PLINK.

This method performs a series of PLINK commands to analyze heterozygosity rates in genetic data, separating SNPs based on minor allele frequency (MAF) threshold and computing heterozygosity for both groups.

Parameters:

maf (float, optional) – Minor allele frequency threshold used to split SNPs into two groups. Must be between 0 and 0.5. Default is 0.01.

Return type:

None

Raises:

TypeError – If maf is not a float
ValueError – If maf is not between 0 and 0.5
FileNotFoundError – If any of the expected output files are not created

Notes

The method: 1. Extracts autosomal SNPs 2. Splits SNPs based on MAF threshold 3. Computes missingness 4. Converts to PED/MAP format 5. Computes heterozygosity for both MAF groups

The computation uses optimized threading based on available CPU cores and memory.

execute_ibd() → None[source]

Execute Identity by Descent (IBD) analysis using PLINK.

This method performs duplicate and relatedness checks using IBD analysis. It runs two PLINK commands: 1. Generates genome-wide IBD estimates 2. Calculates missing genotype rates

The method uses optimal thread count based on available CPU cores and validates input/output files.

Returns:

None

Raises:

FileNotFoundError: If required input pruned file is missing or if expected output files are not generated

Required instance attributes:: pruned_file: Path to pruned PLINK binary file results_dir: Directory path for output files output_name: Base name for output files ibd_miss: Path to missing genotype rate file (set by method) genome: Path to IBD estimates file (set by method)

execute_kinship(kinship: float = 0.354) → None[source]

Execute kinship analysis to identify and handle sample relatedness.

This method performs kinship analysis using PLINK2 to identify duplicate samples and related individuals. It first computes a kinship coefficient matrix for all samples and then prunes samples based on the specified kinship threshold.

Parameters:

kinship (float, optional) – The kinship coefficient threshold used to identify related samples. Must be between 0 and 1. Samples with kinship coefficients above this threshold will be marked for removal. Default is 0.354 (equivalent to first-degree relatives).

Return type:

None

Raises:

TypeError – If kinship parameter is not a float.
ValueError – If kinship parameter is not between 0 and 1.
FileNotFoundError – If the expected output file from PLINK2 is not created.

Notes

Uses PLINK2 to compute kinship coefficients and perform sample pruning
Automatically determines optimal thread count and memory usage based on system resources
Creates output files with kinship coefficient matrix and list of samples to be removed
Updates self.kinship_miss with path to file containing samples to be removed

execute_duplicate_relatedness(kinship: float = 0.354, use_kinship: bool = True) → None[source]

Execute duplicate and relatedness analysis on the genotype data. This method performs either IBD (Identity by Descent) or KING kinship coefficient analysis to identify duplicate samples and related individuals in the dataset.

Parameters:

kinship (float, optional) – The KING kinship coefficient threshold for identifying related samples. Default is 0.354, which corresponds to duplicates/MZ twins.
use_kinship (bool, optional) – If True, uses KING algorithm for relatedness analysis. If False, uses traditional IBD analysis. Default is True.

Return type:

None

Raises:

TypeError – If kinship is not a float or use_kinship is not a boolean.

Notes

The method will store the analysis type (KING or IBD) in the use_kinship attribute.

get_fail_samples(call_rate_thres: float, std_deviation_het: float, maf_het: float, ibd_threshold: float) → pandas.DataFrame[source]

Identify and compile all samples that failed quality control checks.

This method aggregates samples that failed various QC checks including call rate, sex check, heterozygosity rate, and duplicate/relatedness checks. It uses chunked reading for memory efficiency with large datasets and provides a summary of failures.

Parameters:

call_rate_thres (float) – Call rate threshold (F_MISS). Samples with missing rate above this value fail. Recommended range: 0.02 to 0.1.
std_deviation_het (float) – Number of standard deviations from mean heterozygosity rate for outlier detection. Typical value: 3.
maf_het (float) – Minor allele frequency threshold used in heterozygosity analysis. Typical value: 0.01.
ibd_threshold (float) – PI_HAT threshold for IBD-based relatedness detection (only used if use_kinship=False). Typical values: >0.185 for 2nd degree relatives, >0.5 for 1st degree.

Returns:

Summary DataFrame with columns: - Failure: Type of QC failure - count: Number of samples failing each check Includes additional rows for duplicated sample IDs and totals.

Return type:

pd.DataFrame

Raises:

FileNotFoundError – If any required input file from previous QC steps is missing.

Notes

The method performs the following:

Reads large files in 10,000-row chunks to manage memory
Identifies samples failing multiple checks (reported as duplicates)
Saves two output files:
- fail_samples.txt: List of unique samples to remove (FID, IID)
- fail_summary.txt: Summary statistics of failures by type
Uses helper methods _analyze_heterozygosity_failures() and _analyze_ibd_failures() for complex analyses

The analysis respects the use_kinship attribute to determine whether to use KING-based or IBD-based relatedness detection.

execute_drop_samples() → None[source]

Execute the removal of samples that failed quality control checks using PLINK. This method performs the following steps: 1. Determines the appropriate binary file name based on previous processing steps 2. Reads the fail_samples.txt file containing samples to be removed 3. Executes PLINK command to create new binary files excluding failed samples

Raises:

FileNotFoundError: If the fail_samples.txt file is not found in the fails directory

Returns:

None

Notes:

The output files will be created with suffix ‘-clean-samples’

The method preserves allele order during the operation

Input files must be in PLINK binary format (.bed, .bim, .fam)

execute_sample_qc_pipeline(sample_params: dict) → None[source]

Execute the complete sample quality control pipeline.

This method runs all sample QC steps in the correct order with proper memory management and logging. It encapsulates the entire workflow that was previously handled in the main script.

Parameters:: sample_params (dict) – Dictionary containing all sample QC parameters with keys: - ‘rename_snp’: bool, whether to rename SNPs - ‘hh_to_missing’: bool, whether to convert haploid to missing - ‘ind_pair’: list, LD pruning parameters [window, step, r2] - ‘mind’: float, missing genotype rate threshold - ‘sex_check’: list, sex check F-statistic thresholds - ‘maf’: float, minor allele frequency for heterozygosity - ‘kinship’: float, kinship coefficient threshold - ‘use_kinship’: bool, whether to use KING vs IBD - ‘het_deviation’: float, standard deviations for het filtering - ‘ibd_threshold’: float, IBD threshold for relatedness
Return type:: None

Notes

The pipeline executes steps in this order: 1. Rename SNPs (optional) 2. Convert haploid to missing (optional) 3. LD pruning 4. Missing genotype analysis 5. Sex check 6. Heterozygosity rate analysis 7. Duplicate/relatedness analysis 8. Identify failed samples 9. Remove failed samples 10. Clean up temporary files

Memory usage is monitored after each step and garbage collection is performed to prevent memory issues with large datasets.

Supporting Classes

class ideal_genom.qc.sample_qc.SampleQCReport(output_path: Path)[source]

Bases: object

__init__(output_path: Path) → None[source]

report_sample_qc(call_rate_smiss: Path, sexcheck_miss: Path, xchr_miss: Path, maf_greater_het: Path, maf_less_het: Path, maf_greater_smiss: Path, maf_less_smiss: Path, genome: Path | None = None, generate_ibd_report: bool = False, f_coeff_thresholds: list = [0.2, 0.8], call_rate_thres: float = 0.2, std_deviation_het: float = 3, maf_het: float = 0.01, ibd_threshold: float | None = None) → None[source]

Generate comprehensive quality control visualization reports for sample-level QC.

This method orchestrates the generation of all sample QC visualization reports by calling individual reporting methods for call rate, sex check, heterozygosity, and optionally IBD analysis. It generates plots and visualizations without performing any QC execution or sample filtering.

Parameters:

call_rate_smiss (Path) – Path to the sample missingness file (.smiss) from PLINK –missing command
sexcheck_miss (Path) – Path to the sex check results file (.sexcheck) from PLINK –check-sex
xchr_miss (Path) – Path to the X chromosome missingness file (.smiss)
maf_greater_het (Path) – Path to heterozygosity file (.het) for SNPs with MAF > threshold
maf_less_het (Path) – Path to heterozygosity file (.het) for SNPs with MAF < threshold
maf_greater_smiss (Path) – Path to missingness file (.smiss) for SNPs with MAF > threshold
maf_less_smiss (Path) – Path to missingness file (.smiss) for SNPs with MAF < threshold
genome (Optional[Path], default=None) – Path to the PLINK .genome file for IBD analysis. Required if generate_ibd_report=True
generate_ibd_report (bool, default=False) – Whether to generate IBD analysis visualization
call_rate_thres (float, default=0.2) – Call rate threshold for visualization reference line
std_deviation_het (float, default=3) – Number of standard deviations for heterozygosity outlier visualization
maf_het (float, default=0.01) – Minor allele frequency threshold used in the analysis
ibd_threshold (Optional[float], default=None) – PI_HAT threshold for IBD visualization reference line. Required if generate_ibd_report=True

Returns:

This method generates plots as side effects and does not return data.

Return type:

None

Raises:

ValueError – If generate_ibd_report=True but ibd_threshold or genome is None

Notes

The method generates the following visualizations in sequence: 1. Call rate distribution plots (histogram and scatterplots) 2. Sex check scatter plot showing F-statistics vs X chr missingness 3. Heterozygosity rate plots for MAF > threshold 4. Heterozygosity rate plots for MAF < threshold 5. IBD PI_HAT distribution histogram (if generate_ibd_report=True)

All plots are saved to the output_path directory specified during initialization. This method is intended to be used after execute_sample_qc_pipeline() has completed and all required QC result files have been generated.

report_call_rate(smiss_file: Path, threshold: float, plots_dir: Path, y_axis_cap: int | float = 10, color: str = '#1B9E77', line_color: str = '#D95F02', format: str = 'png') → None[source]

Generate sample call rate analysis plots. This method reads a PLINK-format missing rate file and creates visualization plots showing the distribution of missing SNPs across samples. It generates two sets of plots: 1. Histograms showing the distribution of missing SNPs (F_MISS) 2. Scatterplots showing different views of the call rate data

Parameters:

smiss_file (Path) – Path to the PLINK format missing rate file (.smiss or .imiss)
threshold (float) – Call rate threshold for visualization reference line (in terms of F_MISS)
plots_dir (Path) – Directory where plots will be saved
y_axis_cap (Union[int, float], optional) – Maximum value for y-axis in capped histogram plots. Default is 10
color (str, optional) – Color for the main plot elements. Default is ‘#1B9E77’
line_color (str, optional) – Color for threshold lines in plots. Default is ‘#D95F02’
format (str, optional) – Format for saving plots (e.g., ‘png’, ‘pdf’, ‘svg’). Default is ‘png’

Returns:

This method generates plots as side effects and does not return data.

Return type:

None

Notes

The method generates two image files: - call_rate_{threshold}_histogram.<format>: Contains histogram plots - call_rate_{threshold}_scatterplot.<format>: Contains scatter plots

report_sex_check(sex_check_filename: Path, xchr_imiss_filename: Path, plots_dir: Path, f_coeff_thresholds: list = [0.2, 0.8], format: str = 'png', fig_size: tuple = (8, 6)) → None[source]

Creates a sex check visualization based on PLINK’s sex check results. This function reads sex check data and X chromosome missingness data, merges them, and generates a scatter plot to visualize potential sex discrepancies.

Parameters:

sex_check_filename (Path) – Path to PLINK’s sex check results file (typically .sexcheck file)
xchr_imiss_filename (Path) – Path to X chromosome missingness data file (.smiss or .imiss)
plots_dir (Path) – Directory where the plot will be saved
f_coeff_thresholds (list, optional) – List of two F coefficient thresholds [lower, upper] for reference lines. Default is [0.2, 0.8]
format (str, optional) – Format for saving the plot (e.g., ‘png’, ‘pdf’, ‘svg’). Default is ‘png’
fig_size (tuple, optional) – Figure size as (width, height). Default is (8, 6)

Returns:

This method generates plots as side effects and does not return data.

Return type:

None

Notes

The function creates a scatter plot with: - Blue hollow circles for samples with Male PEDSEX - Green hollow circles for samples with Female PEDSEX - Red filled circles for problematic samples - Dotted red vertical lines at the specified F coefficient thresholds The plot is saved as ‘sex_check.{format}’ in the specified plots directory.

report_heterozygosity_rate(het_filename: Path, autosomal_filename: Path, std_deviation_het: float | int, maf: float, split: str, plots_dir: Path, y_axis_cap: float | int = 80, format: str = 'png', scatter_fig_size: tuple = (10, 6)) → None[source]

Generate heterozygosity rate visualization plots for quality control analysis. This function loads heterozygosity and autosomal call rate data, merges them, and generates visualization plots showing samples with deviant heterozygosity rates.

Parameters:

het_filename (Path) – Path to the file containing heterozygosity information (.het file)
autosomal_filename (Path) – Path to the autosomal missingness file (.smiss or .imiss)
std_deviation_het (Union[float, int]) – Number of standard deviations to use as threshold for identifying deviant samples
maf (float) – Minor allele frequency threshold used in the analysis
split (str) – Direction of MAF comparison (‘>’ or ‘<’) to indicate which MAF subset is being analyzed
plots_dir (Path) – Directory where plot files will be saved
y_axis_cap (Union[float, int], optional) – Maximum value for y-axis in capped histogram plot. Default is 80
format (str, optional) – Format for saving plots (e.g., ‘png’, ‘pdf’, ‘svg’). Default is ‘png’
scatter_fig_size (tuple, optional) – Figure size for scatter plot as (width, height). Default is (10, 6)

Returns:

This method generates plots as side effects and does not return data.

Return type:

None

Notes

The function generates two types of plots: 1. Histograms of heterozygosity rates (both uncapped and capped) 2. Scatter plot of heterozygosity rate vs missing SNP proportion Files are saved with names: - heterozygosity_rate_greater_{maf}_histogram.{format} (if split=’>’) - heterozygosity_rate_less_{maf}_histogram.{format} (if split=’<’) - heterozygosity_rate_greater_{maf}_scatterplot.{format} (if split=’>’) - heterozygosity_rate_less_{maf}_scatterplot.{format} (if split=’<’)

report_ibd_analysis(genome: Path, ibd_threshold: float = 0.185, chunk_size: int = 100000) → None[source]

Generate visualization of IBD (Identity By Descent) analysis results.

This method processes IBD analysis results and creates a histogram showing the distribution of PI_HAT values for related sample pairs. The visualization includes a reference line at the specified threshold to help assess relatedness in the dataset.

Parameters:

genome (Path) – Path to the PLINK .genome file containing pairwise IBD estimates
ibd_threshold (float, default=0.185) – The PI_HAT threshold for the reference line in the plot. Typical values: >0.98 for duplicates, >0.5 for first-degree relatives, >0.185 for second-degree relatives.
chunk_size (int, default=100000) – Number of rows to process at a time when reading the genome file.

Returns:

This method generates a plot as a side effect and does not return data.

Return type:

None

Raises:

TypeError – If ibd_threshold is not a float or chunk_size is not an integer.
FileNotFoundError – If the genome file is not found.

Notes

The method creates a histogram visualization showing: - Distribution of PI_HAT values for sample pairs with PI_HAT > 0.1 - A vertical line indicating the ibd_threshold The plot is saved as ‘ibd_pihat_distribution.png’ in the output directory.

class ideal_genom.qc.sample_qc.SampleQCCleanUp(output_path: Path, input_path: Path)[source]

Bases: object

__init__(output_path: Path, input_path: Path) → None[source]

clean_input_files() → None[source]

Remove intermediate files from input directory.

This method deletes temporary files created during preprocessing steps: - Files ending with ‘processed’ (.bed, .bim, .fam)

Return type:: None

Notes

Only removes files if they exist. No error is raised if files are not found.

clean_results_files() → None[source]

Remove intermediate files from output directory.

This method deletes temporary files created during sample QC steps: - Files ending with ‘.bed’, ‘.bim’, ‘.fam’, ‘.vmiss’, ‘.smiss’, ‘.nosex’, ‘.sexcheck’, ‘.het’, ‘.genome’ - Files ending with ‘prune.in’, ‘prune.out’ - Files ending with ‘king.cutoff.out.id’, ‘king.cutoff.in.id’, ‘king.id’, ‘king.bin’

Return type:: None

Notes

Only removes files if they exist. No error is raised if files are not found.

clean_all() → None[source]

Remove all intermediate files from input and output directories.

This method calls clean_input_files() and clean_results_files() to remove temporary files created during preprocessing and sample QC steps.

Return type:: None