Variant QC Module
The Variant QC module performs variant-level quality control on genotype data.
Main Class
- class ideal_genom.qc.variant_qc.VariantQC(input_path: Path, input_name: str, output_path: Path, output_name: str)[source]
Bases:
object- __init__(input_path: Path, input_name: str, output_path: Path, output_name: str) None[source]
Initialize the VariantQC class. This class handles quality control for genetic variants data stored in PLINK binary format (.bed, .bim, .fam files).
Parameters:
- input_path: Path
Directory path containing input PLINK files
- input_name: str
Base name of input PLINK files (without extension)
- output_path: Path
Directory path where output files will be saved
- output_name: str
Base name for output files
Raises:
- TypeError:
If input_path/output_path are not Path objects or if input_name/output_name are not strings
- FileNotFoundError:
If input_path/output_path don’t exist or required PLINK files are not found
Attributes:
- input_path: Path
Path to input directory
- output_path: Path
Path to output directory
- input_name: str
Base name of input files
- output_name: str
Base name for output files
- hwe_results:
Storage for Hardy-Weinberg equilibrium test results
- results_dir: Path
Directory for all QC results
- fails_dir: Path
Directory for failed samples
- clean_dir: Path
Directory for cleaned files
- plots_dir: Path
Directory for QC plots
- execute_missing_data_rate(chr_y: int = 24) None[source]
Executes missing data rate analysis using PLINK for male and female subjects separately. This method performs two PLINK operations: 1. Generates .vmiss and .smiss files for male subjects on chromosome Y 2. Generates .vmiss and .smiss files for all subjects excluding chromosome Y
- Parameters:
chr_y (int, default=24) – Chromosome Y number in the dataset. Must be between 0 and 26.
- Return type:
None
- Raises:
TypeError – If chr_y is not an integer
ValueError – If chr_y is not between 0 and 26
Notes
The method uses 2/3 of available system memory for PLINK operations. Output files are generated in the results directory with the following naming pattern: - {output_name}-missing-males-only.lmiss/.imiss : For male subjects - {output_name}-missing-not-y.lmiss/.imiss : For non-Y chromosome data The results are stored in self.males_missing_data and self.females_missing_data as Path objects.
- execute_different_genotype_call_rate() None[source]
Execute test for different genotype call rates between cases and controls using PLINK.
This method performs the following operations: 1. Calculates available memory for PLINK execution 2. Runs PLINK’s –test-missing command to identify markers with significantly different missing rates between cases and controls 3. Generates a .missing file with the results
The method uses approximately 2/3 of available system memory for PLINK execution.
- Return type:
None
Notes
Creates a .missing file in the results directory
Sets self.case_control_missing path attribute
- get_fail_variants(marker_call_rate_thres: float = 0.2, case_controls_thres: float = 1e-05) pandas.DataFrame[source]
Identify and consolidate failing variants based on multiple quality control criteria. This method combines the results of three QC checks: 1. Variants with high missing data rates 2. Variants with significantly different genotype call rates between cases and controls 3. Variants failing Hardy-Weinberg equilibrium test
- Parameters:
marker_call_rate_thres (float, optional) – Threshold for failing variants based on missing data rate (default: 0.2)
case_controls_thres (float, optional) – P-value threshold for differential missingness between cases and controls (default: 1e-5)
hwe_threshold (float, optional) – P-value threshold for Hardy-Weinberg equilibrium test (default: 5e-8)
- Returns:
A summary DataFrame containing: - Counts of variants failing each QC criterion - Number of variants failing multiple criteria (duplicates) - Total number of unique failing variants
- Return type:
pd.DataFrame
Notes
Results are also written to a tab-separated file ‘fail_markers.txt’
Variants failing multiple criteria are only counted once in the final output file
- execute_drop_variants(maf: float = 5e-08, geno: float = 0.1, hwe: float = 5e-08) None[source]
Execute variant filtering based on quality control parameters using PLINK.
This method removes variants that fail quality control criteria including minor allele frequency (MAF), genotype missingness rate, and Hardy-Weinberg equilibrium (HWE) test.
- Parameters:
maf (float, optional) – Minor allele frequency threshold. Variants with MAF below this value are removed. Default is 5e-8.
geno (float, optional) – Maximum per-variant missing genotype rate. Variants with missing rate above this value are removed. Default is 0.1 (10%).
hwe (float, optional) – Hardy-Weinberg equilibrium test p-value threshold. Variants with HWE p-value below this are removed. Default is 5e-8.
- Returns:
Creates quality controlled PLINK binary files (.bed, .bim, .fam) in the clean directory with suffix ‘-variantQCed’.
- Return type:
None
- report_missing_data(filename_male: Path, filename_female: Path, threshold: float) pandas.DataFrame[source]
Analyze and report missing data rates for male and female subjects. This method processes missing data information from separate files for male and female subjects, creates visualizations of missing data distributions, and identifies SNPs that fail the missing data threshold for each sex group.
- Parameters:
directory (str) – Path to the directory containing the input files
filename_male (str) – Name of the file containing missing data information for male subjects (.lmiss format)
filename_female (str) – Name of the file containing missing data information for female subjects (.lmiss format)
threshold (float) – Maximum allowed missing data rate (between 0 and 1)
y_axis_cap (int, optional) – Upper limit for y-axis in histogram plots (default is 10)
- Returns:
A DataFrame containing SNPs that fail the missing data threshold for either sex, with columns [‘SNP’, ‘Failure’] where ‘Failure’ indicates the failing category
- Return type:
pd.DataFrame
Notes
The method generates two histogram plots saved as ‘missing_data_male’ and ‘missing_data_female’ showing the distribution of missing data rates for each sex.
- report_different_genotype_call_rate(filename: Path, threshold: float) pandas.DataFrame[source]
Reports markers with different genotype call rates based on a given threshold. This function reads a .missing file, filters markers with a different genotype call rate below the specified threshold, and returns a DataFrame containing these markers.
Parameters:
directory (str): The directory where the .missing file is located. filename (str): The name of the .missing file. threshold (float): The threshold for filtering markers based on the P-value.
Returns:
- pd.DataFrame: A DataFrame containing markers with different genotype call rates
below the specified threshold. The DataFrame has two columns: ‘SNP’ and ‘Failure’, where ‘Failure’ is set to ‘Different genotype call rate’.
- execute_variant_qc_pipeline(variant_params: dict) None[source]
Execute a comprehensive variant quality control pipeline.
This method runs a series of quality control steps on genetic variants, including missing data analysis, genotype calling assessment, Hardy-Weinberg equilibrium testing, and variant filtering based on specified thresholds.
- Parameters:
variant_params (dict) –
Dictionary containing quality control parameters with the following keys:
- ’chr-y’bool or str
Flag for chromosome Y analysis
- ’miss_data_rate’float
Threshold for missing data rate filtering
- ’diff_genotype_rate’float
Threshold for differential genotype call rate between cases/controls
- ’hwe’float
Hardy-Weinberg equilibrium p-value threshold
- ’maf’float
Minor allele frequency threshold for variant filtering
- ’geno’float
Genotype call rate threshold for variant filtering
- Returns:
This method performs quality control operations in-place and does not return any values.
- Return type:
None
Notes
The pipeline executes the following steps in order:
Missing data rate computation (sex-stratified analysis)
Case/control nonrandom missingness test
Hardy-Weinberg equilibrium test
Identification of variants failing QC thresholds
Removal of variants that failed quality control
Each step prints a colored status message indicating the current operation being performed.
Supporting Classes
- class ideal_genom.qc.variant_qc.VariantQCReport(output_path: Path)[source]
Bases:
objectHandles visualization and reporting for variant quality control results.
This class is responsible for generating plots and reports from variant QC analyses, following the Single Responsibility Principle by separating visualization concerns from the main VariantQC logic.
Parameters:
- output_path: Path
Directory path where plots and reports will be saved
Raises:
- TypeError:
If output_path is not a Path object
- FileNotFoundError:
If output_path doesn’t exist
Attributes:
- output_path: Path
Path to output directory where visualizations will be saved
- __init__(output_path: Path) None[source]
Initialize the VariantQCReport class.
Parameters:
- output_path: Path
Directory path where plots and reports will be saved
Raises:
- TypeError:
If output_path is not a Path object
- FileNotFoundError:
If output_path doesn’t exist
- report_variant_qc(missing_data_rate_male: Path, missing_data_rate_female: Path, y_axis_cap: float | None = 100, missing_data_threshold: float = 0.1) None[source]
Generate comprehensive visualization reports for variant quality control results.
This method creates histogram plots for missing data rates (separated by sex) and Hardy-Weinberg equilibrium test results, providing visual assessment of QC metrics.
- Parameters:
missing_data_rate_male (Path) – Path to the .vmiss file containing missing data statistics for male subjects
missing_data_rate_female (Path) – Path to the .vmiss file containing missing data statistics for female subjects
hwe_file (Path) – Path to the .hwe file containing Hardy-Weinberg equilibrium test results
y_axis_cap (float, optional) – Maximum value for y-axis in histograms. If None, automatically determined. Default is 100.
missing_data_threshold (float, optional) – Threshold for missing data rate visualization (vertical line). Default is 0.2.
hwe_threshold (float, optional) – P-value threshold for Hardy-Weinberg equilibrium test (vertical line). Default is 5e-8.
- Return type:
None
- Raises:
FileNotFoundError – If any of the input files don’t exist
pd.errors.EmptyDataError – If input files are empty or malformed
Notes
Creates ‘missing_data_male.svg’ and ‘missing_data_female.svg’ plots
Creates ‘hwe-histogram.svg’ plot
All plots are saved in SVG format with 600 DPI resolution
Threshold lines are displayed in red with dashed style
- class ideal_genom.qc.variant_qc.VariantQCCleanUp(output_path: Path)[source]
Bases:
object- clean_all() None[source]
Remove intermediate files from output directory.
This method deletes temporary files created during sample QC steps: - Files ending with ‘.bed’, ‘.bim’, ‘.fam’, ‘.vmiss’, ‘.smiss’, ‘.nosex’, ‘.missing’, ‘.hwe’
- Return type:
None
Notes
Only removes files if they exist. No error is raised if files are not found.