Variant QC Module

The Variant QC module performs variant-level quality control on genotype data.

Main Class

class ideal_genom.qc.variant_qc.VariantQC(input_path: Path, input_name: str, output_path: Path, output_name: str)[source]

Bases: object

__init__(input_path: Path, input_name: str, output_path: Path, output_name: str) None[source]

Initialize the VariantQC class. This class handles quality control for genetic variants data stored in PLINK binary format (.bed, .bim, .fam files).

Parameters:

input_path: Path

Directory path containing input PLINK files

input_name: str

Base name of input PLINK files (without extension)

output_path: Path

Directory path where output files will be saved

output_name: str

Base name for output files

Raises:

TypeError:

If input_path/output_path are not Path objects or if input_name/output_name are not strings

FileNotFoundError:

If input_path/output_path don’t exist or required PLINK files are not found

Attributes:

input_path: Path

Path to input directory

output_path: Path

Path to output directory

input_name: str

Base name of input files

output_name: str

Base name for output files

hwe_results:

Storage for Hardy-Weinberg equilibrium test results

results_dir: Path

Directory for all QC results

fails_dir: Path

Directory for failed samples

clean_dir: Path

Directory for cleaned files

plots_dir: Path

Directory for QC plots

execute_missing_data_rate(chr_y: int = 24) None[source]

Executes missing data rate analysis using PLINK for male and female subjects separately. This method performs two PLINK operations: 1. Generates .vmiss and .smiss files for male subjects on chromosome Y 2. Generates .vmiss and .smiss files for all subjects excluding chromosome Y

Parameters:

chr_y (int, default=24) – Chromosome Y number in the dataset. Must be between 0 and 26.

Return type:

None

Raises:

Notes

The method uses 2/3 of available system memory for PLINK operations. Output files are generated in the results directory with the following naming pattern: - {output_name}-missing-males-only.lmiss/.imiss : For male subjects - {output_name}-missing-not-y.lmiss/.imiss : For non-Y chromosome data The results are stored in self.males_missing_data and self.females_missing_data as Path objects.

execute_different_genotype_call_rate() None[source]

Execute test for different genotype call rates between cases and controls using PLINK.

This method performs the following operations: 1. Calculates available memory for PLINK execution 2. Runs PLINK’s –test-missing command to identify markers with significantly different missing rates between cases and controls 3. Generates a .missing file with the results

The method uses approximately 2/3 of available system memory for PLINK execution.

Return type:

None

Notes

  • Creates a .missing file in the results directory

  • Sets self.case_control_missing path attribute

get_fail_variants(marker_call_rate_thres: float = 0.2, case_controls_thres: float = 1e-05) pandas.DataFrame[source]

Identify and consolidate failing variants based on multiple quality control criteria. This method combines the results of three QC checks: 1. Variants with high missing data rates 2. Variants with significantly different genotype call rates between cases and controls 3. Variants failing Hardy-Weinberg equilibrium test

Parameters:
  • marker_call_rate_thres (float, optional) – Threshold for failing variants based on missing data rate (default: 0.2)

  • case_controls_thres (float, optional) – P-value threshold for differential missingness between cases and controls (default: 1e-5)

  • hwe_threshold (float, optional) – P-value threshold for Hardy-Weinberg equilibrium test (default: 5e-8)

Returns:

A summary DataFrame containing: - Counts of variants failing each QC criterion - Number of variants failing multiple criteria (duplicates) - Total number of unique failing variants

Return type:

pd.DataFrame

Notes

  • Results are also written to a tab-separated file ‘fail_markers.txt’

  • Variants failing multiple criteria are only counted once in the final output file

execute_drop_variants(maf: float = 5e-08, geno: float = 0.1, hwe: float = 5e-08) None[source]

Execute variant filtering based on quality control parameters using PLINK.

This method removes variants that fail quality control criteria including minor allele frequency (MAF), genotype missingness rate, and Hardy-Weinberg equilibrium (HWE) test.

Parameters:
  • maf (float, optional) – Minor allele frequency threshold. Variants with MAF below this value are removed. Default is 5e-8.

  • geno (float, optional) – Maximum per-variant missing genotype rate. Variants with missing rate above this value are removed. Default is 0.1 (10%).

  • hwe (float, optional) – Hardy-Weinberg equilibrium test p-value threshold. Variants with HWE p-value below this are removed. Default is 5e-8.

Returns:

Creates quality controlled PLINK binary files (.bed, .bim, .fam) in the clean directory with suffix ‘-variantQCed’.

Return type:

None

report_missing_data(filename_male: Path, filename_female: Path, threshold: float) pandas.DataFrame[source]

Analyze and report missing data rates for male and female subjects. This method processes missing data information from separate files for male and female subjects, creates visualizations of missing data distributions, and identifies SNPs that fail the missing data threshold for each sex group.

Parameters:
  • directory (str) – Path to the directory containing the input files

  • filename_male (str) – Name of the file containing missing data information for male subjects (.lmiss format)

  • filename_female (str) – Name of the file containing missing data information for female subjects (.lmiss format)

  • threshold (float) – Maximum allowed missing data rate (between 0 and 1)

  • y_axis_cap (int, optional) – Upper limit for y-axis in histogram plots (default is 10)

Returns:

A DataFrame containing SNPs that fail the missing data threshold for either sex, with columns [‘SNP’, ‘Failure’] where ‘Failure’ indicates the failing category

Return type:

pd.DataFrame

Notes

The method generates two histogram plots saved as ‘missing_data_male’ and ‘missing_data_female’ showing the distribution of missing data rates for each sex.

report_different_genotype_call_rate(filename: Path, threshold: float) pandas.DataFrame[source]

Reports markers with different genotype call rates based on a given threshold. This function reads a .missing file, filters markers with a different genotype call rate below the specified threshold, and returns a DataFrame containing these markers.

Parameters:

directory (str): The directory where the .missing file is located. filename (str): The name of the .missing file. threshold (float): The threshold for filtering markers based on the P-value.

Returns:

pd.DataFrame: A DataFrame containing markers with different genotype call rates

below the specified threshold. The DataFrame has two columns: ‘SNP’ and ‘Failure’, where ‘Failure’ is set to ‘Different genotype call rate’.

execute_variant_qc_pipeline(variant_params: dict) None[source]

Execute a comprehensive variant quality control pipeline.

This method runs a series of quality control steps on genetic variants, including missing data analysis, genotype calling assessment, Hardy-Weinberg equilibrium testing, and variant filtering based on specified thresholds.

Parameters:

variant_params (dict) –

Dictionary containing quality control parameters with the following keys:

  • ’chr-y’bool or str

    Flag for chromosome Y analysis

  • ’miss_data_rate’float

    Threshold for missing data rate filtering

  • ’diff_genotype_rate’float

    Threshold for differential genotype call rate between cases/controls

  • ’hwe’float

    Hardy-Weinberg equilibrium p-value threshold

  • ’maf’float

    Minor allele frequency threshold for variant filtering

  • ’geno’float

    Genotype call rate threshold for variant filtering

Returns:

This method performs quality control operations in-place and does not return any values.

Return type:

None

Notes

The pipeline executes the following steps in order:

  1. Missing data rate computation (sex-stratified analysis)

  2. Case/control nonrandom missingness test

  3. Hardy-Weinberg equilibrium test

  4. Identification of variants failing QC thresholds

  5. Removal of variants that failed quality control

Each step prints a colored status message indicating the current operation being performed.

Supporting Classes

class ideal_genom.qc.variant_qc.VariantQCReport(output_path: Path)[source]

Bases: object

Handles visualization and reporting for variant quality control results.

This class is responsible for generating plots and reports from variant QC analyses, following the Single Responsibility Principle by separating visualization concerns from the main VariantQC logic.

Parameters:

output_path: Path

Directory path where plots and reports will be saved

Raises:

TypeError:

If output_path is not a Path object

FileNotFoundError:

If output_path doesn’t exist

Attributes:

output_path: Path

Path to output directory where visualizations will be saved

__init__(output_path: Path) None[source]

Initialize the VariantQCReport class.

Parameters:

output_path: Path

Directory path where plots and reports will be saved

Raises:

TypeError:

If output_path is not a Path object

FileNotFoundError:

If output_path doesn’t exist

report_variant_qc(missing_data_rate_male: Path, missing_data_rate_female: Path, y_axis_cap: float | None = 100, missing_data_threshold: float = 0.1) None[source]

Generate comprehensive visualization reports for variant quality control results.

This method creates histogram plots for missing data rates (separated by sex) and Hardy-Weinberg equilibrium test results, providing visual assessment of QC metrics.

Parameters:
  • missing_data_rate_male (Path) – Path to the .vmiss file containing missing data statistics for male subjects

  • missing_data_rate_female (Path) – Path to the .vmiss file containing missing data statistics for female subjects

  • hwe_file (Path) – Path to the .hwe file containing Hardy-Weinberg equilibrium test results

  • y_axis_cap (float, optional) – Maximum value for y-axis in histograms. If None, automatically determined. Default is 100.

  • missing_data_threshold (float, optional) – Threshold for missing data rate visualization (vertical line). Default is 0.2.

  • hwe_threshold (float, optional) – P-value threshold for Hardy-Weinberg equilibrium test (vertical line). Default is 5e-8.

Return type:

None

Raises:
  • FileNotFoundError – If any of the input files don’t exist

  • pd.errors.EmptyDataError – If input files are empty or malformed

Notes

  • Creates ‘missing_data_male.svg’ and ‘missing_data_female.svg’ plots

  • Creates ‘hwe-histogram.svg’ plot

  • All plots are saved in SVG format with 600 DPI resolution

  • Threshold lines are displayed in red with dashed style

class ideal_genom.qc.variant_qc.VariantQCCleanUp(output_path: Path)[source]

Bases: object

__init__(output_path: Path) None[source]
clean_all() None[source]

Remove intermediate files from output directory.

This method deletes temporary files created during sample QC steps: - Files ending with ‘.bed’, ‘.bim’, ‘.fam’, ‘.vmiss’, ‘.smiss’, ‘.nosex’, ‘.missing’, ‘.hwe’

Return type:

None

Notes

Only removes files if they exist. No error is raised if files are not found.