ideal_genom_qc.VariantQC
Python module to perform variant quality control
- class ideal_genom_qc.VariantQC.VariantQC(input_path: Path, input_name: str, output_path: Path, output_name: str)
Bases:
object- execute_different_genotype_call_rate() None
Execute test for different genotype call rates between cases and controls using PLINK.
This method performs the following operations: 1. Calculates available memory for PLINK execution 2. Runs PLINK’s –test-missing command to identify markers with significantly different
missing rates between cases and controls
Generates a .missing file with the results
The method uses approximately 2/3 of available system memory for PLINK execution.
Returns:
None
Side effects:
Creates a .missing file in the results directory
Sets self.case_control_missing path attribute
- execute_drop_variants(maf: float = 5e-08, geno: float = 0.1, hwe: float = 5e-08) None
Execute variant filtering based on quality control parameters using PLINK.
This method removes variants that fail quality control criteria including minor allele frequency (MAF), genotype missingness rate, and Hardy-Weinberg equilibrium (HWE) test.
- Parameters:
maf (float, optional) – Minor allele frequency threshold. Variants with MAF below this value are removed. Default is 5e-8.
geno (float, optional) – Maximum per-variant missing genotype rate. Variants with missing rate above this value are removed. Default is 0.1 (10%).
hwe (float, optional) – Hardy-Weinberg equilibrium test p-value threshold. Variants with HWE p-value below this are removed. Default is 5e-8.
- Returns:
Creates quality controlled PLINK binary files (.bed, .bim, .fam) in the clean directory with suffix ‘-variantQCed’.
- Return type:
None
- execute_hwe_test() None
Execute Hardy-Weinberg Equilibrium (HWE) test using PLINK.
This method performs the following steps: 1. Calculates available memory and allocates 2/3 for the test 2. Runs PLINK command to compute HWE test on the input binary PLINK files 3. Saves results to a .hwe output file
The HWE test is used to assess whether genotype frequencies in a population remain constant across generations under specific conditions.
Returns:
None
Side effects:
Creates a .hwe output file in the results directory
Sets self.hwe_results to the name of the output file
- execute_missing_data_rate(chr_y: int = 24) None
Executes missing data rate analysis using PLINK for male and female subjects separately. This method performs two PLINK operations: 1. Generates .lmiss and .imiss files for male subjects on chromosome Y 2. Generates .lmiss and .imiss files for all subjects excluding chromosome Y
- Parameters:
chr_y (int, default=24) – Chromosome Y number in the dataset. Must be between 0 and 26.
- Return type:
None
- Raises:
TypeError – If chr_y is not an integer
ValueError – If chr_y is not between 0 and 26
Notes
The method uses 2/3 of available system memory for PLINK operations. Output files are generated in the results directory with the following naming pattern: - {output_name}-missing-males-only.lmiss/.imiss : For male subjects - {output_name}-missing-not-y.lmiss/.imiss : For non-Y chromosome data The results are stored in self.males_missing_data and self.females_missing_data as Path objects.
- get_fail_variants(marker_call_rate_thres: float = 0.2, case_controls_thres: float = 1e-05, hwe_threshold: float = 5e-08, male_female_y_cap: int = None, hwe_y_cap: int = None) DataFrame
Identify and consolidate failing variants based on multiple quality control criteria. This method combines the results of three QC checks: 1. Variants with high missing data rates 2. Variants with significantly different genotype call rates between cases and controls 3. Variants failing Hardy-Weinberg equilibrium test
- Parameters:
marker_call_rate_thres (float, optional) – Threshold for failing variants based on missing data rate (default: 0.2)
case_controls_thres (float, optional) – P-value threshold for differential missingness between cases and controls (default: 1e-5)
hwe_threshold (float, optional) – P-value threshold for Hardy-Weinberg equilibrium test (default: 5e-8)
- Returns:
A summary DataFrame containing: - Counts of variants failing each QC criterion - Number of variants failing multiple criteria (duplicates) - Total number of unique failing variants
- Return type:
pd.DataFrame
Notes
Results are also written to a tab-separated file ‘fail_markers.txt’
Variants failing multiple criteria are only counted once in the final output file
- report_different_genotype_call_rate(directory: str, filename: str, threshold: float) DataFrame
Reports markers with different genotype call rates based on a given threshold. This function reads a .missing file, filters markers with a different genotype call rate below the specified threshold, and returns a DataFrame containing these markers.
Parameters:
directory (str): The directory where the .missing file is located. filename (str): The name of the .missing file. threshold (float): The threshold for filtering markers based on the P-value.
Returns:
- pd.DataFrame: A DataFrame containing markers with different genotype call rates
below the specified threshold. The DataFrame has two columns: ‘SNP’ and ‘Failure’, where ‘Failure’ is set to ‘Different genotype call rate’.
- report_hwe(directory: Path, filename: str, hwe_threshold: float = 5e-08, y_lim_cap: int = None) DataFrame
Generate Hardy-Weinberg Equilibrium (HWE) test report and visualization.
This method reads HWE test results from a file, identifies variants that fail HWE, creates a histogram of -log10(P) values, and returns failed variants.
- Parameters:
directory (Path) – Directory path where the HWE test results file is located
filename (str) – Name of the file containing HWE test results
hwe_threshold (float, optional) – P-value threshold for HWE test failure (default: 5e-8)
- Returns:
DataFrame containing variants that failed HWE test with columns: - SNP: variant identifier - Failure: reason for failure (always ‘HWE’)
- Return type:
pd.DataFrame
Notes
The method creates a histogram plot saved as ‘hwe-histogram’ showing the distribution of -log10(P) values from HWE tests.
- report_missing_data(directory: str, filename_male: str, filename_female: str, threshold: float, y_axis_cap: int = None) DataFrame
Analyze and report missing data rates for male and female subjects. This method processes missing data information from separate files for male and female subjects, creates visualizations of missing data distributions, and identifies SNPs that fail the missing data threshold for each sex group.
- Parameters:
directory (str) – Path to the directory containing the input files
filename_male (str) – Name of the file containing missing data information for male subjects (.lmiss format)
filename_female (str) – Name of the file containing missing data information for female subjects (.lmiss format)
threshold (float) – Maximum allowed missing data rate (between 0 and 1)
y_axis_cap (int, optional) – Upper limit for y-axis in histogram plots (default is 10)
- Returns:
A DataFrame containing SNPs that fail the missing data threshold for either sex, with columns [‘SNP’, ‘Failure’] where ‘Failure’ indicates the failing category
- Return type:
pd.DataFrame
Notes
The method generates two histogram plots saved as ‘missing_data_male’ and ‘missing_data_female’ showing the distribution of missing data rates for each sex.