VCF Processing Modules

Modules for post-imputation VCF file processing and conversion to PLINK format.

VCF Processing

Module to run the post-imputation processing tasks on VCF files.

This module provides classes for running various post-imputation tasks in parallel, including unzipping VCF files, filtering variants based on imputation quality, normalizing VCF files, and indexing VCF files. It uses the ThreadPoolExecutor for parallel execution and tqdm for progress tracking. The tasks are designed to handle large genomic datasets efficiently by leveraging multi-threading.

It also includes functionality to download and use reference genomes for normalization, and convert VCF file into a format suitable for further analysis, that is PLINK binary files.

class ideal_genom.post_imputation.vcf_process.ParallelTaskRunner(input_path: Path, output_path: Path, max_workers: int | None = None)[source]

Bases: object

A base class for running parallel tasks on files.

This class provides the basic infrastructure for parallel processing of files using ThreadPoolExecutor. It handles file collection and parallel task execution while providing progress monitoring and logging.

input_path

Directory path where input files are located.

Type:: Path

output_path

Directory path where output files will be saved.

Type:: Path

max_workers

Maximum number of worker threads to use. Defaults to min(8, CPU count).

Type:: int

files

List of files to be processed.

Type:: List[Path]

Raises:

TypeError – If input_path or output_path are not Path objects.
FileNotFoundError – If input_path or output_path don’t exist.
NotADirectoryError – If input_path or output_path are not directories.

__init__(input_path: Path, output_path: Path, max_workers: int | None = None) → None[source]

execute_task() → None[source]

Execute the specific post-imputation processing task.

This abstract method should be implemented by all subclasses to perform their specific post-imputation processing operations. Implementations should handle the execution logic for the particular task the subclass is designed to perform.

Return type:: None
Raises:: NotImplementedError – If the subclass does not implement this method.

class ideal_genom.post_imputation.vcf_process.UnzipVCF(input_path: Path, output_path: Path, max_workers: int | None = None, password: str | None = None)[source]

Bases: ParallelTaskRunner

A class for unzipping VCF (Variant Call Format) files after imputation, with support for parallel processing.

This class extends ParallelTaskRunner to efficiently extract VCF files from zip archives, including password-protected ones. It collects all zip files in the working directory and extracts their contents to the output directory.

(See `ParallelTaskRunner` for inherited attributes.)

Notes

VCF files are commonly used in genomics for storing gene sequence variations
The class only extracts files (not directories) from the zip archives
All extracted files are placed directly in the output directory without preserving paths
This class is designed for post-imputation processing in genetic data pipelines

__init__(input_path: Path, output_path: Path, max_workers: int | None = None, password: str | None = None) → None[source]

execute_task() → None[source]

Execute the post-imputation unzipping task on VCF files.

This method performs the following steps: 1. Collects all zip files in the working directory 2. Unzips the VCF files, using the provided password if necessary

Parameters:: password (Optional[str]) – Password to decrypt zip files if they are password-protected. Default is None.
Returns:: This method doesn’t return any value.
Return type:: None

unzip_files(zip_path: Path, password: str | None = None, output_prefix: str = 'unzipped-') → None[source]

Extract files from a password-protected zip archive. This method extracts all non-directory files from the specified zip archive to the class’s output_path directory. If the zip file is password-protected, provide the password as a parameter.

Parameters:

zip_path (Path) – Path to the zip file to be extracted
password (Optional[str], optional) – Password for the zip file, None if the file is not password-protected. Defaults to None.
output_prefix (str, optional) – Prefix to add to extracted filenames. Defaults to ‘unzipped-‘.

Return type:

None

Raises:

zipfile.BadZipFile – If the zip file is corrupted or password is incorrect
FileNotFoundError – If the zip file does not exist
PermissionError – If there are insufficient permissions to read the zip file or write to output directory

Notes

Files are extracted to the output_path directory of the class instance. Only files (not directories) are extracted from the archive. File paths are not preserved - all files are placed directly in output_path. The output_prefix is added to the beginning of each extracted filename.

class ideal_genom.post_imputation.vcf_process.FilterVariants(input_path: Path, output_path: Path, max_workers: int | None = None, r2_threshold: float = 0.3, output_prefix: str = 'filtered-')[source]

Bases: ParallelTaskRunner

A class for filtering genetic variants in VCF/BCF files based on imputation quality (R² statistic). This class extends ParallelTaskRunner to provide parallel processing capabilities for filtering variants across multiple VCF files. It identifies variants with imputation quality below a specified R² threshold and removes them from the output files.

r2_threshold

The threshold value for the R² statistic. Variants with an R² value below this threshold will be filtered out.

Type:: float

output_prefix

The prefix to be added to output filenames. Default is ‘filtered-‘.

Type:: str, optional

(See `ParallelTaskRunner` for inherited attributes.)

Notes

The class searches for files matching the pattern *dose.vcf.gz in the input directory and processes them in parallel. The filtered output files will be saved in the output directory with the specified prefix added to their original filenames.

Note

bcftools must be installed and available in the system path

__init__(input_path: Path, output_path: Path, max_workers: int | None = None, r2_threshold: float = 0.3, output_prefix: str = 'filtered-') → None[source]

execute_task() → None[source]

Execute the task of filtering variants based on an R² threshold.

This method collects the necessary files with the pattern *dose.vcf.gz and runs the filtering task with the specified parameters.

Return type:: None
Raises:: TypeError – If r2_threshold is not a float or output_prefix is not a string.

Notes

The method uses internal methods _file_collector and _run_task to perform the filtering operation.

filter_variants(input_file: Path, r2_threshold: float, output_prefix: str = 'filtered-') → None[source]

Filter variants from a VCF/BCF file based on R2 imputation quality threshold.

This method takes an imputed VCF/BCF file and filters out variants with imputation quality (R2) below the specified threshold. The filtered output is saved as a compressed VCF.

Parameters:

input_file (Path) – Path to the input VCF/BCF file to be filtered
r2_threshold (float) – Minimum R2 imputation quality threshold (variants with R2 <= threshold will be removed)
output_prefix (str, optional) – Prefix to add to the output filename. Defaults to ‘filtered-‘.

Returns:

The method outputs a filtered VCF file but doesn’t return a value.

Return type:

None

Raises:

FileExistsError – If the input file does not exist
IsADirectoryError – If the input path is a directory, not a file
TypeError – If r2_threshold is not a float or output_prefix is not a string

Notes

The output file will be saved in the instance’s output_path directory with
the name constructed as: output_prefix + input_file.name

Note

This method requires bcftools to be installed and available in the system path.

class ideal_genom.post_imputation.vcf_process.NormalizeVCF(input_path: Path, output_path: Path, max_workers: int | None = None, output_prefix: str = 'uncompressed-')[source]

Bases: ParallelTaskRunner

A class for normalizing VCF files post-imputation in parallel.

This class provides functionality to process VCF files by normalizing them using bcftools. It’s specifically designed to handle post-imputation VCF files and split multiallelic variants into separate entries. The class inherits from ParallelTaskRunner to enable parallel processing of multiple VCF files, which improves performance for large-scale genomic datasets.

Inherits all attributes from ParallelTaskRunner

output_prefixstr, optional: Prefix to add to the output files. Defaults to ‘uncompressed-‘.

Note

bcftools must be installed and available in the system path

__init__(input_path: Path, output_path: Path, max_workers: int | None = None, output_prefix: str = 'uncompressed-') → None[source]

execute_task() → None[source]

Execute the post-imputation normalization task on VCF files.

This method collects filtered dose VCF files matching the pattern filtered-*dose.vcf.gz and runs the normalization process on them. The normalized files will be prefixed with the provided output_prefix.

Parameters:: output_prefix (str (optional)) – Prefix to add to the output files. Defaults to ‘uncompressed-‘.
Raises:: TypeError – If output_prefix is not a string.
Return type:: None

normalize_vcf(input_file: Path, output_prefix: str = 'uncompressed-') → None[source]

Normalizes a VCF file using bcftools norm with the -m -any option.

This method takes a VCF file, performs normalization using bcftools to split multiallelic variants into separate entries, and outputs the normalized file with the specified prefix.

Parameters:

input_file (Path) – Path to the input VCF file to be normalized
output_prefix (str, optional) – Prefix for the output file name. Defaults to ‘uncompressed-’

Return type:

None

Raises:

FileExistsError – If the input file does not exist
IsADirectoryError – If the input file path points to a directory
TypeError – If output_prefix is not a string

Notes

The output file will be saved in the output_path directory with the naming convention: output_prefix + base_name, where base_name is derived from the input file.

class ideal_genom.post_imputation.vcf_process.ReferenceNormalizeVCF(input_path: Path, output_path: Path, max_workers: int | None = None, build: str = '38', output_prefix: str = 'normalized-', reference_file: Path | None = None)[source]

Bases: ParallelTaskRunner

A class for normalizing VCF files using a reference genome in parallel.

This class extends ParallelTaskRunner to process multiple VCF files concurrently, normalizing them against a reference genome using bcftools. If a reference file is not provided, it will automatically download the appropriate reference genome based on the specified build.

build

Genome build version, either ‘37’ or ‘38’. Defaults to ‘38’.

Type:: str

output_prefix

Prefix to add to the output files. Defaults to ‘normalized-‘.

Type:: str

reference_file

Path to the reference genome file used for normalization. Defaults to None. If None or the file does not exist, it will be downloaded automatically based on the build.

Type:: Path, optional

(See `ParallelTaskRunner` for inherited attributes.)

Note

bcftools must be installed and available in the system path

__init__(input_path: Path, output_path: Path, max_workers: int | None = None, build: str = '38', output_prefix: str = 'normalized-', reference_file: Path | None = None) → None[source]

execute_task() → None[source]

Execute the post-imputation normalization task with reference genome.

This method normalizes VCF files using a reference genome. If no reference file is provided, it automatically downloads the appropriate reference genome based on the build parameter.

Return type:

None

Raises:

TypeError – If output_prefix is not a string.
ValueError – If build is not ‘37’ or ‘38’.
FileNotFoundError – If the reference file could not be found or downloaded.

Notes

This method collects uncompressed dose VCF files using a pattern match and normalizes them against the reference genome. The downloaded reference genomes come from the 1000 Genomes Project.

normalize_with_reference(input_file: Path, output_prefix: str = 'normalized-') → None[source]

Normalize a VCF file with a reference genome using bcftools.

This method takes an input VCF file and normalizes it against a reference genome using bcftools norm. The normalized output is compressed with gzip (-Oz).

Parameters:

input_file (Path) – Path to the input VCF file to be normalized.
output_prefix (str, default='normalized-') – Prefix to add to the output filename.

Returns:

The method doesn’t return a value but creates a normalized VCF file at the output_path location.

Return type:

None

Raises:

TypeError – If output_prefix is not a string.
Exception – If the bcftools command fails.
FileNotFoundError – If the input file cannot be found.

Notes

The output filename is constructed from the output_prefix and the base name extracted from the input filename (after the first hyphen).

class ideal_genom.post_imputation.vcf_process.IndexVCF(input_path: Path, output_path: Path, max_workers: int | None = None, pattern: str = 'normalized-*dose.vcf.gz')[source]

Bases: ParallelTaskRunner

A class for indexing VCF (Variant Call Format) files using bcftools in parallel.

This class extends ParallelTaskRunner to enable parallel processing of multiple VCF files. It creates index files that facilitate quick random access to compressed VCF files.

pattern

The glob pattern to match VCF files for indexing. Defaults to normalized-*dose.vcf.gz.

Type:: str, optional

(See `ParallelTaskRunner` for inherited attributes.)

Raises:: TypeError – If pattern is not a string.

Note

bcftools must be installed and available in the system path

__init__(input_path: Path, output_path: Path, max_workers: int | None = None, pattern: str = 'normalized-*dose.vcf.gz') → None[source]

execute_task() → None[source]

Execute the task of indexing VCF files.

This method collects files based on the provided pattern and indexes the VCF files.

Return type:: None

index_vcf(input_file: Path) → None[source]

Index a VCF file using bcftools.

This method creates an index for the specified VCF file using bcftools, which is required for efficient querying and processing of VCF files.

Parameters:: input_file (Path) – Path to the VCF file to be indexed. Must be an existing file.
Return type:: None
Raises:: FileExistsError – If the input file does not exist.

class ideal_genom.post_imputation.vcf_process.AnnotateVCF(input_path: Path, output_path: Path, ref_annotation: Path, max_workers: int | None = None, output_prefix: str = 'annotated-')[source]

Bases: ParallelTaskRunner

A parallel task runner for annotating normalized VCF files using reference annotation.

This class provides functionality to annotate normalized VCF files with identifiers from a reference annotation file using bcftools. It processes multiple VCF files in parallel, making it efficient for large genomic datasets.

The class identifies all normalized VCF files matching a specified pattern and annotates them using the provided reference annotation file. It adds identifiers from the reference file to the VCF entries.

output_prefix

Prefix to add to the output files. Defaults to ‘annotated-‘.

Type:: str, optional

ref_annotation

Path to the reference annotation file used for annotating VCF files.

Type:: Path

(See `ParallelTaskRunner` for inherited attributes.)

Raises:

TypeError – If ref_annotation is not a Path object or output_prefix is not a string.
FileNotFoundError – If the reference annotation file does not exist.
IsADirectoryError – If the reference annotation file is not a file.

Note

This class requires bcftools to be installed and available in the system path.

__init__(input_path: Path, output_path: Path, ref_annotation: Path, max_workers: int | None = None, output_prefix: str = 'annotated-') → None[source]

execute_task() → None[source]

Annotates normalized VCF files using a reference annotation file.

This method collects all normalized VCF files matching the pattern normalized-*dose.vcf.gz and annotates them using the provided reference annotation file. The annotated files will be saved with the specified output prefix.

Return type:: None

annotate_vcf(input_file: Path, ref_annotation: Path, output_prefix: str = 'annotated-') → None[source]

Annotates a VCF file with identifiers from a reference annotation file using bcftools. This method takes an input VCF file and annotates it with IDs from a reference annotation file. The annotated VCF is saved to a new file with the specified prefix.

Parameters:

input_file (Path) – Path to the input VCF file to be annotated.
ref_annotation (Path) – Path to the reference annotation file used for annotation.
output_prefix (str (optional)) – Prefix to add to the output filename. Defaults to ‘annotated-‘.

Return type:

None

Raises:

FileExistsError – If the input file does not exist.
IsADirectoryError – If the input file is a directory, not a file.
TypeError – If ref_annotation is not a Path object or output_prefix is not a string.
Exception – If the bcftools command fails.
FileNotFoundError – If the input file is not found during execution.

class ideal_genom.post_imputation.vcf_process.ProcessVCF(input_path: Path, output_path: Path, input_name: str | None = None, output_name: str = 'concatenated.vcf.gz')[source]

Bases: object

ProcessVCF class for post-imputation processing of Variant Call Format (VCF) files.

This class provides a pipeline for processing VCF files through multiple sequential steps:

Unzipping VCF files (if compressed)
Filtering variants based on imputation quality (R²)
Normalizing variant representation
Normalizing against a reference genome
Indexing the normalized VCF files
Annotating variants with additional information
Concatenating multiple VCF files into a single output file

input_path

Path to the directory containing input VCF files.

Type:: Path

output_path

Path to the directory where processed files will be saved.

Type:: Path

Raises:

TypeError – If input_path or output_path is not of type Path.
FileNotFoundError – If input_path or output_path does not exist.
NotADirectoryError – If input_path or output_path is not a directory.

Notes

A subdirectory named process_vcf is created inside the input_path directory for storing intermediate files during processing.
This class is designed to handle multiple sequential steps in VCF file processing, such as unzipping, filtering, normalizing, and annotating.
Unlike other pipeline classes, this class processes multiple files in a directory rather than a single named input file, so input_name and output_name parameters are optional and not used if provided.

Note

This class requires bcftools to be installed and available in the system path.

__init__(input_path: Path, output_path: Path, input_name: str | None = None, output_name: str = 'concatenated.vcf.gz') → None[source]

execute_unzip(password: str | None = None) → None[source]

Unzips a VCF file using the UnzipVCF utility.

This method creates an instance of UnzipVCF with the input and process paths from the current object, then executes the unzipping task. If the VCF file is password-protected, a password can be provided.

Parameters:: password (str, optional) – Password for the protected zip file. Defaults to None.
Return type:: None

execute_filter(r2_threshold: float = 0.3) → None[source]

Execute a filtering operation on VCF data based on R² threshold.

This method filters variants in the processed VCF file by creating and executing a FilterVariants object with the specified R² threshold. Both input and output are set to the same process_vcf file.

Parameters:: r2_threshold (float, optional) – The R² threshold value for filtering variants. Variants with R² value below this threshold will be filtered out. Default is 0.3.
Return type:: None

execute_normalize() → None[source]

Normalizes the VCF file using the NormalizeVCF class.

This method creates a NormalizeVCF object with the current processed VCF file as both input and output, then executes the normalization task. The normalization process updates the VCF file in place.

Return type:: None

execute_reference_normalize(build: str = '38', ref_genome: Path | None = None) → None[source]

Normalize the VCF file against a reference genome.

This method creates a ReferenceNormalizeVCF object and executes the normalization task on the processed VCF file, using the specified genome build or reference file.

Parameters:

build (str, optional) – Genome build version to use. Defaults to ‘38’.
reference_file (Path, optional) – Path to a custom reference file. If provided, this will be used instead of the default reference for the specified build. Defaults to None.

Return type:

None

execute_index(pattern: str = 'normalized-*dose.vcf.gz') → None[source]

Index VCF files matching a specific pattern.

This method creates an indexer for VCF files and executes the indexing task on files that match the given pattern in the process_vcf directory.

Parameters:: pattern (str, optional) – The glob pattern to match VCF files for indexing. Defaults to normalized-*dose.vcf.gz.
Return type:: None

execute_annotate(ref_annotation: Path, output_prefix: str = 'annotated-') → None[source]

Annotates a VCF file using a reference annotation file.

This method initializes an AnnotateVCF object and executes the annotation process on the current VCF file.

Parameters:

ref_annotation (Path) – Path to the reference annotation file.
output_prefix (str, optional) – Prefix to be added to the output file name. Default is ‘annotated-‘.

Return type:

None

execute_concatenate(output_name: str, max_threads: int | None = None) → None[source]

Concatenates annotated VCF files using bcftools concat.

This method finds all annotated VCF files in the process_vcf directory, sorts them, and concatenates them into a single compressed VCF file.

Parameters:

output_name (str) – Name of the output file.
max_threads (int (optional)) – Maximum number of threads to use for concatenation. If None, uses get_optimal_threads(max_threads=8). Defaults to None.

Return type:

None

Raises:

TypeError – If output_name is not a string.
FileNotFoundError – If no annotated VCF files are found in the process_vcf directory.
ValueError – If max_threads is less than 1.

Notes

The output file will be saved in the output_path directory. The method uses the ‘bcftools concat’ command with Oz compression.

excute_intermediate_cleanup() → None[source]

Cleans up intermediate files in the process_vcf directory.

This method removes all files in the process_vcf directory to free up space after processing is complete.

Return type:: None

execute_process_vcf_pipeline(process_vcf_params: dict) → None[source]

Execute the full VCF processing pipeline.

This method runs the complete sequence of VCF processing steps: unzipping, filtering, normalizing, reference normalization, indexing, annotating, and concatenating.

Parameters:

password (str, optional) – Password for the protected zip file. Defaults to None.
r2_threshold (float, optional) – R² threshold for filtering variants. Defaults to 0.3.
build (str, optional) – Genome build version for reference normalization. Defaults to ‘38’.
ref_genome (Path, optional) – Path to a custom reference genome file. Defaults to None.
ref_annotation (Path, optional) – Path to the reference annotation file for annotating VCF files. Defaults to None.
output_name (str, optional) – Name of the final concatenated output file. Defaults to ‘final_output.vcf.gz’.
max_threads (int, optional) – Maximum number of threads for concatenation. Defaults to None.

Return type:

None

VCF to PLINK Conversion

Module for converting VCF files to PLINK binary format.

This module provides the GetPLINK class for converting post-imputation VCF files to PLINK binary format for downstream genomic analysis. The conversion process handles resource management, file validation, and supports both initial conversion and family information updates.

class ideal_genom.post_imputation.vcf_to_plink.GetPLINK(input_path: Path, input_name: str, output_path: Path, output_name: str)[source]

Bases: object

A class for converting concatenated VCF files to PLINK binary format.

This class handles conversion of a concatenated VCF file to a PLINK binary ready for further analysis.

input_path

Directory path where the input VCF file is located.

Type:: Path

output_path

Directory path where the output files will be saved.

Type:: Path

input_name

Name of the input VCF file (must end with .vcf or .vcf.gz).

Type:: str

output_name

Name for the output file. If not provided, it will be derived from input_name.

Type:: str, optional

Raises:

TypeError – If input_path or output_path is not a Path object, or if input_name or output_name is not a string.
FileNotFoundError – If input_path or output_path does not exist.
NotADirectoryError – If input_path or output_path is not a directory.
ValueError – If input_name is not provided or if it doesn’t end with .vcf or .vcf.gz.

__init__(input_path: Path, input_name: str, output_path: Path, output_name: str) → None[source]

convert_vcf_to_plink(double_id: bool = True, threads: int | None = None, memory: int | None = None) → None[source]

Convert a VCF file to PLINK binary format (.bed, .bim, .fam).

This method runs the plink2 command-line tool to convert the input VCF file to PLINK binary format, filtering for SNPs with standard ACGT alleles only.

Parameters:

double_id (bool, optional) – Whether to use the –double-id flag in plink2 command, which sets both FID and IID to the sample ID. Defaults to True.
threads (int, optional) – Number of CPU threads to use. If None, defaults to (available CPU cores - 2) or 10 if CPU count can’t be determined.
memory (int, optional) – Memory allocation in MB for plink2. If None, defaults to approximately 2/3 of available system memory.

Returns:

None
Side Effects
————
Creates PLINK binary files (.bed, .bim, .fam) in the self.analysis_ready directory with the prefix self.output_name + “-nosex”.

Raises:

subprocess.CalledProcessError – If the plink2 command execution fails.

update_fam(for_fam_update_file: Path, threads: int | None = None, memory: int | None = None) → None[source]

Add family information to the PLINK .fam file.

This method reads a family information file and updates the PLINK .fam file using the provided family information, via PLINK2.

Parameters:

for_fam_update_file (Path) – Path to the family information file (.fam or without suffix).
threads (int, optional) – Number of threads to use for PLINK2 (defaults to available CPUs - 2).

Return type:

None

execute_intermediate_cleanup() → None[source]

Cleans up intermediate files generated during the PLINK conversion process.

This method removes temporary files such as the initial .bed, .bim, and .fam files created during the conversion from VCF to PLINK binary format.

Return type:: None

execute_rename_snpid(rename: bool = True) → None[source]

Executes the SNP ID renaming process using PLINK2. This method renames SNP IDs in the PLINK binary files to a standardized format of ‘chr:pos:a1:a2’. The renaming is performed using PLINK2’s –set-all-var-ids parameter.

Parameter:

rename (bool, optional): Flag to control whether SNP renaming should be performed.: Defaults to True.

Returns:

None

Raises:

TypeError: If rename parameter is not a boolean.

Notes:

The renamed files will be saved with ‘-renamed’ suffix

Thread count is optimized based on available CPU cores

The new SNP ID format will be: chromosome:position:allele1:allele2

Sets self.renamed_snps to True if renaming is performed

execute_plink_conversion_pipeline(plink_params: dict) → None[source]

Execute the full PLINK conversion pipeline: VCF to PLINK binary and optional family info update.

This method orchestrates the conversion of a VCF file to PLINK binary format and optionally updates the .fam file with family information.

Parameters:

double_id (bool, optional) – Whether to use the –double-id flag in plink2 command. Defaults to True.
for_fam_update_file (Path, optional) – Path to the family information file for updating the .fam file. If None, no update is performed.
threads (int, optional) – Number of CPU threads to use. If None, defaults to (available CPU cores - 2) or 10 if CPU count can’t be determined.
memory (int, optional) – Memory allocation in MB for plink2. If None, defaults to approximately 2/3 of available system memory.

Return type:

None