ideal_genom_qc.get_references

class ideal_genom_qc.get_references.Fetcher1000Genome(destination: Path = None, built: str = '38')

Bases: object

get_1000genomes(url_pgen: str = None, url_pvar: str = None, url_psam: str = None) Path

Download and decompress 1000 Genomes reference data. This method downloads the PLINK2 binary files (.pgen, .pvar, .psam) for the 1000 Genomes reference dataset, corresponding to the specified genome build (37 or 38). If the files already exist in the destination directory, the download is skipped.

Parameters:

url_pgen (str, optional): Custom URL for downloading the .pgen file.

If None, uses default URL based on genome build.

url_pvar (str, optional): Custom URL for downloading the .pvar file.

If None, uses default URL based on genome build.

url_psam (str, optional): Custom URL for downloading the .psam file.

If None, uses default URL based on genome build.

Returns:

Path: Path object pointing to the decompressed .pgen file location.

Note:

The method requires plink2 to be installed and accessible in the system path for decompressing the .pgen file.

get_1000genomes_binaries() Path

Convert downloaded 1000 Genomes data into PLINK binary files (.bed, .bim, .fam). This method processes the downloaded 1000 Genomes data files and converts them into PLINK binary format. If the binary files already exist, it skips the conversion process. The method handles file cleanup and proper renaming of output files. The conversion is done in two steps: 1. Convert pfile to binary format including only SNPs from chromosomes 1-22,X,Y,MT 2. Update variant IDs and create final binary files

Returns:

Path object pointing to the generated binary files (without extension) The actual files created will be .bed, .bim, .fam and .psam with the same prefix

Return type:

Path

class ideal_genom_qc.get_references.FetcherLDRegions(destination: Path = None, built: str = '38')

Bases: object

get_ld_regions() Path

Downloads or creates high LD regions file based on genome build version. This method handles the retrieval of high Linkage Disequilibrium (LD) regions for different genome builds (37 or 38). For build 37, it downloads the regions from a GitHub repository. For build 38, it creates the file from predefined coordinates.

Returns:

Path: Path to the created/downloaded LD regions file. Returns empty Path if

download fails for build 37.

Raises:

None explicitly, but may raise standard I/O related exceptions.

Notes:

  • For build 37: Downloads from genepi-freiburg/gwas repository

  • For build 38: Creates file from hardcoded coordinates from GWAS-pipeline

  • Files are named as ‘high-LD-regions_GRCh{build}.txt’

  • Creates destination directory if it doesn’t exist