ideal_genom_qc.get_references
- class ideal_genom_qc.get_references.Fetcher1000Genome(destination: Path = None, built: str = '38')
Bases:
object- get_1000genomes(url_pgen: str = None, url_pvar: str = None, url_psam: str = None) Path
Download and decompress 1000 Genomes reference data. This method downloads the PLINK2 binary files (.pgen, .pvar, .psam) for the 1000 Genomes reference dataset, corresponding to the specified genome build (37 or 38). If the files already exist in the destination directory, the download is skipped.
Parameters:
- url_pgen (str, optional): Custom URL for downloading the .pgen file.
If None, uses default URL based on genome build.
- url_pvar (str, optional): Custom URL for downloading the .pvar file.
If None, uses default URL based on genome build.
- url_psam (str, optional): Custom URL for downloading the .psam file.
If None, uses default URL based on genome build.
Returns:
Path: Path object pointing to the decompressed .pgen file location.
Note:
The method requires plink2 to be installed and accessible in the system path for decompressing the .pgen file.
- get_1000genomes_binaries() Path
Convert downloaded 1000 Genomes data into PLINK binary files (.bed, .bim, .fam). This method processes the downloaded 1000 Genomes data files and converts them into PLINK binary format. If the binary files already exist, it skips the conversion process. The method handles file cleanup and proper renaming of output files. The conversion is done in two steps: 1. Convert pfile to binary format including only SNPs from chromosomes 1-22,X,Y,MT 2. Update variant IDs and create final binary files
- Returns:
Path object pointing to the generated binary files (without extension) The actual files created will be .bed, .bim, .fam and .psam with the same prefix
- Return type:
Path
- class ideal_genom_qc.get_references.FetcherLDRegions(destination: Path = None, built: str = '38')
Bases:
object- get_ld_regions() Path
Downloads or creates high LD regions file based on genome build version. This method handles the retrieval of high Linkage Disequilibrium (LD) regions for different genome builds (37 or 38). For build 37, it downloads the regions from a GitHub repository. For build 38, it creates the file from predefined coordinates.
Returns:
- Path: Path to the created/downloaded LD regions file. Returns empty Path if
download fails for build 37.
Raises:
None explicitly, but may raise standard I/O related exceptions.
Notes:
For build 37: Downloads from genepi-freiburg/gwas repository
For build 38: Creates file from hardcoded coordinates from GWAS-pipeline
Files are named as ‘high-LD-regions_GRCh{build}.txt’
Creates destination directory if it doesn’t exist