pysradb package¶
Submodules¶
pysradb.basedb module¶
- class pysradb.basedb.BASEdb(sqlite_file)[source]¶
 Bases:
object- all_row_counts()[source]¶
 Get row counts of all tables in the db file.
- Returns
 - row_counts: DataFrame
 A dataframe with table names and corresponding row count.
- desc_table(table)[source]¶
 Describe all fields in a table.
- Parameters
 - table: string
 Table name. See list_tables for getting all table names
- Returns
 - table_desc: DataFrame
 A DataFrame with field name and its schema description
- get_row_count(table)[source]¶
 Get row counts for a table.
- Parameters
 - table: string
 Table name. See list_tables for getting all table names
- Returns
 - row_count: int
 Number of rows in table
- list_fields(table)[source]¶
 List all fields in a given table.
- Parameters
 - table: string
 Table name. See list_tables for getting all table names
- Returns
 - field_list: list
 A list of field names for the table
pysradb.cli module¶
Command line interface for pysradb
- class pysradb.cli.ArgParser(prog=None, usage=None, description=None, epilog=None, parents=[], formatter_class=<class 'argparse.HelpFormatter'>, prefix_chars='-', fromfile_prefix_chars=None, argument_default=None, conflict_handler='error', add_help=True, allow_abbrev=True)[source]¶
 Bases:
argparse.ArgumentParser
- class pysradb.cli.CustomFormatterArgP(prog, indent_increment=2, max_help_position=24, width=None)[source]¶
 Bases:
argparse.ArgumentDefaultsHelpFormatter,argparse.RawDescriptionHelpFormatter
pysradb.download module¶
Utility function to download data
- pysradb.download.download_file(url, file_path, md5_hash=None, timeout=10, block_size=1048576, show_progress=False)[source]¶
 Resumable download. Expect the server to support byte ranges.
- Parameters
 - url: string
 URL
- file_path: string
 Local file path to store the downloaded file
- md5_hash: string
 Expected MD5 string of downloaded file
- timeout: int
 Seconds to wait before terminating request
- block_size: int
 Chunkx of bytes to read (default: 1024 * 1024 = 1MB)
- show_progress: bool
 Show progress bar
- pysradb.download.get_file_size(row, url_col)[source]¶
 Get size of file to be downloaded.
- Parameters
 - row: pd.DataFrame row
 - url_col: str
 url_column
- Returns
 - content_length: int
 
pysradb.exceptions module¶
This file contains custom Exceptions for pysradb
pysradb.filter_attrs module¶
- pysradb.filter_attrs.expand_sample_attribute_columns(metadata_df)[source]¶
 Expand sample attribute columns to individual columns.
Since the sample_attribute column content can be different for differnt rows even if coming from the same project (SRP), we explicitly iterate through the rows to first determine what additional columns need to be created.
- Parameters
 - metadata_df: DataFrame
 Dataframe as obtained from sra_metadata or equivalent
- Returns
 - expanded_df: DataFrame
 Dataframe with additionals columns pertaining to sample_attribute appended
- pysradb.filter_attrs.guess_cell_type(sample_attribute)[source]¶
 Guess possible cell line from sample_attribute data.
- Parameters
 - sample_attribute: string
 sample_attribute string as in the metadata column
- Returns
 - cell_type: string
 Possible cell type of sample. Returns None if no match found.
pysradb.geodb module¶
Methods to interact with SRA
- class pysradb.geodb.GEOdb(sqlite_file)[source]¶
 Bases:
pysradb.basedb.BASEdb- geo_convert(from_acc)[source]¶
 Convert one GEO accession to other.
- Parameters
 - from_acc: string
 GPL/GSE/GSM accession ID
- Returns
 - mapping_df: DataFrame
 A dataframe with relevant mappings
- gse_metadata(gse)[source]¶
 Get metadata for GSE ID.
- Parameters
 - gse: string
 GSE ID
- Returns
 - metadata_df: DataFrame
 A dataframe with relevant fields
- gse_to_gsm(gse)[source]¶
 Fetch GSMs for a GSE.
- Parameters
 - gse: string
 GSE ID
- Returns
 - mapping_df: DataFrame
 A dataframe with relevant mappings
- gsm_metadata(gsm)[source]¶
 Get metadata for GSM ID.
- Parameters
 - gsm: string
 GSM ID
- Returns
 - metadata_df: DataFrame
 A dataframe with relevant fields
pysradb.search module¶
This file contains the search classes for the search feature.
- class pysradb.search.EnaSearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False)[source]¶
 Bases:
pysradb.search.QuerySearchSubclass of QuerySearch that implements search via querying ENA API
See also
QuerySearchSuperclass of EnaSearch
Methods
search()
sends the user query via requests to ENA API and stores search result as an instance attribute in the form of a pandas dataframe
show_result_statistics()
Shows summary information about search results.
visualise_results()
Generate graphs that visualise the search results.
get_plot_objects():
Get the plot objects for plots generated.
_format_query_string()
formats the input user query into a string
_format_request()
formats the request payload
_format_result(content)
formats the search query output and converts it into a pandas dataframe
- class pysradb.search.GeoSearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, geo_query=None, geo_dataset_type=None, geo_entry_type=None, suppress_validation=False)[source]¶
 Bases:
pysradb.search.SraSearchSubclass of SraSearch that can query both GEO DataSets and SRA API.
See also
GeoSearch.infoGeoSearch usage details
SraSearchSuperclass of GeoSearch
QuerySearchSuperclass of SraSearch
Methods
search()
sends the user query via requests to SRA, GEO DataSets, or both depending on the search query. If query is sent to both APIs, the intersection of the two sets of query results are returned.
show_result_statistics()
Shows summary information about search results.
visualise_results()
Generate graphs that visualise the search results.
get_plot_objects():
Get the plot objects for plots generated.
_format_geo_query_string()
formats the GEO DataSets portion of the input user query into a string.
_format_geo_request()
formats the GEO DataSets request payload
_format_result(content)
formats the search query output and converts it into a pandas dataframe
- class pysradb.search.QuerySearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False)[source]¶
 Bases:
objectThis is the base class for the search feature.
This class takes as input the user’s search query, which has been tokenized by the ArgParser. The query will be sent to either SRA or ENA depending on the user’s input, and the results will be returned as a pandas dataframe.
- Parameters
 - verbosityinteger
 The level of details of the search result.
- return_maxint
 The maximum number of entries to be returned.
- querystr
 The main query string.
- accessionstr
 A relevant study / experiment / sample / run accession number.
- organismstr
 Scientific name of the sample organism
- layoutstr
 Library layout. Possible inputs: single, paired
- mbasesint
 Size of the sample of interest rounded to the nearest megabase.
- publication_datestr
 The publication date of the run in the format dd-mm-yyyy. If a date range is desired, input should be in the format of dd-mm-yyyy:dd-mm-yyyy
- platformstr
 Sequencing platform used for the run. Some possible inputs include: illumina, ion torrent, oxford nanopore
- selectionstr
 Library selection. Some possible inputs: cdna, chip, dnase, pcr
- sourcestr
 Library source. Some possible inputs: genomic, metagenomic, transcriptomic
- strategystr
 Library Preparation strategy. Some possible inputs: wgs, amplicon, rna seq
- titlestr
 Title of the experiment associated with the run
- suppress_validation: bool
 Defaults to False. If this is set to True, the user input format checks will be skipped. Setting this to True may cause the program to behave in unexpected ways, but allows the user to search queries that does not pass the format check.
- Attributes
 - self.df: Pandas DataFrame
 The search result belonging to this search instance
Methods
get_df()
Returns the dataframe storing this search result.
search()
Executes the search.
show_result_statistics()
Shows summary information about search results.
visualise_results()
Generate graphs that visualise the search results.
get_plot_objects():
Get the plot objects for plots generated.
- visualise_results(graph_types=('all',), show=False, saveto='./search_plots/')[source]¶
 Generate graphs that visualise the search results.
This method will only work if the optional dependency, matplotlib, is installed in the system.
- Parameters
 - graph_typestuple
 tuple containing strings representing types of graphs to generate. Possible strings: all, daterange, organism, source, selection, platform, basecount
- savetostr
 directory name where the generated graphs are saved.
- showbool
 Whether plotted graphs are immediately shown.
- class pysradb.search.SraSearch(verbosity=2, return_max=20, query=None, accession=None, organism=None, layout=None, mbases=None, publication_date=None, platform=None, selection=None, source=None, strategy=None, title=None, suppress_validation=False)[source]¶
 Bases:
pysradb.search.QuerySearchSubclass of QuerySearch that implements search by querying NCBI Entrez API
See also
QuerySearchSuperclass of SraSearch
Methods
search()
sends the user query via requests to NCBI Entrez API and returns search results as a pandas dataframe.
show_result_statistics()
Shows summary information about search results.
visualise_results()
Generate graphs that visualise the search results.
get_plot_objects():
Get the plot objects for plots generated.
get_uids():
Get NCBI uids retrieved during this search query.
_format_query_string()
formats the input user query into a string
_format_request()
formats the request payload
_format_result(content)
formats the search query output.
pysradb.sradb module¶
Methods to interact with SRA
- class pysradb.sradb.SRAdb(sqlite_file)[source]¶
 Bases:
pysradb.basedb.BASEdb- download(srp=None, df=None, url_col=None, out_dir=None, filter_by_srx=[], use_ascp=False, ascp_dir=None, ascp_bin=None, skip_confirmation=False, threads=1)[source]¶
 Download SRA files.
- Parameters
 - srp: string
 SRP ID (optional)
- df: Dataframe
 A dataframe as obtained from sra_metadata
- url_col: string
 Column of df to use for downloading
- out_dir: string
 Directory location for download
- filter_by_srx: list
 List of SRX ids to filter
- protocol: string
 [‘fasp’/’ftp’] fasp => faster download, ftp => slower
- ascp_dir: string
 Location of ascp directory
- gse_to_gsm(gses, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert GSE to GSM
- Parameters
 - gses: string or list
 List of GSE ID
- Returns
 - gse_to_gsm_df: DataFrame
 
- gse_to_srp(gses, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRP to GSE
- Parameters
 - gses: string or list
 List of GSE ID
- Returns
 - gse_to_srp_df: DataFrame
 
- gsm_to_gse(gsms, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert GSM to GSE
- Parameters
 - gsms: string or list
 List of GSM ID
- Returns
 - gsm_to_gse_df: DataFrame
 
- gsm_to_srp(gsms, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert GSM to SRP.
- Parameters
 - gsms: string or list
 List of GSM ID
- Returns
 - gsm_to_srp_df: DataFrame
 
- gsm_to_srr(gsms, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert GSMs to SRR.
- Parameters
 - gsms: string or list
 List of GSM id
- sample_attribute: bool
 Include sample_attribute column
- Returns
 - gsm_to_srr_df: DataFrame
 DataFrame with two columns for GSM/SRR
- gsm_to_srs(gsms, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert GSM to SRS.
- Parameters
 - gsms: list or string
 List of gsms
- Returns
 - gsm_to_srs_df: DataFrame
 
- gsm_to_srx(gsms, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert GSM to SRX.
- Parameters
 - srx: string
 SRX ID
- Returns
 - srs_to_srx_df: DataFrame
 
- search_by_expt_id(srx)[source]¶
 Search for a SRX/GSM id in the experiments.
- Parameters
 - srx: string
 SRX (experiment_accession) ID
- Returns
 - results: dict
 Dictionary with relevant hits
- search_sra(search_str, out_type=['study_accession', 'experiment_accession', 'sample_accession', 'run_accession'], assay=False, sample_attribute=False, detailed=False, expand_sample_attributes=False, output_read_lengths=False)[source]¶
 Search SRA for any search term.
- Parameters
 - search_str: string
 SQL like text string to search. SQL like text => For example, terms in quotes “” enforce an exact search.
- Returns
 - query_df: DataFrame
 Dataframe with relevant query results
- sra_metadata(acc, out_type=['study_accession', 'experiment_accession', 'sample_accession', 'run_accession'], assay=False, sample_attribute=False, detailed=False, expand_sample_attributes=False, output_read_lengths=False, acc_is_searchstr=False)[source]¶
 Get metadata for the provided SRA accession.
- Parameters
 - acc: string or list
 SRA accession ID
- out_type: list
 List of columns to output
- assay: bool
 True if assay should be outputted
- sample_attribute: bool
 True if sample_attribute should be outputted
- detailed: bool
 True if full metadata tables should be outputted
- expand_sample_attributes: bool
 Should sample_attribute column be expanded?
- output_read_lengths: bool
 True if read lengths should be calculated
- acc_is_searchstr: bool
 True if acc is a search string
- Returns
 - metadata_df: DataFrame
 A dataframe with all relevant columns
- srp_to_gse(srp, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRP to GSE
- Parameters
 - srp: string
 SRP ID
- Returns
 - srp_to_srr_df: DataFrame
 
- srp_to_srr(srp, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRP to SRR.
- Parameters
 - srp: string
 SRP ID
- Returns
 - srp_to_srr_df: DataFrame
 
- srp_to_srs(srp, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRP to SRS.
- Parameters
 - srp: string
 SRP ID
- Returns
 - srp_to_srs_df: DataFrame
 DataFrame with two columns for SRS
- srp_to_srx(srp, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRP to SRX/SRR.
- Parameters
 - srp: string
 SRP ID
- Returns
 - srp_to_srx_df: DataFrame
 DataFrame with two columns for SRX/SRR
- srr_to_gsm(srrs, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRR to GSM
- Parameters
 - gses: string or list
 List of SRR
- Returns
 - srr_to_gsm_df: DataFrame
 
- srr_to_srp(srrs, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRR to SRP.
- Parameters
 - srr: list of string
 List of SRR IDs
- Returns
 - srr_to_srp_df: DataFrame
 
- srr_to_srs(srrs, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRR to SRS.
- Parameters
 - srr: list of string
 List of SRR IDs
- Returns
 - srp_to_srs_df: DataFrame
 
- srr_to_srx(srrs, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRR to SRX.
- Parameters
 - srrs: string or list
 List of SRR id
- sample_attribute: bool
 Include sample_attribute column
- Returns
 - srr_to_srx_df: DataFrame
 DataFrame with two columns for SRX/SRR
- srs_to_gsm(srss, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRS to GSM.
- Parameters
 - srss: list or string
 List of SRS ID
- Returns
 - srs_to_gsm_df: DataFrame
 
- srs_to_srx(srss, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRS to SRX.
- Parameters
 - srx: string
 SRX ID
- Returns
 - srs_to_srx_df: DataFrame
 
- srx_to_srp(srxs, sample_attribute=False, detailed=False, expand_sample_attributes=False)[source]¶
 Convert SRXs to SRP.
- Parameters
 - srxs: string or list
 List of SRX id
- sample_attribute: bool
 Include sample_attribute column
- Returns
 - srx_to_srp_df: DataFrame
 DataFrame with two columns for SRX
- pysradb.sradb.download_sradb_file(download_dir='/github/workspace/docs', overwrite=True, keep_gz=False)[source]¶
 Download SRAdb.sqlite file.
- Parameters
 - download_dir: string
 Directory to download SRAmetadb.sqlite
- overwrite: bool
 overwrite existing file(s). Set to True by default.
- keep_gz: bool
 Delete .gz file after extraction is complete
pysradb.sraweb module¶
Utilities to interact with SRA online
- class pysradb.sraweb.SRAweb(api_key=None)[source]¶
 Bases:
pysradb.sradb.SRAdb- fetch_ena_fastq(srp)[source]¶
 Fetch FASTQ records from ENA (EXPERIMENTAL)
- Parameters
 - srp: string
 Srudy accession
- Returns
 - srr_url: list
 List of SRR fastq urls
- static format_xml(string)[source]¶
 Create a fake root to make ‘string’ a valid xml
- Parameters
 - string: str
 
- Returns
 - xml: str
 
- gse_to_gsm(gse, **kwargs)[source]¶
 Convert GSE to GSM
- Parameters
 - gses: string or list
 List of GSE ID
- Returns
 - gse_to_gsm_df: DataFrame
 
- gse_to_srp(gse, **kwargs)[source]¶
 Convert SRP to GSE
- Parameters
 - gses: string or list
 List of GSE ID
- Returns
 - gse_to_srp_df: DataFrame
 
- gsm_to_gse(gsm, **kwargs)[source]¶
 Convert GSM to GSE
- Parameters
 - gsms: string or list
 List of GSM ID
- Returns
 - gsm_to_gse_df: DataFrame
 
- gsm_to_srp(gsm, **kwargs)[source]¶
 Convert GSM to SRP.
- Parameters
 - gsms: string or list
 List of GSM ID
- Returns
 - gsm_to_srp_df: DataFrame
 
- gsm_to_srr(gsm, **kwargs)[source]¶
 Convert GSMs to SRR.
- Parameters
 - gsms: string or list
 List of GSM id
- sample_attribute: bool
 Include sample_attribute column
- Returns
 - gsm_to_srr_df: DataFrame
 DataFrame with two columns for GSM/SRR
- sra_metadata(srp, sample_attribute=False, detailed=False, expand_sample_attributes=False, output_read_lengths=False, **kwargs)[source]¶
 Get metadata for the provided SRA accession.
- Parameters
 - acc: string or list
 SRA accession ID
- out_type: list
 List of columns to output
- assay: bool
 True if assay should be outputted
- sample_attribute: bool
 True if sample_attribute should be outputted
- detailed: bool
 True if full metadata tables should be outputted
- expand_sample_attributes: bool
 Should sample_attribute column be expanded?
- output_read_lengths: bool
 True if read lengths should be calculated
- acc_is_searchstr: bool
 True if acc is a search string
- Returns
 - metadata_df: DataFrame
 A dataframe with all relevant columns
pysradb.taxid2name module¶
pysradb.utils module¶
- class pysradb.utils.TqdmUpTo(*_, **__)[source]¶
 Bases:
tqdm.std.tqdmAlternative Class-based version of the above. Provides update_to(n) which uses tqdm.update(delta_n). Inspired by [twine#242](https://github.com/pypa/twine/pull/242), [here](https://github.com/pypa/twine/commit/42e55e06).
Credits: https://github.com/tqdm/tqdm/blob/69326b718905816bb827e0e66c5508c9c04bc06c/examples/tqdm_wget.py
- pysradb.utils.confirm(preceeding_text)[source]¶
 Confirm user input.
- Parameters
 - preceeding_text: str
 Text to print
- Returns
 - response: bool
 
- pysradb.utils.copyfileobj(fsrc, fdst, bufsize=16384, filesize=None, desc='')[source]¶
 Copy file object with a progress bar.
- Parameters
 - fsrc: filehandle
 Input file handle
- fdst: filehandle
 Output file handle
- bufsize: int
 Length of output buffer
- filesize: int
 Input file file size
- desc: string
 Description for tqdm status
- pysradb.utils.get_gzip_uncompressed_size(filepath)[source]¶
 Get uncompressed size of a .gz file
- Parameters
 - filepath: string
 Path to input file
- Returns
 - filesize: int
 Uncompressed file size
- pysradb.utils.mkdir_p(path)[source]¶
 Python version mkdir -p
- Parameters
 - pathstring
 Path to directory to create
- pysradb.utils.order_dataframe(df, columns)[source]¶
 Order a dataframe
Order a dataframe by moving the columns in the front
- Parameters
 - df: Dataframe
 Dataframe
- columns: list
 List of columns that need to be put in front
- pysradb.utils.path_leaf(path)[source]¶
 Get path’s tail from a filepath.
- Parameters
 - path: string
 Filepath
- Returns
 - tail: string
 Filename
- pysradb.utils.requests_3_retries()[source]¶
 Generates a requests session object that allows 3 retries.
- Returns
 - session: requests.Session
 requests session object that allows 3 retries for server-side errors.
Module contents¶
Top-level package for pysradb.