Tutorial

DBGen is a general purpose database to support genomic data anlysis studies. The current implementation is based on mongoDB.

Supported services

DBGen has been designed in order to provide user-friendly APIs for typical bioinformatic workflows. The following functionalities are currently supported:

  • extremely simple interface to the database backend

  • automatic import of source files containing minimal information about biological samples i.e. project name, run accession, FTP links to download the genotype, and sample phenotypes (if available)

  • automatic download of sample genotypes (using the provided FTP links)

  • simple interface to save the results (raw files) of any bioinformatic tool

  • a collection of methods to query the database efficiently

Backend database

Setup

In order to use this package, you need to download, install, and configure mongoDB on your machine. You can follow the instructions on the official website.

Start and stop the backend

DBGen provides two methods to start and stop the backend database service, dbgen.start_db and dbgen.shutdown_db:

import dbgen

configs = dbgen.load_cfg()
dbgen.start_db(configs)
...
dbgen.shutdown(configs)

Customize DBGen

DBGen can be easily configured using command line parameters. There are 5 user parameters:

  • --password: user password

  • --database: database name (default: dbgen_test)

  • --host: host name (default: localhost)

  • --port: port (default: 27017)

  • --root-data-dir: root directory for input data (default: ./test/data)

You can load the configuration parameters directly in python using DBGen. The method load_cfg will return an object of class argparse.Namespace:

import dbgen

configs = dbgen.load_cfg()

The configs object will be required by other DBGen methods to update the database.

DBGen tables

The database schema of DBGen is composed of 5 tables:

  • Species: the basic unit of classification and a taxonomic rank of an organism

  • Dataset: an homogeneous collection of organisms’ samples

  • Sample: collected individuals

  • Phenotype: observable characteristics of a sample

  • Result: output result of bioinformatic tools

../_images/database_schema.png

Workflow

The typical use of DBGen consists of 4 elements:

  • load sample data (with the corresponding phenotypes) from source files

  • download sample genotype (e.g. fastq files)

  • save the results (raw files) of bioinformatic analyzes (e.g. VCF files)

  • query the database

Load data

Once you have loaded the configuration parameters and started the backend database service, you are ready to upload data into DBGen.

In order to simplify the insertion of new information, DBGen is designed to scan a user-defined directory looking for new data. The pre-defined directory is ./test/data. You can change the default directory using the command line parameter --root-data-dir.

In order to work properly, DBGen requires a specific tree structure under the root data directory:

root-data-dir
|
+-- <species name>
|   +-- <year>_<publicationName>.txt
|   +-- <year>_<publicationName>.txt
|   +-- ...
|
+-- <species name>
|   +-- <year>_<publicationName>.txt
|   +-- ...
+-- ...

Each source file must be a TSV (tab-separated values) file with the following columns:

Project name

URLs

Run accession

Phenotype A

Phenotype B

PRJNA497094

ftp.baz/SRR8074810_1.fastq.gz;ftp.baz/SRR8074810_2.fastq.gz

SRR8074810

R

S

PRJNA497094

ftp.baz/SRR8074811_1.fastq.gz;ftp.baz/SRR8074811_2.fastq.gz

SRR8074811

R

S

Download sample genotype

DBGen provides a simple method to download samples’ genotype. You just need to specify the name of the species and/or the name of the dataset you are interested in.

dbgen.Sample.download_raw_data(species_name="<species name>", dataset_name="<year>_<publicationName>")

Save bioinformatic results

Once you have obtained the results a bioinformatic pipeline, you can easily import the results inside DBGen using the save_result method.

dbgen.Sample.save_result(sample_id=sample_primary_key,
                         tool_name="<tool name>",
                         version="<tool version>",
                         date="<current date>",
                         parameters="<tool parameters>",
                         raw_result_path="</path/to/result/file>")

Query the database

There are 6 basic methods available in DBGen to query the database:

  • get_phenotype_names: get the set of phenotype names

  • get_tool_names: get the set of bioinformatic tool names

  • get_download_urls: get samples’ URLs used to download their genotype

  • get_raw_data: get samples’ genotype (e.g. fastq files)

  • get_phenotypes: get the observed samples’ phenotype

  • get_results: get the results of bioinformatic analyzes

The first two methods return a python set of strings. The last four methods, instead, return a pandas DataFrame.

Get phenotype names

You can get the list of available phenotypes using the get_phenotype_names method. You just need to specify the name of the species and/or the name of the dataset you are interested in.

dbgen.Phenotype.get_phenotype_names(species_name="<species name>", dataset_name="<year>_<publicationName>")

The method will return a python set of strings, i.e.:

('<phenotype name 1>', '<phenotype name 2>', '<phenotype name 3>', ...)

Get tool names

You can get the list of tools used so far using the get_tool_names method. You just need to specify the name of the species and/or the name of the dataset you are interested in.

dbgen.Result.get_tool_names(species_name="<species name>", dataset_name="<year>_<publicationName>")

The method will return a python set of strings, i.e.:

('<tool name 1>', '<tool name 2>', '<tool name 3>', ...)

Get raw genotype

You can get the raw genotype of samples using the get_raw_data method. You just need to specify the name of the species and/or the name of the dataset you are interested in.

dbgen.Sample.get_raw_data(species_name="<species name>", dataset_name="<year>_<publicationName>")

The method will return a pandas DataFrame, e.g.:

Sample primary key

Project

Run accession

Species

Dataset

Genotype files

652g6736f37719hbd

PRJEB5225

ERR410034

Staph. Aureus

2009_Austin

[<RawFile>, <RawFile>]

dug36ij3db73d8h92

PRJEB5225

ERR410035

Staph. Aureus

2009_Austin

[<RawFile>, <RawFile>]

The last column will contain the raw genotype of each sample. If you want to access the file, you can use the following procedure:

results = dbgen.Sample.get_raw_data(dataset_name="2009_Austin")

# get the first genotype file of the first sample
raw_data = results.iloc[0, -1][0]

# save the file locally so that
# it can be processed by bioinformatics tools
with open(raw_data.name, "wb") as f:
    raw_bytes = raw_data.file.read()
    f.write(raw_bytes)

Get genotype URLs

You can get the list of URLs used to download the genotype of samples using the get_download_urls method. You just need to specify the name of the species and/or the name of the dataset you are interested in.

dbgen.Sample.get_download_urls(species_name="<species name>", dataset_name="<year>_<publicationName>")

The method will return a pandas DataFrame, e.g.:

Sample primary key

Project

Run accession

Species

Dataset

Genotype files

652g6736f37719hbd

PRJEB5225

ERR410034

Staph. Aureus

2009_Austin

[ftp.baz/ERR410034_1.fastq.gz, ftp.baz/ERR410034_2.fastq.gz]

dug36ij3db73d8h92

PRJEB5225

ERR410035

Staph. Aureus

2009_Austin

[ftp.baz/ERR410035_1.fastq.gz, ftp.baz/ERR410035_2.fastq.gz]

The last column will contain the list of URLs to download the genotype of each sample.

Get phenotypes

You can get samples’ phenotypes using the get_phenotypes method. You just need to specify and/or:

  • the name of the species

  • the name of the dataset

plus the name of the phenotype you are interested in.

dbgen.Phenotype.get_phenotypes(species_name="<species name>",
                               dataset_name="<year>_<publicationName>",
                               phenotype_name="<name of the phenotype>")

The method will return a pandas DataFrame, e.g.:

Sample primary key

Species

Dataset

Mupirocin

652g6736f37719hbd

Staph. Aureus

2009_Austin

R

dug36ij3db73d8h92

Staph. Aureus

2009_Austin

S

The last column will contain the observed phenotype for each sample.

Get results

You can get the results of bioinformatic pipelines using the get_results method. You just need to specify the name of the species and/or the name of the dataset you are interested in.

dbgen.Sample.get_download_urls(species_name="<species name>",
                               dataset_name="<year>_<publicationName>")

The method will return a pandas DataFrame, e.g.:

Sample primary key

Species

Dataset

Tool

Version

Parameters

Result

652g6736f37719hbd

Staph. Aureus

2009_Austin

AMRFinder+

1.0.0

-l 20

<File Object>

dug36ij3db73d8h92

Staph. Aureus

2009_Austin

BLASTn

0.8.0

-p 0.2

<File Object>

dug36ij3db73d8h92

Staph. Aureus

2009_Austin

AMRFinder+

1.1.2

-l 30 -k 24

<File Object>

The last column will contain a file object containing the result of each bioinformatic tool.

Examples

Some working examples are presented in this section. Consider the following directory structure.

./test/data
|
+-- species1
|   +-- 2010_AuthorName1.txt
|
+-- species2
    +-- 2009_AuthorName3.txt
    +-- 2011_AuthorName3.txt
import os
import sys
import dbgen

def test_dbgen():

    # load configuration and connect to the backend database
    configs = dbgen.load_cfg()
    dbgen.start_db(configs)
    dbgen.connect_db(configs)

    # load source data
    dbgen.import_data(configs)

    # some queries
    s1 = dbgen.Sample.get_raw_data(species_name="species2", dataset_name="2009_AuthorName3")
    s2 = dbgen.Sample.get_raw_data(dataset_name="2009_AuthorName3")
    s3 = dbgen.Sample.get_raw_data(species_name="species2")
    u1 = dbgen.Sample.get_download_urls(dataset_name="2009_AuthorName3")
    u2 = dbgen.Sample.get_download_urls(species_name="species2")
    u3 = dbgen.Sample.get_download_urls(species_name="species2", dataset_name="2009_AuthorName3")
    p0 = dbgen.Phenotype.get_phenotype_names(species_name="species2")
    p1 = dbgen.Phenotype.get_phenotypes(species_name="species2", phenotype_name="Mupirocin")
    p2 = dbgen.Phenotype.get_phenotypes(dataset_name="2009_AuthorName3", phenotype_name="Mupirocin")
    p3 = dbgen.Phenotype.get_phenotypes(species_name="species2",
                                        dataset_name="2009_AuthorName3",
                                        phenotype_name="Mupirocin")

    # download samples' genotype
    dbgen.Sample.download_raw_data(species_name="species2", dataset_name="2009_AuthorName3")

    # get raw genotype
    s1 = dbgen.Sample.get_raw_data(species_name="species2", dataset_name="2009_AuthorName3")
    s2 = dbgen.Sample.get_raw_data(dataset_name="2009_AuthorName3")
    s3 = dbgen.Sample.get_raw_data(species_name="species2")

    # save a local genotype file
    raw_data = s1.iloc[0, -1][1]
    with open(raw_data.name, "wb") as f:
        raw_bytes = raw_data.file.read()
        f.write(raw_bytes)

    # pretend to run a bioinformatic pipeline
    # on some samples and save the results
    # into DBGen
    root_path = "./test/db/cooked/"
    if not os.path.exists(root_path):
        os.makedirs(root_path)
    for k, v in s1.iterrows():
        tool_name = "AMRFinder+"
        version = "0.0.1"
        date = "2019-09-02"
        parameters = "-c 20 -v 39"
        file_path = os.path.join(root_path, "testfile_%s.txt" % v["run accession"])
        raw_result = open(file_path, "w")
        raw_result.write("Hello %s" % v["run accession"])
        raw_result.write("This is our new text file")
        raw_result.write("and this is another line.")
        raw_result.write("Why? Because we can.")
        raw_result.close()
        raw_result_path = os.path.abspath(file_path)

        dbgen.Sample.save_result(k, tool_name, version,
                                 date, parameters, raw_result_path)

    # query the result table
    res0 = dbgen.Result.get_results(species_name="species2", dataset_name="2009_AuthorName3")
    res1 = dbgen.Result.get_results(species_name="species2")
    res2 = dbgen.Result.get_results(dataset_name="2009_AuthorName3")

    # shutdown the database backend
    dbgen.shutdown_db(configs)

    return