Generate chem dataset

This R notebook uses the full downloaded chem database (version 28) to extract interesting drug information.

This script requires the following dataset files:

These data sources are agglomerated and stored in the sqlite3 format saved as the file pointed to by the file.output variable. A cached (temporary) sqlite3 database will also be used at the path pointed to by the file.local variable.

Cross-reference with PubChem

In this work we use the pre-computed 2D-structure kernel available from the PubChem interface. For this purpose, we need to associate each drug in the Chembl database with those in the PubChem database.
This happens via the unichem crossreferencing library.

For this you will need the downloaded unichem

Parsing mechanism-of-action entries

This is part of the manually curated information that is available in the Chembl database and consists of assumed or verified mechanism of action information for each drug.

The result is stored on the drug_moa table of the output sqlite3 database file.

Parse Anatomical Therapeutic Chemical (ATC) Classifications

The Anatomical Therapeutic Chemical is a classification system for drugs based on the body part they target as well as a classification into families of the active ingredients.
It consists of a hierarchy of 5 levels, out of which we only use the first 4, which we also simplify and compact (for better presentation).

The result is stored on the drug_atc table of the output sqlite3 database file.

Parse Experimental Factor Ontology (EFO) Classifications

The Experimental Factor Ontology (EFO) is an ontological categorisation of molecules of biological interest, and therefore assigns varying number of tags in varying depth levels of the selected drugs.

Similar to the ATC, we simplify and compactify a selection of the data. To parse this information the EFO ontology must be downloaded and available in the file.efo variable, and the ontologyIndex R package must be installed. You will also need the popular tidyr package.

The result is stored on the drug_efo table of the output sqlite3 database file.

Parsing chemical information of the drugs

These are not used in this project, but provided for a future extension.

Feel free to skip this section.

The result is stored on the drug_chemprops table of the output sqlite3 database file.

Parse drug SMILES structures

These can be used to manually compute the similarities of the drugs, for example in a SOAP kernel. In this work we use pre-computed similarities through the PubChem interface (see Cross-reference with PubChem )

Feel free to skip this section.

The result is stored on the drug_dtructs table of the output sqlite3 database file.