1. Data sets for chemoinformatics and computational medicinal chemistry
Description: Data sets provided herein are a part of a freely available database of annotated compound data sets developed in our laboratory for chemoinformatics and computational medicinal chemistry from 2007 to 2014. The individual data sets and corresponding publications are described in the following article.
Article: Hu Y & Bajorath J. Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer [v1; ref status: indexed, f1000r.es/32j] F1000Research 3:69, 2014.
2. Programs for chemoinformatics and computational medicinal chemistry
Description: Programs provided herein are a part of a freely available database of software tools developed in our laboratory for chemoinformatics and computational medicinal chemistry from 2007 to 2011. The individual programs and corresponding publications are described in the following article.
Article: Hu Y & Bajorath J. Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer [v1; ref status: indexed, f1000r.es/32j] F1000Research 3:69, 2014.
3. Drug-unique scaffolds
Description: A list of 221 drug-unique scaffolds that represented approved drugs but were not detected in currently available bioactive compounds is provided. For each scaffold, the corresponding approved drug(s) are listed with their IDs in DrugBank and names. The structures of scaffolds and drugs are provided in canonical SMILES representation.
Article: Hu Y & Bajorath J. Many drugs contain unique scaffolds with varying structural relationships to scaffolds of currently available bioactive compounds. Eur. J. Med. Chem. 76, 427-434, 2014.
4. Detailed data sets of MMP-cliffs, SAR transfer series, RECAP-MMPs and compound activities
Description: An up-to-date version of three MMP-based data sets derived from compounds included in ChEMBL release 17 is presented. These data sets include activity cliffs, structure-activity relationship (SAR) transfer series, and second generation MMPs based upon retrosynthetic rules. The structural data and information are provided in eight different files comprising the data sets of MMP-cliffs, SAR transfer series, and RECAP-MMPs. Compound activities are incorporated in files of RECAP-MMPs. For transfer series, substituted fragments are also provided. All MMP-cliffs, SAR transfer series with approximate or regular potency progression, and RECAP-MMPs are provided in canonical SMILES representation on a per-target basis separately for the Ki and IC50 subsets. The corresponding files are clearly designated.
Article: Hu Y, de la Vega de León A, Zhang & Bajorath J. Matched molecular pair-based data sets for computer-aided medicinal chemistry [v2; ref status: indexed, f1000r.es/2w9] F1000Research 3:36, 2014.
5. Prediction of Compounds in Different Local SAR Environments using ECP
Description: SD files of 15 data sets reported in the manuscript are uploaded. Each data set is represented by its CHEMBL Target ID. The file format is provided in the file 'description.txt'.
Article: Namasivayam V, Gupta-Ostermann D, Balfer J, Heikamp K & Bajorath J. Prediction of compounds in different local structure-activity relationship environments using emerging chemical patterns. J. Chem. Inf. Model. 54, 1301-1310, 2014.
6. A Set of 72 Kinase Inhibitors
Description: A set of 72 known kinase inhibitors was assembled from ChEMBL release 18. For each inhibitor, MACCS and ECFP4 representations are provided. Both fingerprints were calculated using an in-house implementation based upon OpenEye's OEChem toolkit. For MACCS, we used SMARTS patterns adapted from RDKit.
Article: Balfer J, Hu Y & Bajorath, J. Compound structure-independent activity prediction in high-dimensional target space. Mol. Inf. 33, 544-558, 2014.
7. The ‘SAR Matrix’ method and its extensions for applications in medicinal chemistry and chemogenomics
Description: SD files of data sets reported in the manuscript that are used to generate the SARMs and CSMs are uploaded. The given data sets are represented by the corresponding figure numbers (as reported in the publication). The file format is provided in the file 'description.txt'.
Article: Gupta-Ostermann D & Bajorath J. The ‘SAR Matrix’ method and its extensions for applications in medicinal chemistry and chemogenomics [v2; ref status: indexed, f1000r.es/3gh] F1000Research 3:113, 2014.
8. Classification of Binding Modes for Kinase-Inhibitor Complex Structures, 3D Activity Cliffs Formed by Kinase Inhibitors, and Structural Analogues of 3D-Cliff Compounds
Description: The classification of crystallographic binding modes is provided for 884 kinase-inhibitor complex structures that were assembled from PDB. In addition, a total of 105 three-dimensional activity cliffs formed by 3D kinase inhibitors are listed. Their corresponding potency information is also given. Furthermore, the 2D structural analogues of 3D cliff-forming inhibitors were identified from ChEMBL database, on the basis of matched molecular pairs. These analogs and their activity information are also provided.
Article: Furtmann N, Hu Y & Bajorath J. Comprehensive analysis of three-dimensional activity cliffs formed by kinase inhibitors with different binding modes and cliff mapping of structural analogs. J. Med. Chem. 58, 30-40, 2015.
9. Visualization and Graphical Interpretation of Bayesian Compound Classification Models
Description: A prototypic Python implementation of the new visualization method is provided
Article: Balfer J & Bajorath J. Introduction of a methodology for visualization and graphical interpretation of Bayesian classification models. J. Chem. Inf. Model. 54, 2451-2456, 2014.
10. Kinase Inhibitors and Scaffolds
Description: On the basis of the assessment of compound and scaffold coverage of human kinome, sets of kinase inhibitors and scaffolds with high-confidence activity data assembled from ChEMBL release 18 are provided. In addition, sets of scaffolds involving in different types of structural relationships, having varying degrees of promiscuity and representing highly potent inhibitors are made available. A documentation file (i.e., README.txt) is given.
Article: Hu Y & Bajorath J. Exploring the scaffold universe of kinase inhibitors. J. Med. Chem. 58, 315-332, 2015.
11. Drug activity data
Description: Three sets of approved drugs with associated activity data at varying confidence levels are provided. Approved drugs are taken from DrugBank. Each drug in each set is associated with ChEMBL activity records for individual time intervals.
Article: Hu Y & Bajorath J. Monitoring drug promiscuity over time [v2; ref status: indexed, f1000r.es/4bh] F1000Research 3:218, 2014.
12. AnalogExplorer Program
Description: The program of AnalogExplorer, i.e., a new method for graphical analysis of analog series and associated structure-activity relationship information, has been provided. The implementation of the program is detailed in the file 'HowToRunAnalogExplorer.txt'.
Article: Zhang B, Hu Y & Bajorath J. AnalogExplorer – a new method for graphical analysis of analog series and associated structure-activity relationship information. J. Med. Chem. 57, 9184-9194, 2014.
13. 31 ChEMBL data sets for regression modeling
Description: From ChEMBL version 17, 31 compound data sets have been selected for regression modeling. Compounds had to be active against human targets in a direct inhibition/binding assay with highest ChEMBL confidence score and Ki values below 100 micromolar. Multiple Ki values for the same compound were averaged if they fell into the same order of magnitude, or else they were disregarded. Duplicates, known pan-assay interference, and other reactive molecules were removed. Only sets with at least 500 compounds were considered.
Article: Balfer J & Bajorath J. Systematic artifacts in support vector regression-based compound potency prediction revealed by statistical and activity landscape analysis. PLoS One 10, e0119301, 2015.
14. BindingDB activity classes
Description: A collection of 120 activity classes extracted from
BindingDB and their ECFP4, GpiDAPH3, and MACCS fingerprint representations are provided.
Article: Garnett R, Gärtner T, Vogt M & Bajorath J. Introducing the 'active search' method for iterative virtual screening. J. Comput.-Aided Mol. Des. 29, 305-314, 2015.
15. Drug scaffolds and their structural relationships
Description: A list of 779 scaffolds extracted from approved drugs is provided. For each scaffold, the number of approved drugs it represented, the number and the list of targets it was annotated with and the SMILES representation is given. In addition, pairs of drug scaffolds that formed substructure, CSK equivalence, MMP and RECAP-MMP relationships are provided in separate files. Furthermore, drug scaffold pairs that displayed distinct activity profiles and formed one or more types of structural relationships are given.
Article: Hu Y & Bajorath J. Structural and activity profile relationships between drug scaffolds. AAPS J. 17, 609-619, 2015.
16. Follow-up: Prospective compound design using the ‘SAR Matrix’ method and matrix-derived conditional probabilities of activity
Description: Details of the conditional probability calculations on exemplary the matrix provided in Figure 3 of the publication (see Gupta-Ostermann, Hirose, Odagami & Bajorath, Follow-up: Prospective compound design using the ‘SAR Matrix’ method and matrix-derived conditional probabilities of activity, F1000Research 2015, 4:75 , DOI: 10.12688/f1000research.6271.1 ) is provided in an excel sheet.
Informative SARMs from the PRISM library are included. Due to proprietary issues the structural information of compounds is not included. Key and value fragments of SARMs are provided with an identifier.
Article: Gupta-Ostermann, Hirose, Odagami & Bajorath, Follow-up: Prospective compound design using the ‘SAR Matrix’ method and matrix-derived conditional probabilities of activity, F1000Research 4:75, 2015.
17. Data sets for orthologous target pair analysis
Description: The set of all 803 originally identified orthologous target pairs (OTPs) and the subset of 222 OTPs with at least 10 shared compounds are provided herein. For each OTP, both organisms, the target, the number of shared compounds, the OTP category, and the number of reference articles is reported. In addition, the list of all 1149 candidate compounds and their human target assignments is provided.
Article: Dimova D, Stumpfe D & Bajorath J. Identification of orthologous target pairs with shared active compounds and comparison of organism-specific activity patterns. Chem. Biol. Drug Des. 86, 1105-1114, 2015.
18. Bioactive compound classes from ChEMBL20 for molecular hierarchy
Description: Compound structure, scaffold, and CSK of 78,150 (all) and 54,042 (only compounds with a scaffold shared by at least one another compound) compounds, respectively, are provided herein. In addition, each compound is annotated with target and potency information.
Article: Stumpfe D, Dimova D & Bajorath J. Systematic assessment of scaffold hopping versus activity cliff formation across bioactive compound classes following a molecular hierarchy. Bioorg. Med. Chem. 23, 3183-3191, 2015.
19. Currently available 3D activity cliffs and 2D-analogs of 3D-cliff compounds
Description: Three dimensional activity cliffs (3D-cliffs) were systematically determined based on currently available X-ray structures in PDB. The list of all 236, 292, 595 3D-cliffs that were identified from the Ki, IC50, and Ki/IC50 sets, respectively, is provided. In addition, on the basis of matched molecular pairs, the 2D structural analogs of 3D-cliff compounds identified from ChEMBL database (release 19) are given.
Article: Furtmann N, Hu Y, Gütschow M & Bajorath J. Identification and Analysis of Currently Available High-Confidence Three-Dimensional Activity Cliffs. RSC Adv. 5, 43660-43668,2015.
Hu Y, Furtmann N & Bajorath J. Extension of Three-Dimensional Activity Cliff Information through Systematic Mapping of Active Analogs. RSC Adv. 5, 43006-43015,2015.
20. Visualization and interpretation of support vector machine activity predictions
Description: A prototypic Python implementation of the new visualization method is provided (see Balfer & Bajorath, Visualization and Interpretation of Support Vector Machine Activity Predictions, Journal of Chemical Information and Modeling, in press).
Article: Balfer J & Bajorath J. Visualization and Interpretation of Support Vector Machine Activity Predictions. J. Chem. Inf. Model. 55, 1136-1147, 2015.
21. Sets of ChEMBL compounds with high or low confidence activity data
Description: Two sets of compounds assembled from ChEMBL release 20 that were annotated with high or low confidence activity data were provided in separate files. For each compound in a file, the unique compound identifier (i.e., molregno), the number of targets in individual years (from 1976 to 2014) and the list of target annotations (if any) was given.
Article: Hu Y, Jasial S & Bajorath J. Promiscuity progression of bioactive compounds over time [v2; ref status: indexed, http://f1000r.es/5h4] F1000Research 4(Chem. Inf. Sci.):118, 2015.
22. Knowledge base of two- and three-dimensional activity cliffs
Description: The result of up-to-date surveys and systematic analyses of 2D-cliffs including clusters, 3D-cliffs, and extensions of 3D-cliffs is made freely available in four separate data files. These files contain the list of 2D-cliffs and cliff clusters, 3D-cliffs, 3D-cliff-MMPs, and superpositions of complex X-ray structures and 3D ligands for selected targets. The data organization and information is detailed in README.doc.
Article: Hu Y, Stumpfe D, Furtmann N & Bajorath J. Promiscuity progression of bioactive compounds over time [v1; ref status: indexed, http://f1000r.es/5ir] F1000Research 4(Chem. Inf. Sci.):168, 2015.
23. 3D activity cliff information for target-specific interaction hotspot analysis
Description: Seven subsets of recently identified three-dimensional activity cliffs (3D-cliffs) were used for the identification of target-specific interaction hotspots. The list of all 3D-cliffs as well as the superpositions of complex X-ray structures and 3D ligands for the corresponding targets are provided. The data organization and information is detailed in README.doc.
Article: Furtmann N, Hu Y, Gütschow M & Bajorath J. Identification of interaction hotspots in structures of drug targets on the basis of three-dimensional activity cliff information. Chem. Biol. Drug Des. 86, 1458-1465, 2015.
24. ChEMBL20 data sets for multi-property landscape analysis
Description: Six data sets from ChEMBL (version 20) are provided with their ChEMBLID, SMILES (in SMILES.zip) and descriptor values (Properties.zip). Additionally, the coordinates of compounds and property axes displayed in the linked article are given (Coordinates.zip).
Article: de la Vega de León A, Kayastha S, Dimova D, Schultz T & Bajorath J. Visualization of multi-property landscapes for compound selection and optimization. J. Comput.-Aided Mol. Des. 29, 695-705, 2015.
25. Compounds with multi-target activities forming target cliffs and selectivity cliffs
Description: On the basis of high-confidence activity data assembled from ChEMBL release 20, compounds active against one or more pairs of targets are provided (in the file 'Compounds_In_Target_Pairs.xlsx') for the Ki and IC50 value-based data sets, respectively. Target cliffs formed by selective compounds are listed in the file 'Target_Cliffs.xlsx'. In addition, selectivity cliffs formed by structurally analogous compounds with significantly different selectivity are given in the file 'Selectivity_Cliffs.xlsx'. The content of individual files is detailed in the README file.
Article: Hu Y, Bajorath J. Systematic Assessment of Molecular Selectivity at the Level of Targets, Bioactive Compounds, and Structural Analogs. ChemMedChem in press, 2016.
26. AnalogExplorer and AnalogExplorer2
Description: AnalogExplorer is a computational method for the organization and graphical analysis of analog series from medicinal chemistry (see Zhang B, Hu Y & Bajorath J. AnalogExplorer - a new method for graphical analysis of analog series and associated structure-activity relationship information. J. Med. Chem. 2014, 57, 9184-9194). AnalogExplorer2, the second generation program, explicitly takes stereochemical information during graphical analysis into account. The following tools and data sets are deposited. Three executable files of the original AnalogExplorer program are provided for different applications including the analysis of multiple analog series from a given compound set, analysis of an individual series, and selectivity analysis. With the exception of the OpenEye OEChem library (OpenEye Scientific Software; www.eyesopen.com/), for which a license is required, jar files of external libraries are also provided. In addition, all compound sets analyzed in the original publication of AnalogExplorer are deposited. These compound sets were taken from ChEMBL version 18 (https://www.ebi.ac.uk/chembl/). Furthermore, three executable files are provided for AnalogExplorer2 (for multiple analog series, individual series, and selectivity analysis) as well as exemplary compound sets taken from ChEMBL version 20. The usage of the programs is detailed in the “README.pdf”.
Article: Hu Y, Zhang B, Vogt M, Bajorath J. AnalogExplorer2 - Stereochemistry sensitive graphical analysis of large analog series. F1000Research 4(Chem Inf Sci):1031, 2015.
27. Data sets for SAR progression analysis
Description: Four compound data sets assembled from ChEMBL are provided that have been subjected to SAR progression analysis, to be published in Journal of Medicinal Chemistry.
Article: Shanmugasundaram V, Zhang L, Kayastha S, de la Vega de León A, Dimova D & Bajorath J. Monitoring the progression of structure-activity relationship information during lead optimization. J. Med. Chem. in press, 2016.
28. Bioactive compounds with no structural analogs (high-confidence activity data)
Description: A set of 52,815 unique bioactive compounds (human targets, high-confidence activity data) with no structural analogs with high-confidence activity data was extracted from ChEMBL. For each compound the ChEMBL compound ID (CHEMBLID_Compound) and high-confidence target annotation(s) (CHEMBLID_Targets) are provided. The data set was generated as a part of an analysis to be published in 'Medicinal Chemistry Communications'.
Article: Dimova D, Stumpfe D & Bajorath J. Systematic assessment of analog relationships between bioactive compounds and promiscuity of analog sets. Med. Chem. Commun. 7, 230-236, 2016.
29. Currently available scaffolds and MMP cores in ChEMBL 20
Description: All target-based BM scaffolds (Ki: 35,872; IC50: 74,379), CSKs (Ki: 23,056; IC50: 49,216), MMP cores (Ki: 42,104; IC50: 73,616), and retrosynthetic MMP cores (Ki: 19,040; IC50: 32,382) for high confident bioactive compounds extracted from ChEMBL version 20 are provided herein.
Article: Hu Y, Stumpfe D & Bajorath J. Computational Exploration of Molecular Scaffolds in Medicinal Chemistry. J. Med. Chem. in press, 2016.
30. Classification of Matching Molecular Series on the Basis of SAR Phenotypes and Structural Relationships
Description: A database comprising a total of 13,236 pairs of MMS with different SAR characteristics is provided. For each pair the corresponding MMS-cores are provided as SMILES. In addition, for each MMS-core the number of compounds and the SAR phenotype are given. ChEMBL target IDs (CHEMBLID_Target) designate target sets from which the MMS pairs originate.
Article: Ghosh A, Dimova D & Bajorath J. Classification of matching molecular series on the basis of SAR phenotypes and structural relationships. Med. Chem. Commun. 7, 237-246, 2016.
31. Systematic Design of Analogs of Active Compounds Covering More than 1000 Targets
Description: The analog database consisting of 1,297,204 virtual compounds is provided. Virtual compounds are reported in SMILES representation. In addition, for each virtual compound all available ChEMBL analogs (CHEMBL_COMPOUND_ID) and their activities (CHEMBL_TARGET_IDs) are given.
Article: Dimova D & Bajorath J. Systematic design of analogs of active compounds covering more than 1000 targets. Med. Chem. Commun. in press, 2016.
32. PubChem compounds tested in primary and confirmatory assays
Description: The set of 437,257 compounds that were tested in both primary and confirmatory assays was assembled from PubChem BioAssay collection and deposited in an EXCEL file. For each compound, its compound identifier in PubChem (i.e., cid), the number of primary and confirmatory assays it was tested in and activity annotations are reported.
Article: Jasial S, Hu Y & Bajorath J. Determining the Degree of Promiscuity of Extensively Assayed Compounds. submitted, 2016.
33. Activity classes from different categories
Description: A total of 102 activity classes (ACs) were assembled from ChEMBL version 20 and were classified as “easy” (i.e. yielded generally high compound recall using different fingerprints in benchmarking calculations), “preferred/intermediate” (moderate compound recall), and “difficult” (low compound recall). In addition, 10000 randomly selected ZINC compounds were provided. For each compound, both MACCS and ECFP4 fingerprints were given. Furthermore, the general information for these ACs was provided in the excel file.
Article: Jasial S, Hu Y, Vogt M & Bajorath J. Activity-relevant similarity values for fingerprints and implications for similarity searching. submitted, 2016.