This post was inspired by the following conjunctions:
1) The BBC Radio 4 Today slot about the GSK transparency on clinical trials.
2) A new publication on Open data for drug discovery: learning from the biological community has appeared from ChEMBL and GSK authors.
3) My co-authors and I had the good news about the acceptance of ”Challenges and Recommendations for Obtaining Chemical Structures of Industry-Provided Repurposing Candidates” (there is some background in this blog post chain and we will inform when the paper goes online).
4) There is a useful commentary on the initial GSK press release.
5) A recent conference presentation from Ben Goldacre outlines his plans to mine clinicaltrials.gov (the tail end of his talk from 18:20 onwards) and, amongst other provocative ideas, one is to encourage independent interested parties (e.g. patient groups) to actively ”prod” sponsors for overdue data.
6) I moved over to figshare a magazine article co-authored at the beginning of this year, entitled Connecting Up: assessing the name space and molecular mappings of the drug interventions in ClinicalTrials.gov
The essence of the GSK press announcement is that details from clinical trials will become available so that others (implicitly on the basis of approved but non-GSK affiliated data mining) can draw independent conclusions about safety and efficacy of their new therapeutic agents. In addition, the top-200 TB screening hits are due to be published (but no details yet on database depositions). As interesting as these declarations of intent are, the real utilty for "us" will be dependent (as ever) on the technical details where the rubber meets the road (to use a US cliché). Notwithstanding, I'd take bets that other pharma companies will follow suit. The interesting "Open data" article is somewhat orthogonal to the code number/trial data issues per se but, significantly, it announces that GSK are pioneering a short-cut for direct deposition of supplementary SAR data into ChEMBL.
We can do here is check how well, obviously retrospectively, GSK have publicly declared (or made plausibly findable at least) structures associated with the code names of their recent clinical trial candidates (i.e. the basics of transparency). Notwithstanding the caveats described by us and others (including Ben Goldacre) it is possible to perform queries at clinicaltrials.gov with useful specificity. For example, I was intrigued to find that nested wild cards could pull out current and pre-merger code name stems, via the simple query "Drug | Interventional Studies | GSK* OR GW* OR SB* | Glaxo SmithKline [Lead] " (you can even find some old SK* hits) The returns are shown below, along with a standard sortable Excel download.
I can drop the sheet out on figshare if anyone expresses an interest but it is easy to generate (note also this could be done for any consistently prefixed set of company codes but ideally triples). So where do we go from here ? I am sure many of you can think of interesting options for the list but for this post I shall just pop code names of the most recent trial declarations to see what I can find. I had intended to do the first 10 but, as is often the way, just the first one produced an extended story that became more than enough for one blog post.
While the code name resolution triage is already described for the NCATS code names, I tried what I hoped would be a shortcut via the NCBI all databases Entrez interface because you could, in theory, match to PubMed (PM) and PubChem (PC) at one pop, but, this turns out to be hyphen sensitive (and so is PM), so I had to revert back to the two-stop individual sources. First on the list, GSK1605786 is PC-negative but at least we get two PMs (below).
These were not GSK publications and I don't have access. However, what unexpectedly came up trumps was a Google Images search (below)
The match is to a ChEMBL blogpost on the April 2012. USANs. This gets pretty close to the horse's mouth because it includes a link to the PDF of the GSK approval for Vercirnon. I was just about to convert the IUPAC in the PDF via chemicalize.org but, having fortoutisly resolved the code name to a USAN I tried an open Google search (below).
So now we have extended the mapping chain to: vercirnon, vercirnon sodium (the usual USAN parent-salt doublets) , Traficet-EN CCX282-B (both ChemoCentryx legacy names as licencee) and GSK1605786 or GSK-1605786. This establishes that both the USAN (and the INN) PDF contents are Google-scraped. Last but not least we also have a ChemSpider (CS) database hit (below).
There are some interesting aspects to CS 8518913 but it needed some Tweeting and a comment feedback to suss out the details (and thanks for the responses). First up is that CS are taking a feed of new USANs via ChEMBL cloud resources. The quirk is that the RN (CAS Registry Number) from the USAN gives a false-positive return from PubChem because the query runs as (698394[All Fields] AND 73[All Fields] AND 9[All Fields] ). We thus get back NSC698394 as CID 3107921 . Should you want to interface pop or script up RN queries it needs to be ("698394-73-9"[CompleteSynonym]) To be fair, false positives like this case are both rare and obvious because they only happen a) where CS has an RN that PubChem does not and b) there happens to be a spurious 6-digit match in the CID fields. I am pleased to add that not only were my updates to the above entry added in short order, so you will now see the synonyms revised as below, but also the URLs associated with the RN flag were fixed.
OK - so who else has picked up this RN ? The Google results are below.
Surprisingly, along with the instantaneous capture of blogger posts, and the sources we already expected, this seems to be purchasable already (top link page below).
However, my Avast antivirus gave me a malicious URL warning from the LookChem home page so I am disinclined to inspect the entry, but it looks like a search engine optimsed, dubious secondary brokerage operation (no stock, just calls for tenders). We could speculate their RN > structure link (note the patent reference) is derived from SciFinder, maybe after picking up the USANs. We should thus move swiftly to some more solid informatic gound via the PC link from the CS entry. The SIDs under CID 10343454 are shown below.
Only two of these are primary sources (the rest are piggy-backing) and, as already established, neither included any of the synonyms above. First was Thomson Pharma, back in 2006, presumably from a maual extraction of an early ChemoCentryx patent. Second in, 6 years later, was an automated patent extraction by SCRIPDB entering as the SID just a few months ago. OK, so the next go round the sources can be via the InChIKey JRWROCIMSDXGOZ-UHFFFAOYSA-N (below)
large;">The CS hit was no surprise but the chemicalise.org direct match was, because there was no PubChem entry for this source as would be expected (I may follow up on this). Just for the record, the entry is shown below.
So, after these technical digressions where does this leave us in the transparency stakes ? To be fair to GSK a late licensed compound is not a good example as the research data was not generated on their watch. We can look at their clinical trials search portal and it comes up on the Google hits (below).
But, while it is nice to get the summary reports they do not specify the structure. However, if we put the legacy code in as well, we now bring back 11 studies (below).
We can do the same thing in PubMed (but remember you need the hyphens here) which brings us up to four reports.
OK so lets tot up this weeks quirk list
1) Beyond the INN and USAN applications GSK have done nothing to "transparently" declare a name-to-struc. For INNS and USANs its only Google indexing that makes them findable. Why these two crucial operations have never seen fit to put up a proper public database is a mystery. Note also, unless you use domain selection in advanced Google search for GSK-1605786 you would be hard put to pick up the USAN PDF ranked at ~100 because the top matches are swamped by clinical trials mirroring and replication sites.
2) Note in this case the intersect between the USAN picked up by ChEMBL and the CS entry was fortuitous, not systematic capture. CS only had structure to link the new USAN information to because this had been pulled across from PC pre-2007. There is no direct Thomson Pharma feed to CS so this old one was picked by chance because ChEMBLdb (that does have a CS feed) does not have the structure.
3) The capture of USANs by ChEMBL with a CS feed is welcome but it does seem paradoxical that the structure mappings surface in London but not Bethesda. However, the structure itself is not yet in ChEMBLdb because neither ChemoCentryx nor GSK saw fit to publish a primary medicinal chemistry SAR paper that would have been captured by ChEMBL.
4) I can only access one of the four PubMeds but guess at the reasons for the mapping failures on the Bethesda side is that none of the papers explicitly specified a structure that the MeSH annotators could have picked up and eventually get linked to a PubChem entry.
5) Legacy code number changes due to licensing are become more common and hence more problematic because of the "lost forward-mapping". In this case, the recent publications have included synonyms but we still have two PubMeds and three clinicaltrials.gov entries that only retrieve via CCX282-B (i.e. they are GSK-1605786 -ve and thus do not link forwards). Note here we have suffix ambiguity for usage of CCX282 +/- B even by ChemoCentryx themselves. The status of the synonym Traficet-EN is also unclear because it is quoted as a Trade Mark but is not an approved brand name for the INN (anyone know what -EN stands for ?)
bio (35) chem (28) PubChem (11) patents (11) NCATS (5) antimalarials (5) BACE1 (4) BACE2 (4) SureChem (4) chemicalize.org (4) drug target (4) BioIT (2) ChEMBL (2) LACTB (2) Mw (2) OSDD (2) code names (2) drug candidates (2) drugs (2) mechanism of action (2) AZD5904 (1) ChemSpider (1) DrugBank (1) Ensembl (1) Google (1) HeLa cell (1) INNs (1) IPI (1) InChI (1) Lp-PLA2 (1) MRC (1) MRC repurposing (1) MyNCBI (1) ORFs (1) Oyster (1) PLA2G7 (1) RefSeq (1) SKL-NP (1) TB (1) TTD (1) UniProt (1) Ur-bACE (1) albumin (1) annotation (1) antibiotics (1) bosutinib (1) chemistry-to-protein (1) chimera (1) citations (1) code numbers (1) darapladib (1) drug names (1) drug repurposing (1) drug_names transparency (1) evolution (1) fish (1) gene trees (1) genomes (1) gliptins (1) malaria (1) microbiome (1) myloperoxidase (1) protein homology (1) repurposing (1) signal peptides (1) target (1) transmembrane domains (1)