Archive for the ‘Diagnostics’ Category

The interleukin lottery: playing the odds on numbers 9 and 16

The interleukins are an odd family.  One name encompasses dozens of secreted proteins that are linked by function rather than by structure.  And even that common function is very broadly defined: cytokines that communicate between cells of the immune system.

Defined in such a way, its perhaps not surprising that the interleukins have yielded some of the best biomarkers of inflammatory disease conditions, and even more importantly are the target for a growing range of antibody therapeutics.  Interfering with interleukins is to biologicals what GPCRs are to small molecule drugs.

As with GPCRs, though, despite the success of interleukins as biomarkers and drug targets, some members of the superfamily are extensively studied and well understood, while others lie on the periphery largely ignored.  Type interleukin-1 into PubMed and it returns a staggering 54690 papers.  Repeat the exercise for the rest of the interleukins and you make an interesting discovery: although there is a slight downward trend across the family (probably reflecting the decreasing time since each was first described), there are a couple of striking outliers (Figure 1).  Family members who are much less well studied than the rest.   IL-9 has only 451 citations, IL-16 has 414 and IL-20 just 98.

Figure 1 : PubMed Citations for the Interleukin Family in December 2011. Note the log scale.

Are they really less interesting?  Or does this just reflect the positive re-enforcement of previous publications?  Once one paper links a particular interleukin with a disease or physiological process, a crop of papers exploring that link quickly appear, casting in concrete the random process of discovery.  If that’s correct, these unloved interleukins might make excellent targets for research and drug discovery.

Take IL-9 for example: what little is known about this cytokine certainly doesn’t paint a picture of a backwater function undeserving of attention.  IL-9 is a product of CD4+ T cells (probably one of the Th2 group of cytokines that includes the much-studied IL-4 and IL-5) that promotes proliferation and survival of a range of haemopoietic cell types.  It signals through the janus kinases (jaks) to modulate the stat transcription factors (both of which are validated drug targets in inflammatory diseases).  Polymorphisms in IL-9 have been linked to asthma, and in knockout animal studies the gene has been shown to be a determining factor in the development of bronchial hyper-reactivity.

IL-16 looks no less interesting.  It is a little known ligand for the CD4 protein itself (CD4 is one of the most extensively studied proteins in all of biology, playing a key role on helper T cells, as well as acting as the primary receptor for HIV entry).  On T cells, which express the T Cell Receptor (TCR) complex, CD4 acts an important co-stimulatory pathway, recruiting the lck tyrosine kinase (a member of the src family, and itself and interesting drug target being pursued by, among others, the likes of Merck).  But CD4 is also expressed on macrophages, in the absence of the TCR, and here it is ligand-mediated signaling in response to IL-16 that is likely to be the dominant function.

Another interesting feature of IL-16 is the processing it requires for activity.  Like several other cytokines, such as TGF-beta, IL-16 needs to be cleaved to have biological activity.  For IL-16 the convertase is the protease caspase-3, which is the lynchpin of the apoptosis induction cascade, tying together cell death and cell debris clearance.

Like IL-9, polymorphisms in the human IL-16 gene have also been associated with chronic inflammatory diseases, including coronary artery disease and asthma.  But perhaps the most interesting observations relating to IL-16 come from biomarker studies.  Our own studies at Total Scientific in our extensive range of preclinical models of chronic inflammatory diseases have repeatedly found IL-16 to be the best marker of disease activity.   In human studies, too, IL-16 levels in both serum and sputum have been associated with inflammatory status, particularly in asthma and COPD but also in arthritis and IBD.

After years in the backwater, perhaps its time for the ‘ugly ducklings’ of the interleukin family to elbow their way into the limelight.  After all, the rationale for adopting either IL-9 or IL-16 as a diagnostic biomarker, or even as a target for therapeutic intervention, is as good as the case for the better known interleukins.  But the competition is likely to be less intense.

Many years ago, the Nobel laureate Arthur Kornberg, discoverer of DNA polymerase, once said “If, one night, you lose your car keys, look under the lamppost – they may not be there, but it’s the only place you have a chance to find them”.  Sound advice – unless, of course, there are twenty others already searching in the pool of light under the lamppost.  Maybe the twinkle of metal in the moonlight may be your chance to steal a march on the crowd.

Dr. David Grainger
CBO, Total Scientific

Smoke Screen: The intensifying debate about population screening generates more heat than light

If a test with prognostic value exists, should it be used for population screening? On the face of it, it’s a simple question, but it doesn’t have a simple answer.  Like most things in life, it depends on the context: how prevalent and how dangerous is the disease?  How invasive and how expensive is the test?

So if we are dealing with cancer, which can be fatal if not diagnosed early, and a screening test such as a mammogram or a blood test for PSA, then it seems obvious that the case for population screening must be impregnable.  Such was the basis for the wave of enthusiasm for screening twenty or thirty years ago that lead to the introduction of a number of national screening campaigns, of which mammography was only the most high profile.

But the pendulum has swung the other way: October 2011 saw the US Preventative Services Task Force conclude that the mortality benefit of PSA screening for prostate cancer was small to none, while in the UK the NHS announced a review of the evidence for the effectiveness of its flagship breast cancer screening programme, after recent research suggested the benefits were being exaggerated.

If earlier diagnosis really does improve the outcome for those patients, what can possibly be the problem?  The problems are two-fold: over-diagnosis and cost-effectiveness.

The “obvious” case for screening focuses entirely on the benefit gained by the ‘true positives’ – that is, the people who are correctly identified as having the disease.  On the negative side is the harm done to the ‘false positives’ – the people who are treated for the disease, but who did not really have it.  This harm can be significant, both physically and mentally.  Being told you have cancer can be traumatic enough (interpreted by many people, even today, as an automatic death sentence), but undergoing an unnecessary mastectomy, or having an unnecessary course of radiotherapy or chemotherapy is arguably even tougher.

A quantitative accounting of benefit and harm is tricky because the benefit (in terms of the harm avoided) and the harm of over-diagnosis (in the terms of the side-effects of the treatment) are different and so difficult to compare.   But the number of people affected by each outcome is easy enough to ascertain: for a test with 90% sensitivity and specificity (so better than most diagnostic tests in clinical use) applied to a disease like breast cancer with an incidence of 5 per 10,000 per year, and the numbers look something like this:

For every million people screened, you will make a correct early diagnosis of 450 of the people who will go on to get breast cancer; the remaining 50 will be missed (but of course, all 500 would have had to wait until clinical symptoms were obvious in the absence of a screening programme).  That looks pretty good.

But a specificity of 90% means 10 ‘false positives’ in every hundred people screened.  That is a shocking 10,000 people given a positive diagnosis when in fact they did not have cancer at all!

Suddenly, the performance of the test doesn’t look so great.  Of the 10,450 people given a positive diagnosis only just over 4% really had cancer.  Fully 20 people were given a wrong diagnosis for every one that was correctly identified.  Clearly, that’s not a good enough performance to initiate treatment (whether mastectomy or chemotherapy).

Even if the test had been 99% specific, the ‘false positives’ still outnumber the real positives by more than two to one.

What this quantitative analysis clearly shows is that to have any chance of being useful for population screening (at least for a relatively rare condition, such as cancers) the usual kind of diagnostic performance criteria have to be replaced with a new paradigm where it is the decimal fractions after the 99% specificity that are being scrutinized prior to introducing the test.  Few, if any, molecular tests can reach this level of performance (at least while retaining any useful degree of sensitivity at the same time).   The US Preventative Services task force was certainly right to conclude that PSA testing, which most definitely doesn’t approach this level of diagnostic performance, has little value when used in screening mode.

Let me correct that:  PSA testing, when used in screening mode, does a whole lot more harm than good.  The US Preventative Services review found that over a 10-year period, 15-20% of men had a positive test triggering a biopsy (of which at least 80% were false positives).  The biopsy itself is not free from harm, being accompanied by fever, infection, bleeding, urinary incontinence and pain.  But the damning evidence comes from the trials of intervention in prostate tumour identified through screening.  Here, there was a small reduction in all-cause mortality following surgery or radiotherapy, but only in men under 65; by contrast, there was a 0.5% peri-operative mortality rate associated with surgery and a big increase in bowel dysfunction and urinary incontinence in the radiotherapy group.  The review rightly concluded that the screening programme yielded questionable benefits but at the cost of substantial harms.

With that kind of conclusion, there is no need to even enter into a cost effectiveness assessment.  Clearly, population screening is inherently costly (because of the very large number of tests that must be performed).  Even when the unit cost of the test is very low indeed, the cost burden is substantial.  Even if there were a net benefit (and the argument is closer for mammographic screening in breast cancer than it is for PSA screening and prostrate cancer), the cost effectiveness of the screening programme would not approach the levels required to justify spending on a new therapeutic product (at least not based on current NICE cost effectiveness frameworks).  A back of the envelope calculation suggests that mammography would have to be at least 10-fold cheaper than at present to win approval if it were a therapeutic.

Proponents of screening are quick to argue that the solution lies in proper stratification before applying the test – so instead of screening the whole population, only a higher risk sub-group is screened.  The stratification might be on the basis of age, or symptoms or some other demographic (indeed, such stratification takes place even in the current ‘universal’ breast cancer screening programme in the UK, since males are not screened even though breast cancer can and does occur, albeit at a much lower prevalence, among men).

Fine.  But if you want to incorporate stratification into the screening paradigm, it’s critical that the data on the performance of the test is gathered using that same paradigm.  This kind of oversight can over-estimate the value of a test that discriminates very well between disease and the general healthy population but discriminates poorly between the disease and similar maladies with which it shares symptoms.   This has proven to be the difficulty for many, if not all, of the new range of molecular colon cancer tests currently in development.  These molecular tests typically have a reasonably good sensitivity and specificity when comparing colon cancer with the general healthy population (achieving, perhaps, 90% sensitivity and specificity in the best studies).  That, though, as we have already seen, is nowhere near good enough performance to adopt as a general population screening tool.  No matter, suggest the proponents of such tests: lets instead use it only in people with symptoms of colon cancer (such as fecal occult blood, intestinal pain or changes in bowel habits for example).  Now, with a prevalence of colon cancer of 10-20% in this group, a test with 90% specificity would be more attractive – at least now the number of real positives might (just) outnumber the ‘false positives’.  True, but only if the test still has 90% specificity in this selected patient group!  In most cases, sadly diagnostic performance falls away once you have stratified the subjects, precisely because the chance of a positive test is increased by inflammatory bowel conditions as well as by cancer.  There is nowhere left to go: for a test like this, there is no application in which it is sufficiently useful to justify clinical adoption (even if it were not a premium priced molecular test).

Janet Woodcock, Director of the Centre for Drug Evaluation and Research (CDER) at the FDA summed it up perfectly at the recent US conference on Rare Diseases and Orphan Products, saying “How can something that is so widely used have such a small evidence base?  The FDA has never accepted PSA as a biomarker for that very reason – we don’t know what it means.”

What the analysis presented here proves is that you need a low cost, minimally burdensome test with superb diagnostic power coupled with a reasonably prevalent, but very nasty, disease that clearly benefits from early diagnosis and treatment.  That’s a pretty demanding set of criteria.

Neither this analysis, nor the review of the US Preventative Services team, published on October 11th, proves that PSA screening is not useful because it depends on a subjective trade-off of benefits and harms (and in any case, some statisticians have been quick to point out some inadequacies in the meta-analysis framework that was used).  But the evidence that prostate cancer really does benefit a great deal from early diagnosis and aggressive treatment is weak, and PSA testing certainly doesn’t have outstanding diagnostic performance.  So the weight of argument is certainly heavily stacked against it.

For colon cancer, there is no doubt that the disease is relatively prevalent and benefits from early diagnosis and treatment.  By contrast, the tests that are available (whether immuno-FOBT or newer molecular tests) are nowhere near good enough in terms of diagnostic performance to justify use in a screening programme.

For breast cancer, the case is the strongest of the three.  Again, there is clear benefit from early diagnosis and treatment, and the test itself has the greatest diagnostic power.  The question is simply whether it is good enough.  It will be interesting indeed to read the conclusions of Sir Mike Richards, National Cancer Director for the UK, who has been charged with reviewing the evidence.  It will be even more interesting to see whether they use this opportunity to attempt a cost-effectiveness assessment, using a framework similar to NICE, at the same time.  After all, the breast cancer screening programme is paid for out of the same global NHS budget as all the rest of UK healthcare, including, interestingly, treatment for breast cancer with expensive new drugs such as Herceptin™.  It would be fascinating to know whether screening or more rapid treatment once symptoms appear would result in the best use of the available cash for the benefit of breast cancer sufferers in the UK.  Sadly, if the nature of the debate on PSA is anything to go by, I doubt the review will yield that much clarity.

The emotional, but evidence-light, arguments in favour of screening exert enormous pressure on healthcare providers.  For example, the American Urological Association (AUA) condemned the US Preventative Services report on prostate cancer screening, saying the recommendations against PSA “will ultimately do more harm than good to the many men at risk for prostate cancer” – although they provided no evidence to support their emotive statement.  After all, the general population find it hard to imagine how screening can possibly be harmful.   The debate will no doubt continue generating much heat, and only a little light.  Sadly, despite all the evidence to the contrary it is very hard to see wasteful and possibly even harmful national screening programmes being halted any time soon.

Dr. David Grainger
CBO, Total Scientific

Personalized Medicine Demands Investment in Innovative Diagnostics: Will the Returns be High Enough?

Several very senior pharma executives were recently overhead by a journalist discussing what each of them viewed as the most important changes in the way healthcare will be delivered over the coming decade.  Each of them listed several such factors, including increased payor pressure on prices, the mounting regulatory burden and the shift toward orphan indications, but there was unanimity on just one factor: the importance of personalized medicine.

Personalized medicine is the great white hope for the pharmaceutical industry: by only treating the fraction of the population who can benefit from a particular medicine, efficacy and value-for-money are substantially increased.  But the prices set by Pfizer and Abbott for lung cancer drug Xalkori™ (a dual c-met and ALK kinase inhibitor) and its companion diagnostic (a FISH assay for translocations affecting the ALK genes) following its US approval last week, while on the face of it being unremarkable, nevertheless raise questions about the personalized medicine business model.

Xalkori™ crizotinib will cost $9,600 per month, yielding $50k to $75k per patient for the full treatment regimen – expensive, but pretty much in line with other newly approved medicines for small patient groups (only about 5% of non-small cell lung carcinomas – those with transloactions affecting the ALK gene cluster – are amenable to treatment with this drug).

The Vysis ALK Break Apart™ FISH probe test, from Abbott, which identifies the patient subset sensitive to treatment with Xalkori™, by contrast, will cost less than $250 per patient.  Again, this is entirely consistent with pricing structure of DNA-based diagnostics used in the clinic.

So if there is nothing surprising about these prices, what’s the problem?  The distribution of income between the drug developer and the diagnostic developer is heavily biased towards the drug.  It’s not as extreme as the unit prices for the products suggest, because the diagnostic should be applied to a wider population to identify the target population.  So with 100 non-small cell lung carcinoma patients tested with diagnostic (raising $25,000 revenue for Abbott), 5 will be identified who are suitable for treatment with Xalkori™ (raising $375,000 revenue for Pfizer), assuming full penetration of the market in both cases.  The diagnostic product, therefore, garners about 6% of total spend on the test and drug combined.

There are lots of obvious reasons why this is the case: the cost of developing the drug product was more than 10-times higher than the development costs for a typical diagnostic.  Drugs take longer to develop, and have a much higher risk of failure.  The regulatory hurdles are much higher for drugs than diagnostics.  And in any case, the need for the diagnostic only became clear because of the success of the drug.  In short, 6% of the overall returns for the diagnostic partner in such a situation sounds generous.

However, the situation in oncology, where the vast majority of companion diagnostic products currently on the market are located, hides a bigger issue: the difficulty in earning rewards for genuine innovation in the field of diagnostics.  In oncology, not a great deal of innovation is required on the companion diagnostic side, since the test is tightly tied to the mechanism of action of the associated therapeutic.  In such situations, there is virtually no technical risk associated with the development of the diagnostic product.  The only risk is regulatory risk (which is relatively easy to mitigate, at least for the big players who well understand the process) as well as risk that the associated therapeutic fails to win regulatory or market acceptance – in which case sales of the diagnostic product will also be non-existent.

But in other indications, finding companion diagnostics will require much more innovation.  For example, in chronic inflammatory diseases picking people who might show the best responses to anti-TNFs requires something more innovative than tests for genetic variation in the TNF-a gene or its receptors.  Because the biology of inflammation is complex, predicting the responses to drugs (even those with well defined molecular mechanisms) is a substantial challenge – a challenge that, for the most part, remains unmet.

Indeed, in some cases innovations in biomarker discovery might actually drive new therapeutic approaches:  the management team of Total Scientific, in collaboration with Imperial College, London, discovered that low circulating levels of the amino acid proline is a powerful new biomarker for osteoporosis, predicting fracture risk as well as low bone mineral density.  This finding not only suggests that a diagnostic assay for serum proline may be clinically useful, but that therapeutic strategies directed to modulating proline metabolism may also be effective.  Our innovation in biomarker discovery may ultimately open up a whole new field of bone biology, spawning multiple high value therapeutic products.

In these situations where innovation is required in both the diagnostic and therapeutic domains (which will probably prove to be the majority of personalized medicine product combinations), a business model that splits the revenues 94% to the drug developer and 6% to the diagnostic developer seems skewed.  If the driving innovative step came from the biomarker end (as in the example with proline), the team with the original insight may hope to reap at least half the reward.

There are two major reasons why this is unlikely to happen: firstly, there is a glass ceiling on price for a diagnostic product.  Paying more than $200 or so for a molecular diagnostic, no matter how innovative or complex, is contrary to almost every healthcare reimbursement system worldwide.  Secondly, the barriers to prevent competition against the therapeutic component of the product combination are very high indeed (both from regulatory and intellectual property perspectives).  But in marked contrast, the barriers to prevent another competing product being launched against the diagnostic assay component of the combination are very much lower.

These two factors will likely combine to restrict the return to innovators in the diagnostics space relative to those in the therapeutic space, irrespective of the apparent value of their innovation.

This state of affairs is bad for everyone.  It limits the incentive for real investment in biomarker discovery independent of therapeutic development, so the chances of finding innovative new companion diagnostics outside of oncology are materially reduced.  As a result, even though (for example) a new test to determine which RA patients might respond best to anti-TNFs would be beneficial to patients (avoiding exposing patients to the drug who will not benefit and immediately giving them the opportunity to try something else without waiting 6 months to see of they responded), and also beneficial to payors by reducing the number of patients treated with an expensive drug.  Indeed, the economics of such a test might sustain a price for the product that was well above $200.

Yet the second problem would then intervene to drop the price: competition.  Since it is (usually) impossible to protect the concept of measuring a particular analyte (and is only possible to protect a particular methodological approach to its measurement), others would most likely be free to develop different assays for the same analytes.  As the regulatory hurdles for developing competing tests is low – particularly once the first test has been launched, since fast-followers need only demonstrate equivalence – it would not be long before the first product to successfully predict responses to anti-TNFs among RA patients would be subjected to competition, driving prices back down again.

Subtle though they seem, the differences in the IP and regulatory landscape for diagnostic tests compared with therapeutics, threaten the viability of the personalized medicine business model.  Delivering on the promise of personalized medicine for both patients and the healthcare industry requires allocation of capital to drive innovation in both biomarker discovery and identification of novel therapeutic targets.

At first sight, developing diagnostic products, as opposed to therapeutics is relatively attractive.  The limited demand on capital, short time-line to product launch, low technical and regulatory risk and the substantial medical need all favour developing diagnostic products.  But not if the discovery component becomes lengthy and expensive.  In other words, developing “me-better” diagnostics makes a lot of commercial sense, but investing in genuine innovation in biomarkers still looks unattractive.  And it is precisely these highly innovative new diagnostic products that will underpin the delivery of personalized medicine.

What can be done?  Not a great deal in the short term, perhaps.  But in the longer term, much needed reforms of the regulation of diagnostic products might raise the barrier to competition against first-in-class assay products.  The current regulatory framework for therapeutics is draconian, demanding very high levels of safety from every aspect of the drug product, from manufacturing to long-term side-effects.  By contrast, despite some tinkering in recent years, the diagnostic regulatory framework remains relatively lax.  Home-brew tests are introduced with little regulation of manufacturing standards, and the focus of the regulators is on the accuracy of the measurement rather than on the clinical utility of the result.  This leaves open a weak-spot in the overall protection of the patient, since an inaccurate diagnosis (leading to incorrect treatment) can be as harmful for the patient as treatment with an inherently unsafe medicine.  Just because molecular diagnostics are non-invasive, it doesn’t mean their potential to harm the patient is zero.

There are moves to close this loophole, and the unintended consequence of such regulatory tightening will be an increased barrier to competition.  Perhaps the addition of a period of data-exclusivity, much as applies in the therapeutics world, could be added in addition to further protect truly innovative diagnostic products from early competition.

Such moves are essential to make innovation in biomarkers as commercially attractive as innovation in therapeutics.  It will be difficult to achieve in practice, however, as pressure on healthcare costs ratchets up still further over the coming decade.  Competition, lowering prices, is on the surface attractive to everyone.  But it is the differing protection from competition between therapeutics and diagnostics that leads to skewed incentives to invest in innovation in one area rather than the other.  Lets hope that once combinations of therapeutics and companion diagnostics start to appear outside of oncology, the relative pricing of the associated products properly reflects the innovation in each of them.  If it doesn’t, our arrival in the world of truly personalized medicine may be delayed indefinitely.

Dr. David Grainger
CBO, Total Scientific

Ultra-sensitive NMR-based diagnosis for infectious diseases: the tortoise races the hare again

Obtaining rapid and reliable diagnosis of infectious diseases is usually limited by the sensitivity of the detection technology.   Even in severe sepsis, accompanied by organ failure and admission to an intensive care unit, the causative organism is often present at a level of less than one bacterium per milliliter of blood.  Similarly, in candidiasis the yeast cells are present at vanishingly low levels in body fluids, while in chlamydia infections the pathogen is located intracellularly as is entirely absent from the blood fluid.

All these (and many other) pathogens have evolved to escape detection by the immune system, and its antibody sensors.  This, coupled with the low levels of organisms in samples from infected individuals, means that antibody-based diagnostic tests rarely have enough sensitivity to be useful.

Then came PCR.  The big selling point of the polymerase chain reaction is its exquisite sensitivity, while retaining useful specificity.  Under optimal conditions you can detect a single DNA molecule with this technique.   Surely PCR was going to revolutionize infectious disease diagnosis?

Not really.  There are several problems: the very low levels of infectious organisms in the samples means that there is a very large amount of other DNA (from the host cells) in the sample.  Unless some kind of enrichment is performed, the PCR reaction cannot achieve the necessary sensitivity in the presence of so much competing DNA template.  Secondly, DNA from dead organisms is detected just as efficiently as from live ones, and worse still DNA released from the dead organisms can persist in the blood for weeks and months.   Together, these issues lead to high rates of both false positive and false negative findings, and for many infectious diseases such simple PCR tests perform too poorly in the clinic to be of value.

A common solution that deals with both these problems is to culture the sample prior to running the test.  The rapid growth of the infectious organism enriches the sample with the target DNA template, and at the same time differentiates viable organisms from dead ones.  PCR on cultured samples usually achieves the necessary sensitivity and specificity to be clinically useful – but for severe disease, such as sepsis, the time taken to culture the sample (which may be several days) is critical when the correct treatment needs to be started immediately.

As a result, there is still a massive product opportunity for new infectious disease diagnostics.

One approach is to try and confer on the PCR tests the specificity for live organisms, and at the same time improve the ability to distinguish template from the organism from the high levels of host DNA.  A particularly promising solution from Momentum Biosciences is to employ the DNA ligase enzyme from live bacteria to ligate added DNA template to create an artificial gene that is then amplified by conventional PCR.  The product is still in development, but it offers real hope of a sepsis test that can identify live organisms in less than 2 hours.

But another potential solution comes from a much more surprising approach: using nuclear magnetic resonance (NMR) spectroscopy.  NMR offers exquisite specificity to distinguish molecules in a sample based on their chemical structure, a property that underpins the use of the technique in metabolic profiling.  However, as anyone who has ever tried to exploit this elegant specificity will tell you, the problem with NMR is its lack of sensitivity.  Even with cutting-edge equipment, costing millions, the sensitivity limit is usually above 10µM (which equates to a million million or so molecule per milliliter of sample.  Not much use, one might think, for detecting a single cell in a milliliter of blood.

But T2 Biosystems, based in Lexington, MA, have found a neat solution to the sensitivity problem of both antibodies and NMR.  By coating highly paramagnetic beads with antibodies specific for the infectious organism, they can readily detect the clumping of these beads in the presence of very low levels of antigen.  Again, the test is in development, but the company announced last week the closing of a $23M series D investment to bring the system to market.

There is an attractive irony in using a technique famed for its ultra-low sensitivity to solve a problem where sensitivity of detection was the limiting factor.  In the race to find clinically useful diagnostic tests for many infectious diseases, just as in Zeno’s race between the hare and the tortoise, the super-sensitive PCR took a massive early lead and for a long time looked like the only winner in an arena where the major barrier to success was sensitivity of detection.  But the wily old tortoise is not out of it yet: an ingenious twist added to low-sensitivity NMR might still win the race to clinical and commercial success in the infectious disease diagnostic arena.

Dr. David Grainger
CBO, Total Scientific Ltd.

Chemokines as biomarkers for cancer: Time to revisit an old friend?

A wide-ranging study pre-published on-line in Nature last month points the finger at the chemokine CCL2 (also known as MCP-1, or JE in mice) as a key regulator of tumour metastasis.  Intriguingly, CCL2 seems to participate in the generation of clinically-relevant metastatic disease on multiple levels: it promotes seeding of the shed metastatic cells, but it also promotes establishment and growth of the micrometastases, a process that is dependent on VEGF production from a tissue macrophage subset that responds to CCL2.  All this nicely suggests that CCL2 (and its signaling pathway) may be an attractive therapeutic avenue for reducing the risk of metastasis.  The close links between the academic authors and the global pharmaceutical company Johnson & Johnson suggests that this avenue is already being aggressively pursued.

But what about CCL2 as a biomarker for detecting early metastasis and directing treatment?  The study shows that the density of CCL2-expressing macrophages in the region of the metastasis is associated with disease progression, so it seems plausible that measuring CCL2 levels in appropriate biological samples (whether tissue or blood) might be a productive investigation.

All this has special resonance for scientists at Total Scientific.  A decade ago, similar data (here and here) linking CCL2 to the mechanism of atherosclerosis and vascular restenosis prompted us, among others, to investigate whether circulating levels of CCL2 might be predictive of coronary heart disease.

The bottom-line finding (that CCL2 levels in serum are not linked to heart disease) was disappointing.  But the process of getting to that conclusion was highly instructive.  CCL2 binds to blood cells through both high affinity (receptor) interactions and lower affinity (matrix) associations.  The amount of CCL2 bound to signaling receptors is essentially irrelevant for the measurement of CCL2 in blood, but the lower affinity associations turned out to be much more significant.  As much as 90% of the CCL2 in blood is bound to the enigmatic Duffy antigen on red blood cells (enigmatic because this receptor seems to be related to chemokine receptors but lacks any kind of signaling function).   Worse still, this equilibrium is readily disturbed during the processing of the blood sample: anticoagulants such as heparin or EDTA shift the equilibrium in one direction or the other altering apparent CCL2 levels.  Minor variations in the sample preparation protocol can have dramatic effects on the measured levels – whether between studies or within a study – not a good sign for a biomarker to achieve clinical and commercial utility.

And it’s not only ex vivo variables that affect the equilibrium: red blood cell counts differ between subjects, with women typically having lower red blood cell counts and lower total CCL2 levels as a result.  Since women also have lower rates of heart disease, a widespread failure to recognize the complexity of measuring CCL2 in blood fractions most likely contributed to a number of false-positive studies.    Needless to say, almost a decade on from those positive studies, CCL2 has not found a place as a biomarker for heart disease probably because, as we discovered, the reported associations had their origins in a subtle measurement artifact.

Does this mean CCL2 is unlikely to be a useful biomarker for metastatic potential among cancer sufferers?  Not at all.  But it does mean that studies to investigate the possibility will have to be much more carefully designed than is typically the case.  Learning from our previous experiences studying CCL2 levels in heart disease patients, the Total Scientific team has assembled the necessary tools to address this question in cancer.

However, an old adage among biomarker researchers comes to mind: “If it looks simple to measure, it probably means you don’t know enough about it”.

Dr. David Grainger
CBO, Total Scientific Ltd.

Biomarkers: lessons from history

The increase in the use of the term biomarker is a recent one.  When one looks back at the use of this term in the literature over the last fifty years, there was an explosive increase in its use in the 1980s and 1990s, and it continues to grow today.  However, biomarker research as we now know it has a much deeper history.

Here we are going to focus on just one paper, published in 1965, twelve years before the term “biomarker” appeared in either the title or abstract of any paper in the PubMed database[i].  This is a paper by Sir Austin Bradford Hill, which appeared in the Proceedings of the Royal Society of Medicine entitled “The Environment and Disease:  Association or Causation?”.

Sir Austin neatly and eloquently describes nine factors that he feels should be taken into account when assessing the relationship between an environmental factor and disease.  These are:

  1. Strength
  2. Consistency
  3. Specificity
  4. Temporality
  5. Biological gradient
  6. Plausibility
  7. Coherence
  8. Experiment
  9. Analogy

In this blog we discuss the applicability of each of these factors to biomarker research today.  However, before we do, it is important to note that the aims of biomarker research today are much broader than the primary aim of Sir Austin’s paper – which was to discuss the ways in which an observed association between the environment and some disease may be assessed for the degree of causality involved.  However, only a very few biomarkers lie directly on this causal path (some biomarkers change in response to the disease itself, others are only indirectly associated with the disease and its causes), but crucially their utility does not depend upon a causal association.  However, particularly when biomarkers are used to aid the identification of disease, there are clear parallels between Sir Austin Bradford Hill’s assessment of causality and our current need to assess utility.

1.  Strength. Sir Austin’s primary factor to consider in the interpretation of causality was the strength of the association.  He argues that the stronger the association between two factors, the more likely it is that they are causally related.  However, he cautions against the obverse interpretation – that a weak association implies a lack of causality.  In fact, the strength of an association depends on the proportion of the variance in one factor that explained by the other over the relevant sampling timescale.  In other words, there may be a completely causal relationship between X and Y, but X may be only one factor (possibly a small factor) controlling Y.  The remaining variance in Y may even be random fluctuations (so X is the only factor causally associated with Y), yet the strength of the observed association will be weak, unless time-averaged measurements are taken for both variables.

The strength of the association is probably an even more important factor for assessing the utility of biomarkers than it was for assessing causality.  Firstly, it is clear to all that the stronger the association between a putative biomarker and the disease under examination, the more likely it is to have clinical application.  However, as with the arguments for causality there are important caveats to insert.  The clinical utility of a putative biomarker often depends upon the shape of the receiver-operator curve, not just the area underneath the curve.  For example, a test where the specificity remains at 100%, even with lower sensitivity may have far more clinical utility than a test where both sensitivity and specificity are 90% – depending on the application – even if the overall strength of the association was weaker.

It’s also possible to improve the strength of a crude association, for example by subsetting the patient population.  A given biomarker may perform much better in, say, males than females, or younger people rather than older people.  The applicability of the biomarker may be restricted but the strength, and hence clinical utility of the association may be improved dramatically.  But despite these caveats, the strength of the association is a good “first pass” screening criterion for assessing the utility of biomarkers – much as for Sir Austen it yielded a good “first guess” as to whether an association was likely to be causal

2.  Consistency.  Sir Austin Bradford Hill puts this essential feature of any biomarker programme second on his list of causality factors.  He states “Has [it] been repeatedly observed by different persons, in different places, circumstances and times?”.  This is an absolutely crucial issue, and one on which many a biomarker programme has failed.  One only has to look at the primary literature to realise that there have been dozens of potential biomarkers published, of which most have not been validated, as indicated by the lack of positive follow-on studies.  Much of this attrition can be put down to study design, something that was discussed in an earlier blog.

3.  Specificity. The discussion of specificity by Sir Austin Bradford Hill is also highly relevant to today’s biomarker research.  We live in an  ’omics world’, with the ability to measure levels of dozens, hundreds or even thousands of potential biomarkers with an ease that must have seemed like science fiction in 1965.  As a result, it is often trivial (in both the technical logical sense of the word as well as the everyday use) to identify a biomarker apparently associated with a disease.  Consider, however, how a marker of inflammation might behave:  they will likely be strongly associated with any selected inflammatory disease, but they are unlikely to have any specificity over other inflammatory conditions.  For example, serum levels of C-reactive protein correlate well with rheumatoid arthritis, but because it is also associated with dozens of other inflammatory conditions it has little clinical utility for the diagnosis of RA (although, of course, it may be useful for monitoring disease activity once you have secured a robust differential diagnosis by other means).  Again, this raises the issue of study design: preliminary studies are often set up with the aim of identifying differences in levels of biomarkers between subjects with disease and healthy controls.  Such studies may provide a list of candidates, but ultimately most of these will not show adequate specificity, an issue identified when a more suitable control population is used.

4.  Temporality. This is perhaps the most obvious of Bradford-Hill’s concepts: for a causal relationship between X and Y, changes in X must precede changes in Y.  Similarly, it is more useful in disease diagnosis when a biomarker changes before the disease is manifestly obvious.  On the face of it, the earlier the change can be detected before the disease exhibits clinically-relevant symptoms, the more useful that advance warning becomes.  In the limit, however, differences that are exhibited long before the disease (perhaps even for the whole life of the individual, such as genetic markers) become markers of risk rather than markers of the disease process itself.

5.  Biological gradient.  This feature of biomarker studies is just as important as it was when Sir Austin discussed it in relation to the causality of associations.  Our assessment of the utility of a biomarker increases if there is a dose-response association between levels of the biomarker and presence or severity of disease.  So, examining colorectal cancer for example, one might give greater weight to a biomarker whose levels are elevated somewhat in patients who have large polyps and strongly elevated in patients who have overt cancer.  A gradient of elevation across patients with different stages of cancer would also add to the plausibility of the putative biomarker (see below)

6.  Plausibility. Of all of the criteria put forward in the paper by Sir Austin Bradford Hill back in 1965, we find this is the most interesting.  Prior to the ’omics era, the majority of experimental designs were already based on a hypothesis of some sort – that is plausibility was inherently built-in to all experiments, just because the act of measuring most analytes or potential biomarkers was expensive in both time and money.  To Sir Austin, it must have been the norm rather than the exception that observed associations had at least a degree of plausibility.

In the modern era this is no longer the case.  Thousands of genes, metabolites or proteins may now be examined in a very short period of time and (for the amount of data obtained) at a very reasonable cost.  And because recruiting additional subjects into a clinical study is typically significantly more expensive than measuring an additional analyte or ten, one often finds that the resulting dataset for most modern studies is “short and fat” – that is, you have measured many more analytes (variables) than you had patients (observations) in the first place.  Moreover, there is often no particular reason why many of the analytes have been measured – other than the fact that they composed part of a multi-analyte panel or some pre-selected group of biomarkers.  Post-hoc justification becomes the norm.  It is almost impossible to avoid.  We find a few “statistically significant” differences[ii], and then rush to explain them either from our own background knowledge or by some hurried literature searches.  The sum of biological knowledge (or at least published data) is orders of magnitude greater than it was in Hill’s day, and nowadays it is entirely undemanding to construct a plausibility argument for any association one might find in such a trawl.

We caution strongly against this approach, however.  Tempting though it is to take this route, the likelihood that any biomarkers identified in such experiments have any validity is almost nil, and enthusiastic but unwitting over-interpretation is often the outcome.  This does not mean that such dataset are cannot be mined successfully, but doing so is a job for a professional, wary of the pitfalls.  And no such biomarker should be considered useful until it has been validated in some well-accepted manner.

Interestingly, from the perspective of 1965, Sir Austin Bradford-Hill came to the conclusion that it would be “helpful if the causation we suspect is biologically plausible”, but today we do not share that perspective.  Armed with so much published data, an argument for plausibility can be built for any association – this lack of specificity therefore means that such plausibility has little predictive value as a criterion for assessing utility.  He did, however, state that from the perspective of the biological knowledge of the day, an association that we observe may be one new to science and it must not be dismissed “light-heartedly as just too odd.”  This holds true as much today as it did then.  When faced with two associations, one plausible and one off-the-wall, the criteria of plausibility is not necessarily the primary criterion that we apply to determine utility.

7.  Coherence.  Similar to plausibility, this criterion highlights that while there may be no grounds to interpret something positively based on currently available biological knowledge, there may nevertheless be reason to doubt data based on existing scientific evidence.  The arguments against using coherence to assess utility of candidate biomarkers are the same as for plausibility.

8.  Experiment.  This is another crucial factor that is just as relevant in today’s world of biomarkers as it was in 1965.  Sometimes the fields of diagnostic medicine and experimental biology are not as well integrated as they should be.  Interpretation of biomarker identification or biomarker validation experiments is often limited by the availability of samples or data.  However, there is much to be said for taking the information learnt in the examination of biomarkers in patients back to the bench.  Here much tighter control may be applied to your experimental system, and hypotheses generated in vivo may be tested in vitro.  This may seem back-to-front, but it is an essential feature of any well-designed biomarker programme that it be tested experimentally.  This may be possible in patients, but it may often be carried out more cheaply and quickly at the bench or in animal models of disease.

9.  Analogy. Analogy falls into the same category as plausibility and coherence.  The huge range of published data, much of which is carried out poorly and / or not followed through means that testing the validity of a finding by analogy to existing biological knowledge is becoming ever more difficult.  It is not analogy that’s needed, but consistency – and that means more well-designed experiments.

Perhaps it’s time to bring Bradford-Hill’s criteria bang up to date for the 21st Century?  Much of his pioneering work applied to assessing causality between environmental factors and disease is just as valuable in assessing modern biomarkers for clinical utility.  For the initial assessment of biomarkers, as data begins to emerge from the first discovery studies it is consistency and specificity that carry the greatest weight, with temporality, strength of the association and biological gradient only a short distance behind.  The key is to design efficient studies that allow each of these critical parameters to be assessed at the earliest stages of the biomarker discovery programme – too often biomarkers are trumpeted as ready for use before this checklist has been completed, and quite often before any experiment has even been conceived of that might properly test each of them.

Experiment is a crucial component of the eventual validation of any biomarker, but the effort involved means that preliminary prioritization of candidate biomarkers will likely have to be undertaken without it.  Our Total Scientific Criteria (with appropriate deference to Sir Austin Bradford Hill) for assessing the utility of biomarkers might look something like this:

  1. Consistency
  2. Specificity
  3. Temporality
  4. Strength
  5. Biological gradient

There may be inflation in almost everything in the modern world, but at least when it comes to criteria for judging the utility of biomarkers we have gone from nine criteria to just five.  The pleasures of living in a simpler world!

Dr. David Mosedale and Dr. David Grainger
CEO and CBO, Total Scientific Ltd.


References

[i] Source:  PubMed search carried out in March 2011.

[ii] We are deliberately avoiding discussion of what might be statistically significant in such a short and fat dataset.  Interestingly, Sir Austin’s paper finishes with a discussion on statistical tests, and their potential overuse back in 1965.  This is well worth a read!

FDA guidance on the use of biomarkers as drug development tools

Back in September the US Food and Drug Administration announced that it was going to delay its publication of draft guidance on the qualification of drug development tools, originally promised for the summer.  However, this draft guidance was finally published at the end of October.  While still in draft form, the Guidance substantially expands on the outline of the pilot qualification process given in an article written by two members of the Center for Drug Evaluation and Research published in 2007.

The new guidance principally provides information on the proposed administrative process that will be followed by the FDA in order to qualify new drug development tools (DDTs).  Qualification is defined as “a conclusion that within the stated context of use, the results of assessment with a DDT can be relied upon to have a specific interpretation and application in drug development and regulatory review.”  The document discusses two forms of DDT – biomarkers and patient-reported outcome scales.  There are couple of points that bear discussion in relation to biomarkers.

Firstly, the new qualification procedure is aimed at enhancing the utility of a qualified biomarker across the industry.  Hence, while previously use of a biomarker may have been part of an NDA, IND or BLA, this new programme is designed to make public those biomarkers that satisfy the qualification process, so that future drug development programmes can take advantage of already knowing that the biomarker has been qualified for a particular purpose.  Of course, wherever a new biomarker is proprietary, it can be retained as such by not using the new qualification process, but by remaining part of the NDA, IND or BLA.  This new programme, it seems therefore, is not particularly aimed at individual companies, but more towards collaborative groups that together can share the burden of the development of the drug development tools and submission to the FDA.  Indeed, less than a month after the draft guidance was published, several major pharmaceutical companies and leading academic institutions announced such a collaborative biomarker consortium for COPD.

Secondly, while there is detailed information as to the administrative process, there is no information on the level of evidence required by the FDA to take a biomarker through from submission to qualification.  There are a number of discreet stages that have to be undertaken, but nowhere are the criteria on which a new biomarker will be assessed described.  The means by which such an assessment may be made are described:  In the first stage they include a consultation process between the submitter and the FDA, and formal assessment of the biomarker follows, to include discussion at internal FDA meetings, discipline reviews and potentially even public discussions.  However, the level of evidence required for success at each stage is not discussed.

It is tempting to suggest that this gap in the document is due primarily to the difficulty in formalising the criteria required for qualification of a biomarker.  The wide range of uses to which biomarkers may be put, whether to preselect individuals for study or treatment, inform about disease progression, predict drug efficacy or toxicity or follow dynamically treatment in an individual, makes it difficult a priori to put together criteria that will apply in all cases.  If this supposition is true, and each biomarker will be assessed on its own merits, with no reference to pre-determined criteria, the new qualification procedures do at least give the scientific community the ability to comment and provide feedback on decisions made, since all new qualifications will be published.  Indeed the guidance states “Once a DDT is qualified for specific use, the context of use may become modified or expanded over time as additional data are collected, submitted, and analyzed.  Alternatively, if the growing body of scientific evidence no longer supports the context of use, the DDT qualification may be withdrawn.”  Of concern, such scrutiny will not be applied to proprietary biomarkers submitted as part of INDs, NDAs or BLAs, but some scrutiny and sharing of validation study data is at least a move in the right direction.   The FDA’s qualification process seems likely to stimulate a further increase in the utility of biomarkers as drug discovery tools.

It should be noted that the Guidance Document is still in draft form.  Published on October 25, the FDA is asking for comments and suggestions to be submitted to them by January 24, 2011 for consideration when preparing the final document.

Biomarkers: standing the test of time means good initial study design

It feels like every other day that another putative biomarker is identified that will predict presence or extent of some disease or another, usually with an absurdly low p value.  So, if these biomarkers are so common, why is the subsequent commercialisation and clinical use of these potential diagnostics so difficult to achieve?

I believe that the biggest of the problems is in the design of the studies carried out, particularly in the early stages of biomarker research.

The first stage in the development of a biomarker is the study in which it is identified.  This initial study is usually designed to maximise the difference in phenotype between your control subjects and patients with disease.  This is usually thought of as the way of best identifying a biomarker for the disease in question.  However, it should always be remembered that the result of your experiment will always be dictated by its design.  Assuming that the scientific aspects of the study are carried out rigorously, the best outcome of a biomarker identification study can only be a biomarker that best distinguishes your two study groups.  However, the groups of subjects studied in the first identification of a biomarker are rarely those that a clinician will want to discriminate between.  A clinician is seldom faced with the need to determine whether a sample is indicative of a patient with disease or a healthy individual.  More typically their problem is in distinguishing different underlying pathologies with similar symptoms or, in the context of screening, in determining which of two apparently healthy subjects have underlying asymptomatic disease.

Once this initial mistake in study design has been made, it leads to something we all see often – gradually reducing diagnostic power the more you work with a diagnostic.  After the first study has been carried out, you try and repeat your preliminary work, usually with greater numbers of patients.  Another clinician is brought on board, or an additional clinical site.  The study is repeated, and the ability to distinguish subjects with disease from those without disease is markedly weaker than your first study.  This should be entirely expected.  Your biomarker identification study looked for the difference between healthy subjects and those with your target disease.  Often your follow up studies are not actually testing the same thing.  Now you are looking at distinguishing subjects with disease from subjects with similar symptoms, but who may have completely different underlying pathologies.  It should come as no surprise that your sensitivity and specificity has dropped.

At this stage consternation sets in for you and your investors.  You feel that there must be some improvement that can be made in the measurement of your biomarker.  Protocols are tightened up, samples re-assayed, statisticians called in.  But they are all in vain.  Your initial choice of biomarker was flawed, and it is all too late.

So what are the answers to this chain of events?  The answer is actually quite simple.  When you are designing your early biomarker studies, make absolutely sure that you have done all of your homework.  Understand exactly what problems clinicians have when trying to identify particular pathologies, and target your FIRST clinical studies accordingly.  You might be less likely to find that wonderful biomarker in the very first study, but you will quickly find yourself in one of two situations:  either you will fail early (and cheaply, which your investors will thank you for) or you will find a biomarker that is much more likely to stand the test of time.

David Mosedale

Return top

Total Scientific

Total Scientific Ltd. is a contract research organisation that specialises in biomarkers.

The use of biomarkers is playing an ever-increasing role in both the pre-clinical and clinical phases of drug discovery, as well as its more traditional role as a core activity for many diagnostic companies. From target identification and validation, through pre-clinical and early clinical phases, the ability to predict or follow drug effects in vivo can significantly reduce the cost and time taken to develop new drugs.