NATURAL LANGUAGE PROCESSING AND

Identifying Chemical Compounds (Avoiding Redundancies)

One of the most promising examples of a machine harnessing human knowledge in a way that a mere mortal could never dream of is in the area of natural language processing to pursue the discovery of new drugs. Using this technology allows researchers to synthesize massive public and private datasets in both tabular and unstructured formats relatively quickly.


As an example of natural language processing in action, at the end of 2016, Pfizer announced their efforts in “accelerating research into immuno-oncology,” a medical approach that seeks to use the body’s immune system to battle cancer. Pfizer’s researchers use natural language processing to analyze over a million articles in medical journals, 20 million abstracts of journal articles and 4 million patents. The platform pours over studies and identifies test results –– perhaps results that no person has read for decades ––  that might shed light on possible treatments.


Data science helps pharma companies gain insight from far more than previous clinical studies, however. Just as critical is the way they use it to analyze the millions of medical treatments that are taking place in real-time across the world.


Because medical records are increasingly available on the cloud, drug companies can view and analyze what medications patients are being prescribed and the resulting outcomes. That type of information can’t be used to get a new drug approved, but it’s extremely valuable in guiding researchers towards hypotheses that they can use for a clinical trial.


Being able to expand the scope of analysis to millions of real-world patients is also important simply because it includes many who are either typically excluded or underrepresented in clinical trials:

  • The elderly or those with severe mobility limitations, for instance, rarely make it into a trial group.
  • Members of ethnic or racial minorities are also likely to be underrepresented.
  • Finally, while it may be challenging to assemble a clinical trial focused on a rare condition (men with breast cancer, for instance), real-time data generated across the country and world can provide a large enough sample size for analyzing treatment and outcomes for even very rare conditions.