Making Discoveries Using Published Articles

Making Discoveries Using Published Articles


THE SHORT READ:

 

The Question

Can you make potential scientific discoveries just by reviewing and reading already published research literature?

 

The Answer

Yes, if you perform a smart search of published research, it is possible to find important connections in existing knowledge that represent potential discoveries.

 


THE LONG READ:

 

How the Study Was Designed and What the Researchers Learned

A group of medical doctors, engineers, industry and military researchers set out to try and find improved ways to treat four diseases (Raynaud’s phenomenon, multiple sclerosis, Parkinson’s disease, and cataracts) as well as improved ways to purify water.  They used a special approach called “literature-related discovery”.

Literature-related discovery is where you search large databases of published research (like well-known and well-respected medical websites, academic journals, clinical journals, and databases like Medline) to try and find unnoticed connections that might lead to new scientific knowledge or treatment programs.

The article summarized here is actually one of eight papers (by these same authors) in an entire special issue of the journal dedicated to applying this method and sharing what they learned.  The other articles in the issue demonstrate “proof-of-principle”: that their method of reviewing published literature can be used to identify potential new medical discoveries.

 

What is the basic idea of “literature-related discovery”?

The basic logic behind why this method works adds up like this:

  1. There is a lot more technical literature than any one person can read.
  2. With so much information out there, there are probably important links between published content that have not been noticed.
  3. Identifying even a small set of these links could lead to a lot of significant scientific discoveries.

The authors define two types of discovery using published literature:

  1. “literature-based discovery” where only published items, like articles and books, are analyzed.
  2. “literature-assisted discovery” where published items are analyzed and input from the authors of that content and other experts is added to the mix, through things like proposals, workshops, and panels.

 

How do the authors try and make discoveries using published literature?

The general idea is this:

They start by picking a problem that needs to be solved.  Then they come up with possible answers (for example, disease treatment options) by mining published writing using keyword and phrase searches.  Once they have possible answers, they review them for their discovery-potential and use them to come up with even more targeted keyword searches to try on a bigger set of research articles.  They keep repeating this process until they’ve created a good list of potential new links.  Once they’ve got this list, they check to see if someone else has already discovered an item on it by reviewing patents and articles and by consulting with experts on their team and outside experts.  Anything that passes this last cut, is labelled as a potential discovery.

The trick is to come up with a really good set of search terms and phrases, pulled out of the language used in the research literature itself, that will zoom in on key concepts that appear in pairs of research items related to the problem to be solved.  These could be new links and potential discoveries.

This process is “iterative”:

The researchers start by reading and getting to know their problem area well.  Once they have a set of phrases and terms to search for, they pull together a batch of published articles that match their search terms.  Then they use special techniques called “clustering” to see if the material they have groups itself into particular themes.  These themes are key concepts that relate to the problem they are trying to solve.

From each theme (or “cluster”) they pull out specific phrases from the literature to use as new search terms.  Then they search a new set of articles using these new search terms.  Each time they do this, they refine their search terms and they change the set of research they are mining, to try and uncover new potential links.

The study authors always start with a set of materials that is directly related to solving their problem and they move toward looking for answers in published research that is indirectly related to the problem they are trying to solve, each time using their new search terms to pull out key concepts that appear in the literature.

 

What problems did they have when trying to use published literature to make new discoveries?

Some of the older entries in the research databases they used were incomplete.  For example, many papers done before 1975 did not have abstracts and obtaining full-text copies of all the articles was too labor intensive or expensive to do.  Since this discovery method depends on the language used in published articles, it is only as good as the range of articles it can search.  Incomplete research records or entries undermine the power of this method.

Also, because of the very large number of search hits that needed to be reviewed by an actual person, non-experts (in this case, those without medical expertise) were recruited to help in the search.  The authors note that while non-experts are as good as experts in identifying obvious connections (for example, food-based links), they were not able to identify subtle connections (like biological protein mechanisms) that rely on technical jargon.

 

What lessons did the authors learn to help make their literature searches better?

The authors pointed out two things that they believed made their searches of the research literature more effective at making potential discoveries than other attempts.

First, the sheer volume of possible relevant research documents that a search finds can be overwhelming to people trying to sort through them manually.  Prior studies solved this problem by applying “priority ranking” schemes: they would attempt to rank order the value of potential discovery links, based on things like the frequency with which a phrase appears or other number cut-offs (emphasizing the quantity of matches).  However, the authors point out that using this method may overlook or undervalue a lot of potential discoveries.  The authors did not use these kinds of number cut-offs and.  Instead, to limit the amount of information people had to review, they combined more search terms (emphasizing the quality of the match), which automatically reduces the number of matches a search will find.

Second, the investigators improved their method’s discovery potential by applying a definition of discovery (“novel, interesting, plausible, and intelligible knowledge”) that focused on quality over quantity and emphasized newness. As a result, compared to other attempts at making discoveries using just research literature, they found more genuinely new options.  Other prior studies tended to unearth many already known discoveries, but few to no new discoveries.

 

What the Study Teaches Us

This study shares lessons that the authors’ learned from trying to find new ways to treat four different chronic diseases and to improve water purification methods by carrying out a structured search of existing research literature to find new connections.

Overall they demonstrate proof-of-principle: that by reviewing large amounts of technical literature on topics related (both directly and indirectly) to a problem we are trying to solve, we can find helpful new pathways to pursue further discovery.  This is even more true if we use techniques that encourage us to notice key themes in these subject areas and how those themes re-appear in other indirectly related areas.

They study authors noticed that on the one hand literature discovery methods tend to find fewer potential discoveries in literature directly related to the problem and that those discoveries tend to come from topics published less frequently.  On the other hand, literature discovery searches often find more potential discoveries in literature not directly related to the problem.  They comment that this may be because it’s harder to notice connections in both of these cases (too few papers to read and too many papers read).

One challenge that the researchers point out at the end of their paper is not successfully finding potential discoveries, but following-up on the large number of potential discoveries found.  Another challenge is that the focus of discovery is usually on “individual discoveries”—identifying one key concept that leads to a breakthrough.  But most literature-based discovery methods have not considered how to mine existing research to find “synergistic discoveries”—combining multiple key concepts or techniques to expand the frontiers of knowledge.

Lastly, the authors champion the importance of scientists themselves, over computational and data mining tools, to identify potential discoveries in existing literature.  They stress that while algorithms can assess quantity (like how frequently two concepts are linked), discovery is really about quality (what is the mechanism behind a link and is it important).  And humans are better equipped to recognize quality-based characteristics than computers are.


PUT IT IN ACTION:

 

Three Things to Try

 

(1)  Make conscious search choices: humans can recognize value, but algorithms can’t.

(2)  Use general search terms when you’re looking for new connections or ideas.

(3)  Try combining multiple concepts to come up with new discovery possibilities.

 


THE FINAL WORD:

 

Best Quote from the Study Authors

“The final step of linking the Abstracts to the core literature is the time consuming step.  But…suppose that an extra six man-months were required to read and evaluate the Abstracts.  This might add another $100 K to the total study cost.  The potential discovery could make major inroads on many chronic diseases in the medical field, and similar magnitude advances in the physical sciences.  If we have truly found a path to discovery, these additional costs due to labor intensity are miniscule compared to the potential payoff.  While we would prefer to eliminate these labor intensive costs if possible, we would rather incur these costs if the alternative is to use automation and lose most of the discovery.” (page 293)

 


Full Citation

Kostoff, Ronald N., Block, Joel A., Solka, Jeffrey L., et al.  “Literature-related discovery (LRD): Lessons learned, and future research directions”, Technological Forecasting & Social Change, volume 75, issue number 2, 2008, pages 276-299.  (6 pages + 17 pages of appendices)

Categories:  Scientific Discovery

Tags:  activities, best practices