Lunchtime Provocation: Emily Singley

AI & Machine Learning in Scholarly Publishing: Services, Data, and Ethics

NISO Plus Forum, October 3, 2023, Washington DC

Delivered as a “Lunchtime provocation” as a response to the prompt: What is the largest potential disruption regarding AI/ML from your perspective?

What is the largest potential disruption regarding AI/ML from your perspective? Are there corresponding opportunities, and if so what are they? 

AI has the potential to reinvent how we communicate about science - how we create the scientific literature, how we interrogate and interact with it, and how well we all, as a broader society, understand the scientific literature

AI models have the potential to make science communication not only vastly more efficient, but also much more inclusive and universal.

This reinvention is already well underway

How will searching change? What if instead of  “searching” or “discovering” information in the literature, we could interrogate it, dialogue with it? 

  • Think about how we’ve traditionally searched the literature – typing in a few keywords into a scholarly database, getting back thousands of articles 
  • Painstakingly sift through those to find what we are looking for 
  • That’s a big waste of time
  • AI promises to change that
  • AI has promise to eliminate the “Needle in a haystack” problem - when you need to find something really specific that’s buried in the text of a paper 
  •  Example: 
    • My friend Jenny is a neuroscientist who studies decision making in monkeys using MRI protocols
    • One time, her protocol wasn’t working the way she expected, and she couldn’t figure out why
    • She needed to find other people who were doing similar experiments, using similar protocols
    • Now, that is a really difficult thing to search for - papers aren’t organized or tagged by protocol, and that information is often buried in a passage of text
    • Because Jenny didn’t really care if the research topic was the same - it didn’t need to be decision making, or even monkeys, it was the methodology she was interested in.
    • Got really frustrated, spent a lot of time - time she could have spent on experiments and advancing scientific discovery
  • In the not too distant future, every single search box is going to be AI-assisted, able to find these kinds of answers, instead of spit back lots of papers 
  • And we will see more and more LLMs pointed at trusted, reliable data sources, and you will be able to track back those answers to the peer-reviewed citations 
  • Saves researchers like Jenny time - more time to experiment, less time sifting through paper

How is the creation of scientific literature changing? How will writing change? 

  • Gen AI is already assisting in writing scientific papers
  • Publishers - including Elsevier - already allow submission of AI-assisted papers – so long as the use of AI is documented and transparent 
  • We will likely see standards and conventions emerge to indicate AI generated text
  • And publishers will continue to spend significant resources developing more sophisticated tools to vet the authenticity, quality, and validity of science communication - and AI will be key to helping us do that
  • Communicating science findings might, in the future, not even revolve around the static “paper” anymore
  • Why not let data talk directly to data - why not let Jenny’s MRI outputs and lab notebooks and datasets inform a subsequent researcher’s experiments directly, without the intermediary static step of the output of a “paper”? 
  • This is the promise we see when AI models converge with Open Science

How will understanding the scientific literature be reinvented? How will reading change? 

  • Summarization and synthesis of the literature is a clear use case for genAI
  • Will make it easier for interdisciplinary researchers to better grasp unfamiliar subject areas, and for non-scientists to understand complex topics
  • And unlike ChatGPT, the tools that are emerging summarize trusted, accurate databases of research and point out to real citations for peer reviewed literature so you can easily validate the underlying sources
  • Scopus AI is currently in beta testing with users - you can type in a topic and get back a summary based on the 90 million peer-reviewed article abstracts in Scopus 
  • Includes accurate citations for the papers that summary is based on 
  • We are also experimenting with generating policy briefs on scientific topics - an advance that could be very useful for scientific advisors who work with lawmakers and regulatory bodies

Translation 

  • Another way genAI will reinvent how we read and understand the scientific literature is through translation services
  • There is now the potential for scientists to understand one another regardless of their native language 
  • Imagine if an Egyptian scientist could communicate her findings in Arabic and the rest of the world could easily comprehend them?  
  • Think about the barrier that removes for her
  • This has the potential to greatly increase our global research output, as well as make the scholarly communication ecosystem more equitable and inclusive 
  • This will finally be like having a “fish in your ear”

 So these are just a few ways I see search, reading, and writing beginning to radically transform

What impact will this have on the information professions? What are the opportunities?

Opportunity for librarians to become metadata heros 

  • This is something that librarians have done for a long time - metadata creation is a core role for our profession - organizing and structuring data
  • Stuctured data has become a critical resource 
  • It is what underpins these large models – and the better the structure, the more accurate and effective the AI can be
  • But too many large scientific datasets do not have consistent, reliable structured data
  • There is an urgent need for annotating, tagging, and structuring massive data sets so that AI models can be more accurate and can analyze and make sense of them better 
  • Example: 
    • My friend Ajit is a researcher who is working on a new biomarker to be used with cancer therapies 
    • He needed to query a massive biomed repository to see if others were doing similar work 
    • The repository consisted of millions of large datasets – and was too large to be query-able by traditional means, he needed to build and utilize an AI agent to search and interrogate all the data
    • But his agent kept spitting back garbage – why? Because the datasets weren’t consistently structured. This was very frustrating for him and wasted a lot of his time 
    • “I know the information I need is there, I just can’t get to it” - again, needle in a haystack
  • People like Ajit should be in the lab running experiments, not wrangling data
  • So we need more people – and I think they should be librarians – who have the data science skills to write tools that will automate structuring these large scientific  datasets 
  • MARC is not going to solve this problem
  • We need next-gen metadata librarians - we need librarians to become data scientists 
  • We need to train librarians to meet the metadata needs of today – rather than of yesterday

Conclusion 

  • GenAI will significantly disrupt how we read, write, and interact with the scientific literature 
  • It has great promise for accelerating scientific discovery, through efficiencies at communicating findings, and minimizing time spent querying the literature
  • Scientists will spend more time thinking, questioning, and experimenting, and less time slogging through databases and writing up lit reviews 
  • This is a good thing for humanity