designation: | D1-004 |
---|---|
author: | andrew white |
status: | done |
prepared date: | October 18, 2024 |
updated date: | October 19, 2024 |
model | vertex_ai/gemini-1.5-pro-002 |
papers | 514 |
cost | $0.65 |
facts | 692 |
Abstract: I wanted to generate 692 facts about crows derived from primary scientific literature. So here's one way to generate factoids by running LLMs over hundreds of research papers. You can see the facts, paper source, and context from the paper from which the LLM derived the fact below:
Loading fact...
I recently led a project on creating super-human agents that can write literature reviews and answer specific questions.1 This agent is cool, but it can't do one specific task: looking at every possible paper. It was engineered for precision. For example, to write literature reviews it would write a list of questions the review should answer, and then go answer them by finding precise sources.
I've been thinking about how to do the recall task: execute an agent over each document in a set. Following on the idea from my last post about all scientific discoveries, I decided to run an LLM on a bunch of source documents. I've been impressed with Gemini's ability to work with long contexts lengths2 and so I wanted to see how well it does on reading entire papers.
I tried this prompt on a paper about crows:
"""
Write out a list of interesting facts from the given source.
Only create facts derived specifically from the primary source.
The facts should stand on their own and not describe or summarize specific parts of the source.
Report them in a JSON object that contains a list of entries, each with the following fields:
[
{{
"fact": "The fact you have written",
"context": "...a sentence or two from directly from the source that provides context for the fact..."
}},
]
Return an empty list if no facts about the topic can be derived from the source,
or the source is unrelated to the topic.
"""
I found some really nice initial results in their API studio and experimented a bit on the prompt/settings. Writing good facts is hard though. Here's an early one that came out:
Carrion crows are highly adaptable and their abundance increases near areas with accessible human-related food sources, such as waste management facilities and animal feeding areas.
which is obvious and uninteresting. So I iterated a bit by chatting with Gemini about what a good instruction guide would be for writing interesting facts. Here's an example of a fact that came out after the iteration:
Waste isn't always a bad thing for crows. In Austria's Rhine Valley, the number of carrion crows observed near waste disposal sites was much higher compared to areas without waste, highlighting how crows enjoy food even if it's waste!
The guide we came up with is:
Instructions for Recognizing and Writing a Good Fact from a Research Paper
Pretty good guide I think.
Finding papers about crows can be tough, especially if you haven't been downloading PDFs of crow papers for years. There are three main steps to getting lots of full-text research papers:
Getting the search results really depends on what you're trying to do. You could use Semantic Scholar to find all papers cited by another paper. Or you could use a topic list from OpenAlex which is a categorization of all of science.
For this project, I tried a few search engines that give back lists of papers. You can of course just go to google scholar and copy-paste results into an LLM to get into JSON. Or you can use a search engine with API access. Some examples are are Semantic Scholar, SearXNG, and CrossRef.
I used a variety of these engines to get a long list of papers about crows. For each paper, I have a title, authors, and DOI if I can get it. Then we go to step 2: getting links to PDFs.
Getting a link to a PDF is quite hard. Sometimes you get lucky and the papers retrieved have arxiv identifiers or direct links. Usually, you have a DOI, title, and author. You can use OpenAlex API (or their Unpaywall API) to find the best known open access link of the article. This sounds hard, but is actually not too bad with paper-qa:
# pip install paper-qa
from paperqa.clients import DocMetadataClient, OpenAlexProvider
client = DocMetadataClient(clients=[OpenAlexProvider])
result = await client.query(
title=title,
authors=authors or [],
)
print(result.pdf_url)
You can use more providers - like CrossRef or Semantic Scholar, but we only want the pdf_url
and that comes best from OpenAlex. This process isn't perfect and obviously there are often not open access links. In the end, I usually see about 15-25% success on getting a PDF link. It is highly field-dependent though.
Downloading PDFs is a real art. There are lots of complications, terms of use issues, and IP limitations. For example, the National Institutes of Health bans downloading PDFs using automated methods from PubMed and will permanently ban your IP address. It is insane that a government agency will permanently ban your IP address if you download too many PDFs. Don't think about getting clever with a proxy, because most proxies don't allow you to access .gov TLDs.
So my recommendation? Just downloading them manually is often the best approach. You can usually download one PDF per 10 seconds and so you can decide on your tolerance to set goal number. It also avoids a lot of these complications I mentioned above. Anyway - the state of scientific literature is so depressing that it is a legitimate plan in chemistry to regenerate all published data again to get around publication and data access issues. High-throughput chemistry has made it possible to plausible repeat human history of science for many chemistry domains for less money than accessing research papers.
I got 514 papers for my experiment on crow facts.
One important fact about downloading PDFs is there is almost an inverse correlation between publisher standards and how tightly they prevent you from downloading open access articles. For example, SpringerNature will put cloudflare screens that force you to wait a few seconds even after downloading 5 PDFs in quick succession. Whereas MDPI will let you download many PDFs. So often when doing research, instead of crow facts, there is an inverse correlation between ease of downloading and impact of paper. Remember - this is for open access papers - which costs over $10,000 for authors in some journals - and not for paywalled or subscription only articles.
Actually executing your prompt on the papers is pretty easy. You can click get code
in the Gemini API studio and then ask Gemini to revise the code to be general. I actually used paper-qa to do this part, just because it has some nice features like a PDF reader, and that avoids needing to upload all the PDFs for Gemini.
The rate limits needed to accomplish this are pretty high. Another reason I used paper-qa is because it has rate limits built in. I ended up using a rate limit of 3.5 million tokens per minute because doing back to back papers otherwise would trigger the rate limits. I'm confused about this to be honest - because each request was only about 20,000 token input and a few thousand output. I had a lot of failures form their API - I had to cache and rerun quite a few times. In the end, I only spent $0.65 cents (which is great).
I did try OpenAI's models and they work ok. GPT-4o wouldn't reject facts that are unrelated to crows - for example, it gave one about roosters which are not in the corvid family. Gemini is also quite a bit cheaper.
Here's one of the rooster facts:
The highest-ranking rooster in a group gets the privilege of crowing first to announce the break of dawn.
They're pretty good. Some of the failure modes are talking about magpies or jackdaws, which are corvids, but not exactly "crow facts." Another failure mode is assuming the reader has context of the fact - like describing an experiment result without explaining the setup. Maybe 5% have that problem. There is also an indirect attribution problem - where a paper mentions a fact from another paper and we can't ground on that information from that paper. We must just assume the author characterized it correctly.
You can fetch your own facts:
curl https://facts.drugcrow.ai