Meet your (free) new lab partner: Nonprofit ‘AI scientist’ gets to work

Fanie van Rooyen
Aug 18
9 min read

Illustration of a scientist at a computer with a mechanical crow perched nearby, symbolising AI support in research.

How does research usually start? With a mountain of PDFs, of course! Luckily for researchers, AI is changing that, already able to sift, digest, sort and summarise hundreds of papers in mere minutes. What a time-saver! But that is not where AI’s impact on the scientific process will stop (not even close). Many, including the likes of Google Deepmind’s Dennis Hassabis, believe that AIs will soon be able to automate the entire scientific process and make discoveries of their own. The era of ‘AI scientists’ is fast approaching. One of the main players emerging in this space is the Eric Schmidt-backed nonprofit FutureHouse, whose free platform of tailored AI agents is specifically being trained for “automating scientific discovery”.

On launch day, 1 May 2025, Sam Rodrigues, one of the founders, claimed that FutureHouse can already perform a wide variety of scientific tasks better than humans. In this article, we dive a bit deeper to show you how you can use FutureHouse in your own research, how it compares to competing AI tools for research, and give you a four-step experiment to see for yourself if AI science is up to scratch.

Meet Your AI Research Team

Unlike widespread LLMs like ChatGPT, FutureHouse’s platform has access to a vast database of high-quality, open-access papers and specialised scientific tools running in the background. It currently has a suite of four distinct and complementing AI tools: Crow, Falcon, Owl and Phoenix:

Crow can fine-comb scientific literature and quickly answer questions about the papers it finds (or is provided with).

Falcon can conduct deeper literature searches, including of scientific databases, and produce detailed research reports similar to the “Deep Research” options of ChatGPT and Gemini.

Owl can run a search to tell you if anyone else has already done research on a specific topic in any field to help you find research gaps.

Phoenix uses additional, highly tailored software tools to help plan chemistry experiments.

Illustrated cards describing four AI science tools: Crow for concise search, Falcon for deep search, Phoenix for chemistry tasks, and Owl for precedent search.

When you stack all of that alongside the increasingly bold, headline-grabbing claims from the likes of xAI, OpenAI, Claude and Gemini - and then look at what Deepmind’s AlphaFold, IBM’s RoboRXN and other ‘self-driving labs’ are already capable of - you see a bigger trend: specialist ‘AI scientists’ are quickly moving from conference slides to everyday lab life.

So, how can you make the most of these new tools for your own research? And where is AI taking science? Let’s take a look.

Who (and what) is FutureHouse?

FutureHouse is a San Francisco nonprofit of roughly fifteen staff, founded in September 2023 with philanthropic backing from former Google CEO Eric Schmidt.

On its own LitQA benchmark, FutureHouse reports “superhuman” precision that edges out traditional LLMs like GPT-4-Turbo and Claude Opus.

The team has even kitted out its own robotic wet lab so hypotheses, experiments and analysis can run in one continuous loop - demo footage shows pipettes moving without human hands! The goal is to essentially automate the whole scientific process, from hypothesis to testing (🤯 isn’t this wild?!). See the video below to see a demonstration of how the different AI tools can be used to help researchers come up with new science.

AI ‘scientists’ versus mainstream LLMs

The last year of AI launches has felt like a rocket-race, with each vendor claiming its latest model will revolutionise research.

During the 9 July 2025 Grok-4 livestream, Elon Musk declared the model could solve “real-world engineering problems where the answers cannot be found anywhere on the Internet or in books”. OpenAI’s Sam Altman said during the launch of ChatGPT‑5 that it is the first time that it “really feels like talking to an expert in any topic, like a PhD level expert”. Anthropic’s Claude 3 family highlights new science benchmarks and transparent step-by-step thinking, and Google’s Gemini models now have context windows big enough to swallow an entire special issue of Nature.

Yet bigger is not always better. TechCrunch testing showed Grok-4’s internal “chain-of-thought” sometimes searches Musk’s own posts before answering controversial questions, raising bias concerns. FutureHouse argues that science-tuned agents with transparent reasoning and curated databases give more reliable evidence trails.

Ready to find out if this can actually help with your own work? We thought up a four-step mini-experiment to help you point the way.

Mini-experiment (fun!)

Nothing beats hands-on testing, so here is a quick protocol you can run between coffee refills:

Choose a stubborn research question.
Ask FutureHouse’s Crow for a concise answer with citations.
Pose the same question to ChatGPT, Gemini, Claude or Grok.
Compare depth, citation quality and any hallucinations.

For a comprehensive, robust test, you will need to compare the output from FutureHouse’s Falcon with the various ‘Deep Research’ options from ChatGPT, Gemini, Claude or Grok, but since we don’t want to bog things down with long research reports, we opted for a quick-test using Crow. We used the following prompt:

💡 Find and list the top 3 most cited papers that discuss effective, evidence-based sleep training methods for 2-year-olds, and distil the findings of all three into a single sentence that summarises the available evidence (the potential for parental stress and long-term impact on the child's psychological and behavioural development).

QUICK TIP! Always remember to toggle the web search tool/functionality so that AI models are not limited to referencing their training data, but can also search online for recent papers.

The results of our little mini-experiment were as follows:

🤖 ChatGPT (model-5)

Here are the three most-cited, evidence-based papers on behavioural sleep training relevant to toddlers (2 years is within the “young children” scope used by these papers):

Gradisar M, et al. Behavioral interventions for infant sleep problems: a randomized controlled trial (graduated extinction vs bedtime fading vs control). Pediatrics. 2016;137(6):e20151486. RCT; ~200+ citations. PubMed
Hiscock H, Wake M. Randomised controlled trial of a behavioural infant sleep intervention to improve infant sleep and maternal mood. BMJ. 2002;324:1062. RCT; ~440+ citations. PubMed PMC
Mindell JA, Kuhn BR, Lewin DS, Meltzer LJ, Sadeh A. Behavioral treatment of bedtime problems and night wakings in infants and young children. Sleep. 2006;29(10):1263-1276. Task-force evidence review; ~1,170+ citations. AASM PubMed

Evidence from the most-cited trials and reviews shows that behavioural sleep training methods such as graduated extinction, bedtime fading, and consistent bedtime routines are effective for 2-year-olds, improve both child sleep and parental wellbeing, and do not cause long-term harm to attachment, behaviour, or psychological development, though some short-term parental stress may occur.

Verdict: ✅

ChatGPT strictly opted for the highest citations, which resulted in using some older studies (2002, 2006), but it used reputable sources and didn’t hallucinate.

🔍 Gemini (2.5 Pro)

Mindell, J. A., Kuhn, B., Lewin, D. S., Meltzer, L. J., & Sadeh, A. (2006). Behavioral treatment of bedtime problems and night wakings in infants and young children. Sleep, 29(10), 1263–1276.
- This foundational systematic review is one of the most influential papers in the field. It analyzed 52 treatment studies to establish the overall efficacy of behavioral interventions for sleep problems in young children, including toddlers.
Price, A. M., Wake, M., Ukoumunne, O. C., & Hiscock, H. (2012). Five-year follow-up of harms and benefits of behavioral infant sleep intervention: randomized trial. Pediatrics, 130(4), 643-651.
- This landmark study is critical as it provides the strongest long-term evidence available. It followed up on families from a randomized controlled trial five years after they implemented sleep training (graduated extinction or camping out/fading) to assess for any lasting negative or positive consequences.
Gradisar, M., Jackson, K., Spurrier, N. J., Gibson, J., Whitham, J., Sved-Williams, A., ... & Kennaway, D. J. (2016). Behavioral interventions for infant sleep problems: a randomized controlled trial. Pediatrics, 137(6).
- This highly-cited randomised controlled trial directly compared two popular methods (graduated extinction and bedtime fading) against a control group. Crucially, it also included objective measures of maternal stress by testing for cortisol, the primary stress hormone.

Evidence from top-cited studies shows that behavioural sleep training is a highly effective method for toddlers that reduces parental stress without causing any demonstrable long-term harm to the child's psychological well-being or the parent-child bond.

Verdict: ✅

Gemini mostly cited the same authors as being the most influential, while referencing some more recent papers. It didn’t, however, directly link to the sources or indicate the citation count.

🪶FutureHouse Crow

The top three most‐cited papers addressing effective, evidence‐based sleep training methods applicable to 2‐year‐olds are:

Meltzer and Mindell’s 2014 systematic review and meta‐analysis of behavioral interventions for pediatric insomnia (436 citations) (1.1)
Mindell and Williamson’s 2018 review of the benefits of a bedtime routine in young children (459 citations) (2.1)
Gradisar et al.’s 2016 randomized controlled trial on behavioral interventions for infant sleep problems (211 citations) (3.1)

Collectively, the reviewed evidence supports the safe and effective use of structured behavioural sleep-training techniques for 2-year-olds, with the benefits extending not only to improved sleep parameters but also to enhanced family well-being without compromising the child’s long-term psychological or behavioural development.

Verdict: ✅

Crow again cited the same influential authors, but seemed to favour the most recent (and perhaps more relevant) studies, with concise, albeit slightly more academic, summarisation.

Which one you prefer, will be up to you. Although outside the scope of this blog, comparing large research reports from Falcon with Deep Research agents will shed more light on which models are better, analytically, or more prone to hallucinations etc. (we might have a dedicated blog post in the future about it). Nevertheless, the point is that AI now affords you some major opportunities for enhancing your research efficiency and output.

For chemists, you can let Phoenix draft a reaction route and check it against your own notebook or other chemistry AI’s to compare (see below).

Do share your screenshots or “wow” moments on social media with #AISciBuddy and tag @AnimateYourScience so we can amplify the smartest finds (and copy yours!).

FutureHouse is far from the only player on this exciting new playground. Below we take a dip into the rest of the fast-growing ‘AI scientist’ ecosystem to see what else is out there. The key thing about FutureHouse, though, is that it's completely free and they don't seem to have plans to monetise in the near term.

All the other tools available that offer AI research agents or literature review tools are paid platforms. So if you don't like to pay for another subscription fee, maybe FutureHouse is the way to go.

The wider ‘AI scientist’ landscape (speed-date edition!)

Think of this as a lightning tour of lab-automation speed-dating - five recent projects, thirty seconds each, plenty of inspiration. Try out the ones that seem relevant to your research, and let us know what you think. AI is, after all, currently one giant global experiment!

Tool / project	Domain super-power	Why it matters
AlphaFold & GNoME(Google DeepMind)	Protein folding, discovering 380 000 candidate materials	Concrete biology and materials breakthroughs
IBM RoboRXN	Advanced AI chemist that integrates with the first remotely accessible, physicalautonomous chemical laboratory to test materials synthesis	Free molecular-recipe playground
ChemCrow(open-sourced)	GPT-augmented chemistry agent proven in wet-lab syntheses	Phoenix, on the FutureHouse platform, was built on top of ChemCrow
Self-driving labs (UIUC SAMPLE, Berkeley A-Lab)	Closed-loop robotics that made 41 new compounds in 17 days	Labs that learn while you sleep
Sakana AI-Scientist pipeline	End-to-end idea-to-paper automation	Sparks debate over authorship and ethics
Google Earth AI	A collection of geospatial AI models and datasets, including AlphaEarth Foundations, a state-of-the-art AI model that acts like a "virtual satellite."	Integrates vast amounts of Earth observation data to create a unified digital representation of the planet (wildfire detection, flood forecasting, monitoring crops and deforestation etc.)

From proteins to polymers, these systems show that automated science is not science fiction. Each advances a different slice of the research pipeline or a different field, with exciting possibilities for the future.

As you’re taking your first steps in this weird and exciting new ‘AI scientist’ world, it will pay to keep in mind some caveats and ethics that still shadow autonomous science.

Caveats, ethics and future outlook

Like any shiny new instrument, AI agents come with safety labels that deserve a careful read.

Hallucinations happen even in ‘Deep Research’ or domain-tuned models, so always click through every citation. Bias remains tricky: Grok-4’s Musk-centric reasoning is a textbook example of founder fingerprints on an output.

Fully autonomous systems such as Sakana’s AI-Scientist still need human validation to catch buggy code and dodgy references. Nevertheless, the potential upside is compelling - faster hypothesis cycles, lightning-quick lit-reviews, cheaper access to expertise and a level research playing field for labs with tight budgets.

If that sounds like the future you want in your own ‘lab’, the next step is simple: FutureHouse’s agents are live today and free to use, so run your own mini-experiment and tell us what surprises you the most.

If you want a guided tour of the broader AI toolset available for researchers, and practical tricks for slotting these agents into your day-to-day workflow, join our “AI for Researchers” online course. It is packed with live demos, free tool resources and real-world case studies to help you harness AI ethically in research.

We cannot wait to see what discoveries you and your new robot lab partner make next!

Disclaimer: This article is an independent review. We have not received any contact, payment or sponsorship from any of the AI software tools mentioned in this piece. All observations and evaluations are based on our unbiased testing.