Does Agentic Deep Search Converge? Reproducibility Questions for LLM-Driven Literature Discovery

Location

Ngee Ann Kongsi Auditorium (NAKA)

Start Date

4-6-2026 11:00 AM

End Date

4-6-2026 11:30 AM

Description

Agentic deep-search tools—such as Undermind.ai, AI2 Paper Finder, and Elicit—are rapidly gaining adoption. They use large language models (LLMs) to iteratively broaden queries, fetch sources, and re-rank what looks most relevant. This often outperforms traditional keyword search, but it raises a practical question for scholarship: if we rerun the same search, or ask the same question in semantically equivalent words, do we get essentially the same core set of papers?In short, does agentic search converge even with the non-determinism of LLMs—and when is it reproducible enough for research use?
This session reports preliminary results from a structured reproducibility test. We (1) run the same search multiple times to see how much the top and “relevant” results overlap, and (2) pose semantically equivalent versions of the query (paraphrases/synonyms) to check robustness to wording. We report side-by-side results using AI2 Paper Finder (well-regarded, open source) and Undermind.ai (a popular, commercial agentic search tool).
Why this matters: as AI-assisted search increasingly informs literature reviews, grant proposals, and policy briefs, the community needs clear evidence about reproducibility.

This document is currently not available here.

Share

COinS
 
Jun 4th, 11:00 AM Jun 4th, 11:30 AM

Does Agentic Deep Search Converge? Reproducibility Questions for LLM-Driven Literature Discovery

Ngee Ann Kongsi Auditorium (NAKA)

Agentic deep-search tools—such as Undermind.ai, AI2 Paper Finder, and Elicit—are rapidly gaining adoption. They use large language models (LLMs) to iteratively broaden queries, fetch sources, and re-rank what looks most relevant. This often outperforms traditional keyword search, but it raises a practical question for scholarship: if we rerun the same search, or ask the same question in semantically equivalent words, do we get essentially the same core set of papers?In short, does agentic search converge even with the non-determinism of LLMs—and when is it reproducible enough for research use?
This session reports preliminary results from a structured reproducibility test. We (1) run the same search multiple times to see how much the top and “relevant” results overlap, and (2) pose semantically equivalent versions of the query (paraphrases/synonyms) to check robustness to wording. We report side-by-side results using AI2 Paper Finder (well-regarded, open source) and Undermind.ai (a popular, commercial agentic search tool).
Why this matters: as AI-assisted search increasingly informs literature reviews, grant proposals, and policy briefs, the community needs clear evidence about reproducibility.