An Exploratory QA Benchmark over a Million-Scale Data Lake
Can LLM agents discover the evidence before reasoning over it?
LakeQA evaluates Exploratory Question Answering (EQA), a setting where an LLM agent is given only a natural-language question and must discover the relevant evidence from a large heterogeneous data lake before answering. Unlike reading-comprehension benchmarks where evidence is provided, or open-domain QA over a small curated corpus, EQA requires the agent to repeatedly alternate between (i) reasoning about what evidence is missing and (ii) searching for documents that contain it.
LakeQA is built over a ~9.5 TB deduplicated collection of Wikipedia and open-source government data — about 40 million unique files spanning structured (CSV, JSON) and unstructured (TXT, PDF, HTML) formats. To answer a single LakeQA question, an agent typically needs to locate, inspect, and reason across roughly 7.67 documents drawn from a search space of more than one million candidates.
For more details about the task, the tools, the metrics, and a worked example, see the Details page.
Start from a natural-language task with no provided evidence.
Reason about what documents, tables, or facts must be discovered.
Use ontology tags and keyword queries to narrow millions of candidates.
Open, preview, download, and query the retrieved evidence.
Synthesize the final response from the evidence chain.
At each step the agent uses a fixed tool interface
(search, listdata, download, inspect, query)
to discover, retrieve, and analyze data.
@inproceedings{lakeqa2026,
title = {{LakeQA}: An Exploratory QA Benchmark over a Million-Scale Data Lake},
author = {Haonan Wang and Jiaxiang Liu and Yurong Liu and Austin Senna Wijaya and Tianle Zhou and Eden Wu and Yijia Chen and Wanting You and Reya Vir and Daniela Pinto and Grace Fan and Yusen Zhang and Juliana Freire and Eugene Wu},
note = {Haonan Wang and Jiaxiang Liu contributed equally; order decided by a coin flip},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
series = {Proceedings of Machine Learning Research},
volume = {306},
publisher = {PMLR},
year = {2026},
address = {Seoul, South Korea}
}
End-to-end performance on the two LakeQA splits. Click any numeric column header to sort. See metric definitions.
Submit: open a PR adding a row to
assets/data/leaderboard_full.json or
assets/data/leaderboard_mini.json. Schema is in the
README.