LakeQA — An Exploratory QA Benchmark over a Million-Scale Data Lake

What is LakeQA?

LakeQA evaluates Exploratory Question Answering (EQA), a setting where an LLM agent is given only a natural-language question and must discover the relevant evidence from a large heterogeneous data lake before answering. Unlike reading-comprehension benchmarks where evidence is provided, or open-domain QA over a small curated corpus, EQA requires the agent to repeatedly alternate between (i) reasoning about what evidence is missing and (ii) searching for documents that contain it.

LakeQA is built over a ~9.5 TB deduplicated collection of Wikipedia and open-source government data — about 40 million unique files spanning structured (CSV, JSON) and unstructured (TXT, PDF, HTML) formats. To answer a single LakeQA question, an agent typically needs to locate, inspect, and reason across roughly 7.67 documents drawn from a search space of more than one million candidates.

For more details about the task, the tools, the metrics, and a worked example, see the Details page.

Paper (PDF) ↗ Dataset (S3) ↗ Code (GitHub) ↗

Benchmark workflow

Question

Start from a natural-language task with no provided evidence.

Plan missing evidence

Reason about what documents, tables, or facts must be discovered.

Search data lake

Use ontology tags and keyword queries to narrow millions of candidates.

Inspect / query files

Open, preview, download, and query the retrieved evidence.

Multi-hop answer

Synthesize the final response from the evidence chain.

At each step the agent uses a fixed tool interface (search, listdata, download, inspect, query) to discover, retrieve, and analyze data.

Citation

Authors: Haonan Wang*, Jiaxiang Liu*, Yurong Liu, Austin Senna Wijaya, Tianle Zhou, Eden Wu, Yijia Chen, Wanting You, Reya Vir, Daniela Pinto, Grace Fan, Yusen Zhang, Juliana Freire, Eugene Wu.

Proceedings of the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea, PMLR 306, 2026. *Equal contribution; order decided by a coin flip.

@inproceedings{lakeqa2026,
  title     = {{LakeQA}: An Exploratory QA Benchmark over a Million-Scale Data Lake},
  author    = {Haonan Wang and Jiaxiang Liu and Yurong Liu and Austin Senna Wijaya and Tianle Zhou and Eden Wu and Yijia Chen and Wanting You and Reya Vir and Daniela Pinto and Grace Fan and Yusen Zhang and Juliana Freire and Eugene Wu},
  note      = {Haonan Wang and Jiaxiang Liu contributed equally; order decided by a coin flip},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  series    = {Proceedings of Machine Learning Research},
  volume    = {306},
  publisher = {PMLR},
  year      = {2026},
  address   = {Seoul, South Korea}
}

Leaderboard

End-to-end performance on the two LakeQA splits. Click any numeric column header to sort. See metric definitions.

Submit: open a PR adding a row to assets/data/leaderboard_full.json or assets/data/leaderboard_mini.json. Schema is in the README.