Set up the dataset, run a baseline agent, and submit your results to the leaderboard.
LakeQA's data lake is hosted on a public S3 bucket and is accessed anonymously — no AWS account is required. Local installation needs:
gitInstall the Python dependencies with:
pip install pandas boto3 botocore
LakeQA is published under two public S3 roots containing roughly 40 million deduplicated files from Wikipedia and Data.gov. There is no auth — clients use the anonymous (unsigned) S3 protocol.
Don't pre-download anything. The reference agent downloads only the files it inspects for each task.
This is how the example in Details works — see node 1's fact:
import io, pandas as pd, boto3
from botocore import UNSIGNED
from botocore.config import Config
bucket = "lakeqa-yc4103-datalake"
key = "datagov/2021-state-expenditures/files/rows.txt"
s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(io.BytesIO(obj["Body"].read()), low_memory=False)
For local-only experiments or full reproducibility, mirror the bucket with the AWS CLI. Allocate ≥ 9.5 TB of disk and budget several hours of bandwidth.
aws s3 sync --no-sign-request \
s3://lakeqa-yc4103-datalake/datagov \
./lakeqa-data/datagov
aws s3 sync --no-sign-request \
s3://lakeqa-yc4103-datalake/wikipedia \
./lakeqa-data/wikipedia
Tasks themselves are small JSON files distributed with the code repository (next section). Each task has the structure shown in Details — Example task.
The reference agent, evaluation harness, and tool implementations live in a separate repository:
git clone https://github.com/lakeagent/datalake-qa.git cd datalake-qa pip install -e .
The repository ships with the five tools described in Details
(search, listdata, download, inspect, query),
a baseline agent skeleton, and an evaluator that produces the leaderboard metrics
(EM, runtime, cost, and the D_acc / D_ret precision/recall/F1).
Smoke-test your install by running the bundled baseline agent on the first task of LakeQA-mini:
python -m lakeqa.run \
--tasks tasks/lakeqa_mini.jsonl \
--task-id 1 \
--agent baseline \
--model claude-haiku-4-5 \
--out runs/smoke_test/
This downloads the task's gold datasets on demand, executes the agent loop until it submits an answer
(or hits the turn limit), and writes a per-task trace JSON plus a top-level predictions.jsonl
to runs/smoke_test/.
Run on the full mini split:
python -m lakeqa.run \
--tasks tasks/lakeqa_mini.jsonl \
--agent baseline \
--model claude-haiku-4-5 \
--out runs/mini_haiku/
An agent is a Python class with a single step(state) -> tool_call method. The harness
handles the loop, the tool execution, and the final-answer extraction. Subclass the base agent and register it:
from lakeqa.agents import BaseAgent, register
@register("my_agent")
class MyAgent(BaseAgent):
def step(self, state):
# state.history : list of prior tool calls + outputs
# state.tools : the 5 LakeQA tools
# state.scratchpad : your own working memory
return state.tools.search(query="missouri department of corrections")
Then run with --agent my_agent. See examples/ in the code repo for full
demonstrations.
Score a run against the gold annotations:
python -m lakeqa.eval \
--predictions runs/mini_haiku/predictions.jsonl \
--tasks tasks/lakeqa_mini.jsonl \
--report runs/mini_haiku/report.json
report.json contains the eight numbers you need for a leaderboard submission:
em, runtime_s, cost_usd,
dacc_p/dacc_r/dacc_f1,
dret_p/dret_r/dret_f1.
Submissions are accepted as pull requests against this site's repo. Add a row to the relevant JSON file and link your trace bundle (uploaded anywhere publicly accessible — your repo, a Hugging Face dataset, an S3 bucket, etc.).
assets/data/leaderboard_full.json or
assets/data/leaderboard_mini.json using the schema below.{
"rank": 1,
"model": "Your System Name",
"em": 23.08,
"runtime_s": 124.45,
"cost_usd": 0.96,
"dacc_p": 34.18, "dacc_r": 33.27, "dacc_f1": 40.03,
"dret_p": 2.61, "dret_r": 42.12, "dret_f1": 5.86,
"reported": "2026-05",
"source": "https://link-to-your-paper-or-traces",
"notes": "Optional short note about your system"
}
Field meanings live in the Details → Metrics section.
If LakeQA is useful in your work, please cite:
@inproceedings{lakeqa2026,
title = {{LakeQA}: An Exploratory QA Benchmark over a Million-Scale Data Lake},
author = {Haonan Wang and Jiaxiang Liu and Yurong Liu and Austin Senna Wijaya and Tianle Zhou and Eden Wu and Yijia Chen and Wanting You and Reya Vir and Daniela Pinto and Grace Fan and Yusen Zhang and Juliana Freire and Eugene Wu},
note = {Haonan Wang and Jiaxiang Liu contributed equally; order decided by a coin flip},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
series = {Proceedings of Machine Learning Research},
volume = {306},
publisher = {PMLR},
year = {2026},
address = {Seoul, South Korea}
}