Get started — LakeQA

1. Prerequisites

LakeQA's data lake is hosted on a public S3 bucket and is accessed anonymously — no AWS account is required. Local installation needs:

Python 3.10+
git
Approximately 9.5 TB of free disk space for the full data lake (or skip ahead and use only the per-task download flow described below).

Install the Python dependencies with:

pip install pandas boto3 botocore

2. Get the dataset

LakeQA is published under two public S3 roots containing roughly 40 million deduplicated files from Wikipedia and Data.gov. There is no auth — clients use the anonymous (unsigned) S3 protocol.

Data roots: s3://lakeqa-yc4103-datalake/datagov and s3://lakeqa-yc4103-datalake/wikipedia.

Option A — On-demand (recommended for most users)

Don't pre-download anything. The reference agent downloads only the files it inspects for each task. This is how the example in Details works — see node 1's fact:

import io, pandas as pd, boto3
from botocore import UNSIGNED
from botocore.config import Config

bucket = "lakeqa-yc4103-datalake"
key = "datagov/2021-state-expenditures/files/rows.txt"

s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(io.BytesIO(obj["Body"].read()), low_memory=False)

Option B — Bulk download

For local-only experiments or full reproducibility, mirror the bucket with the AWS CLI. Allocate ≥ 9.5 TB of disk and budget several hours of bandwidth.

aws s3 sync --no-sign-request \
    s3://lakeqa-yc4103-datalake/datagov \
    ./lakeqa-data/datagov

aws s3 sync --no-sign-request \
    s3://lakeqa-yc4103-datalake/wikipedia \
    ./lakeqa-data/wikipedia

The task files

Tasks themselves are small JSON files distributed with the code repository (next section). Each task has the structure shown in Details — Example task.

Splits: tasks/lakeqa_full.jsonl contains all 1,007 tasks. tasks/lakeqa_mini.jsonl is a stratified 135-task subset for fast iteration.

3. Get the code

The reference agent, evaluation harness, and tool implementations live in a separate repository:

git clone https://github.com/lakeagent/datalake-qa.git
cd datalake-qa
pip install -e .

The repository ships with the five tools described in Details (search, listdata, download, inspect, query), a baseline agent skeleton, and an evaluator that produces the leaderboard metrics (EM, runtime, cost, and the D_acc / D_ret precision/recall/F1).

4. Run the baseline on one task

Smoke-test your install by running the bundled baseline agent on the first task of LakeQA-mini:

python -m lakeqa.run \
    --tasks tasks/lakeqa_mini.jsonl \
    --task-id 1 \
    --agent baseline \
    --model claude-haiku-4-5 \
    --out runs/smoke_test/

This downloads the task's gold datasets on demand, executes the agent loop until it submits an answer (or hits the turn limit), and writes a per-task trace JSON plus a top-level predictions.jsonl to runs/smoke_test/.

Run on the full mini split:

python -m lakeqa.run \
    --tasks tasks/lakeqa_mini.jsonl \
    --agent baseline \
    --model claude-haiku-4-5 \
    --out runs/mini_haiku/

5. Add your own agent

An agent is a Python class with a single step(state) -> tool_call method. The harness handles the loop, the tool execution, and the final-answer extraction. Subclass the base agent and register it:

from lakeqa.agents import BaseAgent, register

@register("my_agent")
class MyAgent(BaseAgent):
    def step(self, state):
        # state.history    : list of prior tool calls + outputs
        # state.tools      : the 5 LakeQA tools
        # state.scratchpad : your own working memory
        return state.tools.search(query="missouri department of corrections")

Then run with --agent my_agent. See examples/ in the code repo for full demonstrations.

6. Evaluate

Score a run against the gold annotations:

python -m lakeqa.eval \
    --predictions runs/mini_haiku/predictions.jsonl \
    --tasks tasks/lakeqa_mini.jsonl \
    --report runs/mini_haiku/report.json

report.json contains the eight numbers you need for a leaderboard submission: em, runtime_s, cost_usd, dacc_p/dacc_r/dacc_f1, dret_p/dret_r/dret_f1.

7. Submit to the leaderboard

Submissions are accepted as pull requests against this site's repo. Add a row to the relevant JSON file and link your trace bundle (uploaded anywhere publicly accessible — your repo, a Hugging Face dataset, an S3 bucket, etc.).

Fork this site's repo.
Add an object to assets/data/leaderboard_full.json or assets/data/leaderboard_mini.json using the schema below.
Open a PR. Maintainers will sanity-check the trace and merge.

Submission schema

{
  "rank": 1,
  "model": "Your System Name",
  "em": 23.08,
  "runtime_s": 124.45,
  "cost_usd": 0.96,
  "dacc_p": 34.18, "dacc_r": 33.27, "dacc_f1": 40.03,
  "dret_p": 2.61,  "dret_r": 42.12, "dret_f1": 5.86,
  "reported": "2026-05",
  "source": "https://link-to-your-paper-or-traces",
  "notes": "Optional short note about your system"
}

Field meanings live in the Details → Metrics section.

8. Cite

If LakeQA is useful in your work, please cite:

@inproceedings{lakeqa2026,
  title     = {{LakeQA}: An Exploratory QA Benchmark over a Million-Scale Data Lake},
  author    = {Haonan Wang and Jiaxiang Liu and Yurong Liu and Austin Senna Wijaya and Tianle Zhou and Eden Wu and Yijia Chen and Wanting You and Reya Vir and Daniela Pinto and Grace Fan and Yusen Zhang and Juliana Freire and Eugene Wu},
  note      = {Haonan Wang and Jiaxiang Liu contributed equally; order decided by a coin flip},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  series    = {Proceedings of Machine Learning Research},
  volume    = {306},
  publisher = {PMLR},
  year      = {2026},
  address   = {Seoul, South Korea}
}