Task definition, a worked example, evaluation metrics, and why LakeQA is hard.
Each task in LakeQA is a tuple (Q, D, D*, A), where
Q is a natural-language question,
D = {D₁, …, Dn} is the data lake (each Di is a document
from Wikipedia or Data.gov), D* ⊆ D is the gold set of relevant documents,
and A is the final answer.
Given Q and tool access to D, the agent must:
(i) explore D to identify a relevant subset,
(ii) extract supporting facts and/or compute aggregated statistics from the retrieved files, and
(iii) synthesize the final answer.
Because D contains roughly 40 million documents, exhaustive content-level search is
computationally infeasible — agents must strategically use the search tool to prune the space using
ontology tags before examining contents.
The agent is provided with a fixed set of tools 𝒯 for discovery, retrieval, and local analysis:
| Tool | Input | Output / Purpose |
|---|---|---|
search(query) |
keyword or tag string | Returns a ranked list of dataset IDs relevant to the query. |
listdata(dataset-ids) |
list of dataset IDs | Lists files available under each dataset ID directory. |
download(dataset-ids) |
list of dataset IDs | Downloads all files for each dataset ID into a per-task local sandbox. |
inspect(path) |
file path (e.g. <id>/<file>) |
Returns the first k characters of the file (a lightweight preview). |
query(path, q) |
file path and query | Queries over locally downloaded data (e.g., filter, aggregate). |
A real task from LakeQA. The agent must combine three years of Missouri state expenditure tables with three years of state employee-pay tables, then chain into Wikipedia to find a person.
{
"question": "Across 2021–2023, which Missouri state agency stayed in the top five for professional services spending and also had the highest average employee count? In its headquarters city, a historically Black university was founded. Who founded that university? Return the name only. Your response should be in the format [Answer].",
"answer": "James Milton Turner",
"datasets_used": [
"2021-state-expenditures", "2022-state-expenditures", "2023-state-expenditures",
// 6 more — see assets/task_7.json
],
"nodes": {
"1": {
"source": "datagov/2021-state-expenditures/files/rows.txt",
"subquestion": "What were the top 5 Missouri state agencies by total professional services spending in fiscal year 2021?",
"fact": """ # imports compacted from 5 lines for readability
import io, pandas as pd, boto3
from botocore import UNSIGNED
from botocore.config import Config
source = "datagov/2021-state-expenditures/files/rows.txt"
bucket = "lakeqa-yc4103-datalake"
obj = boto3.client("s3", config=Config(signature_version=UNSIGNED)) \
.get_object(Bucket=bucket, Key=source)
df = pd.read_csv(io.BytesIO(obj["Body"].read()), low_memory=False)
result = (
df[df["Category Description"] == "PROFESSIONAL SERVICES"]
.groupby("Agency Name", dropna=False)["Payments Total"]
.sum()
.sort_values(ascending=False)
.reset_index()
)
answer = result.head(5)["Agency Name"].tolist()
""",
"answer": ["SOCIAL SERVICES", "CORRECTIONS", "PUBLIC SAFETY",
"TRANSPORTATION", "ELEMENTARY AND SECONDARY EDUCATION"]
},
// 8 more nodes hidden — see assets/task_7.json
},
"reasoning_hops": [
{ "hop_id": 1, "answer": ["CORRECTIONS", "SOCIAL SERVICES", "TRANSPORTATION"] },
{ "hop_id": 2, "answer": "CORRECTIONS" },
{ "hop_id": 3, "answer": "Jefferson City" },
{ "hop_id": 4, "answer": "Lincoln University" },
{ "hop_id": 5, "answer": "James Milton Turner" }
]
}
Trimmed from assets/task_7.json in this repo. The full task has 9 retrieval/computation nodes
— hops 1 and 2 each fan out across three yearly datasets. Node 1's fact is shown verbatim,
with the import lines compacted on a single line for readability.
We report end-to-end performance using Exact Match (EM) against the ground-truth answer (verified by multiple annotators), together with the wall-clock runtime and dollar cost per task. To study the role of search separately from reasoning, we additionally evaluate dataset-level precision, recall, and F1 on two dataset collections induced by the agent's traces:
EM — Exact Match against the gold answer (↑)Runtime — wall-clock seconds per task (↓)Cost — US dollars per task (↓)D_accPrecision vs D*Recall vs D*F1 vs D*Datasets the agent opened or queried while constructing its answer.
D_retPrecision vs D*Recall vs D*F1 vs D*Union of dataset IDs the agent had access to via discovery, regardless of whether they were used.
D_ret and D_acc?
The gap between D_ret and D_acc can be interpreted as reasoning failure
— the agent surfaced the relevant document but did not actually use it. Low recall on D_ret
can be interpreted as search failure — the agent was not able to pinpoint what to query to find
the relevant documents at all.
The agent must identify which documents contain the required evidence inside a collection where even enumerating candidate names is prohibitively expensive — over one million candidates per task.
The collection mixes kilobyte text files with gigabyte-scale tables, structured and unstructured formats, and separates each table from its metadata into different documents.
Tasks require an average of 7.67 documents (vs ~2–4 in prior multi-hop benchmarks). The agent must repeatedly reason about what is missing, propose targeted queries, and verify evidence before synthesizing an answer.