Details

Task definition, a worked example, evaluation metrics, and why LakeQA is hard.

What is a task?

Each task in LakeQA is a tuple (Q, D, D*, A), where Q is a natural-language question, D = {D₁, …, Dn} is the data lake (each Di is a document from Wikipedia or Data.gov), D* ⊆ D is the gold set of relevant documents, and A is the final answer.

Given Q and tool access to D, the agent must: (i) explore D to identify a relevant subset, (ii) extract supporting facts and/or compute aggregated statistics from the retrieved files, and (iii) synthesize the final answer. Because D contains roughly 40 million documents, exhaustive content-level search is computationally infeasible — agents must strategically use the search tool to prune the space using ontology tags before examining contents.

Tool interface

The agent is provided with a fixed set of tools 𝒯 for discovery, retrieval, and local analysis:

Tool Input Output / Purpose
search(query) keyword or tag string Returns a ranked list of dataset IDs relevant to the query.
listdata(dataset-ids) list of dataset IDs Lists files available under each dataset ID directory.
download(dataset-ids) list of dataset IDs Downloads all files for each dataset ID into a per-task local sandbox.
inspect(path) file path (e.g. <id>/<file>) Returns the first k characters of the file (a lightweight preview).
query(path, q) file path and query Queries over locally downloaded data (e.g., filter, aggregate).

Example task

A real task from LakeQA. The agent must combine three years of Missouri state expenditure tables with three years of state employee-pay tables, then chain into Wikipedia to find a person.

{
  "question": "Across 2021–2023, which Missouri state agency stayed in the top five for professional services spending and also had the highest average employee count? In its headquarters city, a historically Black university was founded. Who founded that university? Return the name only. Your response should be in the format [Answer].",

  "answer": "James Milton Turner",

  "datasets_used": [
    "2021-state-expenditures", "2022-state-expenditures", "2023-state-expenditures",
    // 6 more — see assets/task_7.json
  ],

  "nodes": {
    "1": {
      "source": "datagov/2021-state-expenditures/files/rows.txt",
      "subquestion": "What were the top 5 Missouri state agencies by total professional services spending in fiscal year 2021?",
      "fact": """   # imports compacted from 5 lines for readability
        import io, pandas as pd, boto3
        from botocore import UNSIGNED
        from botocore.config import Config

        source = "datagov/2021-state-expenditures/files/rows.txt"
        bucket = "lakeqa-yc4103-datalake"
        obj = boto3.client("s3", config=Config(signature_version=UNSIGNED)) \
                   .get_object(Bucket=bucket, Key=source)
        df = pd.read_csv(io.BytesIO(obj["Body"].read()), low_memory=False)

        result = (
            df[df["Category Description"] == "PROFESSIONAL SERVICES"]
              .groupby("Agency Name", dropna=False)["Payments Total"]
              .sum()
              .sort_values(ascending=False)
              .reset_index()
        )
        answer = result.head(5)["Agency Name"].tolist()
      """,
      "answer": ["SOCIAL SERVICES", "CORRECTIONS", "PUBLIC SAFETY",
                 "TRANSPORTATION", "ELEMENTARY AND SECONDARY EDUCATION"]
    },
    // 8 more nodes hidden — see assets/task_7.json
  },

  "reasoning_hops": [
    { "hop_id": 1, "answer": ["CORRECTIONS", "SOCIAL SERVICES", "TRANSPORTATION"] },
    { "hop_id": 2, "answer": "CORRECTIONS" },
    { "hop_id": 3, "answer": "Jefferson City" },
    { "hop_id": 4, "answer": "Lincoln University" },
    { "hop_id": 5, "answer": "James Milton Turner" }
  ]
}

Trimmed from assets/task_7.json in this repo. The full task has 9 retrieval/computation nodes — hops 1 and 2 each fan out across three yearly datasets. Node 1's fact is shown verbatim, with the import lines compacted on a single line for readability.

Metrics

We report end-to-end performance using Exact Match (EM) against the ground-truth answer (verified by multiple annotators), together with the wall-clock runtime and dollar cost per task. To study the role of search separately from reasoning, we additionally evaluate dataset-level precision, recall, and F1 on two dataset collections induced by the agent's traces:

End-to-end

  • EM — Exact Match against the gold answer (↑)
  • Runtime — wall-clock seconds per task (↓)
  • Cost — US dollars per task (↓)

Accessed set D_acc

  • Precision vs D*
  • Recall vs D*
  • F1 vs D*

Datasets the agent opened or queried while constructing its answer.

Retrieval set D_ret

  • Precision vs D*
  • Recall vs D*
  • F1 vs D*

Union of dataset IDs the agent had access to via discovery, regardless of whether they were used.

Why both D_ret and D_acc?

The gap between D_ret and D_acc can be interpreted as reasoning failure — the agent surfaced the relevant document but did not actually use it. Low recall on D_ret can be interpreted as search failure — the agent was not able to pinpoint what to query to find the relevant documents at all.

Why is it hard?

Challenge 1

Missing evidence and high search density

The agent must identify which documents contain the required evidence inside a collection where even enumerating candidate names is prohibitively expensive — over one million candidates per task.

Challenge 2

Abundance and heterogeneity

The collection mixes kilobyte text files with gigabyte-scale tables, structured and unstructured formats, and separates each table from its metadata into different documents.

Challenge 3

Exploratory planning and high reasoning density

Tasks require an average of 7.67 documents (vs ~2–4 in prior multi-hop benchmarks). The agent must repeatedly reason about what is missing, propose targeted queries, and verify evidence before synthesizing an answer.