Task Definition • Dataset construction framewok • Dataset • GettingStart • Result •
We introduce a new practical scenario, NL-conditional table discovery (nlcTD), where users specify both a query table and additional requirements expressed in natural language (NL), and we provide the corresponding automated and highly configurable dataset construction framework and a large-scale dataset. Figure 1 depicts TableCopilot, a future paradigm of table discovery we envision, that serves as an interactive agent understanding both tables and natural language.
Definition 1 (NL-conditional Table Discovery). Given a table repository
Example 1. Imagine a teacher analyzing students’ performance with an existing table containing information like student ID, name, and major (as shown in Figure 2). If he directly uses this table to search for related tables, the system might return many matches with varying content (e.g., table ⑥ includes students’ habits), making it challenging to find desired tables. At this point, if a condition can be added upon the original table (such as “I want a table that can be unioned with the original table and includes students with a high grade.”), the retrieved tables will better satisfy user needs and thus reducing user selection efforts.
Figure 3 provides an overview of the nlcTD taxonomy with illus- trative examples of NL conditions. We begin by treating keyword- based table search as a simplified case of nlcTD, extending it to form a single category. Next, we extend query-table-based search by adding NL conditions, creating two advanced categories: NL- conditional table union search targets rows, while NL-conditional table join search focuses on identifying relevant columns. Each category features distinct NL requests. Furthermore, we classify NL conditions based on table granularity into three levels: table-level, column-level, and mixed-mode conditions.

Figure 3: The taxonomy of nlcTD, consisting of 16 NL condition subcategories along with their illustrative examples.
As depicted in Figure 4, the construction process consists of three main stages. First, we collect a large and diverse set of tables and apply filtering to obtain high-quality original tables. Next, we adopt table splitting to construct queries that include both NL conditions and query tables, while simultaneously generating ground truth labels. Finally, to enhance the diversity and authenticity of the dataset, we apply large language models (LLMs) for semantic augmentation of the ground truths that have been generated via table splitting. Meanwhile, we manually annotate several ground truths based on real SQL use cases contained in the Spider dataset.

Figure 4: The three stages of constructing nlcTables: (1) Table Preprocessing: collecting, filtering, and labeling tables; (2) Query Construction: splitting tables vertically and horizontally to create joinable and unionable tables; (3) Ground Truth Generation: generating labels via automatic table splitting with semantic augmentation, and manual SQL-based labeling.
Our nlcTables supports NL-only table search (nlcTables_K), NL-conditional table union search (nlcTables-U), and NL-conditional table join search (nlcTables-J). For union and join tasks, fuzzy versions (nlcTables-U-fz and nlcTables-J-fz) are provided using semantic augmentation. You can download these five datasets in Table 1. In total, nlcTables contains 22,080 tables with large average size and includes 21,200 labeled GTs. The more detailed statistics are shown in Table 2.
Datasets | Download |
---|---|
NL-only table search (nlcTables_K) | Download |
NL-conditional table union search (nlcTables-U) | Download |
nlcTables-U-fz | Download |
NL-conditional table join search (nlcTables-J) | Download |
nlcTables-U-fz | Download |
This is an example of how to construct your own nlcTD datasets. Remember to change the file paths.
- Generate NL-conditional unionable table search dataset.
python union.py
- Generate NL-conditional joinable table search dataset.
python join.py
- Use your own original table for splitting. You have to change your table into json file.
{
"title": [
"Hancock St & Cottage Ave",
"Quincy Ave Opp President Plaza",
"Washington St & Broad St",
"Commercial St Opp Brookside Rd"
],
"numCols": 4,
"numericColumns": [],
"dateColumns": [
0,
1,
2,
3
],
"pgTitle": "",
"numDataRows": 26,
"secondTitle": "",
"numHeaderRows": 1,
"caption": "Bus schedule",
"data": [
[
"09:00 AM",
" ",
"08:44 AM",
"08:46 AM"
],
[
"08:55 AM",
"08:51 AM",
"08:40 AM",
"08:42 AM"
],
[
"11:29 AM",
" ",
"11:12 AM",
"11:13 AM"
]
]
}
- Change hyper-parameter. We have constructed several splitting fuction based on our taxonomy, for example the theme query at table level has the split_theme_table function. You can change these parameter when you call these functions.
def split_theme_table(index, json_file, query_folder, datalake_folder, query_txt, groundtruth_txt, ori_minRow=10, max_duplicate=0.1, min_split_rate=0.2, template_num=3, shuffle=1, neg_num = 10,pos_num = 5):