← Back to feed
2026-05-26data

LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models

Oroel Ipas, Guillermo Gomez-Trenado, Rocío Romero-Zaliz, Isaac Triguero

PDF preview unavailable
Read on arXiv →

Key claim

LUCoS outperforms traditional selection methods in tabular learning.

The paper presents LUCoS, a novel method for selecting instances in low-label tabular learning by utilizing latent geometry from embeddings. It significantly outperforms random selection and traditional methods across various datasets and budgets, highlighting the importance of representativeness in context selection.

In plain English

The paper presents LUCoS, a novel method for selecting instances in low-label tabular learning by utilizing latent geometry from embeddings. It significantly outperforms random selection and traditional methods across various datasets and budgets, highlighting the importance of representativeness in context selection.

Novelty
8.0/10

The proposed LUCoS method introduces a new approach to context selection in low-label tabular learning, leveraging latent geometry for improved performance.

Reliability
8.0/10

The evaluation on 67 datasets with multiple metrics demonstrates solid experimental validation and robustness of the claims.

Deep reliability assessment

The methodology supports the narrower claim that, on low-label OpenML-CC18 classification tasks, selecting K-Medoids in TabClustPFN latent space can outperform raw-feature-space or alternative embedding selection for TabPFN-style context construction. It overclaims if read as a general solution for tabular cold-start labeling, because regression, high-cardinality classification, severe imbalance, very large unlabeled pools, and other tabular embedding models are not fully validated.

Reproducibility

No code repository is mentioned in the provided paper text. The benchmark uses public OpenML-CC18 datasets and describes the use of TabPFN-2.5, TabClustPFN embeddings, K-Medoids/KM++ selection, budgets from 1C to 32C, and Wilcoxon tests, but full reproducibility would depend on unavailable implementation details and code; the ZEUS comparison is also incomplete due to early termination and GPU memory limits.

Discussion questions

  1. 1.Does an unsupervised PFN embedding actually encode predictive similarity, or is LUCoS mostly benefiting from density/coverage in benchmarks where classes happen to align with clusters?
  2. 2.For builders with costly labeling workflows, when is the extra compute and engineering complexity of latent embedding plus K-Medoids justified over random sampling or simple stratified heuristics?
  3. 3.What result would falsify LUCoS: failure on datasets with rare but high-value minority classes, regression targets, strong domain shift, or a controlled benchmark where cluster structure is deliberately unaligned with labels?

Key figure

Figure 1 shows LUCoS embedding the unlabeled training set with an unsupervised tabular foundation model, selecting representative medoids in latent space, sending those original instances for expert labeling, and using the labeled subset as context for a supervised TabPFN-style classifier at inference time.

Benchmark results

58 OpenML-CC18 datasets with complete ZEUS comparison, folds 1-5Mean AUC: 0.78vs LUCoS with ZEUS embeddings+0.016
58 OpenML-CC18 datasets with complete ZEUS comparison, folds 1-5Mean AUC: 0.718vs LUCoS with ZEUS embeddings+0.024
58 OpenML-CC18 datasets with complete ZEUS comparison, folds 1-5Mean AUC: 0.77vs LUCoS with ZEUS embeddings+0.022
58 OpenML-CC18 datasets with complete ZEUS comparison, folds 1-5Mean AUC: 0.874vs LUCoS with ZEUS embeddings+0.007