ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization

Abstract

Deep Reinforcement Learning (DRL) has emerged as a promising approach for solving Combinatorial Optimization (CO) problems, such as the 3D Bin Packing Problem (3D-BPP), Traveling Salesman Problem (TSP), or Vehicle Routing Problem (VRP), but these neural solvers often exhibit brittleness when facing distribution shifts. To address this issue, we uncover the Satisficing Generalization Edge, which we validate both theoretically and experimentally: identifying a set of promising actions is inherently more generalizable than selecting the single optimal action. To exploit this property, we propose Adaptive Selection After Proposal (ASAP), a generic framework that decomposes the decision-making process into two distinct phases: a proposal policy that acts as a robust filter, and a selection policy as an adaptable decision maker. This architecture enables a highly effective online adaptation strategy where the selection policy can be rapidly fine-tuned on a new distribution. Concretely, we introduce a two-phase training framework enhanced by Model-Agnostic Meta-Learning (MAML) to prime the model for fast adaptation. Extensive experiments on 3D-BPP, TSP, and CVRP demonstrate that ASAP improves the generalization capability of state-of-the-art baselines and achieves superior online adaptation on out-of-distribution instances.

Core Idea

Satisficing Generalization Edge

Finding a set of promising actions transfers better across distributions than committing to one exact optimal action.

Proposal + Selection

A proposal policy filters robust candidate actions, then a selection policy makes the final decision inside that smaller action space.

Fast Online Adaptation

On a new distribution, ASAP freezes the proposal policy and fine-tunes the lightweight selection policy for rapid adaptation.

Why Proposal Helps

Preliminary experiments motivating the proposal policy

Preliminary experiments show that proposal sets preserve high-quality actions under distribution shift, while top-1 action selection is more brittle.

Experimental Highlights

3

CO domains

Validated on 3D-BPP, TSP, and CVRP.

7.87%

TSP-1000 gain

POMO+ASAP improves the Clustered TSP-1000 optimality gap from 47.88% to 40.01% after adaptation.

1.65%

CVRP-50 gain

INViT-1V+ASAP with MAML improves Clustered CVRP-50 from 5.43% to 3.78% after adaptation.

Evaluation Distributions

Instance distributions for ID and OOD datasets

ASAP is evaluated across in-distribution and out-of-distribution settings with varied spatial structures and instance scales.

BibTeX

@article{fang2026asap,
  title={ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization},
  author={Han Fang and Paul Weng and Yutong Ban},
  journal={arXiv preprint arXiv:2501.17377},
  year={2026},
  url={https://arxiv.org/abs/2501.17377}
}