Auto-FL-Research: Agentic Search for Federated Learning Algorithms

agentic search

Auto-FL-Research: Agentic Search for Federated Learning Algorithms


Overview


Federated learning (FL) research is characterized by numerous small yet consequential algorithmic choices: optimizer variants, server aggregation rules, local training schedules, normalization and regularization techniques, and model architectures. These decisions are expensive to explore manually and difficult to compare fairly, particularly when candidate modifications alter the FL training or evaluation path. This paper presents Auto-FL-Research (AFR), a constrained coding-agent workflow designed for systematic FL algorithmic recipe search.


Methodology


AFR employs a multi-agent framework where candidate algorithms can propose and implement modifications across several dimensions:


  • Server aggregation rules
  • Client update schedules
  • Local objective functions
  • Registered model variants

Task profiles constrain the mutation surface, compute budget, communication contract, and final model evaluation procedure. Each campaign documents candidate scores, runtime, edited source files, generated artifacts, and failure status, enabling comprehensive reproducibility analysis.


Experimental Setup


AFR was evaluated on two benchmark families:


  1. Five healthcare cross-silo FL tasks from the FLamby benchmark suite
  2. Grouped-client profiles for the five fixed LEAF datasets, plus the LEAF synthetic task

  3. All experiments used five-seed repeat evaluations to assess statistical reliability.


    Key Results


    • Positive findings: AFR demonstrated performance gains on four of the five FLamby tasks and five of the six LEAF profiles
    • Identified limitations: The evaluation also revealed seed-sensitive outcomes and search-selected failure cases
    • Control experiments: When matched with same-budget controls, several gains corresponded to genuine FL-recipe changes. However, some improvements were recovered by fixed-surface scalar controls, and others failed under repeat or held-out evaluation

    Contributions and Implications


    These mixed outcomes represent a core contribution of the work: they demonstrate a methodology for separating agent-generated candidates into three distinct categories:


    1. Repeated FL mechanisms โ€” algorithmic changes that reliably improve performance
    2. Fixed-surface tuning effects โ€” improvements attributable to hyperparameter optimization within the existing algorithmic surface
    3. Selected single-run artifacts โ€” apparent gains that do not replicate under controlled conditions

    4. This framework is particularly relevant for 2026 as FL research matures and the community increasingly requires rigorous evaluation protocols that can distinguish genuine algorithmic advances from statistical artifacts or search-induced overfitting.

      via ArXiv AI

Related