Auto-FL-Research: Agentic Search for Federated Learning Algorithms
Overview
Federated learning (FL) research is characterized by numerous small yet consequential algorithmic choices: optimizer variants, server aggregation rules, local training schedules, normalization and regularization techniques, and model architectures. These decisions are expensive to explore manually and difficult to compare fairly, particularly when candidate modifications alter the FL training or evaluation path. This paper presents Auto-FL-Research (AFR), a constrained coding-agent workflow designed for systematic FL algorithmic recipe search.
Methodology
AFR employs a multi-agent framework where candidate algorithms can propose and implement modifications across several dimensions:
- Server aggregation rules
- Client update schedules
- Local objective functions
- Registered model variants
Task profiles constrain the mutation surface, compute budget, communication contract, and final model evaluation procedure. Each campaign documents candidate scores, runtime, edited source files, generated artifacts, and failure status, enabling comprehensive reproducibility analysis.
Experimental Setup
AFR was evaluated on two benchmark families:
- Five healthcare cross-silo FL tasks from the FLamby benchmark suite
- Grouped-client profiles for the five fixed LEAF datasets, plus the LEAF synthetic task
- Positive findings: AFR demonstrated performance gains on four of the five FLamby tasks and five of the six LEAF profiles
- Identified limitations: The evaluation also revealed seed-sensitive outcomes and search-selected failure cases
- Control experiments: When matched with same-budget controls, several gains corresponded to genuine FL-recipe changes. However, some improvements were recovered by fixed-surface scalar controls, and others failed under repeat or held-out evaluation
- Repeated FL mechanisms โ algorithmic changes that reliably improve performance
- Fixed-surface tuning effects โ improvements attributable to hyperparameter optimization within the existing algorithmic surface
- Selected single-run artifacts โ apparent gains that do not replicate under controlled conditions
All experiments used five-seed repeat evaluations to assess statistical reliability.
Key Results
Contributions and Implications
These mixed outcomes represent a core contribution of the work: they demonstrate a methodology for separating agent-generated candidates into three distinct categories:
This framework is particularly relevant for 2026 as FL research matures and the community increasingly requires rigorous evaluation protocols that can distinguish genuine algorithmic advances from statistical artifacts or search-induced overfitting.
via ArXiv AI
