Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics

In this tutorial, we explore the Open-SWE-Traces dataset from NVIDIA as a practical resource for studying and preparing agentic software-engineering trajectories for fine-tuning. The dataset is streamed directly from Hugging Face, enabling efficient work with large-scale data in Google Colab without requiring a full local download. We inspect individual records, normalize multi-turn agent conversations, parse final code patches, extract useful metadata, and build an analysis DataFrame to understand trajectory length, tool usage, patch size, language distribution, and resolution outcomes. These insights are then used to curate a high-quality supervised fine-tuning (SFT) subset, retaining only trajectories that meet criteria based on success labels, token limits, language filters, and patch availability. As of 2026, NVIDIA's Open-SWE-Traces dataset has become a cornerstone for training code-generation and agentic AI models, particularly for software engineering tasks. This tutorial provides a step-by-step pipeline for transforming raw trajectories into clean, actionable SFT data, aligning with industry best practices for fine-tuning large language models on agent interactions.

via MarkTechPost

Related