Promset: An annoted dataset for translating natural language to PromQl
Description
PromSet is an annotated dataset designed to support natural language processing (NLP) research for system monitoring. It is particularly suited to applications involving the training and evaluation of large language models to translate queries expressed in natural language into their equivalent in PromQL, the query language used by the Prometheus monitoring tool. An initial dataset was constructed from the results of our experiments on Prometheus, during which we created a set of queries and their natural language descriptions. We then added additional data by collecting PromQL queries and their descriptions from various web sources. This raw data was curated, reviewed, corrected, and enriched with Gemini, resulting in a high-quality dataset suitable for research and development. The dataset contains a total of 4,350 manually curated pairs, each linking an English description to a corresponding PromQL expression. It is provided in CSV format, with two fields: description (a human-readable query) and promql (its equivalent in PromQL syntax). Each record represents a concrete and practical monitoring scenario, such as metric aggregation, label filtering, or time-based calculations. In many cases, a single PromQL query is associated with multiple English-language descriptions, increasing linguistic variation and enabling more robust model training. By bridging the gap between human-readable instructions and machine-interpretable PromQL syntax, Promset enables the development of intelligent systems capable of automatically understanding and generating monitoring queries. This facilitates the creation of more intuitive observability tools, streamlines DevOps workflows, and opens new avenues in research on natural language-to-code translation.