Identifying User Stories in Issues records
Nowadays most software development companies have adopted agile development methodologies, which suggest capturing requirements through user stories. However, user stories are too often poorly written in practice and exhibit inherent quality defects. In addition, it is common to find the user stories of a software project immersed in large volumes of issues request logs from software quality tracking systems, which makes difficult to process them later. In order to solve these defects and to formulate high quality requirements, a current trend is the application of computational linguistic techniques to identify and then process user stories. To train the models, data were taken from public sources that contain issues from real software development projects. These sources contain positive examples of user stories in the format “As a (type of user), I want (goal), [so that (some reason)]” and negative examples (erroneous user stories or sentences with a similar syntaxis to user stories but with a different purpose). To obtain a larger data set suitable for testing the models, an algorithm was implemented for generating additional examples by splitting and mixing positive examples into random parts using the Tokenizer of TensorFlow. In order to differentiate the examples to which each classification class belonged, a manual classification work was performed, which may have introduced to the model some human error index since there was no record of the previously classified data. The resulting dataset includes a total of 7997 positive and negative examples, of which 2618 are positive, and the rest are negative. Therefore, a binary classification problem is presented, where the issues classified as user stories belong to the positive class and the rest to the negative class.