Part-of-Speech Tagging of Program Identifiers for Improved Text-Based Software Engineering Tool

Author : Gupta, Samir; Malik, Sana; Pollock, Lori; Vijay-Shanker, K.
Booktitle : 21st Annual International Conference on Program Comprehension (AWARDED Conference Best Research Paper Award)
Date : May 2013
Publisher : IEEE
Keyword(s) : program understanding, part-of-speech tagging, natural language processing, identifiers
Document Type : In Conference Proceedings
BibTeX Entry : (show)

Abstract :

To aid program comprehension, programmers choose identifiers for methods, classes, fields and other program elements primarily by following naming conventions in software. These software â naming conventions¢ follow systematic patterns which can convey deep natural language clues that can be leveraged by software engineering tools. For example, they can be used to increase the accuracy of software search tools, improve the ability of program navigation tools to recommend related methods, and raise the accuracy of other program analyses. After splitting multi-word names into their component words, the next step to extracting accurate natural language information is tagging each word with its part of speech (POS) and then chunking the name into natural language phrases. State-of-the- art approaches, most of which rely on â traditional POS taggers trained on natural language documents, do not capture the syntactic structure of program elements. In this paper, we present a POS tagger and syntactic chunker for source code names that takes into account programmers naming conventions to understand the regular, systematic ways a program element is named. We studied the naming conventions used in Object Oriented Programming and identified different grammatical constructions that characterize a large number of program identifiers. This study then informed the design of our POS tagger and chunker. Our evaluation results show a significant improvement in accuracy(11%-20%) of POS tagging of identifiers, over the current approaches. With this improved accuracy, both automated software engineering tools and developers will be able to better capture and understand the information available in code.

Paper Link

Presentation Link