Natural Language Processing

Throughout the life cycle of an application, between 60-90% of resources are devoted to modifying the application to meet new requirements and to fix faults. Building effective software tools is important to reduce these high maintenance costs. In our research, we have observed strong indicators that there are many natural language clues in program literals, identifiers, and comments that could be leveraged to increase the effectiveness of many software tools.

Our research group has been investigating how to best extract and utilize natural language clues from code. We call this kind of analysis, Natural Language Program Analysis (NLPA), since it combines natural language processing techniques with traditional program analysis to extract natural language information from the identifiers, literals, and comments of a program. Using NLPA, we have developed techniques and integrated tools that assist in performing software maintenance tasks, including program understanding, navigation, debugging, and aspect mining.

Thus far, we have focused on NLPA tools that identify scattered code segments that are somehow related: whether it be to search through code to understand a particular concern implementation, to mine aspects, or to isolate the location of a bug. Our existing NLPA tools combine program structure information such as calling relationships and code clone analysis with the natural language of comments, identifiers, and maintenance requests. Although we have only begun to explore the potential of NLPA, our various experimental results motivate further investigation of NLPA for software tools.

We believe that NLPA can be used to:

  • Increase the accuracy of software search tools by providing a natural language description of program artifacts to search
  • Increase the ability of program navigation tools to recommend related procedures by providing natural language clues
  • Increase the accuracy of other program analyses by providing access to natural language information

Subprojects

Contextual Search – Emily Hill, Lori Pollock, and K Vijay-Shankar. “Automatically Capturing Source Code Context for Software Maintenance and Reuse.” International Conf on Software Engineering (ICSE), May 2009.

SAMURAI -Eric Enslen, Emily Hill, Lori Pollock, and K Vijay-Shanker. “Mining Source Code to Automatically Split Identiers for Software Analysis.” 6th IEEE Working Conference on Mining Software Repositories (MSR), May 2009.

AMAP – Emily Hill, Zachary P. Fry, Haley Boyd, Giriprasad Sridhara, Yana Novikova, Lori Pollock, and K. Vijay-Shanker. “AMAP: Automatically Mining Abbreviation Expansions in Programs to Enhance Software Maintenance Tools.” MSR 2008: 5th Working Conference on Mining Software Repositories, May 2008.

Dora – Emily Hill, Lori Pollock, and K. Vijay-Shanker. “Exploring the Neighborhood with Dora to Expedite Software Maintenance”, International Conference on Automated Software Engineering (ASE 2007), November 2007.

FindConcept – David Shepherd, Zachary P. Fry, Emily Hill, Lori Pollock, and K. Vijay-Shanker, “Using Natural Language Program Analysis to Locate and Understand Action-Oriented Concerns”, International Conference on Aspect Oriented Software Development (AOSD 2007), March 2007.

Timna – David Shepherd, Lori Pollock, and Vijay-Shanker K.. “Case Study: Supplementing Program Analysis with Natural Language Analysis to Improve a Reverse Engineering Task.” 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, ACM, June 2007.