Extracting Code Segments and Their Descriptions from Research Articles
Author : Chatterjee, Preetha; Gause, Benjamin; Hedinger, Hunter; Pollock, Lori
Booktitle :International Conference on Mining Software Repositories (MSR)
Date : May 2017
Publisher : IEEE
Project : Natural Language Program Analysis
Keywords: mining software repositories, information extraction, code snippet descriptions, text analysis
The availability of large corpora of online software-related documents today presents an opportunity to use machine learning to improve integrated development environments by first automatically collecting code examples along with associated descriptions. Digital libraries of computer science research and education conference and journal articles can be a rich source for code examples that are used to motivate or explain particular concepts or issues. Because they are used as examples in an article, these code examples are accompanied by descriptions of their functionality, properties, or other associated information expressed in natural language text. Identifying code segments in these documents is relatively straightforward, thus this paper tackles the problem of extracting the natural language text that is associated with each code segment in an article. We present and evaluate a set of heuristics that address the challenges of the text often not being collocated with the code segment as in developer communications such as online forums.