A Dataset for Evaluating Identifier Splitters

Author : Hill, Emily; Binkley, Dave; Lawrie, Dawn; Pollock, Lori; Vijay-Shanker, K.
Booktitle : The 10th Working Conference on Mining Software Repositories
Date : May 2013
Publisher : IEEE
Keyword(s) : identifier splitting, natural language analysis, software maintenance tools
Document Type : In Conference Proceedings
BibTeX Entry : (show)

Abstract :

Software engineering and evolution techniques have recently started to exploit the natural language information in source code. A key step in doing so is splitting identifiers into their constituent words. While simple in concept, identifier splitting raises several challenging issues, leading to a range of splitting techniques. Consequently, the research community would benefit from a dataset (i.e., a gold set) that facilitates comparative studies of identifier splitting techniques. A gold set of 2,663 split identifiers was constructed from 8,522 individual human splitting judgments and can be obtained from www.cs.loyola.edu/ binkley/ludiso. This sets construction and observations aimed at its effective use are described.

Paper Link

Presentation Link