Brief Overview

Integrated development environments today include sophisticated program modeling and analyses behind the scenes to support the developer in navigating, understanding, and modifying their code. While much can be learned from the results of static and dynamic analysis of their source code, developers also look to others for advice and learning. As software development teams are more globally distributed and the open source community has grown, developers rely increasingly on written documents for help they might have previously obtained through in-person conversations. My research activities cover (a) conducting empirical studies to analyze the information in written documents of software archives, and (b) designing techniques to mine useful information from the software archives which could be used in building/improving software maintenance and evolution tools.


Refereed Conferences.

  • Automatic Extraction of Opinion-based Q&A from Online Developer Chats
    Preetha Chatterjee, Kostadin Damevski, and Lori Pollock,
    The 43rd International Conference on Software Engineering (ICSE), Technical Track, May 2021.
    Preprint    DOI    Slides
    Virtual conversational assistants designed specifically for software engineers could have a huge impact on the time it takes for software engineers to get help. Research efforts are focusing on virtual assistants that support specific software development tasks such as bug repair and pair programming. In this paper, we study the use of online chat platforms as a resource towards collecting developer opinions that could potentially help in building opinion Q&A systems, as a specialized instance of virtual assistants and chatbots for software engineers. Opinion Q&A has a stronger presence in chats than in other developer communications, thus mining them can provide a valuable resource for developers in quickly getting insight about a specific development topic (e.g., What is the best Java library for parsing JSON?). We address the problem of opinion Q&A extraction by developing automatic identification of opinion-asking questions and extraction of participants’ answers from public online developer chats. We evaluate our automatic approaches on chats spanning six programming communities and two platforms. Our results show that a heuristic approach to opinion-asking questions works well (.87 precision), and a deep learning approach customized to the software domain outperforms heuristics-based, machine-learning-based and deep learning for answer extraction in community question answering.
  • Software-related Slack Chats with Disentangled Conversations
    Preetha Chatterjee, Kostadin Damevski, Nicholas A. Kraft, and Lori Pollock,
    The 17th International Conference on Mining Software Repositories (MSR), Data Showcase Track, Oct 2020. Seoul, South Korea
    Preprint     Dataset    DOI    Slides
    More than ever, developers are participating in public chat communities to ask and answer software development questions. With over ten million daily active users, Slack is one of the most popular chat platforms, hosting many active channels focused on software development technologies, e.g., python, react. Prior studies have shown that public Slack chat transcripts contain valuable information, which could provide support for improving automatic software maintenance tools or help researchers understand developer struggles or concerns. In this paper, we present a dataset of software-related Q&A chat conversations, curated for two years from three open Slack communities (python, clojure, elm). Our dataset consists of 38,955 conversations, 437,893 utterances, contributed by 12,171 users. We also share the code for a customized machine-learning based algorithm that automatically extracts (or disentangles) conversations from the downloaded chat transcripts.
  • Extracting Archival-Quality Information from Software-Related Chats
    Preetha Chatterjee,
    The 42nd International Conference on Software Engineering (ICSE), Doctoral Symposium Track, Oct 2020. Seoul, South Korea
    Preprint    DOI    Slides

    Software developers are increasingly having conversations about software development via online chat services. Many of those chat communications contain valuable information, such as code descriptions, good programming practices, and causes of common errors/exceptions. However, the nature of chat community content is transient, as opposed to the archival nature of other developer communications such as email, bug reports and Q&A forums. As a result, important information and advice are lost over time. The focus of this dissertation is Extracting Archival Information from Software-Related Chats, specifically to (1) automatically identify conversations which contain archival-quality information, (2) accurately reduce the granularity of the information reported as archival information, and (3) conduct a case study to investigate how archival quality information extracted from chats compare to related posts in Q&A forums. Archiving knowledge from developer chats that could be used potentially in several applications such as: creating a new archival mechanism available to a given chat community, augmenting Q&A forums, or facilitating the mining of specific information and improving software maintenance tools.
  • Exploratory Study of Slack Q&A Chats as a Mining Source for Software Engineering Tools
    Preetha Chatterjee, Kostadin Damevski, Lori Pollock, Vinay Augustine, and Nicholas A. Kraft,
    The 16th International Conference on Mining Software Repositories (MSR), Research Track, May 2019. Montreal, Canada
    Preprint    DOI    Slides
    Modern software development communities are increasingly social. Popular chat platforms such as Slack host public chat communities that focus on specific development topics such as Python or Ruby-on-Rails. Conversations in these public chats often follow a Q&A format, with someone seeking information and others providing answers in chat form. In this paper, we describe an exploratory study into the potential use- fulness and challenges of mining developer Q&A conversations for supporting software maintenance and evolution tools. We designed the study to investigate the availability of information that has been successfully mined from other developer communications, particularly Stack Overflow. We also analyze characteristics of chat conversations that might inhibit accurate automated analysis. Our results indicate the prevalence of useful information, including API mentions and code snippets with descriptions, and several hurdles that need to be overcome to automate mining that information.
  • Extracting Code Segments and Their Descriptions from Research Articles
    Preetha Chatterjee
    , Benjamin Gause, Hunter Hedinger, and Lori Pollock,
    The 14th International Conference on Mining Software Repositories (MSR), Research Track, May 2017. Buenos Aires, Argentina
    Preprint    DOI    Slides
    The availability of large corpora of online software-related documents today presents an opportunity to use machine learning to improve integrated development environments by first automatically collecting code examples along with associated descriptions. Digital libraries of computer science research and education conference and journal articles can be a rich source for code examples that are used to motivate or explain particular concepts or issues. Because they are used as examples in an article, these code examples are accompanied by descriptions of their functionality, properties, or other associated information expressed in natural language text. Identifying code segments in these documents is relatively straightforward, thus this paper tackles the problem of extracting the natural language text that is associated with each code segment in an article. We present and evaluate a set of heuristics that address the challenges of the text often not being colocated with the code segment as in developer communications such as online forums.
  • What Information about Code Snippets Is Available in Different Software-Related Documents? An Exploratory Study
    Preetha Chatterjee, Manziba Akanda Nishi, Kostadin Damevski, Vinay Augustine, Lori Pollock, and Nicholas A. Kraft,
    The 24th IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER), Early Research Achievements Track, Feb 2017. Klagenfurt, Austria
    Preprint    DOI
    A large corpora of software-related documents is available on the Web, and these documents offer the unique opportunity to learn from what developers are saying or asking about the code snippets that they are discussing. For example, the natural language in a bug report provides information about what is not functioning properly in a particular code snippet. Previous research has mined information about code snippets from bug reports, emails, and Q&A forums. This paper describes an exploratory study into the kinds of information that is embedded in different software-related documents. The goal of the study is to gain insight into the potential value and difficulty of mining the natural language text associated with the code snippets found in a variety of software-related documents, including blog posts, API documentation, code reviews, and public chats.

Journal Publications.

  • Automatically Identifying the Quality of Developer Chats for Post Hoc Use
    Preetha Chatterjee, Kostadin Damevski, Nicholas A. Kraft, and Lori Pollock,
    Transactions on Software Engineering and Methodology (TOSEM), Feb2021
    Preprint    DOI
    Software engineers are crowdsourcing answers to their everyday challenges on Q&A forums (e.g., Stack Overflow) and more recently in public chat communities such as Slack, IRC and Gitter. Many software-related chat conversations contain valuable expert knowledge that is useful for both mining to improve programming support tools and for readers who did not participate in the original chat conversations. However, most chat platforms and communities do not contain built-in quality indicators (e.g., accepted answers, vote counts). Therefore, it is difficult to identify conversations that contain useful information for mining or reading, i.e,. conversations of post hoc quality. In this paper, we investigate automatically detecting developer conversations of post hoc quality from public chat channels. We first describe an analysis of 400 developer conversations that indicate potential characteristics of post hoc quality, followed by a machine learning-based approach for automatically identifying conversations of post hoc quality. Our evaluation of 2000 annotated Slack conversations in four programming communities (python, clojure, elm, and racket) indicates that our approach can achieve precision of 0.82, recall of 0.90, F-measure of 0.86, and MCC of 0.57. To our knowledge, this is the first automated technique for detecting developer conversations of post hoc quality.
  • Finding Help with Programming Errors: An Exploratory Study of Novice Software Engineers’ Focus in Stack Overflow Posts
    Preetha Chatterjee, Minji Kong, Lori Pollock,
    Journal of Systems and Software (JSS), Research Paper, Jan 2020.
    Preprint    DOI    Slides
    Monthly, 50 million users visit Stack Overflow, a popular Q&A forum used by software developers, to share and gather knowledge and help with coding problems. Although Q&A forums serve as a good resource for seeking help from developers beyond the local team, the abundance of information can cause developers, especially novice software engineers, to spend considerable time in identifying relevant answers and suitable suggested fixes. This exploratory study aims to understand how novice software engineers direct their efforts and what kinds of information they focus on within a post selected from the results returned in response to a search query on Stack Overflow. The results can be leveraged to improve the Q&A forum interface, guide tools for mining forums, and potentially improve granularity of traceability mappings involving forum posts. We qualitatively analyze the novice software engineers’ perceptions from a survey as well as their annotations of a set of Stack Overflow posts. Our results indicate that novice software engineers pay attention to only 27% of code and 15-21% of text in a Stack Overflow post to understand and determine how to apply the relevant information to their context. Our results also discern the kinds of information prominent in that focus.


  • Exploring the Generality of a Java-based Loop Action Model for the Quorum Programming Language (Ph.D. Preliminary Project)
    This project explores an approach to automatically identify the higher level abstraction of the action being performed by a particular loop structure in Quorum based on their structure, data flow and linguistic characteristics.
    Many algorithmic steps require more than one statement to implement, but not big enough to be a method (e.g., add element, find the maximum, determine a value, etc.). These steps are generally implemented by loops. Internal comments for the loops often describe these intermediary steps, however, unfortunately a very small percentage of code is well documented to help new users/coders. As a result, information at levels of abstraction between the individual statement and the whole method is not leveraged by current source code analyses, as that information is not easily available beyond any internal comments describing the code blocks.
    Hence, this project explores the generality of an approach to automatically determine the high level actions of loop constructs. The approach is to mine loop characteristics of a given loop structure over the repository of the Quorum language source code, map it to an (already developed for Java) action identification model, and thus identify the action performed by the specified loop. The results are promising enough to conclude that this approach could be applied to other programming languages too.

Talks and Presentations

Poster Presentations

  • Extracting Code Segments and Their Descriptions from Research Articles
    University of Delaware, 2017
  • What Information about Code Snippets Is Available in Different Software-Related Documents? An Exploratory Study
    Computing Research Association Women (CRA-W) Grad Cohort 2017, Washington, DC 2017

Mentoring Undergraduate Students in Research

Mentored and collaborated with 7 undergraduate students in performing data analysis, and developing software research & maintenance tools.

  • 2019-2020: Brian Phillips, Humpher Owusu, Kevin Mason, Performed data analysis and case study on software developer communications.
  • 2018: Minji Kong, Co-authored Research Paper “Finding Help with Programming Errors: An Exploratory Study of Novice Software Engineers’ Focus in Stack Overflow Posts”, Journal of Systems and Software (JSS).
  • 2017: Qilin Ma, Developed an in-house Python-based research tool for mining developer discussions on Stack Overflow.
  • 2016: Benjamin Gause and Hunter Hedinger, Co-authored Full Research Paper “Extracting Code Segments and Their Descriptions from Research Articles”, 14th International Conference on Mining Software Repositories (MSR).

Research Funding