Mining the Characteristics of Jupyter Notebooks in Data Science Projects: When Code Classification is the Origin of Good AI in the Future

“Data Science Projects” are initiatives that utilize data science principles to design, develop, analyze, and evaluate in order to improve efficiency, solve specific problems, or create new software features that better meet user needs.

A crucial tool that allows software developers to efficiently develop Data Science Projects today is “Computational Notebooks.” These tools facilitate data science work by integrating code and output in one place. Users do not have to switch back and forth between code editors and data analysis programs, and they can immediately check the results of the code. They can also create visualizations. This convenience has made Computational Notebooks highly popular.

Computational Notebooks have been widely developed by users around the world to address new challenges in Data Science Projects. This raises the questions: How can we identify which Computational Notebooks are well-developed and suitable as a guide for beginners in data science? What factors help differentiate the quality of these Computational Notebooks?

The research titled “Mining the Characteristics of Jupyter Notebooks in Data Science Projects” was conducted by students participating in a summer internship program at the NARA Institute of Science and Technology in Japan. This study was later developed into a Senior Project by Miss Tasha Settewong, Miss Urisayar Kaewpichai, and Mr. Vacharavich Jiravatvanich, students of the B.Sc. program in Information and Communication Technology, Batch 17 (ICT International Program). The research was supervised by Asst. Prof. Dr. Morakot Choetkiertikul, Asst. Prof. Dr. Chaiyong Ragkhitwetsagul, and Asst. Prof. Dr. Thanwadee Sunetnanta, instructors in the Computer Science Group and Software Engineering and Business Analytics Research Cluster (SEBA). The study aims to identify factors that determine the quality of current computational notebooks.

“Before, we’ve never known what characteristics are appropriate for writing Computational Notebooks. We don’t have guidelines to tell developers how to write or create them to ensure quality. This led to our research question: studying the characteristics of Computational Notebooks that reflect quality. We are trying to determine what factors indicate whether a Computational Notebook is good or not.”

The study began by collecting Computational Notebooks from Grand Master and Novice users on Kaggle, a computer science community platform. The analysis focused on four categories: Notebook Attributes, Code Quality, Textual Descriptions, and Visualization. The current findings suggest that most people emphasize code quality, particularly source code complexity, the use of libraries, and the number of libraries used in the analysis. These factors are critical in determining whether a Computational Notebook is good. However, the research is ongoing for more comprehensive results.

Asst. Prof. Dr. Morakot discussed the problems encountered during the research, which included the considerable amount of time required for data collection. This was because the scope of the work involved studying data from actual cases on Kaggle, as well as necessitating the collection of a large number of Computational Notebooks. Additionally, there was the challenge of ensuring the quality of the collected data. How can we be certain that the data isn’t just some random entries or copied from elsewhere?

He concluded with remarks about the potential future developments of this work, “Everything in this study is based on coding. We analyze their code quality, which makes the problem complex. Ultimately, this code will be further developed into future AI models. High-quality code is more likely to be accepted and further developed effectively. Therefore, knowing what constitutes quality code from the start is crucial, as it leads to better AI model training in the future.”