Fosscomm 2022

A Machine Learning toolbox, the road to publication.
2022-11-19, 18:00–18:30 (Europe/Athens), Room I

The focus of this presentation is to briefly show the process of developing a package for publication and the prerequisites of a peer-reviewed journal for open-source software. In the end, we will see some of the decision-making involved in the development of the package. The case study of the event is the “HiPart: Hierarchical divisive clustering toolbox” package, which includes implementations of several machine learning algorithms along with visualizations and examples (https://github.com/panagiotisanagnostou/HiPart).


Introduction

As stated by the journal of reference, "Open-source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems." And in machine learning, a large body of potent learning algorithms for various applications has also been developed.

The HiPart package that we will focus on in this presentation, is dedicated to machine learning, and specifically, it is a package created for divisive hierarchical clustering algorithms. Although most clustering packages focus on low to medium-dimensional data, here we focus on algorithms that can tackle ultra high dimensional datasets. This, in combination with impeccable time performance and competitively clustering results, are the main reasons to use this package.

Journal-specific prerequisites

Each journal that publishes open-source software (OSS) has a different set of rules for the publication of OSS. In general, the following are the main requirements that may be requested for publishing OSS in a peer-reviewed journal:

  • The software needs to be written under an accepted open-source license.
  • The clarity of software design
  • The novelty, breadth, and significance of the contribution
  • Evidence of an active user community (number of active developers, number of stars on GitHub, or similar metrics)
  • The project's openness for new developers to participate and contribute (public source code repository, bug tracker, mailing list/forum).
  • Ensure the quality of comparison to previous (if any) related implementations (run-time, memory requirements, features, algorithm-specific metrics).
  • Include complete and easy-to-modify documentation with a tutorial on typical and advanced usage of the package, followed by thorough documentation for either users or developers.
  • Include extensive unit and integration testing and report of code coverage of tests (With coverage close to 100% may be expected).
  • Continuous integration, ideally with multiple versions of any relevant dependencies, on all platforms, supported.
  • The freedom of the code (lack of dependence on proprietary software)

After the software submission, the manuscript or the cover letter must include the OSS-specific details. Briefly, the manuscript should contain information about the software's nature, focus, advancements, and contributions. The cover letter should have all the necessary information for the review process of the package.

A venue for the collection and distribution of OSS runs concurrently with the theoretical developments in machine learning. Building a common library of high-caliber machine learning software that is supported by the ML research community is made possible by the availability of peer-reviewed software accompanied by concise articles. Prior publication of the method may be acceptable because the majority of journals want to recognize the work that goes into creating a method that is a highly useful piece of software. The open-source tracks of journals are intended to inspire the machine learning community to embrace open science, where advances in open-access publishing, open data standards, and open-source software are encouraged.

Package-specific information

As already mentioned, the HiPart package’s main focus is the Hierarchical Divisive Clustering of low-to-high-dimensional datasets. On that premise, it is separated into three components. The first is the Method Implementation, in which the package employs an object-oriented approach for the implementation of the algorithms while incorporating design similarities with the scikit-learn library. A class executes each of the algorithms, and the class’s parameters and attributes are the algorithm’s hyperparameters and results.

The second component is static visualization. Two static visualization methods are included:

  • A 2-dimensional representation of all data partitions produced by each algorithm during the hierarchical procedure. The objective is to provide the user with insight into each node of the clustering tree and, by extension, each step of the algorithm’s execution.
  • A dendrogram depicts the divisions of all algorithms that divide. The SciPy package generates the dendrogram figure, which is fully parameterized as stated in the library.

The third and final component is Interactive Visualization. In the interactive mode, we offer the ability to manipulate the algorithms in stages. The user can select a specific step (node of the tree) and manipulate the split point on top of a two-dimensional visualization to alter the clustering result in real-time. Each modification causes the algorithm to restart from that point on, reorganizing the subtree of the modified node.

Finally, the package above, as it is natural, exists in a repository which, within the conference, is an essential part of our presentation. The repository is structured as a typical Python package. In terms of the folder tree structure with the source, documentation, and test folders, it also contains a README.md, LICENSE, and additional build information files (requirements.txt, setup.py, etc.).

Furthermore, there are several files for the configuration of all the added functionality needed in a package. Starting with the workflows for the assurance of the continuous integration of the package and the automatic deployment of the package with a new release. They are followed by a list of files for the configuration of non-essential for the project but good practice utilities, such as codecov test coverage validator and codacy code quality tester. These utilities are the assurance of the code quality of a repository that a journal can accept.

Also, as good practice, the repository contains a folder with the experimental procedure used for the results published in the manuscript. The repeatability of research results is essential for the validation of research.

Finally, the package’s code is separated into manageable and subject-specific modules PEP8 compiled. This means that the final form of the source code is extensively checked for formatting errors. The code is divided into four major pieces, three of them in accordance with HiPart’s description: the algorithm execution, the static visualization part, and the interactive visualization part. The final part of the utility functions of the package houses utilities for the entire package.

Mr. Panagiotis Anagnostou since December of 2019 is a Ph.D. Candidate at the Department of Computer Science and Biomedical Informatics of the University of Thessaly in Greece with subject “Design and implementation of machine learning algorithms and user interface tools at the field of biomedicine”. Previously he have received a Diploma degree in Computer Science from the University of Patras, Greece, in 2019. In his current position he has served as a Teaching assistant for the courses “Analysis and big data mining at medicine and biology” and “Computer science and computational biomedicine” of the Interdepartmental Postgraduate Program “Computer science and computational biomedicine” of the University of Thessaly’s School of Science, academic year 2019-22. Prior to that he has been a Laboratory Assistant for the course of Digital Electronics at the Department of Computer Engineering and Informatics of the University of Patras, academic year 2016-17. He has served as researcher on the research project “Advanced personalized, multi-scale computer models preventing OsteoArthritis”, funded by European Community’s H2020 Program, under grant agreement Nr. 777159, for the year 2020-2021. His research interests lie within the fields of Machine Learning, Big Data Mining and Big Data Applications.