Arkouda: Interactive Data Exploration Backed by Chapel
Exploratory data analysis (EDA) is the prerequisite for all data science. EDA is non-negotiably interactive—by far the most popular environment for EDA is a Jupyter notebook—and, as datasets grow, increasingly computationally intensive. Several existing projects attempt to combine interactivity and distributed computation using programming paradigms and tools from cloud computing, but none of these projects have come close to meeting our needs for high-performance EDA. To fill this gap, we have developed a prototype, called arkouda, that allows a user to interactively issue massively parallel computations on distributed data. We designed the API of arkouda to closely mimic NumPy, the underlying computational library used in approximately 80% of EDA workflows (based on a sample of Jupyter notebooks). Our vision is that users will import arkouda as a Python module in place of NumPy (e.g. “import arkouda as np”) and use familiar NumPy functions and syntax to interact with arrays of data residing on an HPC. The computational heart of arkouda is a Chapel interpreter that accepts a pre-defined set of commands from the Python frontend and uses Chapel’s built-in machinery for multi-locale and multithreaded execution. While arkouda, in our experience, comes closer than anything else to enabling high-performance EDA, the process of developing arkouda has also helped identify ways Chapel must improve in order to become a truly productive language for data science.