Computational Storage
Data movement between storage and compute represents a bottleneck in data-driven applications. By executing compute kernels on the storage device instead of moving the data through the memory hierarchy to the CPU cache, throughput can be increased and energy consumption can be reduced. This new type of application architecture allows for a reduction in total cost of ownership when performing similar workloads such as genomics and data analytics.
Status
At CRSS we are researching a computational storage device interface simulator which allow the user application to offload many concurrent compute tasks to the device. This simulator provides a platform which can be used to further develop the interface between user applications, device drivers, and computational storage devices. Given the current lack of readily available hardware designs, this simulator platform allows research in these areas to continue to progress in parallel. This allows us to explore different models for using this type of hardware, including different possible constraints in the NVMe specifications as well as multiple approaches to offloading compute tasks to the device.
We are developing a simulator which is built using the QEMU Linux device emulator system. Using the Intel SPDK userspace NVMe device driver atop the emulated QEMU device allows for high throughput access to the PCIe bus, and the NVMe I/O queueing system allows thousands of requests to be in flight simultaneously. This approach allows the user to take advantage of high levels of parallelism inherent in many data-driven workloads. Our simulator design allows us to further develop an application framework for compute kernel offload. This provides us an opportunity to explore different interfaces and synchronization mechanisms available to connect the user application to the device. By exploring these different approaches in device interface design, we will be able to provide the means for application developers to easily develop the program modifications necessary in order to port existing data-driven applications to this computational storage interface, reducing the engineering effort required to realize these performance and efficiency benefits. This system enables us to evaluate the scalability of kernel offloading techniques and the computational cost of synchronization between the host and the storage accelerator.