Performance Evaluation of Data Intensive Scientific Applications

Project ID
FG-98
Project Categories
Computer Science
Completed
Abstract
We would like to perform a detailed benchmarking effort for data intensive applications using the resources provided by FutureGrid. While synergistic with our other research funded by the NSF Cluster Exploratory (CluE) and SDSC's Triton Resource Opportunity (TRO) program, this work will showcase the use of supercomputing facilities for large-scale data processing, complementing work that we will perform on other private and public grid and cloud-based environments. Under the scope of this project, we wish to install our benchmark data sets; and execute a number of benchmark queries on these data sets, using both "traditional" and Hadoop-based solutions for serving these data sets. The application domains that we are interested in spans multiple scientific disciplines - especially geosciences and bioinformatics. In the area of geosciences, we are interested in benchmarking the performance of database and Hadoop-based implementations to serve high resolution geospatial datasets. In the area of bioinformatics, we are interested in study the performance of various traditional and cloud-enabled codes for next generation sequencing.
Use of FutureSystems
FutureGrid resources will be used to install our benchmark data sets; and execute a number of benchmark queries on these data sets, using both "traditional" and Hadoop-based solutions for serving these data sets. Since the FutureGrid provides a variety of environments, we will use it to experiment with both shared-nothing and traditional HPC-style environments.
Scale of Use
We anticipate needing around 50,000 hours over the course of the next year.