Seismic Data Processing Platform Based on MapReduce and NoSQL

Project ID
Project Categories
Non-Life Science
We propose to develop a novel framework for seismology data processing founded on two relatively new but stable technologies: a NoSQL database system called MongoDB, and a scalable parallel processing framework called Hadoop. The processing system we propose will load core metadata from IRIS DMC into MongoDB to allow a seismologist to build a well-managed, working dataset for input into the system. We will use Hadoop to provide a scalable processing system. That is, Hadoop provides a mechanism to process the same dataset on systems ranging from a single multicore desktop to a state-of-the-art cluster with thousands of nodes. The Hadoop framework abstracts the processing flow into a concept called MapReduce. MapReduce vastly simplifies the process of parallelization as most standard serial processing algorithms can be adapted by writing a Map and Reduce wrapper procedure for the algorithm. Hadoop then handles scheduling and flow of data through the system. We will use MongoDB to manage data created at all phases of the processing systems. MongoDB is a document database which will allow us to manage four additional type of data that are traditionally treated very differently: (1) process generated metadata, (2) processed waveform data, (3) processing parameters used as input to individual algorithms, and (4) log output from individual processing algorithms. Integration of MongoDB and Hadoop is an established technology. The scripting language called Pig Latin will be used to build processing workflows. Construction of the framework itself is largely plug-and-play with initial development work need to define an efficient schema for MongoDB and implementation of MapReduce functions for a suite of algorithms we will need for testing. Research will center around four primary research questions. (1) How can we most efficiently extract, transform, and load data into MongoDB? (2) What is the IO performance of the framework with different configurations? (3) What are the tradeoffs with processing time scale of different algorithms? (4) How scalable is this system for seismic processing?
Use of FutureSystems
Implement our prototype system within FurtureSystems.
ETL seismic data into FutureSystems.
Experiment on the different configuration to find the most efficient IO performance.
Find out the tradeoff after applying different algorithms with MapReduce.
Test the scalability of the system.
Portability of the system to other platforms like AWS.
Scale of Use
I want to test how scalable this system is for seismic processing so I will need to scale VMs for an experiment.