5-9 September 2016
Prague Congress Centre
Europe/Prague timezone

P4.062 Pre-Emptive Data Caching Infrastructure for Data Centric Analysis and Modelling

8 Sep 2016, 14:20
1h 40m
Foyer 2A (2nd floor), 3A (3rd floor) (Prague Congress Centre)

Foyer 2A (2nd floor), 3A (3rd floor)

Prague Congress Centre

5. května 65, Prague, Czech Republic
Board: 62
Poster D. Diagnostics, Data Acquisition and Remote Participation P4 Poster session

Speaker

Ivan Lupelli (UKAEA-CCFE)

Description

The next generation of tokamaks, e.g. ITER, will have extremely large data collection rates (~0.3PBytes per day), significantly larger than those experienced today, with consequential new challenges in data management, data analysis and modelling. With long pulse durations it is important that data be accessible during the experiment for plant monitoring in quasi real-time analysis. One of the big data challenge for these use cases is to ensure that appropriate data is very quickly made available when it is required and where it is consumed. Data volumes with limited network capabilities mean not all data can be distributed in time – we have to be selective. How is this best achieved? One solution is to pre-emptively identify and distribute data to local cache before an application or model requests it. Pre-emption will rely on analysis of historical access patterns using data mining techniques to identify a set of rules whereby following an initial data request the most probable set of next requests can be inferred. Apache Spark is being used to capture the inference rules from IDAM data access logs accumulated from the MAST experiment over several years (>50 Million records). Implementation of these rules requires the inferred sets of data be copied to cache on the host computers running the application code ahead of the next data request. This work is part of the SAGE EU H2020 project, led by Seagate, developing exascale data centric computing architectures. The SAGE hardware consists of multi-tiered storage and HPC compute nodes where data are moved between tiers to where needed using the concept of percipience. The work presented will describe the Spark workflow, the results of the analysis, and an implementation of the pre-emptive caching infrastructure at MAST, together with plans for its implementation and testing on the SAGE platform.

Co-authors

David Muir (UKAEA-CCFE, Abingdon, United Kingdom) Ivan Lupelli (UKAEA-CCFE, Abingdon, United Kingdom) Jonathan Hollocombe (UKAEA-CCFE, Abingdon, United Kingdom) Rob Akers (UKAEA-CCFE, Abingdon, United Kingdom) Shaun de Witt (UKAEA-CCFE, Abingdon, United Kingdom)

Presentation Materials

There are no materials yet.