Filecules and Small Worlds in Scientific Communities: Characteristics and Signif

6 June, 2016

Título:Filecules and Small Worlds in Scientific Communities: Characteristics and Signif
Profesor:Adriana Iamnitchi
Origen:University of South Florida
Fecha impartición: Julio de 2007.
Duración: 10 horas.


Most of today’s science depends on the processing of massive amounts of data in multi-institutional and even international collaborations. Grid computing focuses on enabling resource sharing for wide-area collaborations and has currently reached the stage where deployments are mature and many collaborations run in production mode. As with any growing technology, Grid usage characteristics (that inherently affect performance) could not have been predicted before or during design and implementation. This lack of evidence in usage characteristics has three significant outcomes: (1) resource management solutions are evaluated on irrelevant traces; (2) quantitative comparison of alternative solutions to the same problem becomes impossible due to different experimental assumptions and independently generated workloads; and (3) solutions are designed in isolation, to fit the particular and possibly transitory needs of specific groups.

These concerns led us to analyze more than two years of workloads from a high-energy physics collaboration. In addition to contradicting previously accepted models, we discovered two novel data-usage patterns. First, a data-centric analysis reveals the existence of filecules, groups of files that are always processed together. Second, a user-centric analysis discovers small-world properties in data sharing that show emergent, interest-based grouping of users. We show that exploiting these patterns for designing resource management solutions leads to better scalability, lower costs, and increased adaptability to changing environments. In addition, these traces helped us evaluate the viability of offloading the storage needs of the high-energy physics collaboration to the Amazon Simple Storage Service (S3). We identify ways to exploit application-specific characteristics for reducing S3 storage costs and discuss the support that a storage utility service such as S3 needs to provide in order to support scientific data-sharing collaborations.