Eur. Phys. J. Plus (2011) 126: 12 DOI 10.1140/epjp/i2011-11012-2
THE EUROPEAN PHYSICAL JOURNAL PLUS
Regular Article
Data management in HEP: An approach F. Furanoa CERN, European Organization for Nuclear Research - CH-1211, Gen`eve 23, Switzerland Received: 15 October 2010 / Revised: 5 December 2010 c Societ` Published online: 26 January 2011 – a Italiana di Fisica / Springer-Verlag 2011 Abstract. In this work we describe an approach to data access and data management in High Energy Physics (HEP), which privileges performance, simplicity and scalability, in storage systems that co-operate. We also show why the typical HEP workload is well positioned to access geographically distributed data repositories and then weigh the advantages and disadvantages of accessing data across the Wide Area Network. We discuss some points related to the architecture that a data access/management system should have in order to exploit these possibilities. Currently, this kind of methods were explored by using the xrootd/Scalla software suite, that is a workable example of a distributed non-transactional data repository for the HEP environment. The Scalla architecture naturally allows us to build globally federated repositories congruent with diverse HEP collaborations and their data access needs. These methodologies, however, are based on the concept of caching associated to a performant messaging system and to an efficient way of accessing data through Wide Area Network, thus they can be applied to other data access systems as well.
1 Introduction and related work In this work we address the various solutions we developed in order to initiate a smooth transition from a highperformance storage model composed by several nodes in different distant sites to a model where the nodes cooperate to give a unique file system-like coherent view of their content. With the term “node” we refer to the deployment of a storage site of any size, from a small set of disks up to the most complex hierarchical storage [1] installation. The task of promoting the maximum level of performance while making a potentially high number of storage nodes collaborate is very challenging to achieve. We have indeed to consider this goal from several points of view, functional and not functional, where, for instance performance has a very important role, together with the scalability of the whole system. The goal is to be able to effectively deploy a production-quality system, where the functionalities are effectively usable, and the data access performance is as close as possible to the one reachable by the hardware used to deploy this system. This can be considered as the major challenge in such systems, because requirements and expectations about performance and robustness are very high, and very often they are compared to distributed file systems, whose evolution over time has been quite promising, but, from our perspective, not yet to the point of being able to fully satisfy the requirements of very large-scale computing. To reach this objective, among other items, a design must face the difficulties dealing with the network latency, which varies by more than two orders of magnitude between local network access and Wide Area Network (WAN) access. With respect to distributed file systems there is some consensus that this objective is still possible, however, what is not yet clear because of the lack of experience is if the concept of mount-point is a technical obstacle, together with the restrictions that mount-points have with respect to plain data access primitives available in a library. This high latency typical of the Wide Area Network connections, and the intrinsically higher probability of temporary disconnections among clients and servers residing in different sites have historically been considered as very hard problems. From this perspective, this work addresses in a generic way the problems related to design, functionalities, performance and robustness in such systems, with a strong focus on enabling a generic data-intensive application to compute its inputs and save its outputs in a distributed storage environment, where the most stringent non-functional requirements are concerned with the data access speed and the processing failure rate in a very large-scale deployment. The idea behind our interest for WAN direct data access is not necessarily to force computations to be performed towards remote data. Hence, we do not believe that local storage clusters can be replaced at the present time, especially a
e-mail:
[email protected]
Page 2 of 15
The European Physical Journal Plus
in the cases where a large computing farm is accessing the data. What we will try to show in this paper is that the “classical” use case where a farm can only access a storage element very “close” to it, is not the only choice anymore, as it is demonstrated, for example, in a production environment by the implementation of the ALICE computing model [2]. Moreover, in a modern computing model dealing with data access for High Energy Physics experiments (but probably not only High Energy Physics), there are other very interesting use cases which can be easily accommodated with a well-working WAN-wide data access, for example: – Interactive data access, especially in the case of analysis applications that exploit the computing power of modern multicore architectures [3]. This would be even more precious for tasks like debugging an analysis or data-mining application without having to copy locally the data it accesses. – To allow a batch analysis job to continue its execution if it landed in a farm whose local storage element lost the data files which were supposed to be present. They may instead be present in another site. – Any other usage which could come to the mind of a user of the World Wide Web, which made interactivity available in a very easy way via WAN. Hence, the objectives this kinds of systems are aiming for are: – To be able to deploy a unified worldwide storage, which transparently integrates functionalities at both site level and global level. – To improve data access speed, transaction rate, scalability and efficiency. – To allow dealing with Wide Area Networks, both for distributing the overall system and accessing its data. – To implement fault tolerant communication policies. – To implement fault tolerance through the “self-healing” capabilities of the content of a Storage Element (SE). – To keep remote SEs synchronized with the metadata that describes their content. These objectives also led us to address the issues related with the scalability of the whole design, and the fact that the focus on performance is a major requirement, since these aspects can make, in practice, the difference between being able to run data analysis applications and being not. The case study for these methods is the ALICE [4,2] High Energy Physics experiment at CERN, where a system based on these principles has been incrementally deployed on an already-running production environment among many (about 60) sites spread all over the world, accessed concurrently by tens of thousands of data analysis jobs. However, given the generality of the discussed topics, the methods discussed in this paper can be extended to any other field with similar requirements. 1.1 The problem High Energy Physics experiments, from the computing point of view, rely typically on statistics about events and on all the relative procedures to identify and classify them. Very often, in order to get sufficient statistics, huge amounts of data are analysed in data stores which can reach sizes up to a few dozens of Petabytes, but which will likely grow much more in the future generations of experiments. Hence, the typical computing scenario deals with tens or hundreds million files in the repositories and tens of thousands of concurrent data analysing processes, often called jobs. Each job can open many files at once (e.g. about 100–150 in ALICE, up to 1000 in GLAST/Fermi), and keep them open for the time needed. With these numbers, it is easy to understand why the file open transaction rate for the storage system can be very high. With respect to that, a rate of O(103 ) file opens per second in a local cluster is not uncommon. In our experience the source of this traffic can be a combination of the local GRID site, a local batch system, a local PROOF [3] cluster, remote WAN-wide clients and occasional interactive users. Historically, a common strategy, largely adopted in HEP, to handle very large amounts of data (on the order of dozens of Petabytes per year) is to distribute the data among different computing centres, assigning a part of the data to each of them, possibly having some sites that are more important than others. This approach is related to the so-called MONARC model [5], that designs a fixed hierarchy of sites, together with a quite static set of paths through which the data files are pushed. There is some consensus about the fact that, although successful up to now, this approach may need a revision in order to better support the needs of a more demanding community that needs access to the data repositories. Among others, one of the consequences of this kind of architectures is that, for example, if a data processing job needs to access two data files, each one residing at a different site, the typically adopted simple-minded solution is to make sure that a single site contains both of them, and the processing is started there. Unfortunately, the situation can become more complicated than this, once one starts considering: – Applications accessing hundreds of files. – The need of having some kind of replica catalogue, in order to keep track of the (possibly automatic) data movements.
F. Furano: Data management in HEP: An approach
Page 3 of 15
– The latency in scheduling the massive replication of big data files. – The time needed to transfer them to the right place. – The substantial inefficiency of this approach if the analysis applications do not read entirely their input data files. It is very straightforward to see some simple consequences that come from the above scenario: – A hypothetical data analysis job must wait for all of its data to be present at a site, before starting. – The time to wait before starting can grow if some file transfers fail and have to be rescheduled. – The task of assigning files to sites (and replicating them) can easily degrade into a complex n-to-m matching problem, to be solved online, as new data comes and the old data is replicated in different places. An optimal solution is very difficult, if not impossible, to achieve, given the typically very high complexity of such systems in the real deployments. One could object that the complexity of an n-to-m online matching problem can be avoided in some cases by assigning to each site something like a “closed” subset of the data, i.e. a set of files whose processing does not need the presence of files residing elsewhere. This, depending on the problem, can be true, but generally implies a reduced number of choices for a job scheduler, since generally there are very few sites eligible for a given kind of processing. This situation can theoretically evolve to one where the sites containing the so-called “trendy datasets” are overloaded, while others can be idle. Moreover, this kind of solution “freezes” the possible combination of files to be analysed or “mined”, effectively reducing the extent of the algorithms that can be used without a “reshuffling” of the files. This problem is very well known in the field of High Energy Physics, where many users submit a great number of data analysis jobs a few weeks before the conferences where they expect to present their results. Another direction towards which this approach can evolve is the so-called “manual data management”, where the choices of the data distribution and the relative actions taken are almost manual, and which we did not consider in this work. Moreover, we could also object that making data processing jobs wait too much just because the distribution of files is not yet optimal for a given analysis could become a major source of inefficiency for the whole system, if we consider the users and their results as parts of it.
1.2 File catalogues and metadata catalogues Even if the discussed aspects are well known, however, it is in our opinion that the initiatives related to what we can call “transparent storage globalisation” are not apparently progressing fast enough to allow a system designer to design a very large distributed production storage deployment using them in a transparent way. What happens in many cases is that those designs include some external component whose purpose is to take decisions about file placements, based on the knowledge of the supposed current global status of the storage system, i.e. the current (and/or desired) association of data to sites. Invariably, this kind of component takes the form of a simple-minded file catalogue. Its deployment may represent a performance bottleneck in the architecture and also it can be a single point of failure, or a source of inconsistencies, as discussed later. Here we must point out the subtle difference that lies, in our view, between a file catalogue and an applicationrelated metadata catalogue. A file catalogue is a list of files, associated to other information, e.g. the sites/servers that host each particular file or the URLs to access them. When part of a common design, it often is the only way an application has to locate a file in a heavily distributed environment, hence it must be contacted for each file any application needs to access. A metadata catalogue, instead, contains information about the content of files, but not about where they can be accessed. Hence it can contain metadata about a non-existing file, or a file can have in principle no metadata entries. In our view, a worldwide storage system should shield the applications from its internals, like for example the tasks of locating files through sites or through servers. Assigning the task of locating files to a comprehensive storage system, in our view, can be highly desirable for the intrinsic robustness of the concept. A file catalogue instead must deal with the fact that the content of the repository might change in an uncontrollable way (e.g. a broken media or other issues), opening new failure opportunities, which can also be due to mis alignments between its content and the actual content of the repository. To give a simpler example, we consider a completely different domain: the ID3 tags for music files. A technically very good example of a comprehensive metadata catalogue for them is e.g. the www.musicbrainz.org website. We could also think about creating a file catalogue, indexing all the music files owned by the world’s population (and the scale would not be so different from the scale of the file catalogues of modern HEP experiments). The subtle difference among these two is that in the real world everybody is allowed not to own (or to lose or damage) a CD listed in this metadata catalogue. This would not be considered a fault by the maintainers of the metadata, and has no consequence on the information about that particular CD. In other words, this is just information about music releases which somebody could own and somebody not, and somebody else is unable to tell. Hence, an application willing to process a set of music files would only need to know which music files to process, and not how they can be accessed and in which data server in a worldwide deployment.
Page 4 of 15
The European Physical Journal Plus
On the other hand, a “file catalogue”-like system aims at being a “perfect” representation of reality, and this adds some layers of complexity to the task, last but not least that it is computationally very difficult (if not impossible) just to verify that all the entries are correct and that all the CDs owned (and sold/exchanged) by the persons are listed and correctly attributed. Our point, when dealing with this kind of problems is that, from a general perspective, our proposal is not to promote a metadata catalogue to become a file catalogue, since finding or notifying the location of the requested resources is a task belonging to the storage system layer, not to the layer which handles the information bookkeeping. This might be considered a minor, or even subtle aspect, but it makes a very big difference from the architectural point of view when the scale of the system is very large and it is distributed worldwide. 1.3 File systems and data access systems The purpose of building data access systems in order to publish data is common to several application domains, and various kinds of approaches are possible to achieve this objective. Such approaches are not equivalent, since they refer to very different requirements and data distribution patterns. But they share and give different proposals to the common problems of: – – – –
Coherency in data replication. Performance in data access. Scalability of a complex system. Robustness of a complex system in the case of faults.
Most of the work available in the literature concerning fault tolerant and fast data access only marginally deals with both communication robustness and highly available systems. An interesting effort in order to standardize the way storage and data access systems can be designed is represented by IEEE 1244 [6]. However, our opinion is that the sophisticated way in which they describe the components of a storage system might lead to sparse systems where it is mandatory to use some form of database technology just in order to glue together all the pieces of information distributed through the components. This could be considered a very controversial choice, and being absolutely forced, by design (and not by the true requirements), to adopt it, could be considered a serious pitfall of this design. One of the approaches that are closer to the one we consider is the one of the distributed file systems. Generally, the distributed file system paradigm is tied to policies dealing with distributed caching and coherency, path and filename semantics. The algorithms that handle such policies are known to cause network and CPU overhead when dealing with the operations on the files they manage [7,8]. Hence, there is a general consensus that the network overhead due to the synchronization of the internal data structures in distributed file systems becomes a serious issue when dealing with petabytes of data continuously accessed by thousands of clients. This is common to organizations that rely on massive data sharing, and, generally, is independent from the software framework or field of study. Moreover, at the best of our knowledge, in the world of distributed file systems up to now the fault-tolerance related development effort was not very strong, especially when it comes to considering the robustness of the client-server interactions in a Wide Area Network environment. Other interesting works can be found in the area of design of peer-to-peer file sharing networks [9,10]. There is general consensus that the peer-to-peer paradigm is a good proposal for the solution of problems dealing with scalability and fault tolerance [11,12]. Moreover, the illusion of having a unique huge repository obtained by aggregating a very large number of small nodes is very appealing from the point of view of the features that this approach provides. In this paper apparently we do not give much attention to the aspects linked with writing to such a distributed storage. The reason for this is that, in the use cases we consider, writing efficiently can be seen as an objective which is easier to accomplish from the purely technical point of view. In fact, the typical way data files are written in HEP is mainly sequential. For this reason, it is generally easier to get a high level of efficiency in writing new data files, also through Wide Area Networks. Instead, what is more difficult to deal with is the policy that decides where to write, in the case in which there are reasons to write to the storage hosted in a particular site. Generally, however, such choices at the moment in the HEP case are human-based at some extent, and hence there is not much left to decide to the overall storage system by now.
2 Goals and directions Given all these considerations, in our most recent work we considered multiple overall objectives: – Evolve an already existing and well-tested data access system in order to be able to cope with storage globalization. – Evaluate to which extent the devised solutions can be effective. Can they, for instance, support some interactive form of data analysis?
F. Furano: Data management in HEP: An approach
Page 5 of 15
– Find a way to insert in a working data analysis computing model some key features that enhance it in this direction, while opening the way for more ambitious scenarios, where the components are better integrated and the overall system is more robust and needs less “tinkering” to remain functional. Despite the fact that these items can be considered just as simple examples, in this work we address the various solutions we considered and adopted in order to pass from a high-performance storage model composed by several different distant sites coordinated by some software (or human-based) components to a model where the remote storage nodes cooperate to give a unique file system-like coherent view of their content. Once this is achieved, we could argue that a hypothetical job scheduler could be designed to schedule processing jobs to start in sites which are supposed to host the needed data. This, for instance, is the way the ALICE job submission behaves. But it may happen that a job is scheduled in a site even if the storage local to the site is missing some of the needed data files. Of course, the design must contemplate how to handle this case, since there could be several options, i.e.: – The missing files were supposed to be there. The storage system immediately pulls them into the local storage. – Letting the analysis software automatically access data that is in another site, by means of a normal client-server connection. – Let the jobs crash; somebody (maybe) will investigate on the reason why those files were not found. In this section, we also would like to highlight the kind of functionalities we are aiming for, via two relatively simple use cases. 2.1 A use case: the travelling physicist A physicist is waiting for the results of his analysis jobs on the GRID. There are many jobs, which produce several output files which will be saved on a storage element, e.g. at CERN. This physicist’s laptop has all the softwares configured to access those storage elements and draw his histograms, e.g. with ROOT [13]. This man leaves for a conference in a distant place, and the jobs finish while he is on the plane. Once arrived, he wants to simply plot the results from the laptop, and eventually save new modified histograms in the same storage element. Of course he has no time to lose in tweaking in order to get a remote copy of everything, nor he wants to create file replicas in some other storage element. To avoid confusion, all the things must stay where they are. What can this physicist expect? Can he do it? Our answer is “technologically, yes, it is more than possible”. 2.2 A use case: efficient data access for batch jobs This use case reflects the true production choices of the ALICE [14,2] analysis on the GRID. Each running job reads about 100-150MB of conditions data, from a storage element called ALICE::CERN::SE. These are conditions data accessed directly, not file copies, and the access was quite well optimised, i.e. only the needed bytes are read from the files: each job reads only what it effectively needs, with a maximum read speed of about 20MB/s, with a 100% utilization of one CPU core. Moreover, also the data files are sometimes accessed from a remote SE, especially in the case in which a file is lost or the local SE is down for maintainance. From the ALICE monitoring system [15], the remote access mechanisms turned out to be very robust and efficient, even if it is clear that there is room for further optimizations.
3 A unique view versus a unique thing The difficulty of efficiently accessing a large data repository distributed through different sites can be linked to the size of the data repository itself or to the policies dealing with the data replication. For instance, the whole repository could be bigger than the storage that a single site can provide, or there may be agreements between sites in order to cooperate to a common project. In this case, it could look easier or more efficient to assign to a particular site a particular part of a big data storage, especially if there is some other reason to keep that data in that site. For example, in the High Energy Physics world, the case could be that of a site which hosts a group of experts in some particular topic. This kind of site generally shows great interest in asking to physically host such a data partition. But in this case we do not need to focus on the requirements of the High Energy Physics computing, which represent a particular case. Another fair example could be that every town has the responsibility to store the pictures of all its citizens, and this storage site collaborates with the others in order to give a uniform, catalogue-free nation-wide repository of the citizens picture, able to efficiently feed several computing farms, not necessarily residing close to the storage centres.
Page 6 of 15
The European Physical Journal Plus
From this perspective, hence, it is somehow immaterial if the final purpose of the system is to handle High Energy Physics data or a very high number of pictures. The only important thing in our view is that the characteristics of the chosen storage model must match the characteristics of the computing model of which it is a component. In other words, one could wonder if, in the example dealing with the citizens’ pictures, an application could process a big part of this repository, without forcing the creation of replicas of the pictures’ files just because there might be the need to process them somewhere. The overhead of copying the data and keeping track of this process can be much greater than the task of just processing them. With this statement we mean that the decision of creating replicas should be based on other reasons, with respect to the purely functional requirement of being able to access the data. An example of such reasons could be to introduce some form of data redundancy to reduce the probability of losing important data. A solution to this problem is not necessarily obvious, and it largely depends on the efficiency of the communication paths between the analysis processes, but also in the efficiency of the mechanisms that locate the data in the network of the storage sites. We believe that, by exploiting these two points, which were very strongly considered in the design of the Scalla/xrootd system [16], one can build a transparent, high-performance file-system-like view of a very large storage distributed through many sites, suitable for a wide range of applications willing to populate and access them in real time. There have been other attempts to make this possible. For instance, one could analyze how the AliEn [14,2,17] user interface presents a unique view of all the data repository of the ALICE experiment, and argue that this goal has already been achieved. Despite the strong success of this kind of idea in the AliEn framework, unfortunately this is not completely true. In fact, what AliEn does is to glue the pieces of information regarding the content of many sites into a somehow coherent interface, which creates reasonably well the illusion of a common POSIX-like namespace. The questionable point of this is that AliEn does it by using a centralised database, which holds hundreds of millions of entries and needs constant maintenance, although it is very stable up to now. Nevertheless, we consider very safe the location chosen for this database in that system. In fact, this complex database was not put in front of (or embedded in) the storage system, but it is contacted by the data processing jobs or human users in a “quasi-offline” fashion. In other words, and with a certain degree of approximation, a task (or a user) first contacts this database in order to discover what it needs and where, and then the data processing jobs start running and contacting the storage systems, once fed with this information. This reduces the load on the storage access system of a big factor, thus allowing it to deliver transaction rates which can go up to several thousands of file opens per second, without the need for too special hardware. Hence, our response to the so-called “small files problem”, mainly related to the file open rate in the storage system, is to keep the storage access system as fast and lightweight as possible, in order to provide almost the same level of performance that the hardware can provide, while trying to remove all the obstacles to the free scaling of the system size. One question could come at this point: since distributed file systems (like e.g. AFS [18] or LUSTRE [19]) are widespread, can one build a worldwide file system using this kind of highly integrated technology? Doing this we could avoid all the problems dealing with keeping a file database aligned with the reality, and probably avoid having a unique bottleneck/point of failure. The answer is very difficult, and, in general, in the HEP domain, we believe that it is negative. A justification for this answer could be the fact that, to the best of our knowledge, the mainstream distributed file systems do not yet fulfill the requirement of distributing their functionality across distant sites while maintaining a very high efficiency. Another argument (discussed later in this work) is that also the efficiency in random data access that the distributed file systems can provide is not sufficient in the case of data-intensive applications, also due to the absence of mechanisms to drive their internal read ahead logics. In fact, in our experience, when a multitude of clients hits a storage system, read-ahead policies which are too aggressive make the performance much worse. Unfortunately, this is the case of most of the client-side implementations of such file systems. Hence, the case study which spurred the development of more advanced techniques for handling very large data repositories originated to a great extent from the technical evolution of the data storage of the ALICE experiment [4] (proton-proton and lead nuclei collisions at the LHC), but great care was put in order to evolve the system as a generic set of packaged tools, together with new generic methods and ideas, not specific to the ALICE experiment, or even to HEP for that matter. Historically speaking, another very important starting point has been the work done in order to refurbish the BaBar [20] computing model, after the awareness of the fact that using a central Database Management System containing all the data was not a very productive approach.
4 Xrootd and the Scalla software suite The basic system component of the Scalla software suite is a high-performance data server akin to a file server called xrootd. However, unlike NFS, the xrootd protocol includes a number of features like: – Capability to accommodate up to several tens of thousands concurrent physical client connections per server. – Capability to support several tens of thousands outstanding requests sharing the same physical connection.
F. Furano: Data management in HEP: An approach
Page 7 of 15
Fig. 1. Cell-like organization of an xrootd cluster.
– Communication optimizations, which allow clients to accelerate data transfers (e.g., overlapping requests, TCP multistreaming, etc.). – An extensive fault recovery protocol that allows data transfers to be interrupted and continued at an alternate server. – A comprehensive authentication/authorization framework. – The possibility of interfacing a data server with external systems, like tape drives, in order to provide transparent hierarchical storage capabilities [1]. – Peer-to-peer elements that allow xrootd servers to be clustered together while still providing a uniform name space. Xrootd’s clustering is accomplished by the cmsd component of the system. This is a specialized server that can co-ordinate xrootd’s activities and direct clients to appropriate servers in real-time. In essence, the system consists of: – A logical data network (i.e., the xrootd servers). – A logical control network (i.e., the cmsd servers), based on a proprietary message passing technology. The control network is used to cluster servers while the data network is used to deliver actual data. We define a node as a server pairing of an xrootd with a cmsd. A cmsd can assume multiple roles, depending on the nature of the task. In its manager role, the cmsd discovers the best server for a client file request and co-ordinates the organization of a cluster. In its server role, the cmsd provides sufficient information to its manager cmsd so that it can properly select a data server for a client request. Hence, a server cmsd is essentially an agent running on a data server node. In its supervisor role, the cmsd assumes the duties of both manager and server. As a manager, it allows server cmsds to cluster around it, it aggregates the information provided by the server cmsds and forwards the information to its manager cmsd. When a request is made by its manager, the supervisor cmsd determines which server cmsd will be used to satisfy the request. This role parceling allows the formation of data server cells that cluster around a local supervisor, which, in turn, clusters around a manager cmsd. We can now expand the definition of a node to encompass the role. A data server node consists of an xrootd coupled with a server cmsd, a supervisor node consists of an xrootd and a supervisor cmsd, and a manager node consists of an xrootd coupled with a manager cmsd. The term node is logical, since supervisors can execute on the same hardware used by data servers. 4.1 Cell-based organisation To limit the amount of message traffic in the system, a cell consists of 1-to-64 server nodes. Cells then cluster, in groups of up to 64. Clusters can, in turn, form superclusters, as needed. As shown in fig. 1, the system is organized as a B-64 tree, with a manager cmsd sitting at the root of the tree. Since supervisors also function as managers, the term manager should subsequently be assumed to include supervisors. A hierarchical organization provides a predictable message traffic pattern and is extremely well suited for conducting directed searches for file resources requested by a client. Additionally, its performance, which depends on the number of collaborating nodes, scales very quickly with only a small increase in messaging overhead. For instance, a two-level tree is able to cluster up to 262144 data servers, with no data server being more than two hops away from the root node. In order to provide enhanced fault-tolerance at the server side, manager nodes can be replicated. When nodes in the B-64 tree are replicated, the organization becomes a directed acyclic graph and maintains predictable message
Page 8 of 15
The European Physical Journal Plus
traffic. Thus, when the system is configured with n managers, there are at least n control paths from a manager to a data server node, allowing for a significant number of node failures before any part of the system becomes unreachable. Replicated manager nodes can be used as fail-over nodes or be asked to load balance the control traffic. Data server nodes are never replicated. Instead, file resources may be replicated across multiple data servers to provide the desired level of fault-tolerance as well as overall data access performance. The main reason why we do not use the term “file system” when referring to this kind of system is the fact that some requirements of file systems have been relaxed, in order to make it easier to head for extreme robustness, performance and scalability. The general idea was that, for example, it is much easier to give up atomicity in distributed transactions than being unable to make the system grow if its performance/size is not sufficient. For example, right now there was no need to dedicate a serious effort in order to allow for atomic distributed transactions or distributed file locks. These functionalities, when related to extreme performance and scalability are still an open research problem, even more critical when dealing with high-latency networks interconnecting nodes, clients and servers over a WAN. These kinds of simplifications were also related to the particular peer-to-peer–like behaviour of the xrootd mechanism behind the file location functionalities. In the xrootd case, moreover, we can claim for instance that the fact that there is no database is not entirely true. In fact, the database used to locate files is constituted by the message-passingbased aggregation of several hierarchical databases, which are the file systems of all the aggregated disk partitions. Of course, these are extremely well optimised for their purpose and are up to date by definition; hence there is no need to replicate their content into a typically slow external database, just to locate files in sites or single servers. Another characteristic of the xrootd daemon, which is extremely important, is the fact that each server hides its internal paths where the data is stored by means of a simple local string substitution in the file names. In our terminology this is called “localroot”. Doing this, each server in a cluster can export a namespace which is detached from the internal details like the names of the mount points of a given server. Using this kind of feature, in a typical setup, all the servers are configured in order to export the same namespace, and hence the same file exported by more than one server will appear to have the same file name without the need of potentially complex external name translations. This kind of internal filename prefix processing is also used to aggregate multiple mount points in the same server. By doing this, the processing applications are completely detached from the internal deployment details; all they see is a common coherent name space, and the computational overhead of this kind of local name translation is extremely low. 4.2 Fault tolerance Other aspects, which we consider as very important, are those related to robustness and fault tolerance. From one side, a system designer has to be free to decide what the system should do if, for instance, a disk breaks, and its content is lost. Given the relative frequency of this kind of hardware problems, the manual recovery approach should be reduced to a minimum or not needed at all. The typical solutions exploited in fail-safe HEP data management designs are: – The disk pool is just a disk-based cache of an external tape system, hence, if a disk breaks, its content will be (typically automatically) re-staged from the external units by some other server in the same cluster. Of course, the system must be able to work correctly even during this process. – The files are replicated somewhere else. Hence, some system could schedule the creation of a new replica of the lost ones. The difficult part is that finding out which files were lost in a potentially large repository can be problematic. The other aspect of fault tolerance is related to the client-server communication, typically based on TCP connections between clients and servers, and, in the xrootd case, also between server nodes. The operational principles that we applied rigorously are: – A client must never return an error if a connection breaks for any reason (network glitches, etc.), unless it failed after having retried for a certain number of times and/or failed to find an alternate location where to continue. – The server can explicitly signal every potentially long operation, so that the client goes into a pause state from which it can be woken up later. This avoids having to deal too much with timeouts. – Every inter-server connection is kept alive and eventually re-established if it seems to be broken.
5 Facing a Wide Area Network The previous considerations about robustness become more important when the TCP connections that connect clients to servers are established through a Wide Area Network, which gives a higher probability of disconnections over long times, together with introducing more difficulties in using efficiently the available bandwidth. From this perspective, we can say that our experience with the heavy usage of a “retry somewhere else and continue” mechanism has been extremely positive. A very welcome side effect of this is that typically no action is required to alert users, or pause any processing if one or more servers have to be upgraded or restarted.
F. Furano: Data management in HEP: An approach
Page 9 of 15
Given the stability of the communication mechanism, one could argue that it could finally let an application access efficiently the data it needs even if the repository is very far, or, possibly even over a fast ADSL connection from home. Unfortunately, for data intensive application, this is in general very difficult. Naively one would think that, given enough bandwidth, WANs seem capable of delivering enough data to a running process. However, the achievable performance has hitherto been very low for applications in the High Energy Physics domain. This is generally due to: – The characteristics of WAN data streams, often called the long, fat pipe problem: high latency paid once per request, and theoretically high achievable bandwidth per TCP connection. – The typical structure of HEP data processing applications, which, in their extreme simplification, consist in a loop of get chunk - compute chunk instructions. In [21] the characteristics of WAN networks and of some HEP requirements are discussed, in order to introduce some methods to enhance their data throughput. Such techniques (schematized in fig. 3 and fig. 4 further down) have been implemented in the Scalla/xrootd system during the various development cycles, and have reached now a good maturity. The data centres offering computing services for HEP experiments include computing facilities and storage facilities. In this environment, a simple application that has to analyze the content of some files typically will: – Open the files it has to access. – Cycle through the stored data structures, performing calculations and internally updating the results. – Output in some way the final results. In this context, we assume that these data access phases are executed sequentially, as it is in most data processing applications. These considerations refer to the concept of a file-based data store, which is a frequently used paradigm in HEP computing; however, the same performance issues can affect other data access methods. One of the key issues we tried to address is the fact that the computing phase of a HEP application is typically composed by a huge number of interactions with the data store. Hence, a computing application must deal with the fact that even a very short mean latency (e.g. 0.1 milliseconds) would be multiplied by the huge number of interactions (e.g. 108 ). This argument has historically been considered a serious issue which makes it impossible for data analysis applications to access remote repositories with a high degree of efficiency. However, very often, this is not necessarily true for all applications. A trivial example of this is when the application does not need to read the entire content of the files it opens [21]. A more complicated scenario involves complex data processing applications which can predict in some way which data chunks they are going to access. Our experience, discussed later, is that, if supported by adequate data access technologies, these applications can enhance their performance by up to two orders of magnitude, reaching levels comparable to those achievable through local access, which also gets a large performance benefit. In practice, this would mean that it could be possible, for example, to run a data analysis on a desktop or laptop PC, without having to gather all of its inputs before starting it. 5.1 Read requests and latency Figure 2 shows that, in a sequence of read requests from a client to a server, the transmission latency affects the requests twice, and is placed before and after the server side computation. If we assume that the latency is due to the network, and that the repository is not local to the computing client, the value of the latency can be even greater than 70–80 milliseconds. With a latency of 80 milliseconds, an application in Padova, Italy, requesting data from a server at SLAC, located in California, USA, for example, will have to wait at least 160 seconds just to issue 1000 data requests, even if they are for very small chunks. However, the work done at INFN Padova, SLAC and CERN addresses the problem in a more sophisticated way. If we can get some knowledge about the pattern (or the full sequence) of the data request issued by the application, the client side communication library can, in general, request in advance the data for the future requests, and store portions of it in memory as it arrives, while other portions of it are still travelling over the network. Moreover, issuing the data requests in parallel, with the current ones, has the advantage of having a very high probability that the outstanding data contain responses of future data reads. The difficult task here is to efficiently coordinate these fast chunks exchanges with the need of the application to get the next chunk as soon as possible. 5.2 Read requests and throughput The other parameter that is very important when dealing with data access is the data rate at which the application requests new data chunks to process, and the comparison with the data throughput that the data network can sustain. One of the characteristics that make WANs difficult to deal with in the case of data access is the fact that, even if the throughput of the network could be sufficient to feed the data processing, a single TCP stream through a
Page 10 of 15
The European Physical Journal Plus
Fig. 2. Impact of the communication latency on data access.
WAN is typically able to transfer data only at a fraction of the available bandwidth. Moreover, whenever the TCP stack supports advanced features like the “windows scaling”, this typically translates into a so-called ramp-up phase, during which the achievable bandwidth is not very high because the communicating systems are adjusting the internal TCP parameters. This depends upon the characteristics of the commonly used TCP stacks, and also on the common configurations of the network devices, and led to the development of high-performance file copy tools, such as BBCP, GRIDFTP, FDT and XRDCP (RIF), able to transfer data files at high speed by using a large number of parallel TCP connections. The approach of the Scalla/xrootd system with respect to this aspect is to make it possible to apply a similar mechanism to the transfer of a sequence of application-requested data chunks. In the default case, the regular TCP window scaling (if supported by the operating systems of both client and server) will be used. The main difference from a copy-like application is that a data processing application generates a stream of requests for data chunks whose sizes are orders of magnitude smaller (some kBs versus up to several megabytes) than the chunk sizes used for copy-like applications, which typically are all equal and quite large. This is supposed to have some impact on the achievable data rate in the case of a data processing application. Moreover, the fact that the data blocks are in general small and have a very variable size and pseudo-random offsets inside the files, adds a layer of complexity to the problem of correctly optimizing their flow. In general, a desirable feature would be the ability of the communication library to: – Split a request for a big data chunk into smaller chunks, if it has to be distributed through multiple parallel TCP streams. – Correctly handle a stream of multiple overlapping requests for small chunks, in order to achieve a higher efficiency in the WAN data transfer, even through a single TCP stream. – Even more important, make all this completely transparent to the application. 5.3 Hiding latency and enhancing parallelism To lower the total impact of latency on the data access, in the client side communication library (XrdClient or TXNetFile) [16,21–23], the following techniques are used: – A memory cache of the variably-sized data blocks read/written. – Asynchronous processing of both read and write requests, while exposing a synchronous API. – Ability for the client to issue asynchronous requests for data blocks, which will be carried out in parallel, and with no need for their responses to be serialized by the server.
F. Furano: Data management in HEP: An approach
Page 11 of 15
Fig. 3. Prefetching and outstanding data requests.
– Client interface allowing the application to provide the client with information about its future data requests. – Ability for the client to keep track of the outstanding requests, in order not to request twice the data chunks which are in transit and to make the application wait only if it tries to access data which is currently outstanding. Figure 3 shows a simple example of how such a mechanism works. The client side communication library can guess about the future data needs, and prefetch some data in advance at any moment. Alternatively, the client API can be used by external mechanisms (like TTreeCache in the ROOT framework [13,3]) to inform it about a sequence of future data chunks to request. This kind of informed prefetching mechanism drastically optimizes the flow of the requests which have to be forwarded to the server, thus hiding the overall data transfer latency, and is used as a buffer to coordinate the decisions about the outstanding data chunks. Moreover, thanks to the intrinsically asynchronous architecture used, the client side communication library is able to receive the data while the application code performs its processing on the previous chunks, thus reducing the impact of the data access on the data processing. For more information about caching and prefetching, the reader is referred to see [24] and [12]. Figure 4 shows some typical scenarios for remote data access, sorted in the way they have been implemented and evaluated. The most obvious one is the first one, where the needed files are transferred locally before starting the computation. This solution, much used for historical reasons, forces the application to wait for the successful completion of the data transfer before being able to start the computation, even if the processed data volume is a fraction of the transferred one. The second scenario refers to the application paying for the network latency for every data request. This makes the data access extremely inefficient, as discussed above. The third one, instead, shows that, if the data caching/prefetching is able to keep the cache miss ratio at a reasonably low level (or the application is able to inform the communication library about the future data requests), the achieved result is to keep the data transfer going in parallel with the computation, but a little in advance with respect to the actual data needs. This method is more efficient than the ones previously described, and in principle could represent a desirable solution to the problem. However, if the data demanding application generates requests for very small data chunks, the overhead of transferring and keeping track of a large number of small outstanding chunks could have a measurable impact. The challenge is to keep this impact as low as possible. If the client side issue is now that individually transferring small chunks (although in a parallel fashion) can degrade the data transfer efficiency, one obvious solution is to collect a number of data requests and issue a single request containing all of them. This technique, known as vectored reads, takes into account the fact that also the response will be constituted by a unique bigger data chunk, containing all the requested ones. This kind of solution can result in
Page 12 of 15
The European Physical Journal Plus
Fig. 4. Simple and advanced scenarios for remote data access technologies.
an advantage over the transfer of individual small chunks. On the other hand, it has the disadvantage of requesting both client, network and server to serialize the composite data request, in order to manage its response as a unique data block to transfer. This means that the application has to wait for the last sub-chunk to arrive before being able to process the first one. Moreover, the server does not send the first sub-chunk until it has finished reading them all. These aspects, in practice, make this solution less appealing and efficient, especially when the data servers are under load. An interesting idea, still to evaluate in practice, is to try to merge the advantages of individual asynchronous chunk transfers with the vectored data transfers, leading to the last scenario visible in fig. 4. The basic idea shown is to request data transfers through vectored reads which: – Contain enough sub-chunks to be able to reduce the protocol overhead of a client/server request. – Generate responses whose data chunk is sufficiently large to be efficiently transferred through the network pipe. – Deal with responses which are small enough to avoid an excessive serialization of the requests sequence. The result would be that the client application reads the data chunks it needs in small bursts, spread through the time line of the application, thus giving a more uniform load profile for the data server, and reducing the negative impact of chunk serialization in the communication pipe.
6 More about storage globalisation The other interesting item related to the recent WAN-oriented features of the Scalla/xrootd suite is the possibility of building a unique repository, composed by several sub-clusters residing in different sites. The xrootd architecture, based on a B-64 tree where servers are organised (or can be configured in order to self-organise) into clusters [22], accommodates by construction the fact that servers can connect across a Wide Area Network, provided that the communication protocol which connects them is robust enough. All this is accomplished with the introduction of a so-called meta-manager host, which the remote sites subscribe to. This new role of a server machine is the client entry point of a unique meta-cluster, which contains all the subscribed sites, potentially spread elsewhere. Of course, one important requirement is that the remote sites are actually able to connect to this meta-manager and that each storage cluster is accessible from the other sites, but, an even more important one is that all the subclusters expose the same coherent name space. In other words, a given data file must be exported with the same file name everywhere, and, of course, the data servers must be accessible from outside. With the possibility of accessing a very efficient coherent repository without having to know the location of a file in advance, we can implement many ideas. For instance, this mechanism is being used in the storage elements belonging to ALICE in order to quickly fetch a file that was supposed to be present on a storage element but for some reason is not accessible anymore. The next paragraph better explains the details of this kind of system and synthetically shows how this is performed in a way that is completely transparent to the client that requests the file.
7 The virtual mass storage system As said, the set of Wide Area Network related features of the Scalla/Xrootd system played a major role in opening the possibility to evolve a storage system in the direction of being able to exploit Wide Area Networks in a productive
F. Furano: Data management in HEP: An approach
Page 13 of 15
way. The major design challenge was how to exploit this in order to augment the overall robustness and usefulness of a large distributed storage deployment. Nevertheless, coming back to our real-life example, in this perspective we must consider the fact that the ALICE computing model [14,2] refers to a production environment, which is used heavily and continuously. Any radical change, hence, was not considered a good idea (given also the high complexity of the whole ALICE software infrastructure). Moreover, the idea of accessing remote data from data processing jobs was not new in the ALICE computing framework. For instance, this is the default way that the processing jobs have to access the so-called “conditions data”, which are stored in a Scalla/Xrootd-based storage cluster at CERN. Each processing job accesses 100–150 conditions data files (and 20000 concurrent jobs are considered a normal activity): historically the WAN choice always proved itself a very good one, with a surprisingly good performance for that task, and a negligible failure rate. On the other hand, the AliEn software [14], supported by its central services, already gives some sort of unique view of the entire repository, implementing namespace coherence across different sites by mapping the file names using a central relational database. Moving towards a model where the storage coherence is given by the storage system itself, and does not need external systems, would give an additional benefit to this key component of the ALICE computing. This advantage would be in the form of a performance enhancement, but also it would avoid the fact that an external system (like a relational database) cannot, by construction, keep easily track of the data files which can be lost due to malfunctioning disks. As already discussed, the form of a “metadata catalog” is by definition immune to this, and, in the recent times, the ALICE computing is slowly moving in this direction. An additional consideration is that, from a functional point of view, we considered as a poor solution the fact that, once a missing file is detected, nothing is done by the storage system to fetch it. This is especially infortunate when, for instance, the file that is missing in a site is available in a well-connected one, and pulling it immediately (and quickly) could be done without interrupting the requesting job. A careful consideration of all these aspects led to thinking that there was a way to enhance the ALICE Scalla/Xrootd-based distributed storage in a way which automatically tries to fix this kind of problems, and at the same time does not disrupt the existing design, allowing for an incremental evolution of the storage design, distributed across several almost independent sites contributing to the project. The generic solution to the problem has been to apply the following principles: – All of the xrootd-based storage elements are aggregated into a unique worldwide meta-cluster, which exposes a unique and namespace-coherent view of their content. – Each site that does not have tapes or similar mass storage systems considers the meta-cluster as its mass storage system. If an authorised request for a missing file comes to a storage cluster, then it tries to fetch it from one of the neighbour ones, as soon and as fast as possible. The host which manages the meta-cluster (which in the Scalla/Xrootd terminology is called meta-manager ) has been called ALICE Global redirector, and the idea for which a site sees the global namespace (to which it participates with its content) as its mass storage backend has been called Virtual Mass Storage System (VMSS). In this context, we consider as very positive statements the facts that: – In the xrootd architecture, locating a file is a very fast task (if the location is unknown to a manager server, finding it takes, in average, only a network round-trip time plus one local file system lookup). – No central repositories of data or metadata are needed for the mechanism to work, everything is performed through real-time message exchanges. – All of the features of the Scalla/Xrootd suite are preserved and enhanced (e.g. scalability, performance, etc.). – The design is completely backward compatible, e.g. in the ALICE case no changes were required to the complex software systems which constitute the computing infrastructure. Hence, the system, as designed, represents a way to deploy and exercise a global real-time view of a unique multiPetabyte distributed repository in an incremental way. This means that it is able to give a better service as more and more sites upgrade their storage clusters and join in. Figure 5 shows a generic small schema of the discussed architecture. In the lower part of the figure we can see three sites with a Scalla/Xrootd storage cluster. If a client instantiated by a processing job (on the leftmost site labelled GSI) does not find the file it needs to open, then the cluster named GSI (see picture) asks the global redirector to get the file, just requesting to copy the file from it. This copy request (which is just another normal xrootd client) is then redirected to the site that has the needed file, which is copied immediately to the leftmost site. The previously paused job can then continue its processing as if nothing particular had happened. Hence, we could say that the general idea about the Virtual Mass Storage is that: – Several distant sites provide a unique high-performance file-system–like data access structure. – They constitute some sort of cloud of collaborating sites, in order to i) make some recovery actions automatic, ii) give an efficient way to move data between them, if used by the overall design.
Page 14 of 15
The European Physical Journal Plus
Fig. 5. An exemplification of the Virtual Mass Storage System.
8 Conclusion and current directions The most obvious aspect that we would like to emphasize is that this was (and is) an ongoing attempt to evolve some methodologies towards what we believe is a desirable direction. The fact itself that such technologies are being exercised in a very heavy production environment, with 10–20K concurrent jobs running 24/7, each one opening up to 100–150 files, is a confirmation of the validity of several ideas, and a starting point that can be used to refine their implementation in future development cycles. By using the two mechanisms described (the possibility of efficiently accessing remote data and the possibility of creating inter-site meta-clusters) many things become possible, as these are very generic mechanisms that encapsulate the technicalities but are not linked in any way to particular deployments. For example, building a true and robust federation of sites now starts becoming much easier and efficient. For instance, one site could host the storage part, another one could host the worker nodes, without having to worry too much about the performance loss due to the latency of the link between the sites. Or both might have a part of the overall storage and appear as one, without having to deal with (generally less stable and very complex) “glue code” in order to build an artificial higher-level view of the resources. So far, all the tests and the production usages have been very successful, from the points of view of both performance and robustness. We believe that providing the possibilities of having a storage system able to work acrosss WANs properly and efficiently is a major accomplishment that will give many benefits, especially when dealing with user-level interactive data processing. As discussed, the features that allow this kind of approach are quite generic, and it is foreseeable that other systems will follow a similar technological path. With respect to the consistency between catalogues and the actual content of the remote site, other approaches are possible, for instance trying to update in real-time the content of the catalogues every time an inconsistency is detected. This approach could be a very interesting addition, that could also complement the VMSS approach, trying instead to fetch the file that was supposedly lost. In this case, both the catalogues and the storage systems would have a way to fix their content and/or propagate changes. One more place for improvements could be in the data processing applications or in the analysis framework that ties them to the data files (which for High Energy Physics are very often written in ROOT format). In this side for
F. Furano: Data management in HEP: An approach
Page 15 of 15
instance we could even now imagine better algorithms that discover the list of the chunks to read from a file. Definitely, this would increase the data access efficiency by feeding the latency hiding mechanisms of the xrootd client. With respect to the choices of the data chunks that may be needed by an application, we are also evaluating some specialisations of the basic read-ahead algorithm of the xrootd client, which can be applied in the case in which the application is not able to generate a list of the data chunks it will need to access. This effort could increase the data access efficiency of such applications with no additional development efforts at their side. Of course, if such applications intend to exploit the full power of the “informed prefetching” strategy, they should become able to indicate in advance exactly the chunks they will access. Other aspects worth facing are those related to the so-called “storage globalisation”. For instance, a working implementation of some form of latency/location-aware load balancing would be a very interesting feature for data analysis applications in the need to access remote files efficiently. At the present time, the xrootd load balancing algorithm, indeed, only considers the load of the servers (network throughput and CPU consumption) as a metric in order to decide where to redirect a client. As always, the challenge would be not only to provide one more feature, but to provide it in a way which is compatible with the requirements for extreme performance and scalability.
References 1. Hierarchical storage management, from Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Hierarchical storage management (July 2009). 2. L. Betev, F. Carminati, F. Furano, C. Grigoras, P. Saiz, The ALICE computing model: an overview, in Third International Conference on Distributed Computing and Grid-technologies in Science and Education, GRID2008, http://grid2008.jinr.ru/. 3. M. Ballintijn, R. Brun, F. Rademakers, G. Roland, Distributed Parallel Analysis Framework with PROOF, http://root. cern.ch/twiki/bin/view/ROOT/PROOF (July 2009). 4. The ALICE home page at CERN, http://aliceinfo.cern.ch/ (July 2009). 5. MONARC: Models of Networked Analysis at Regional Centres for LHC Experiments, http://monarc.web.cern.ch/MONARC/ (Oct 2010). 6. The IEEE Computer Society’s Storage System Standards Working Group, http://ssswg.org/ (January 2009). 7. F. Garcia, A. Calderon, J. Carretero, J.M. Perez, J. Fernandez, A Parallel and Fault Tolerant File System based on NFS Servers, in Euro-PDP’03 (IEEE Computer Society, 2003). 8. S. Ghemawat, H. Gobioff, S.T. Leung, The Google file system, in Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (ACM, New York, 2003). 9. A. Bar-Noy, A. Freund, J.S. Naor, On-line Load Balancing in a Hierarchical Server Topology, in Proceedings of the 7th Annual European Symposium on Algorithms, 1999, SIAM J. Comput. 31, 527 (2001). 10. S.R. Waterhouse, D.M. Doolin, G. Kan, Y. Faybishenko, IEEE Internet Comput. 6, 68 (2002). 11. M. Ripeanu, I. Foster, A. Iamnitchi, IEEE Internet Comput. 6, 50 (2002). 12. T.E. Anderson, M. Dahlin, J.M. Neefe, D.A. Patterson, D.S. Rosselli, R.Y. Wang, ACM Trans. Comput. Syst. 14, 41 (1996). 13. ROOT: An Object-Oriented Data Analysis Framework, http://root.cern.ch (July 2009). 14. ALICE: Technical Design Report of the Computing, June 2005, ISBN 92-9083-247-9, http://aliceinfo.cern.ch/ Collaboration/Documents/TDR/Computing.html. 15. ALICE Grid Monitoring with MonALISA. Realtime monitoring of the ALICE activities, http://pcalimonitor.cern.ch/ (July 2009). 16. The Scalla/xrootd Software Suite, http://savannah.cern.ch/projects/xrootd; http://xrootd.slac.stanford.edu/ (July 2009). 17. D. Feichtinger, A.J. Peters, Authorization of Data Access in Distributed Storage Systems, in The 6th IEEE/ACM International Workshop on Grid Computing, 2005, http://ieeexplore.ieee.org/iel5/10354/32950/01542739.pdf?arnumber= 1542739 (July 2009). 18. J.H. Howard et al., ACM Trans. Comput. Syst. 6, 51 (1988). 19. LUSTRE: a scalable, secure, robust, highly-available cluster file system, http://www.lustre.org/ (July 2009). 20. The BABAR Collaboration home page, http://www.slac.stanford.edu/BFROOT (July 2009). 21. F. Furano, A. Hanushevsky, J. Phys. Conf. Ser. 119, 072016 (2008). 22. F. Furano, A. Hanushevsky, Managing commitments in a Multi Agent System using Passive Bids, in IEEE/WIC/ACM International Conference on Intelligent Agent Technology (IAT’05) (2005) pp. 698–701. 23. A. Dorigo, P. Elmer, F. Furano, A. Hanushevsky, XROOTD/TXNetFile: a highly scalable architecture for data access in the ROOT environment, in Proceedings of the 4th WSEAS International Conference on Telecommunications and Informatics, Prague, Czech Republic, March 13 - 15, 2005, edited by M. Husak, N. Mastorakis (World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, Wisconsin) WSEAS Trans. Comput. 4, 348 (2005). 24. R.H. Patterson, G.A. Gibson, E. Ginting, D. Stodolsky, J. Zelenka, Informed prefetching and caching, in Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95) (ACM, New York, 1995) pp. 79–95, doi:10.1145/224057.224064.