OpenFabrics Management
The OpenFabrics Management Framework is in development to provide a common framework that helps to simplify the development of network fabric management applications and tools. Modern HPC and Enterprise computing systems can benefit from a more efficient way to assemble and control network fabrics. The Open Fabric Management Framework provides a centralized set of tools to control and interact with network fabrics.
Modern HPC and Enterprise computing systems can benefit from a more efficient way to assemble and control network fabrics. The OpenFabrics Alliance (OFA) together with its partners the DMTF, SNIA and the Gen-Z Consortium, are developing a new open-source fabric management framework to provide a unified set of tools to control and monitor multiple network fabric types.
Increasingly complex computing problems being tackled today are creating diverse requirements for an array of fabric management tools and applications needed to operate more architecturally complex computing systems. Developers of such tools and management applications, in turn, are faced with a complex permutation of fabrics (InfiniBand, Gen-Z, Slingshot, others).
Disaggregated resources, such as memory, storage, compute, and accelerators, are interconnected by high speed fabrics. With no common way of querying or manipulating such fabrics and resources a Gordian Knot of fabrics and resource allocation is being created. The victims of this Gordian Knot conundrum are System Administrators, Application Designers, and System Architects who design, deploy, maintain, and use any sort of fabric-based computing system and whom must supply their users with reliable, high performance systems. This includes systems for High-Performance Computing, Machine Learning, Cloud-based systems, and Enterprise environments.
With exciting new advancements in technologies developed for HPC and Cloud computing systems, there are new diverse methods for designing and implementing both shared memory and remote IO storage.
Shared memory accessible by CPU cores, both on the same server and shared across networks are important in application design and operation. Also, the Amdahl number law states that a processor will need some IO per second when there is a number of instructions performed per second. In Amdahl’s IOPS Ratio law, programs are expected to need to do I/O every 50,000 CPU instructions.
New heterogeneous and diverse high-speed fabrics are being designed to allow application access to both shared memory and IO storage. Some of these fabrics are being designed to be optimized to provide low-latency memory operations. Other network designs are being optimized to provide high bandwidth.
Currently, no common framework and toolset exists for coordinating, arranging, and composing such things as high-speed network Fabric Attached Memory, Non-volatile Memory, concurrent multiple network fabrics, while providing applications and users a common security toolkit.
Who?
The OpenFabrics Management Framework is designed for System Administrators, Application Programmers and users, HPC and Cloud Architecture Designers, and other stakeholders that are involved in the design and deployment of stable and high-speed network based computing systems.
How?
The OpenFabrics Management Framework (OFMF) provides a universal set of tools and services to manage attached fabrics. The OFMF uses a common language, Redfish, to allow clients to manipulate network fabrics and request information about the fabrics. Each vendor specific fabric can be controlled and manipulated through the use of a custom agent that is designed to provide its services and functions to the OFMF via the Redfish API.
The OpenFabrics Management Framework (OFMF) is designed to be versatile and allow clients to connect and interact with underlying high-speed fabrics.
In the diagram above, clients that interact with the OFMF can include libfabric, Kubernetes, Slurm, and myriad other types of clients that might utilize or monitor fabric services. The advantage of a centralized framework is that there is a uniform set of tools that the clients can access to gain insights and manipulate underlying fabrics.
The OFMF provides tools that interact with Redfish and Swordfish. Redfish and Swordfish provide both a database and a set of methods to create a virtual mirror image of a physical set of fabrics. It should be noted here that the OFMF can integrate, manipulate, monitor, and control multiple fabrics at the same time. In addition, the OFMF is being designed to pick optimum options for fabrics.
A fabric model in the ‘Redfish domain’ can be modeled as a group of endpoints, resources, zones, and zones-of-zones. An endpoint can be considered to be a destination, such as a server connected by a network card or a switch port. A resource can be considered as a component that provides services to a fabric. A zone can be considered to be a set of endpoints and resources that provide an integrated unit, such as a collection of remote memory. Finally, a zone-of-zones can be considered to be a unit or collection, of zones.
In the Redfish model there is no notion of physical separation. Thus, a zone of memory, for instance, could be made up of 2 separate memory chunks that are routable to each other, yet are not located in the same endpoint. The model above depicts only the logical resources, not the physical connections. Redfish models the physical fabric topology by associating actual fabric ports and fabric links, as shown below:
Use Case Descriptions
Create a Connection to Fabric Attached Memory
- In this Use Case, a client requests a fabric endpoint connection between a server and Fabric Attached Memory (FAM) with read/write permissions, no fabric encryption, while picking the connection with the lowest latency, highest bandwidth, and at least one redundant path in active/active mode.
K8s Cluster Create a Zone Use Case Description
- In this Use Case, use a Redfish Composition to allocate a zone object to define a virtual, private network within a larger fabric.
Slurm Allocate a Batch Use Case Description
- In this Use Case, use a Redfish Composition to interact with Slurm to define a zone of nodes within a larger fabric
Create a Fabric Attached Memory Block
- Provide a memory attached block
Proof-of-Concept Using the Gen-Z Fabric
The Gen-Z Consortium, the OpenFabrics Alliance, the DMTF, and the Storage Network Industry Association (SNIA) are planning to demonstrate key features of the new OFMF and in-band Gen-Z fabric management at the upcoming SC21 Supercomputing Conference in St. Louis, Missouri.
The purpose of the proof-of-concept will be to allow a user to interface with a GUI to reach across a Gen-Z fabric and assemble combinations of Fabric Attached Memory. The OFMF provides means for clients to interact with the fabric. The clients consist of a User, a Fabric Attached Memory Manager, and a Composition Manager.
The Gen-Z Subnet Manager, Zephyr, will be responsible for initializing fabric endpoints, switches, and switch ports of the underlying high-speed fabric. The compute servers are running the Gen-Z local host services, Llamas, which will map Gen-Z resources bound to the servers by Zephyr and enable applications to access those resources. On the left side of the drawing is the logical block diagram, with devices labeled per their Redfish logical representation. Each memory module is connected by a cable to a switch port. The following diagram is the Redfish bubble diagram of the fabric resources and their physical structures as initially discovered. Clients of the OFMF can use the Redfish API to request logical connections (e.g. bindings) between the various systems (servers) and the memory resources, thus configuring individual systems using the dis-aggregated fabric resources. The OFMF Services will update the Redfish fabric model and send the proper commands to the Gen-Z fabric manager (Zephyr) to configure the actual fabric to match.
Ongoing work is being performed at: https://github.com/OFMFWG