Disclaimer: This version of the protocol specification is missing all updates to the protocol since Linux 4.0.
Transparent Inter Process Communication Protocol
Abstract
This document specifies TIPC, a protocol specially designed for efficient communication within clusters of loosely coupled nodes.
TIPC provides two types of services to its applications:
An "all-in-one" L2 or L3 based message transport service:
- Reliable datagram unicast, anycast and multicast.
- Reliable connections with stream or message transport.
- Location transparent service addressing.
- Multi-binding of addresses.
- Supervised node-to-node transport links with loss-free failover.
A service and topology tracking function:
- Tracking nodes, processes, sockets, addresses and connections.
- Subscription/event function for service/functional addresses.
- Immediate feedback at service,topology or connectivity changes.
- Automatic neighbor discovery.
Status of This Memo
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”
This Internet-Draft will expire on August 17, 2014.
Copyright Notice
Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Table of Contents
1.
Introduction
2.
Conventions
3.
Overview
3.1.
Background
3.1.1.
Existing Protocols
3.1.2.
Assumptions
3.2.
Architectural Overview
3.3.
Functional Overview
3.3.1.
API Adapters
3.3.2.
Address Subscription
3.3.3.
Address Distribution
3.3.4.
Address Translation
3.3.5.
Multicast
3.3.6.
Connection Supervision
3.3.7.
Routing and Link Selection
3.3.8.
Neighbour Detection
3.3.9.
Link Establishment/Supervision
3.3.10.
Link Failover
3.3.11.
Fragmentation/Reassembly
3.3.12.
Bundling
3.3.13.
Congestion Control
3.3.14.
Sequence and Retransmission Control
3.3.15.
Bearer Layer
3.4.
Fault Handling
3.4.1.
Fault Avoidance
3.4.2.
Fault Detection
3.4.3.
Fault Recovery
3.4.4.
Overload Protection
3.5.
Terminology
3.6.
Abbreviations
4.
TIPC Features
4.1.
Network Topology
4.1.1.
Network
4.1.2.
Zone
4.1.3.
Cluster
4.1.4.
Node
4.2.
Link
4.3.
Port
4.4.
Message
4.4.1.
Taxonomy
4.4.2.
Format
4.5.
Addressing
4.5.1.
Location Transparency
4.5.2.
Network Address
4.5.3.
Port Identity
4.5.4.
Port Name
4.5.5.
Port Name Sequence
4.5.6.
Multicast Addressing
4.5.7.
Publishing Scope
4.5.8.
Lookup Policies
5.
Port-Based Communication
5.1.
Payload Messages
5.1.1.
Payload Message Types
5.1.2.
Payload Message Header Sizes
5.1.3.
Payload Message Format
5.1.4.
Payload Message Delivery
5.2.
Connectionless Communication
5.3.
Connection-based Communication
5.3.1.
Connection Setup
5.3.2.
Connection Shutdown
5.3.3.
Connection Abortion
5.3.4.
Connection Supervision
5.3.5.
Flow Control
5.3.6.
Sequentiality Check
5.4.
Multicast Communication
6.
Name Table
6.1.
Distributed Name Table Protocol Overview
6.2.
Name Distributor Message Processing
6.3.
Name Distributor Message Format
6.4.
Name Publication Descriptor Format
7.
Links
7.1.
TIPC Internal Header
7.1.1.
Internal Message Header Format
7.1.2.
Internal Message Header Fields Description
7.2.
Link Creation
7.2.1.
Link Setup
7.2.2.
Link Activation
7.2.3.
Link MTU Negotiation
7.2.4.
Link Continuity Check
7.2.5.
Sequence Control and Retransmission
7.2.6.
Message Bundling
7.2.7.
Message Fragmentation
7.2.8.
Link Congestion Control
7.2.9.
Link Load Sharing vs Active/Standby
7.2.10.
Load Sharing
7.2.11.
Active/Standby
7.3.
Link Failover
7.3.1.
Active Link Failure
7.3.2.
Standby Link Failure
7.3.3.
Second Link With Same Priority Comes Up
7.3.4.
Second Link With Higher Priority Comes Up
7.3.5.
Link Deletion
7.3.6.
Message Bundler Protocol
7.3.7.
Link State Maintenance Protocol
7.3.8.
Link Changeover Protocol
7.3.9.
Message Fragmentation Protocol
8.
Broadcast Link
8.1.
Broadcast Protocol
8.2.
Piggybacked Acknowledge
8.3.
Coordinated Acknowledge Interval
8.4.
Coordinated Broadcast of Negative Acknowledges
8.5.
Replicated Delivery
8.6.
Congestion Control
9.
Neighbor Detection
9.1.
Neighbor Detection Protocol Overview
9.1.1.
Link Request Message Processing
9.1.2.
Link Response Message Processing
9.1.3.
Link Discovery Message Format
9.1.4.
Media Address Formats
10.
Topology Service
10.1.
Topology Service Semantics
10.2.
Topology Service Protocol
10.2.1.
Subscription Message Format
10.2.2.
Event Message Format
10.3.
Monitoring Service Topology
10.4.
Monitoring Physical Topology
11.
Configuration Service
11.1.
Configuration Service Semantics
11.2.
Configuration Service Protocol
11.2.1.
Command Message Format
12.
Security Considerations
13.
IANA Considerations
14.
Contributors
15.
Acknowledgements
16.
References
16.1.
Normative References
16.2.
Informative References
Appendix A.
Change Log
Appendix B.
Remaining Issues
TOC |
1. Introduction
This section explains the rationale behind the development of the Transparent Inter Process Communication (TIPC) protocol. It also gives a brief introduction to each service provided by this protocol, as well as the basic concepts needed to understand the further description of the protocol in this document.
TOC |
2. Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119] (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.).
TOC |
3. Overview
This section explains the rationale behind the development of the Transparent Inter Process Communication (TIPC) protocol. It also gives a brief introduction to the services provided by this protocol, as well as the basic concepts needed to understand the further description of the protocol in this document.
TOC |
3.1. Background
There are no standard protocols available today that fully satisfy the special needs of application programs working within highly available, dynamic cluster environments. Clusters may grow or shrink by orders of magnitude, having member nodes crashing and restarting, having routers failing and replaced, having functionality moved around due to load balancing considerations, etc. All this must be handled without significant disturbances of the service(s) offered by the cluster. To minimize the effort by the application programmers to deal with such situations, and to maximize the chance that they are handled in a correct and optimal way, the cluster internal communication service should provide special support helping the applications to adapt to changes in the cluster. It should also, when possible, leverage the special conditions present within cluster environments to present a more efficient and more fault-tolerant communication service than more general protocols are capable of. This is the purpose of TIPC.
Version 1 of the TIPC protocol was proprietary, and has been widely deployed in Ericsson's customer networks. This document describes version 2 of the protocol. An open source implementation of version 2 is available as part of the standard Linux kernel at www.kernel.org
TOC |
3.1.1. Existing Protocols
TCP [RFC0793] (Postel, J., “Transmission Control Protocol,” September 1981.) has the advantage of being ubiquitous, stable, and wellknown by most programmers. Its most significant shortcomings in a real-time cluster environment are the following:
- It lacks any notion of service addressing and addressing transparency. Mechanisms exist (DNS, CORBA Naming Service) for transparent and dynamic lookup of the correct IP-adress of a destination, but those are in general too static and expensive to use.
- TCP has non-optimal performance, especially for intra-node communication and for short messages in general. For intra-node communication there are other and more efficient mechanisms available, at least on Unix, but then the location of the destination process has to be assumed, and can not be changed. It is desirable to have a protocol working efficiently for both intra-node and inter-node messaging, without forcing the user to distinguish between these cases in his code.
- The rather heavy connection setup/shutdown scheme of TCP is a disadvantage in a dynamic environment. The minimum number of signaling packets exchanged for even the shortest TCP transaction is seven (SYN, SYNACK etc.), while with TIPC this can be reduced to two, or even to zero if "implicit connect" is used.
- The connection-oriented nature of TCP makes it impossible to support true multicast.
SCTP [RFC2960] (Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. Paxson, “Stream Control Transmission Protocol,” October 2000.) is message oriented, it provides some level of user connection supervision, message bundling, loss-free changeover, and a few more features that may make it more suitable than TCP as an intra-cluster protocol. Otherwise, it has all the drawbacks of TCP already listed above.
Apart from these weaknesses, neither TCP nor SCTP provide any topology information/subscription service, something that has proven very useful both for applications and for management functionality operating within cluster environments.
Both TCP and SCTP are general purpose protocols, in the sense that they can be used safely over the Internet as well as within a closed cluster. This virtual advantage is also their major weakness: they require funtionality and header space to deal with situations that will never happen, or only infrequently, within clusters.
TOC |
3.1.2. Assumptions
TIPC has been designed based on the following assumptions, empirically known to be valid within most clusters.
- Most messages cross only one direct hop.
- Transfer time for most messages is short.
- Most messages are passed over intra-cluster connections.
- Packet loss rate is normally low; retransmission is infrequent.
- Available bandwidth and memory is normally high.
- For all relevant bearers packets are check-summed by hardware.
- The number of inter-communicating nodes is relatively static and limited at any moment in time.
- Security is a less crucial issue in closed clusters than on the Internet.
These assumptions allow TIPC to use a simple, traffic-driven, fixed-size sliding window protocol located at the signalling link level, rather than a timer-driven transport level protocol. This in turn leads to other benefits, such as earlier release of transmission buffers, earlier packet loss detection and retransmission, earlier detection of node unavailability, to mention but some. Of course, situations with long transfer delays, high loss rates, long messages, security issues, etc. must also be dealt with, but from the viewpoint of being exceptions rather than as the general rule.
TOC |
3.2. Architectural Overview
TIPC should be seen as a layer between an application using TIPC and a packet transport service such as Ethernet, InfiniBand, UDP, TCP, or SCTP. The latter are denoted by the generic term "bearer service", or simply "bearer", throughout this document.
TIPC provides reliable transfer of user messages between TIPC users, or more specifically between two TIPC ports, which are the endpoints of all TIPC communication. A TIPC user normally means a user process, but may also be a kernel-level function or a driver.
Described by standard terminology TIPC spans the level of transport, network, and signalling link layers, although this does not inhibit it from using another transport level protocol as bearer, so that e.g. an TCP connection may serve as bearer for a TIPC signalling link.
Node A Node B ----------------- ----------------- | TIPC | | TIPC | | Application | | Application | |-----------------| |-----------------| | | | | | TIPC |TIPC address TIPC address| TIPC | | | | | |-----------------| |-----------------| | L2 or L3 Bearer |Bearer address Bearer address| L2 or L3 Bearer | | Service | | Service | ----------------- ----------------- | | +---------------- Bearer Transport ----------------+
Figure 1: Architectural view of TIPC |
TOC |
3.3. Functional Overview
Functionally TIPC can be described as consisting of several layers performing different tasks, as shown in Figure 2 (Functional view of TIPC).
TIPC User ---------------------------------------------------------- ------------- ------------- | Socket | | Other API | Adapter | | Adapters.. ------------- ------------- ========================================================= ---------------------------- | Address | Address | | Subscription | Resolution | |--------------+---------------------------------------- | Address Table| Connection Supervision | | Distribution | Routing/Link Selection | -----------------------------------------------------+- | | Neighbour Detection | | Node | Multicast | Link Establish/Supervision | ----------> | | Link Failover | Internal -----------------------------------------------+- | Fragmentation/Defragmentation | | | | | ------------------------------------------ | | Bundling | | | Congestion Control | | ------------------------------------+----- | | Sequence/Retransmission | | | | Control | | | -------+---------------+----- | | =========|==============|============|===========|======== | | | | ----V----- -----V---- ----V----- --V------- | Ethernet | | UDP | | TCP | | Mirrored | | | | | | | | Memory | ----------- ---------- ---------- ----------
Figure 2: Functional view of TIPC |
TOC |
3.3.1. API Adapters
TIPC makes no assumptions about which APIs should be used, except that they must allow access to the TIPC services. It is possible to provide all functionality via a standard socket interface, an asynchronous port API, and any other form of dedicated interface that can be motivated. In these layers there MUST be support for transport-level congestion and overload protection control.
TOC |
3.3.2. Address Subscription
The service "Topology Information and Subscription" provides the ability to interrogate and if necessary subscribe for the availability of a service address, and thereby determine the availability of an associated physical/virtual resource or service.
This can be used by a distributed application to synchronize its startup, and may even serve as a simple, distributed event channel.
TOC |
3.3.3. Address Distribution
service addresses and their associated physical addresses must be equally available within the whole cluster. For performance and fault tolerance reasons it is not acceptable to keep the necessary address tables in one node; instead, TIPC must ensure that they are distributed to all nodes in the cluster, and that they are kept consistent at any time. This is the task of the Address Distribution Service, also called Name Distribution Service.
TOC |
3.3.4. Address Translation
The translation from a service address to a physical address is performed on-the-fly during message sending by this functional layer. This step must use an efficient algorithm, and multiple translations of a service address should be avoided where possible.
It is possible to bypass address translation altogether when sending messages if the sender is able to use a physical address as the destination address. For example, this can be done when a server responds to a connection setup request, or when communication between two applications occurs over an already established connection.
TOC |
3.3.5. Multicast
This layer, supported by the underlying three layers, provides a reliable intra-cluster broadcast service, typically defined as a semi-static multicast group over the underlying bearer. It also provides the same features as an ordinary unicast link, such as message fragmentation, message bundling, and congestion control.
TOC |
3.3.6. Connection Supervision
There are several mechanisms to ensure immediate detection and report of connection failure.
TOC |
3.3.7. Routing and Link Selection
This is the step of finding the correct destination node, plus selecting the right link to use for reaching that node. If the destination node turns out to be the own node, the rest of the stack is omitted, and the message is sent directly to the receiving port.
TOC |
3.3.8. Neighbour Detection
When a node is started it must make the rest of the cluster aware of its existence, and itself learn the topology of the cluster. By default this is done by use of broadcast, but there are other methods available.
TOC |
3.3.9. Link Establishment/Supervision
Once a neighbouring node has been detected on a bearer, a signalling link is established towards it. The functional state of that link has to be supervised continuously, and proper action taken if it fails.
TOC |
3.3.10. Link Failover
TIPC on a node will establish one link per-destination node and functional bearer instance, typically one per-configured ethernet interface. Normally these will run in parallel and share load equally, but special care has to be taken during the transition period when a link comes up or goes down, to ensure the guaranteed cardinality and sequentiality of the message delivery. This is done by this layer.
TOC |
3.3.11. Fragmentation/Reassembly
When necessary TIPC fragments and reassembles messages that can not be contained within one MTU-size packet.
TOC |
3.3.12. Bundling
Whenever there is some kind of congestion situation, i.e. when a bearer or a link can not immediately send a packet as requested, TIPC starts to bundle messages into packets already waiting to be sent. When the congestion abates the waiting packets are sent immediately, and unbundled at the receiving node.
TOC |
3.3.13. Congestion Control
When a bearer instance becomes congested, e.g. it is unable to accept more outgoing packets, all links on that bearer are marked as congested, and no more messages are attempted to be sent over those links until the bearer opens up again for traffic. During this transition time messages are queued or bundled on the links, and then sent whenever the congestion has abated. A similar mechanism is used when the send window of a link becomes full, but affects only that particular link.
TOC |
3.3.14. Sequence and Retransmission Control
This layer ensures the cardinality and sequentiality of packets over a link.
TOC |
3.3.15. Bearer Layer
This layer adapts to some connectionless or connection-oriented transport service, providing the necessary information and services to enable the upper layers to perform their tasks.
TOC |
3.4. Fault Handling
Most functions for improving system fault tolerance are described elswhere, under the repective functions, but some aspects deserve being mentioned separately.
TOC |
3.4.1. Fault Avoidance
Strict Source Address Check : After the neighbour detection phase, a message arriving to a node must have a not only a valid Pevious Node address, but this must belong to one of the nodes known having a direct link to the destination. The node may in practice be aware of at most a few hundred such nodes, while a network address is 32 bits long. The risk of accepting a garbled message having a valid address within that range, a sequence number that fits into the reception window, and otherwise valid header fields, is extremely small, no doubt less than one to several billions.
Sparse Port Address Space : As an extra measure, TIPC uses a 32-bit pseudo-random number as the first part of a port identity. This gives an extra protection against corrupted messages, or against obsolete messages arriving at a node after long delays. Such messages will not find any destination port, and be attempted returned to the sender port. If there is no valid sender port, the message should be quietly discarded.
Name Table Keys : When a NAME TABLE is updated with a new publication, each of those are qualified with a Key field, that is only known by the publishing port. This key must be presented and verified when the publication is withdrawn, in all instances of the name table. If the key does not fit, the withdrawal is refused.
Link Selectors : Whenever a message/packet is sent or routed, the link used for the next-hop transport is always selected in a deterministic way, based on the sender port's random number. The risk of having packets arriving in disorder is hence non-existent.
Repeated Name Lookups : If a lookup in the NAME TABLE has returned a port identity that later turns out to be false, TIPC performs up to 6 new lookups before giving up and rejecting the message.
TOC |
3.4.2. Fault Detection
The mechanisms for fault detection have been described in previous sections, but some of them will be briefly repeated here:
Transport Level Sequence Number, to detect disordered multi-hop packets.
Connection Supervision and Abortion mechanism.
Link Supervision and Continuation control.
TOC |
3.4.3. Fault Recovery
When a failure has been detected, several mechanisms are used to eliminate the impact from the problem, or when that is impossible, to help the application to recover from it:
Link Failover: When a link fails, its traffic is directed over to the redundant link, if any, in such a way that message sequentiality and cardinality is preserved. This feature is described in Section 7.3 (Link Failover).
Returning Messages to Sender : When no destination is found for a message, the 1024 first bytes of it is returned to the sender port, along with a descriptive error code. This helps the application to identify the exact instant of failure, and if possible, to find a new destination for the failed call. The complete list of error codes and their significance is described in Figure 9 (TIPC Error Codes).
TOC |
3.4.4. Overload Protection
To overcome situations where the congestion/flow control mechanisms described earlier in this section are inadequate or insufficient, TIPC must provide an additional overload protection service:
Process Overload Protection
TIPC must maintain a counter for each process, or if this is impossible, for each port, keeping track of the total number of pending, unhandled payload messages on that process or port. When this counter reaches a critical value, which should be configurable, TIPC must selectively reject new incoming messages. Which messages to reject should be based on the same criteria as for the node overload protection mechanism, but all thresholds must be set significantly lower. Empirically a ratio 2:1 between the node global thresholds and the port local thresholds has been working well.
TOC |
3.5. Terminology
This section defines terms whose meaning may otherwise be unclear or ambiguous.
Application: A user-written program that directly utilizes TIPC for communication.
Bearer: An instance of a physical or logical transport media, such as Ethernet, ATM/AAL or DCCP, over which messages can be sent.
Broadcast: The sending of a message to all other nodes in the sender's cluster, each of which receives a copy of the message. Note that what is considered a broadcast from the TIPC viewpoint may be mapped onto a multicast at the bearer (Ethernet or DCCP) level.
Connection: A logical channel for passing messages between two ports. Once a connection is established no address need be indicated when sending a message from either of the endpoints.
Cluster: A collection of nodes that are directly interconnected (i.e. fully meshed). All nodes in a cluster have network addresses that differ only in their node identifier.
Domain: A subset of topologically related nodes in a TIPC network, normally designated by a network address. For example, <Z.C.N> designates a specific node, <Z.C.0> designates any node within the specified cluster, <Z.0.0> designates any node within the specified zone, and <0.0.0> designates any node within the network.
Functional Address: Synonym for Service Address.
Internal Message: A message that is generated and consumed by an internal TIPC subsystem.
Link: A communication channel connecting two nodes, performing tasks such as message transfer, sequence ordering, retransmission, etc. A pair of nodes may be interconnected by one link on a single bearer, or by a pair of links on two bearers in either a load sharing or an active-plus-standby configuration.
Link Changeover: The act of moving all traffic from a failing link in a link pair to the remaining link, while retaining the original sequence order and cardinality of messages.
Link Endpoint: A communication endpoint, used in pairs by a link to send and receive TIPC messages between two nodes.
Location Transparency: The ability of an application within a cluster to communicate with another application without knowing the physical location of the latter. (This term is sometimes called "addressing tranparency".)
Message: The fundamental unit of information exchanged between TIPC ports or between TIPC subsystems. Consists of a TIPC message header, followed by from 0 to 66,000 bytes of data.
Message Bundling: The act of aggregating several messages into one packet (typically an Ethernet frame) to minimize the impact of congestion when messages cannot be sent immediately.
Message Fragmentation: The act of dividing a long message into several packets during transmission and later reassembling the fragments into the original message at the receiving end.
Multicast: The sending of a message to multiple TIPC ports, each of which receives a copy of the message.
Name: An alias for Port Name.
Name Sequence: An alias for Port Name Sequence.
Name Table: An TIPC-internal table existing on each node which keeps track of the mapping between port names and port identities.
Network: A collection of nodes that can communicate with one another via TIPC. The network may consist of a single node, a single cluster, a single zone, or a group of inter-connected zones.
Network Address: An integer that identifies a node, or set of nodes, within a TIPC network. It is a 32 bit integer, subdivided into three fields (8/12/12), representing a zone, cluster and node identifier, respectively; normally denoted as <Z.C.N>.
Network Identity: An integer that uniquely identifies a TIPC network. Used to keep traffic from different TIPC networks separated from each other when a common bearer is being used; for example, when multiple networks are running on a LAN in a lab environment.
Node: A computer within a TIPC network, uniquely identified by a network address.
Packet: The unit of data sent over a bearer. It may contain one or more complete TIPC messages, or a fragment of one TIPC message.
Payload Message: A message that carries application-related content between applications, or between an application and a service.
Port: A communication endpoint, capable of sending and receiving TIPC messages. Once created a TIPC port persists until it is deleted by its owner, either explicitly or implicitly. In all practice, a TIPC port is embedded in a POSIX socket in all existing implementations, and there is no need for the user application to distinguish between these two concepts.
Port Identity: A physical address that uniquely identifies a TIPC port port within a network; normally denoted as <Z.C.N:reference>. Once a port is deleted its identity will not be reissued for a very long time.
Port Name: A service address that identifies a TIPC port as being capable of providing a specific service; normally denoted as {type,instance}. For load sharing and redundancy purposes several ports may bind to the same name; likewise, a single port may bind to multiple names if it provides multiple services.
Port Name Sequence: A mechanism for specifying a range of continguous port names; normally denoted as {type,lower-instance,upper-instance}.
Service: A TIPC subsystem that communicates with applications or other TIPC subsystems using TIPC ports.
Service Address: A location independent address, identifying a port. Manifested as a Port Name in TIPC.
Scope: A shorthand form for expressing the domain that contains a node, as seen by that node; that is, own-node, own-cluster, or own-zone.
Unicast: The sending of a message to a single node in the network.
Zone: A "super-cluster" of clusters that are directly interconnected (i.e. fully meshed). All nodes in a zone have network addresses that share a common zone identifier.
TOC |
3.6. Abbreviations
API - Application Programming Interface
MAC - Message Authentication Code [RFC2104] (Krawczyk, H., Bellare, M., and R. Canetti, “HMAC: Keyed-Hashing for Message Authentication,” February 1997.)
MTU - Maximum Transmission Unit
RTT - Round Trip Time, the elapsed time from the moment a message is sent to a destination to the moment it arrives back to the sender, provided the message is immediately bounced back from the sender. Typically on the order of 100 usecs, process-to-process, between 2 Ghz CPUs via a 100 Mbps Ethernet switch.
TOC |
4. TIPC Features
TOC |
4.1. Network Topology
From a TIPC addressing viewpoint the network is organized in a five-layer hierarchy:
+----------------------------------------------------+ +------------+ | Zone <1> | |Zone <2> | | ---------------------- ---------------------- | | | | | Cluster <1.1> | | Cluster <1.2> | | | | | | | | | | | | | | ------- | | ------- ------- | | | ------- | | | | | | | | | | | | | | | | | | | | Node | ------- | | | Node +--+ Node | | | | | Node | | | | |<1.1.1>| | | | | |<1.2.1>| |<1.2.2>| | | | |<2.1.1>| | | | | +--+ +------+ | | +--------+ | | | | | | | | | | | | | | | | | | | | | | ---+--- | Node | | | -------- ---+--- | | | ------- | | | | |<1.1.3>| | | | | | | | | | ---+--- | | | | ---+--- | | | ------- | | | | | | | | | | | | | | | | | | | | Node +--+ | | | | Node +--------+ Node | | | | |<1.1.2>| | | | | |<1.2.3>| | | | |<2.1.2>| | | | | | ------- | | | | | | | | | | | | | | | | | | | | | | | | | | ------- | | ------- | | | -------- | | ---------------------- ---------------------- | | | | | | | +----------------------------------------------------+ +------------+
Figure 3: TIPC Network Topology |
TOC |
4.1.1. Network
The top level is the TIPC network as such. This is the ensemble of all nodes interconnected via TIPC, i.e. the domain where a node can reach another node by using a TIPC network address. A node wanting to communicate with another node within the network, irrespective of its location in the network hierarchy, must have a direct link to that node. There is no routing in TIPC, i.e., a message can not pass from one node to another via an intermediate node.
It is possible to create distinct, isolated networks, even on the same LAN, reusing the same network addresses, by assigning each network a Network Identity. This identity is not an address, and only serves the purpose of isolating networks from each other. Networks with different identities can not communicate with each other via TIPC.
TOC |
4.1.2. Zone
It may be convenient for a system administrator to subdivide the nodes in a network into groups, or Zones, by assigning each zone a Zone Identity. A zone identitiy must be unique and within the numeric range [1,255].
TOC |
4.1.3. Cluster
The nodes within a zone may further be grouped into Clusters, by assigning them a Cluster Identity. A cluster identitiy must be unique within the zone, and within the numeric range [1,4095].
TOC |
4.1.4. Node
A cluster consists of individual Nodes, each having a unique Node Identity within the cluster. A node identitiy must be within the numeric range [1,4095].
TOC |
4.2. Link
The communication channels between pairs of nodes is called Link. A link is delivering units of data between nodes with guaranteed and ordered delivery. A link is also actively supervised, and will declared faulty if no traffic has been received from the other endpoint after a configurable amount of time.
There may be many working links between between a pair of nodes, but only two links may be actively used for data transport at any moment of time.
TOC |
4.3. Port
The endpoint of all data traffic inside each node is called Port, typically accessible for its users via a standard socket API.
TOC |
4.4. Message
The fundamental unit of information exchanged between TIPC ports or TIPC subsystems is called Message.
TOC |
4.4.1. Taxonomy
TIPC messages fall into two main classes.
A "payload message" carries application-specified content between applications, or between applications and TIPC services.
An "internal message" carries TIPC-specified content between TIPC subsystems.
Messages are further categorized based on their use,
as indicated below:
User User Name Purpose Class ---- --------- ------- ----- 0 LOW_IMPORTANCE Low Importance Data payload 1 MEDIUM_IMPORTANCE Medium Importance Data payload 2 HIGH_IMPORTANCE High Importance Data payload 3 CRITICAL_IMPORTANCE Critical Importance Data payload 4 USER_TYPE_4 Reserved for future use n/a 5 BCAST_PROTOCOL Broadcast Link Protocol internal 6 MSG_BUNDLER Message Bundler Protocol internal 7 LINK_PROTOCOL Link State Protocol internal 8 CONN_MANAGER Connection Manager internal 9 USER_TYPE_9 Reserved for future use n/a 10 CHANGEOVER_PROTOCOL Link Changeover Protocol internal 11 NAME_DISTRIBUTOR Name Table Update Protocol internal 12 MSG_FRAGMENTER Message Fragmentation Protocol internal 13 LINK_DISCOVER Neighbor Detection Protocol internal 14 USER_TYPE_14 Reserved for future use n/a 15 USER_TYPE_15 Reserved for future use n/a
Figure 4: TIPC Message Types |
TOC |
4.4.2. Format
Every TIPC message consists of a message header and a data part.
The message header format is user-dependent, and ranges in length from 6 to 11 words. The content of each word in the header is stored as a single 32-bit integer coded in network byte order. A small number of fields are common to all message header formats; the remaining fields are either unique to a single user or utilized by multiple users.
The format of the data part of a message is user-dependent, and ranges in length from 0 to 66,000 bytes.
The message header format and data format for each message user are described in detail in the section describing the message's use.
TOC |
4.5. Addressing
TOC |
4.5.1. Location Transparency
TIPC provides two service address types, Port Name and Port Name Sequence, to support location transparency, and two physical address types, Network Address and Port Identity, to be used when physical location knowledge is necessary for the user.
TOC |
4.5.2. Network Address
A physical entity within a network is identified internally by a TIPC Network Address. This address is a 32-bit integer, structured into three fields, zone (8 MSB), cluster, (12 bits), and node (12 LSB). The address is only filled in with as much information as is relevant for the entity concerned, e.g. a zone may be identified as 0x03000000 (<3.0.0>), a cluster as 0x03001000 (<3.1.0>), and a node as 0x03001005 (<3.1.5>). Any of these formats is sufficient for the TIPC routing function to find a valid destination for a message.
TOC |
4.5.3. Port Identity
A Port Identity is produced internally by TIPC when a port is created, and is only valid as long as that physical instance of the port exists. It consists of two 32-bit integers. The first one is a random number with a period of 2^31-1, the second one is a fully qualified network address with the internal format as described earlier. A port identity may be used the same way as a port name, for connectionless communication or connection setup, as long as the user is aware of its limitations. The main advantage with using this address type over a port name is that it avoids the potentially expensive binding operation in the destination port, something which may be desirable for performance reasons.
TOC |
4.5.4. Port Name
A port name is a persistent address typically used for connectionless communication and for setting up connections. Binding a port name to a port roughly corresponds to binding a socket to a port number in TCP, except that the port name is unique and has validity for the whole publishing scope indicated in the bind operation, not only for a specific node. This means that no network address has to be given by the caller when setting up a connection, unless he explicitly wants to reach a certain node, cluster or zone.
A port name consists of two 32-bits integers. The first integer is called the Name Type, and typically identifies a certain service type or functionality. The second integer is called the Name Instance, and is used as a key for accessing a certain instance of the requested service.
The type/instance structure of a port name helps giving support for both service partitioning and service load sharing.
When a port name is used as destination address for a message, it must be translated by TIPC to a port identity before it can reach it destination. This translation is performed on a node within the lookup scope indicated along with the port name.
TOC |
4.5.5. Port Name Sequence
To give further support for service partitioning TIPC even provides an address type called Port Name Sequence, or just Name Sequence. This is a three-integer structure defining a range of port names, i.e. a name type plus the lower limit of and the upper boundary of the range. By allowing a port to bind to a sequence, instead of just an individual port name, it is possible to partition the service's range of responsibility into sub-ranges, without having to create a vast number of ports to do so.
There are very few limitations on how name sequences may be bound to ports. One may bind many different sequences, or many instances of the same sequence, to the same port, to different ports on the same node, or to different ports anywhere in the cluster or zone. The only restriction, in reality imposed by the implementation complexity it would involve, is that no partially overlapping sequences of the same name type may exist within the same publishing scope.
--------------- | Partition B | | | O bind(type: 17 | ----------------- | lower:10 | | | | upper:19)| |send(type: 17 | --------------- | instance:7) O------+ | | | --------------- | | | | Partition A | ----------------- | | | +-------->O bind(type: 17 | | lower:0 | | upper:9 | ---------------
Figure 5: Service addressing, using port name and port name sequence |
When a port name is used as a destination address it is never used alone, contrary to what is indicated in Figure 5 (Service addressing, using port name and port name sequence). It has to be accompanied by a network address stating the scope and policy for the lookup of the port name. This will be described later.
TOC |
4.5.6. Multicast Addressing
The concept of service addressing is also used to provide multicast functionality. If the sender of a message indicates a port name sequence instead of a port name, a replica of the message will be sent to all ports bound to a name sequence fully or partially overlapping with the sequence indicated.
--------------- | Partition B | | | +-------->O bind(type: 17 | ----------------- | | lower:10 | | | | | upper:19)| |send(type: 17 | | --------------- | lower:7 O------+ | upper 13) | | --------------- | | | | Partition A | ----------------- | | | +-------->O bind(type: 17 | | lower:0 | | upper:9 | ---------------
Figure 6: service multicast, using port name sequence |
Only one replica of the message will be sent to each identified target port, even if it is bound to more than one overlapping name sequence.
This function will whenever possible and considered advantageous make use of the reliable cluster broadcast service also supported by TIPC.
TOC |
4.5.7. Publishing Scope
The default visibility scope of a published (bound) port name is the local cluster. If the publication issuer wants to give it some other visibility he must indicate this explicitly when binding the port. The scopes available are:
Value Meaning ----- ------- 1 Visibility within whole own zone 2 Visibility within whole own cluster 3 Visibility limited to own node
TOC |
4.5.8. Lookup Policies
When a port name is looked up in the TIPC internal naming table for translation to a port identity the following rules apply:
If indicated lookup domain is <Z.C.N>, the lookup algorithm must choose a matching publication from that particular node. If nothing is found on the given node, it must give up and reject the request, even if other matching publications exist within the zone.
If the lookup domain is <Z.C.0>, the algorithm must select round-robin among all matching publications within that cluster, treating node local publications no different than the others. If nothing is found within the given cluster, it must give up and reject the request, even if other matching publications exist within the zone. Note here that if the sender node is not part of the lookup domain, there may be cases where the message is redirected to a third node after lookup. E.g., if a node <Z.C1.N> sends a message with lookup domain <Z.C2.0>, the first lookup will happen on a node in cluster C2, which may quite well redirect the message to a third node in that cluster.
If the lookup domain is <Z.0.0>, the algorithm must select round-robin among all concerned publications within that zone, treating node or cluster local publications no different than the others. If nothing is found, it must give up and reject the request.
A lookup domain of <0.0.0> means that the nearest found publication must be selected. First a lookup with scope <own zone.own cluster.own node> is attempted. If that fails, a lookup with the scope <own zone.own cluster.0> is tried, and finally, if that fails, a lookup with the scope <own zone.0.0>. If that fails the request must be rejected.
Round-robin based lookup means that the algorithm must select equally among all the matching publications within the given scope. In practice this means stepping forward in a circular list referring to those publications between each lookup.
TOC |
5. Port-Based Communication
All application communication through TIPC is done by passing data ("payload") messages between a sender port and a receiver port.
TOC |
5.1. Payload Messages
TOC |
5.1.1. Payload Message Types
TIPC supports four different payload message types:
- Connection Based Messages: Messages sent over an established connection.
- Multicast Messages: Messages sent to all destinations bound to the given port name sequence destination, within the given destination scope.
- Port Name Addressed Messages: These are connection-less messages containing a port name as destination address,
- Direct Addressed Messages: Connection-less messages containing a port identity as destination address.
Figure 8 (TIPC Data Message Types) presents the message types and their corresponding type identifiers in the message header.
TOC |
5.1.2. Payload Message Header Sizes
The header is organized so that it should be possible to omit certain parts of it, whenever any information is dispensable. The following header sizes are used:
- Connection Based Messages: 24 bytes.
- Multicast Messages: 44 bytes.
- Port Name Addressed Messages: 40 bytes.
- Direct Addressed Messages: 32 bytes.
TOC |
5.1.3. Payload Message Format
All payload messages sharea common base header format. The only difference between the message types is how many bytes of the base header they need to use, as described in Section 5.1.2 (Payload Message Header Sizes).
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0:| Ver | User | Hsize |N|D|S|R| Message Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1:|Mtype| Error |Reroute|Lsc| RES | Broadcast Acknowledge | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2:| Link Acknowledge | Link Sequence | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3:| Previous Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4:| Originating Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5:| Destination Port / Destination Network | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w6:| Originating Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w7:| Destination Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w8:| Name Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w9:| Name Instance / Name Sequence Lower | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ wA:| Name Sequence Upper | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \ \ / / \ \ / / \ Data \ / / \ \ / / \ \ / / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0-w5: Required. w6-wA: Conditional. Data: Optional
Figure 7: TIPC Payload Message Format |
The interpretation of the fields of the message is as follows:
- R, RES, RESERVED: Reserved area. These bits must be zero.
- Ver: 3 bits : Message protocol version number; currently 2. This field is present to facilitate future upgrades of the TIPC protocol.
- User: 4 bits : Message user, as described in Section 4.4.1 (Taxonomy) Payload messages utilize one of: LOW_IMPORTANCE, MEDIUM_IMPORTANCE, HIGH_IMPORTANCE, and CRITICAL_IMPORTANCE. This indicates the message's importance as set by the to the application, and determines how it should be treated in case of network or node overlaod. The higher the importance, the less the risk that the message will be rejected or dropped during an overload situation.
- Hsize: 4 bits : The size of the message header, in 32-bit words. Payload messages can have message headers ranging from 6 to 11 words in length, depending on the value of Mtype.
- N: 1 bit : Non-sequenced message flag. If this bit is clear the message is part of the normal flow of messages between link endpoints. Payload messages that are sent to all cluster nodes using the broadcast link have this bit set.
- D: 1 bit : Destination droppable flag. If this bit is set the message should be silently dropped if it cannot be delivered to the specified destination port; if clear, the message should be returned to the originating port.
- S: 1 bit : Source droppable flag. If this bit is set the message should be silently dropped if the sending application is unable to send the message due to congestion; if clear, the sending application should be notified that the send operation was unsuccessful.
- Y: 1 bit : SYN Bit. Indicating if an RDM message is intended for initiating a connection setup.
- Message Size: 17 bits : The size of the message, including both header and data parts, in bytes. The maximum data part of a payload message is 66000 bytes, which makes it possible to tunnel maximum-size IP-messages through TIPC.
-
Mtype: 3 bits : Message type. A User-specific value that
indicates the exact nature of the message.
Payload messages utilize the following values:
Mtype Mtype Name Purpose ----- ---------- ------- 0 CONN_MSG Data sent over an established connection 1 MCAST_MSG Data sent to a port name sequence by multicast 2 NAMED_MSG Data sent to a port name address 3 DIRECT_MSG Data sent to a port id address
Figure 8: TIPC Data Message Types
-
Error: 4 bits : The error status of the message. The
following values apply:
Error Error Name Meaning ----- ---------- ------- 0 OK No error 1 ERR_NO_NAME Destination port name is unknown 2 ERR_NO_PORT Destination port id does not exist 3 ERR_NO_NODE Destination node is unreachable 4 ERR_OVERLOAD Destination is congested 5 CONN_SHUTDOWN Normal connection shutdown occurred
Figure 9: TIPC Error Codes
- Reroute: 4 bits : The number of port name lookup operations performed on the message. Message is rejected or dropped after six failed lookups.
-
Lsc: 2 bits : Lookup Scope. Indicates the scope of the
lookup domain that was used during translation from name
to port identity in a NAMED_MSG or MCAST_MSG.
This information enables subsequent lookup of the original
name, if necessary. The following values apply:
Lsc Meaning --- ---------- 1 Zone Scope 2 Cluster Scope 3 Node Scope
Figure 10: TIPC Address Lookup Scopes
- Broadcast Acknowledge: 16 bits : The sequence number of the last in-sequence packet received by the sending node from the recipient node using the broadcast link. This allows the recipient node to release buffers associated with messages successfully transferred over the broadcast link. Messages sent to all cluster nodes over the broadcast link set this value to zero.
- Link Acknowledge: 16 bits : The sequence number of the last in-sequence packet received by the sending node from the recipient node using the link over which the packet is carried. This allows the recipient node to release buffers associated with messages successfully transferred over the link. Messages sent to all cluster nodes over the broadcast link set this value to zero.
- Link Sequence: 16 bits : The sequence number of the packet being transferred from the sending node to the recipient node on the associated link or the broadcast link. This allows the recipient node to detect lost packets, duplicate packets, or out-of-sequence packets.
- Previous Node: 32 bits : The network address of the last node visited by the message. In the case of intra-cluster messages this is usually, but not always, identical to Originating Node.
- Originating Port: 32 bits : The reference part of the port identifier of the originating port from which the message was sent.
- Destination Port: 32 bits : The reference part of the port identifier to which the message was sent; present only in messages sent a single node over a link. For NAMED_MSG and MCAST_MSG messages this field is set to zero until name lookup has been completed.
- Destination Network: 32 bits : The network identity of the sender; present only in messages sent to all cluster nodes over the broadcast link.
- Originating Node: 32 bits : The network address of the node from which the message originally was sent. In most cases, this field has the same value as the Previous Node field. However, despite the fact that TIPC doesn't support proper routing, there are cases where the fields will differ, because a named message may be redirected after lookup. This is explained in Section 4.5.8 (Lookup Policies).
- Destination Node: 32 bits : The network address of the final destination node for a message. Messages sent to all cluster nodes over the broadcast link set this value to <0.0.0>.
- Name Type: 32 bits : The type part of the port name or port name sequence to which a message of type NAMED_MSG or MCAST_MSG was sent.
- Name Instance: 32 bits : The instance part of the port name to which a message of type NAMED_MSG was sent.
- Name Sequence Lower: 32 bits : The lower boundary of the port name sequence to which a message of type MCAST_MSG was sent.
- Name Sequence Upper: 32 bits : The upper boundary of the port name sequence to which a message of type MCAST_MSG was sent.
- Data: 0 to 66,000 bytes : The content and format of this region is specified by the application or service that sends the message.
TOC |
5.1.4. Payload Message Delivery
CONN_MSG and DIRECT_MSG messages are delivered directly to the destination port. If this is impossible, e.g. because the destination port or node has disappeared, the message is dropped or rejected back to sender, depending of the setting of the 'dest_droppable' bit.
NAMED_MSG messages are subject to a name table lookup before the final destination can be determined. The following procedure applies for finding the correct destination:
- Initially, the 'destination port' and 'destination node' fields of the message header are empty.
- If the sender node is within the lookup domain of the destination address (e.g., <1.1.1> is within domain <1.1.0>), a lookup is performed on that node. Only publications which have been published from a node that is also located within the requested domain, and which have a publication scope comprising the sender node are considered. (E.g., a publication with scope 'cluster' or 'zone' from node <1.1.3> is considered. A publication with scope 'node' from the same node is not seen. A publication with scope 'zone' from <1.2.1> is seen, but ignored.).
- If a matching publication is found, the destination port number and node adress are added to the header, and the message is sent to the destination node for delivery. If no matching publication is found, the message is dropped or rejected back to the sender.
- If the sender node is outside the lookup domain, the message is forwarded to a node within that domain for further lookup. On the receiving node the lookup procedure just described in the previous steps are performed.
- If lookup was successful, and the message has reached the found destination node, it is now delivered to the found destination port, if it is still there.
- If the destination port has disapepared, the destination port and destination node fields are cleared from the header, and a new lookup is performed on the current node, still considering the originally indicated lookup domain, and following the steps described above. The original lookup domain is recreated based on the complete current node address and the 2-bit 'lookup scope', which was conveyed in the message header. At the same time the 'reroute' counter is incremented. Up to six such lookup attempts will be made until the message is dropped or rejected.
MCAST_MSG messages are also subject to name table lookups before the final destinations can be determined. The following procedure applies for finding the correct destinations:
- Initially, the Destination Port and Destination Node fields of the message header are empty.
- A first lookup is performed, unconditionally, on the sending node. Here, all node local matching destination ports are identified, and a copy of the message is sent to each and one of them.
- At the same time, the lookup identifies if there are any publications from external, cluster local, nodes. If so, a copy of the message is sent via the broadcast link to all nodes in the cluster.
- At each destination node, a final lookup is made, once again to identify node local destination ports. A copy of the message is sent each and one of them.
- If any of the found destination ports have disappeared, or are overloaded, the corresponding message copy is silently dropped.
TOC |
5.2. Connectionless Communication
There are two types of unicast connectionless communication, name addressed and direct addressed message transport. Both these have already been described.
The main advantage with this type of communication is that there is no need for a connection setup procedure. Another is that messages can be sent from one to many destionations, as well as many to one. A third one is that a destination port easily can move around, without the potential senders needing to be aware.
The main disadvantage with this kind of communication is the lack of flow control. It is very easy to overwhelm a single destination from many sources.
TOC |
5.3. Connection-based Communication
User Connections are designed to be lightweight because of their potentially huge number, and because it must be possible to establish and shut down thousands of connections per second on a node.
TOC |
5.3.1. Connection Setup
How a connection is established and terminated is not defined by the protocol, only how they are supervised, and if necessary, aborted. Instead, this is left to the implementation to define. The following figures show two examples of this.
------------------- ------------------- | Client | | Server | | | | | | (3)create(cport) | | (1)create(suport) | | (4)send(type:17, |------------->0 (2)bind(type: 17, | | inst: 7) 0<------+ |\ lower:0 | | (8)lconnect(sport)| | | \ upper:9) | | | | | / | | | | |/(5)create(sport) | | | +------0 (6)lconnect(cport)| | | | (7)send() | ------------------- -------------------
Figure 11: Example of user defined establishment of a connection |
In the example illustrated above the user himself defines how to set up the connection. In this case, the client starts with sending one payload- carrying NAMED_MSG message to the setup port (suport)(4). The setup server receives the message, and reads its contents and the client port (cport) identity. He then creates a new port (sport)(5), and connects it to the client port's identity(6). The lconnect() call is a purely node local operation in this case, and the connection is not fully established until the server has fulfilled the request and sent a response payload-carrying CONN_MSG message back to the client port(7). Upon reception of the response message the client reads the server port's identity and performs an lconnect() on it(8). This way, a connection has been established without exchanging a single protocol message.
-------------------- ------------------- | Client | | Server | | | | (1)create(suport) | | (4)create(cport) | "SYN" | (2)bind(type: 17, | | (5)connect(type:17,|------------->0 lower:0 | | (9) inst: 7)0<------+ /| upper:9) | | | | / | (3)accept() | | | (7)| \ | (8) | | | | (6)\| | | | +------0 (9)recv() | | | "SYN" | | -------------------- -------------------
Figure 12: TCP-style connection setup |
The figure above shows an example where the user API-adapter supports a TCP-style connection setup, using hidden protocol messages to fulfil the connection. The client starts with calling connect()(5), causing the API to send an empty NAMED_MSG message ("SYN" in TCP terminology) to the setup port. Upon reception, the API-adapter at the server side creates the server port, peforms a local lconnect()(6) on it towards the client port, and sends an empty CONN_MSG ("SYN") back to the client port (7). The accept() call in the server then returns, and the server can start waiting for messages (8). When the second SYN message arrives in the client, the API-adapter performs a node local lconnect() and lets the original connect() call return (9).
Note the difference between this protocol and the real TCP connection setup protocol. In our case there is no need for SYN_ACK messages, because the transport media between the client and the server (the node-to-node link) is reliable.
Also note from these examples that it is possible to retain full compatibility between these two very different ways of establishing a connection. Before the connection is established, a TCP-style client or server should interpret a payload message from a user-controlled counterpart as an implicit SYN, and perform an lconnect() before queueing the message for reading by the user. The other way around, a user-controlled client or server must perform an lconnect() when receiving the empty message from its TCP-style counterpart.
TOC |
5.3.2. Connection Shutdown
------------------- ------------------- | Client | | Server | | | | | | | | | | lclose() 0 0 lclose() | | | | | | | | | | | | | ------------------- -------------------
Figure 13: Example of user defined shutdown of a connection |
The figure above shows the simplest possible user defined connection shutdown scheme. If it inherent in the user protocol when the connection should be closed, both parties will know the right moment to perform a "node local close" (lclose()) and no protocol messages need to be involved.
-------------------- ------------------- | Client | | Server | | | "FIN" | | | (1)close()0------------->0(2)close() | | | | | | | | | | | | | -------------------- -------------------
Figure 14: TCP-style shutdown of a connection |
In the figure above a TCP-style connection close() is illustrated. This is simpler than the connection setup case, because the built-in connection abortion mechanism of TIPC can be used. When the client calls close() (1) TIPC must delete the client port. As will be described later, deleting a connected port has the effect that a CRITICAL_IMPORANCE/CONN_MSG ("FIN" in TCP terminology) with error code NO_REMOTE_PORT is sent to the other end. Reception of such a message means that TIPC at the receiving side must shut down the connection, and this must be done already before the server is notified. The server must then call close() (2), not to close the connection, but to delete the port. TIPC does not send any "FIN" this time, the server port is already disconnected, and the client port is anyway gone. If both endpoints call close() simultaneously, two "FIN" messages will cross each other, but at the reception they will have no effect, since there is no destination port, and they must be discarded by TIPC.
Note even here the automatic compatibility with a user-defined peer and a TCP-style ditto: Any user, no matter the user API, must at any moment be ready to receive a "connection aborted" indication, and this is what in reality happens here.
TOC |
5.3.3. Connection Abortion
When a connected port receives an indication from the TIPC link layer that it has lost contact with its peer node, it must immediately disconnect itself and send an empty CONN_MSG/NO_REMOTE_NODE to its owner process.
When a connected port is deleted without a preceding disconnect() call from the user it must immediately disconnect itself and send an empty CONN_MSG/NO_REMOTE_PORT to its peer port. This may happen when the owner process crashes, and the OS is reclaiming its resources.
When a connected port receives a timeout call, and is still in CONNECTED/PROBING state since the previous timer expiration,it must immediately disconnect itself and send an empty CONN_MSG/NO_REMOTE_PORT to its owner process.
When a connected port receives a CONN_MSG with error code, it must immediately disconnect itself and deliver the message to its owner process.
When a connected port receives a CONN_MSG from somebody else than its peer port, it must immediately send an empty CONN_MSG/NO_CONNECTION to the originating port of that message.
When TIPC in a node receives a CONN_MSG/TIPC_OK for which it finds no destination port, it must immediately send an empty CONN_MSG/NO_REMOTE_PORT back to the originating port.
When a bound port receives a CONN_MSG from anybody,it must immediately send an empty CONN_MSG/NO_CONNECTION to the originating port.
TOC |
5.3.4. Connection Supervision
A connection also implies automatic supervision of the existence and state of the endpoints.
In almost all practical cases the mechanisms for resource cleanup after process failure and supervision of peer nodes at the link level, is sufficient for immediate failure detection and abortion of connections.
However, because of the non-specified connection setup procedure of TIPC, there exists cases when a connection may remain incomplete. This may happen if the client in a user-defined setup/shutdown scheme forgets to call lconnect() (see "Example of user defined shutdown of a connection"), and then deletes itself, or if one of the parties fails to call lclose() (see "TCP-style connection shutdown") These cases are considered very rare, and should normally have no serious consequenses for the availability of the system, so a slow background timer is deemed sufficient to discover such situations.
When a connection is established each port starts a timer, whose purpose is to check the status of the connection. It does this by regularly (typical configured interval is once an hour) sending a CONN_PROBE message to the peer port of the connection. The probe has two tasks; first, to inform that the sender is still alive and connected; second, to inquire about the state of the recipient.
A CONN_PROBE or a CONN_PROBE_REPLY reply MUST be immediately responded to according to the following scheme:
--------------------------------------------------------------------- | | Received Message Type | | |-----------------+------------------| | | CONN_PROBE | CONN_PROBE_REPLY | | | | | |==============================|====================================| | | Multi-hop | CRITICAL_IMPORANCE+ | | | seqno wrong| TIPC_COMM_ERROR | | | ------------|-----------------+------------------| | | Connected Multi-hop | | | | | to sender seqno ok | | | | | port ------------| | | | | Single hop | CONN_PROBE_REPLY| No Response | | |------------------------| | | | | Not connected, | | | |Rece-| not bound, | | | |ving |------------------------|-----------------+------------------| |Port | Connected to | | |State| other port | CRITICAL_IMPORANCE+ | | |------------------------| TIPC_NOT_CONNECTED | | | Bound to | | | | port name sequence | | | |------------------------|------------------------------------| | | Recv. node available, | CRITICAL_IMPORANCE+ | | | recv. port non-existent| TIPC_NO_REMOTE_PORT | | |------------------------|------------------------------------| | | Receiving node | CRITICAL_IMPORANCE+ | | | unavailable | TIPC_NO_REMOTE_NODE | ---------------------------------------------------------------------
Figure 15: Response to probe/probe replies vs port state |
If everything is well then the receiving port will answer with a probe reply message, and the probing port will go to rest for another interval. It is inherent in the protocol that one of the ports - the one connected last - normally will remain passive in this relationship. Each time its timer expires it will find that it has just received and replied to a probe, so it will never have any reason to explicitly send a probe itself.
When an error is encountered, one or two empty CONN_MSG data are generated, one to indicate a connection abortion in the receiving port, if it exists, and one to do the same thing in the sending port.
The state machine for a port during this message exchange is described in "Connection-based Communication".
TOC |
5.3.4.1. Connection Manager
Although a TIPC internal user, Connection Manager is special, because it uses a 36-byte header format of payload messages instead of the 40-byte internal format. This is because those messages must contain a destination port and a originating port.
The following message types are valid for Connection Manager:
User: 8 (CONN_MANAGER).
Message Types:
ID Value Meaning -------- ---------- 0 Probe to test existence of peer (CONN_PROBE) 1 Reply to probe, confirming existence (CONN_PROBE_REPLY) 2 Acknowledge N Messages (MSG_ACK)
Figure 16: Connection Supervision Message Types |
MSG_ACK messages are used for transport-level congestion control, and carry one network byte order 32-byte integer as data. This indicates the number of messages acknowledged, i.e. actually read by the port sending the acknowledge. This information makes it possible for the other port to keep track of the number of sent, but not yet received and handled messages, and to take action if this value surpasses a certain threshold.
The details about why and when these messages are sent are described in "Connection Supervision"
TOC |
5.3.5. Flow Control
The mechanism for end-to-end flow control at the connection level has as its primary purpose to stop a sending process from overrunning a slower receiving process. Other tasks, such as bearer, link, network, and node congestion control, are handled by other mechanisms in TIPC. Because of this, the algorithm can be kept very simple. It works as follows:
The message sender (the API-adapter) keeps one counter, SENT_CNT, to count messsages he has sent, but which has not yet been acnkowledged. The counter is incremented for each sent message.
The receiver counts the number of messages he delivers to the user, ignoring any messages pending in the process in-queue. For each N message, he sends back a CONN_MANAGER/ACK_MSG containing this number in its data part.
When the sender receives the acknowledge message, he subtracts N from SENT_CNT, and stores the new value.
When the sender wants to send a new message he must first check the value of SENT_CNT, and if this exceeds a certain limit, he must abstain from sending the message. A typical measure to take when this happens is to block the sending process until SENT_CNT is under the limit again, but this will be API-dependent.
The recommended value for the send window N is at least 200 messages, and the limit for SENT should be at least 2*N.
TOC |
5.3.6. Sequentiality Check
Inter-cluster connection-based messages, and intra-cluster messages between cluster nodes, may need to be routed via intermediate nodes if there is no direct link between the two. This implies a small, but not negligeable risk that messages may be lost or re-ordered. E.g. an intermediate node may crash, or it may have changed its routing table in the interval between the messages. A connection level sequence number is used to detect such problems, and this must be checked for each message received on the connection. If the sequence number does not fit in sequence, no attempts of re-sequencing should be done. The port discovering the sequence error must immediately abort the connection by sending one empty CONN_MSG/COMM_ERROR message to itself, and one to the peer port.
The sequence number must not be checked on single-hop connections, where the link protocol guarantees that no such errors can occur.
The first message sent on a connection has the sequence number 42.
TOC |
5.4. Multicast Communication
Section 4.5.6 (Multicast Addressing) describes the concept of multicast addressing in TIPC. Section 5.1.4 (Payload Message Delivery) describes in detail how multicast addess lookup is performed.
Two additional details should be noted:
- In contrast to ordinary named addressing, a multicast domain is limited to cluster scope. It is not possible to send multicast message outside the sending node's cluster.
- A multicast domain is always centered around the sending node, meaning that it is not possible to send a message to a different node or a different cluster, and have the multicast operation performed there.
The addressing concept makes it both logical and easy to permit such 'zone multicast' and 'redirected multicast' features, but at least for now it is deemed to risky to implement this.
TOC |
6. Name Table
The Name Table is a distributed data base that keeps all port names that have been published in the network. It is used for translation from a port name to a corresponding port identity, or from a port name sequence to a corresponding set of port identities. In order to achieve acceptable translation times and fault tolerance, a replica of the table must exist on each node. The table replicas are not exactly identical; instead each replica keeps exactly those publications that have been published in the domain to which the replica belongs, provided that it is directly reachable (via a link) from the publishing node.
TOC |
6.1. Distributed Name Table Protocol Overview
The replicas of the table must be kept consistent with the other instances within the same domain, and there must be no unnecessary delays in the synchronization between neighbouring table instances when a port name sequence is published or withdrawn. Inconsistencies are only tolerated for the short timespan it takes for update messages to reach the neigbouring nodes, or for the time it takes for a node to detect that a neighbouring node has disappeared.
TOC |
6.2. Name Distributor Message Processing
When a node establishes contact with a new node in the cluster or the zone, it must immediately send out the necessary number of NAME_DISTRIBUTOR/ PUBLICATION messages to that node, in order to let it update its local NAME TABLE instance.
When a node looses contact with another node, it must immediately clean its NAME TABLE from all entries pertaining to that node.
When a port name sequence is published on a node, TIPC must immediately send out a NAME_DISTRIBUTOR/PUBLICATION message to all nodes within the publishing scope, in order to have them update their tables.
When a port name sequence is withdrawn on a node, TIPC must immediately send out a NAME_DISTRIBUTOR/WITHDRAWAL message to all nodes within the publishing scope, in order to have them remove the corresponding entry from their tables.
Brief, transient table inconsistencies may occur, despite the above, and are handled as follows: If a successful lookup on one node leads to a non-existing port on another node, the lookup is repeated on that node. If this lookup succeeds, but again leads to a non-existing port, another lookup is done. This procedure can be repeated up to six times before giving up and rejecting the message.
TOC |
6.3. Name Distributor Message Format
The format of the name distribution message used to update remote name tables is shown in Figure 17 (Name Table Distributor Message Format).
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0:| Ver | User | Hsize |N|R|R|R| Message Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1:|Mtype| RESERVED | Broadcast Acknowledge | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2:| Link Acknowledge | Link Sequence | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3:| Previous Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4:| Originating Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5:| Destination Port / Destination Network | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w6:| Originating Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w7:| Destination Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w8:| RESERVED | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w9:| Item Size |M| RESERVED | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \ \ / Data (list of name items) / \ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 17: Name Table Distributor Message Format |
The interpretation of the fields of the message is as follows:
- R, RESERVED: Defined in Payload Message.
- Ver: 3 bits: Defined in Payload Message.
- User: 4 bits: Defined in Payload Message. A NAME_DISTRIBUTOR message is identified by the value 11.
- Hsize: 4 bits: Defined in Payload Message. A NAME_DISTRIBUTOR message header is 40 bytes.
- N: 1 bit: Defined in Payload Message.
- Message Size: 17 bits: Defined in Payload Message.
- Mtype: 3 bits: Defined in Payload Message. A NAME_DISTRIBUTOR message specifies 0 for a name publication message or 1 for a name withdrawl message.
- Broadcast Acknowledge: 16 bits: Defined in Payload Message.
- Link Acknowledge: 16 bits: Defined in Payload Message.
- Link Sequence: 16 bits: Defined in Payload Message.
- Previous Node: 32 bits: Defined in Payload Message.
- Originating Port: 32 bits: Defined in Payload Message. A NAME_DISTRIBUTOR message sets this field to zero as the message originates with TIPC's name table subsystem.
- Destination Port: 32 bits: Defined in Payload Message. A NAME_DISTRIBUTOR message sets this field to zero as the message is destined for TIPC's name table subsystem.
- Destination Network: 32 bits: Defined in Payload Message.
- Originating Node: 32 bits: Defined in Payload Message.
- Destination Node: 32 bits: Defined in Payload Message.
- Item Size: 32 bits: The size, in words, of each name publication descriptor contained in Data. A value of zero indicates that Item Size is not specified by the sender, signifying that a 5 word descriptor size may be assumed.
- M, MORE: During the bulk update at node establishment, this bit indicates that there are more bulk messages to come. In the last bulk message this bit is set to zero, informing the receiver that it can open up for broadcast reception from this peer node. This is needed to avoid contention between publications/withdrawals sent via a unicast link and subsequent items sent via the broadcast link, since that would lead to an inconsistent name table at the receiver.
- Data: up to 66,000 bytes : A list of one or more name publication descriptors. The total number of descriptors in the message is equal to (Message Size - Hsize)/(Item Size * 4).
TOC |
6.4. Name Publication Descriptor Format
The format of a name publication descriptor is shown in Figure 18 (Name Table Distribution Items). The full seven word format MUST be used by nodes in multi-cluster TIPC networks; nodes in single-cluster TIPC networks MAY use the shorter five word format. All fields of the descriptor MUST be stored in network byte order.
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0:| Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1:| Lower Bound | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2:| Upper Bound | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3:| Reference | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4:| Key | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5:| Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w6:| RESERVED | Scope | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 18: Name Table Distribution Items |
Type: The type part of the published port name sequence.
Lower: The lower part of the published port name sequence.
Upper: The upper part of the published port name sequence.
Reference: The reference part of the publishing port's identity.
Key: A created by the publishing port.
Node: The node part of the publishing port's identity. If this field is not present it can be assumed to be the same as Originating Node.
Scope: The distribution scope of the published port name sequence. If this field is not present then it can be assumed to be cluster-wide.
TOC |
7. Links
This section discusses the operation of unicast links that carry messages from the originating node to a single destination node to which it has a direct path.
The operation of TIPC's broadcast link is described in Section 8 (Broadcast Link).
TOC |
7.1. TIPC Internal Header
TOC |
7.1.1. Internal Message Header Format
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0:|vers |msg usr|hdr sz |n|resrv| packet size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1:|m typ| sequence gap | broadcast ack no | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2:|link level ack no/bc gap after | link level/bc seqno/bc gap to | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3:| previous node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4:| last sent broadcast/fragm no | next sent pkt/ fragm msg no | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5:| session no | res |r|berid|link prio|netpl|p| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w6:| originating node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w7:| destination node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w8:| transport sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w9:| msg count/max packet | link tolerance | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ \ \ / User Specific Data / \ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 19: TIPC Internal Message Header Format |
The internal header has one format and one size, 40 bytes. Some fields are only relevant to some users, but for simplicity in understanding and presentation we show it as single header format.
TOC |
7.1.2. Internal Message Header Fields Description
The first four words are almost identical to the corresponding part of the data message header. The differences are described next.
- Sequence Gap: 13 bits. Used by: LINK_PROTOCOL. The fields 'Error Code','Reroute Count', 'Lookup Scope' and 'Options Position' have no relevance for LINK_PROTOCOL/STATE_MSG messages, so these 13 bits can be recycled in such messages. 'Sequence Gap' informs the recipient about the size of a gap detected in the sender's received packet sequence, from 'Link Level Acknowledge Number' and onwards. The receiver of this information must immediately retransmit the missing packets.
- Broadcast Gap After: 16 bits. Used by: BCAST_PROTOCOL. This field occupies the same physical space as Link Level Acknowledge Number, but is defined only for BCAST_PROTOCOL Negative Acknowledge messages. It indicates that a gap has been detected in in the received sequence of broadcast data packets. Last packet received in sequence is indicated here.
- Broadcast Gap To: 16 bits. Used by: BCAST_PROTOCOL. This field occupies the same physical space as Broadcast Sequence Number and Link Level Acknowledge Number, but is defined only for BCAST_PROTOCOL Negative Acknowledge messages. It indicates that a gap has been detected in in the received sequence of broadcast data packets. First packet received out-of-sequence is indicated here.
Next come the fields which are unique for the internal header, from word 4 and onwards.
- Last Sent Broadcast: 16 bits.Used by: LINK_PROTOCOL. In order to speed up detection of lost broadcasts packets all LINK_PROTOCOL/STATE_MSG messages contain this information from the sender node. If the receiver finds that this is not in accordance with what he has received, he immediately broadcasts a BCAST_PROTOCOL/STATE_MSG back to the sender, with bc_gap_after and bcast_gap_to set appropriately.
- Fragment Number: 16 bits.Used by: MSG_FRAGMENTER. Occupying the same space as 'Next Sent Broadcast' this value indicates the number of a message fragment within a fragmented message, starting from 1.
- Next Sent Packet: 16 bits. Used by: LINK_PROTOCOL. Link protocol messages bypass all other packets in order to maintain link integrity, and hence can not have sequence numbers valid for the ordinary packet stream. But all receivers are dependent of this information to detect packet losses, and cannot completely rely on the assumption that a sequenced packet will arrive within acceptable time. To guarantee a worst case packet loss detection time, even on low-traffic links,the equivalent information to a valid sequence number has to be conveyed by the link continuity check (STATE_MSG) messages, and that is the purpose of this field.
- Fragment Number: 16 bits.Used by: MSG_FRAGMENTER. Occupying the same space as 'Next Sent Packet', this value identifies a a fragmented message on the particular link where it is sent.
- Session Number: 16 bits. Used by: LINK_PROTOCOL. The risk of packets being reordered by the router is particularly elevated at the moment of first contact between nodes, so a check of sequentiality is needed even for LINK_PROTOCOL/RESET_MSG messages. The session number starts from a random value, and is incremented each time a link comes up. This way, redundant RESET_MSG messages, delayed by the router and arriving after the link has been brought to a working state,can be identified and ignored.
- Redundant Link: 1 bit. Used by: LINK_PROTOCOL. This bit is used only in RESET and ACTIVATE messages. When set, it informs thereceiving endpoint that, despite the reset of the current link, the sender node still thinks it has a second working and active link towards the receiver, and that a failover procedure can be initiated, if possible. If the bit is zero the recipient MUST reset all links towards the sender, forcing it to go through a "lost contact" cycle. There is otherwise a small, but real, risk that the reset cycles for the link endpoints at the receiver node won't overlap in time, and that there never is a "lost contact" indication on that node, contrary to what has happened on the sender node. This, again, is necessary to ensure that the local name table is purged from publications pertaining to the sender, and that the name table instances are synchronized once contact is re-established between the peers.
- Bearer Identity: 3 bits. Used by: LINK_PROTOCOL. When a bearer is registered with the link layer of TIPC in a node, it is assigned a unique identifying number in the range [0,7]. This number will not necessarily be the same in different nodes, so a link endpoint needs to know the other endpoint's assigned identity for the same bearer. This is needed during the link changeover procedure, in order to identify the destination bearer and link instance of a tunneled packet.
- Link Priority: 5 bits. Used by: LINK_PROTOCOL. When there are more than one link between two nodes, one may want to use them in load sharing or active/standby mode. Equal priority between links means load sharing, different priorities means that the link with the higher numerical value will take all traffic. By offering a value range of 32 one can build in a default relation between different bearer types,(e.g. UDP may be given lower priority than Ethernet), and no manual configuration of these values should normally be needed.
- Network Plane: 3 bits. Used by: LINK_PROTOCOL. When multiple parallel routers and multiple network interfaces are used it is useful, although not strictly needed by the protocol, to have a network pervasive identifier telling which interfaces are connected to which routers. This relieves system managers from the burden of manually keeping track of the actual physical connectivity. Typically, the identifier 0 would be presented to the operator as 'Network A', identity 1 as 'Network B' etc. This identity must be agreed upon in the whole network, and therefore this field is present and valid in the header of all LINK_PROTOCOL messages. The 'negotiation' consists of letting the node with the lowest numeral value of its network address, typically node <1.1.1>, decide the identities. All others must strictly update their identities to the value received from any lower node.
- Probe: 1 bit. Used by: LINK_PROTOCOL. This one-bit field is used only by messages of type LINK_PROTOCOL/ STATE_MSG. When set it instructs the receiving link endpoint to immediately respond with a STATE_MSG. The Probe bit MUST NOT be set in the responding message.
- Message Count: 16 bits. Used by: MSG_BUNDLER, CHANGEOVER_PROTOCOL. This field is used for two different purposes. First, the message bundling function uses it to indicate how many packets are bundled in a bundle packet. Second, when a link goes down, the endpoint detecting the failure must send an ORIG_MSG to the other endpoint (tunneled through the remaing link) informing it about how many tunneled packets to expect. This gives the other endpoint a chance to know when the changeover is finished, so it can return to the normal link setup procedure.
- Max Packet: 16 bits. Used by: LINK_PROTOCOL. Occupying the same space as 'Message Count', this field is used by a link endpoint during MTU negotiation to tell its peer the size of the largest link state message it has received. The size value is specified as a count of 4 byte words, rather than bytes, to allow the largest possible message size (66060 bytes) to be represented using 16 bits. A value of 0 indicates that no information about the largest link state message is provideded. Semantically, this field has two meanings, depending on the message type carrying it. When present in a RESET or ACTIVATE message, this field indicates the theoretical MTU of the sender, in practice the MTU allowed by the sender's local media interface. When present in STATE messages, the field serves as a confirmation that packets with that MTU has been recived. This is useful for the MTU negotiation described earlier.
- Link Tolerance: 16 bits. Used by: LINK_PROTOCOL. Each link endpoint must have a limit for how long it can wait for packets from the other end before it declares the link failed. Initially this time may differ between the two endpoints, and must be negotiated. At link setup all RESET_MSG messages in both directions carry the sender's configured value in this field, and the highest numerical value will be the one chosen by both endpoints. In STATE_MSG messages this field is normally zero, but if the value is explicitly changed at one endpoint, e.g. by a configuration command, it will be carried by the next STATE_MSG and force the other endpoint to also change its value. Subsequent STATE_MSG messages return to the zero value. The unit of the value is [ms].
TOC |
7.2. Link Creation
TOC |
7.2.1. Link Setup
TIPC automatically detects all neighbouring nodes that can be reached through an interface, and automatically establishes a link to each of those nodes, provided that that the sender's bearer is configured to permits this.
This automatic configuration requires that TIPC be able to send Link Request messages to all possible receivers on that interface. This is easily done when the media type used by the interface supports some form of broadcast capability (eg. Ethernet); other media types might require the use of a "replicast" facility. Support for manual configuration of links on interfaces that can not support automatic neighbour discovery in any form is left for future study.
Whenever TIPC detects that a new interface has become active, it periodically broadcasts Link Request messages from that interface to other prospective members of the network, informing them of the node's existence. If a node that receives such a request determines that a link to the sender is required, it creates a new link endpoint and returns a unicast Link Response message to the sending node, which causes that node to create a corresponding link endpoint. The two link endpoints then begin the link activation process described in Section 7.2.2 (Link Activation)
The structure and semantics of Link Request and Link Response messages are described in Section 9 (Neighbor Detection)
------------- | <1.1.3> | | | ucast(dest:<1.1.1>,orig:<1.1.3> | | <------------------------------- | | | | ------------- -------------- | <1.1.1> | | | bcast(orig:<1.1.1>,dest:<1.1.0>) | |--------------------------------> | | | | -------------- ------------- ucast(dest:<1.1.1>,orig:<1.1.2> | <1.1.2> | <------------------------------- | | | | | | | | -------------
Figure 20: Neighbor Detection |
There are two reasons for the on-going broadcasting decribed above. First, it allows two nodes to discover each other even if the communication media between them is initially non-functional. (For example, in a dual-switch system one of the cables may be faulty or disconnected at start up time, while the cluster is still fully connected and functional via the other switch.) The continuous discovery mechanism allows the missing links to be created once a working cable is inserted, without requiring a restart of any of the nodes. Second, it allows users to replace (hot-swap) an interface card with one having a different media address (eg. a MAC address for Ethernet), again without having to restart the node. When a node receives a Link Request message its originating media address is compared with the one previously stored for that destination, and if they differ the old one is replaced allowing the link activation process to begin using the new address.
Link Request broadcasting begins 125 msec after an interface is enabled, then repeats at an interval that doubles after each transmission until it reaches an upper limit of 2000 msec; thereafter, broadcasts occurs every 2000 msec if there are no active links on the interface, or every 600,000 msec if there is at least one active link. The broadcasts continue at these rates as long as the node is up. This pattern of broadcasts ensures that a node broadcasts frequently when an interface is first enabled or when there is no connectivity on the interface, and very slowly once some amount of connectivity exists. Such an approach places the bulk of the burden of neighbour discovery on the node that is increasing its connectivity to the TIPC network, allowing nodes that are already fully connected to take a more passive role.
Note: This algorithm does not allow for rapid neighbour discovery in the event that a cluster is initially partitioned into two or more multi-node sections that later become able to communicate, as it can take up to 10 minutes for the partitions to discovery one another. Further investigation is required to address this issue.
Each Link Request message contains a destination domain that indicates which neighbouring nodes are permitted to establish links to the transmitting interface; this value should be configurable on a per-interface basis. Typical settings include <0.0.0>, which permits connection to any node in the network, and <own_zone.own_cluster.0>, which permits connection to any node within the cluster.
A node receiving a Link Request message ensures that it belongs to the destination domain stated in the message, and that the Network Identity of the message is equal to its own. If so, and if a link does not already exist, it creates its end of the link and returns a unicast Link Response message back to the requesting node. This message then triggers the requesting node to create the other end of the link (if there is not one already), and the link activation phase then begins.
TOC |
7.2.2. Link Activation
Link activation and supervision is completely handled by the generic part of the protocol, in contrast to the partially media-dependent neighbour detection protocol.
The following FSM describes how a link is activated and supervised.
--------------- --------------- | |<--(CHECKPOINT == LAST_REC)--| | | | | | |Working-Unknown|----TRAFFIC/ACTIVATE_MSG---->|Working-Working| | | | | | |-------+ +-ACTIVATE_MSG>| | --------------- \ / ------------A-- | \ / | | | NO TRAFFIC/ \/ RESET_MSG TRAFFIC/ | NO PROBE /\ | ACTIVATE_MSG | REPLY / \ | | ---V----------- / \ --V------------ | |-------+ +--RESET_MSG-->| | | | | | | Reset-Unknown | | Reset-Reset | | |----------RESET_MSG--------->| | | | | | -------------A- --------------- | | | BLOCK/ | UNBLOCK/ | CHANGEOVER| CHANGEOVER END | ORIG_MSG | -V------------- | | | | | Blocked | | | | | ---------------
Figure 21: Link finite state machine |
A link enpoint's state is defined by the own endpoint's state, combined with what is known about the other endpoint's state. The following states exist:
Reset-Unknown
Own link endpoint reset, i.e. queues are emptied and sequence numbers are set back to their initial values. The state of the peer endpoint is unknown. LINK_PROTOCOL/RESET_MSG messages are sent periodically at CONTINUITY_INTERVAL to inform peer about the own endpoint's state, and to force it to reset its own enpoint,if this has not already been done. If the peer endpoint is rebooting, or has reset for some other reason, it will sooner or later also reach the state Reset-Unknown, and start sending its own RESET_MSG messages periodically. At least one of the endpoints, and often both, will eventually receive a RESET_MSG and transfer to state Reset-Reset. If the peer is still active, i.e. in one of the states Working-Working or Working-Unknown, and has not yet detected the disturbance causing this endpoint to reset, it will sooner or later receive a RESET_MSG, and transfer directly to state Reset-Reset. If a LINK_PROTOCOL/ ACTIVATE_MSG message is received in this state, the link endpoint knows that the peer is already in state Reset-Reset, and can itself move directly on to state Working-Working. Any other messages are ignored in this state. CONTINUITY_INTERVAL is calculated as the smallest value of LINK_TOLERANCE/4 and 0.5 s.
Reset-Reset
Own link endpoint reset, peer endpoint known to be reset, since the only way to reach this state is through receiving a RESET_MSG from peer. The link endpoint is periodically at CONTINUITY_INTERVAL sending ACTIVATE_MSG messages. This will will eventually cause peer to transfer to state Working-Working. The own endpoint will also transfer to state Working-Working as soon as any message which is not a RESET_MSG is received.
Working-Working
Own link endpoint working. Peer link endpoint known to be working, i.e. both can send and receive traffic messages. A periodic timer with the interval CONTINUITY_INTERVAL checks if anything has been received from the peer during the last interval. If not,state transfers to state Working-Unknown.
Working-Unknown
Own link endpoint working. Peer link endpoint in unknown state. LINK_PROTOCOL/STATE_MSG messages with the PROBE bit set are sent at an interval of CONTINUITY_INTERVAL/4 to force a response from peer. If a calculated number of probes (LINK_TOLERANCE/(CONTINUITY_INTERVAL/4) remain unresponded, state transfers to Reset-Unknown. Own link endpoint is reset, and the link is considered lost. If, instead, any kind of message, except LINK_PROTOCOL/RESET_MSG and LINK_PROTOCOL/ACTIVATE_MSG is received, state transfers back to Working-Working. Reception of a RESET_MSG in this situation brings the link to state Reset-Reset. ACTIVATE_MSG will never received in this state.
Blocked
The link endpoint is blocked from accepting any packets in either direction, except incoming, tunneled CHANGEOVER_PROTOCOL/ORIG_MSG. This state is entered upon the arrival of the first such message, and left when the last has been counted in and delivered. See description about the changeover procedure later in this section. The Blocked state may also be entered and left through the management commands BLOCK and UNBLOCK. This is also described later.
A newly created link endpoint starts from the state Reset-Unknown. The recommended default value for LINK_TOLERANCE is 0.8 s.
TOC |
7.2.3. Link MTU Negotiation
The actual MTU used by a link may vary with the media used. The two endpints of a link may disagree on the allowed MTU (e.g. one using Ethernet jumbo frames and the other not), and intermediate switches may put a more strict limitation to the MTU size than what is visible from the endpoints. Therefore, TIPC implements an interval halving MTU negotiation algorithm that intends to find the biggest possible MTU that can be used between the two link endpoints. This is done for each direction separately, so in theory we could end up with one MTU in one direction, and a different on in the opposite direction.
The algorithm works as follows:
A link endpoint starts out with an MTU of 1500 bytes, or the MTU reported from the bearer media, whichever is smallest (CURR_MTU). It also registers a wanted MTU (TARGET_MTU), which is equal to the one reported by the local interface. TARGET_MTU is sent along in the Max Packet field of all RESET and ACTIVATE messages to the other end, to let it know about the target to negotiate for. The other end will update its own TARGET_MTU to be the smallest of the the one received and the one registered locally.
When the link has been established, using very short RESET and ACTIVATE messages, the endpoint lets its first STATE messages have the size of CURR_MTU + (TARGET_MTU - CURR_MTU)/2.
If any of those messages are received, the other endpoint responds with a STATE message where Max Packet confirms that the size is usable. CURR_MTU is updated to the new size, and the algorithm goes back to step 2.
After a number og trials (e.g. 10) with the attempted MTU without any confirmation from the other end, TARGET_MTU is decremented with 4, and the algorithm goes back to step 2. If the link state moves to WORKING_UNKNOWN during this negotiation, due to lost STATE messages, the link moves temporarily back to using CURR_MTU as packet size. However, as soon as the link is back in WORKING_WORKING state, the negotiation continues from where it was suspended.
After a number of iterations CURR_MTU is equal to TARGET_MTU, and the negotiation is over.
TOC |
7.2.4. Link Continuity Check
During normal traffic both link enpoints are in state Working-Working. At each expiration point, the background timer checkpoints the value of the Last Received Sequence Number. Before doing this, it compares the check- point from the previous expiration with the current value of Last Received Sequence Number, and if they differ, it takes the new checkpoint and goes back to sleep. If the two values don't differ, it means that nothing was received during the last interval, and the link endpoint must start probing, i.e. move to state Working-Unknown.
Note here that even LINK_PROTOCOL messages are counted as received traffic, altough they don't contain valid sequence numbers. When a LINK_PROTOCOL message is received, the checkpoint value is moved,instead of Last Received Sequence Number, and hence the next comparison gives the desired result.
TOC |
7.2.5. Sequence Control and Retransmission
Each packet eligible to be sent on a link is assigned a Link Level Sequence Number, and appended to a send queue associated with the link endpoint. At the moment the packet is sent, its field Link Level Acknowledge Number is set to the value of Last Received Sequence Number.
When a packet is received in a link endpoint, its send queue is scanned, and all packets with a sequence number lower than the arriving packet's acknowledge number (modulo 2^16-1) are released.
If the packet's sequence number is equal to Last Received Sequence Number + 1 (mod 2^16-1), the counter is updated, and the packet is delivered upwards in the stack. A counter, Non Acknowledged Packets, is incremented for each message received, and if it reaches the value 10, a LINK_PROTOCOL/STATE_MSG is sent back to the sender. For any message sent, except BCAST_PROTOCOL messages, the Non Acknowledged Packets counter is set to zero.
Otherwise, if the sequence number is lower, the packet is considered a duplicate, and is silently discarded.
Otherwise,if a gap is found in the sequence, the packet is sorted into the Deferred Incoming Packets Queue associated to the link endpoint, to be re-sequenced and delivered upwards when the missing packets arrive. If that queue is empty,the gap is calculated and immediately transferred in a LINK_PROTOCOL/STATE_MSG back to the sending node. That node must immediately retransmit the missing packets. Also, for each 8 subsequent received out-of-sequence packets, such a message must be sent.
TOC |
7.2.6. Message Bundling
Sometimes a packet can not be sent immediately over a bearer, due to network or recipient congestion (link level send window overflow), or due to bearer congestion. In such situations it is important to utilize the network and bearer as efficiently as possible, and not stop important users from sending messages before this is absolutely unavoidable. To achieve this, messages which can not be transmitted immediately are bundled into already waiting, packets whenever possible, i.e. when there are unsent packets in the send queue of a link. When the packet finally arrives at the receiving node it is split up to its individual messages again. Since the bundling layer is located below the fragmentation layer in the functional model of the stack, even message fragments may be bundled with other messages this way, but this can only happen to the last fragment of a message, the only one normally not filling an entire packet by itself.
It must be emphasized that message transmissions never are delayed in order to obtain this effect. In contrast to TCP's Nagle Algorithm, the only goal of the TIPC bundling mechanism is to overcome congestion situations as quickly and efficiently as possible.
TOC |
7.2.7. Message Fragmentation
When a message is longer than the identified MTU of the link it will use, it is split up in fragments, each being sent in separate packets to the destination node. Each fragment is wrapped into a packet headed by an TIPC internal header (see Figure 19 (TIPC Internal Message Header Format)) The User field of the header is set to MSG_FRAGMENTER, and each fragment is assigned a Fragment Number relative to the first fragment of the message. Each fragmented message is also assigned a Fragmented Message Number, to be present in all fragments. Fragmented Message Number must be a sequence number with the period of 2^16-1. At reception the fragments are reassembled so that the original message is recreated, and then delivered upwards to the destination port.
TOC |
7.2.8. Link Congestion Control
TIPC uses a common sliding window protocol to handle traffic flow at the signalling link level. When the send queue associated to each link endpoint reaches a configurable limit, the Send Window Limit, TIPC stop sending messages over that link. Packets may still be appended to or bundled into waiting packets in the queue, but only after having been subject to a filtering function, selecting or rejecting user calls according to the sent message's importance priority. LOW_IMPORTANCE messages are not accepted at all in this situation. MEDIUM_IMPORTANCE messages are still accepted, up to a configurable limit set for that user. All other users also have their individually configurable limits, recommended to be assigned values in the following ascending order: LOW_IMPORTANCE, MEDIUM_IMPORTANCE, HIGH_IMPORTANCE, CRITICAL_IMPORTANCE, CONNECTION_MANAGER,BCAST_PROTOCOL, ROUTE_DISTRIBUTOR, NAME_DISTRIBUTOR, MSG_FRAGMENTER. MSG_BUNDLER messages are not filtered this way, since those are packets created at a later stage. Whether to accept a message due for fragmentation or not is decided on its original importance, set before the fragmentation is done. Once such a message has been accepted, its individal fragments must be handled as being more important than the original message.
When the part of the queue containing sent packets again is under the Send Window Limit, the waiting packets must immediately be sent, but only until the Send Window Limit is reached again.
TOC |
7.2.9. Link Load Sharing vs Active/Standby
When a link is created it is assigned a Link Priority, used to determine its relation to a possible parallel link to the same node. There are two possible relations between parallel working links.
TOC |
7.2.10. Load Sharing
Load Sharing is used when the links have the same priority value. Payload traffic is shared equally over the two links, in order to take full advantage of available bandwidth. The selection of which link to use must be done in a deterministic way, so that message sequentiality can be preserved for each individual sender port. To obtain this a Link Selector is used. This must be value correlated to the sender in such a way that all messages from that sender choose the same link, while guaranteeing a statistically equal possibility for both links to be selected for the overall traffic between the nodes. A simple example of a link selector with the right properties is the last two bits of the random number part of the originating port's identity, another is the same bits in Fragmented Message Number in message fragments.
TOC |
7.2.11. Active/Standby
When the priority of one link has a higher numeral value than that of the others, all traffic will go through that link, denoted the Active Link. The other links will be kept up and working with the help of the continuity timer and probe messages, and are called Standby Links. The task of these link is to take over traffic in case the active link fails.
Link Priority has a value within the range [1,31]. When a link is created it inherits a default priority from its corresponding bearer, and this should normally not need to be changed thereafter. However, Link Priority must be reconfigurable in run-time.
TOC |
7.3. Link Failover
When the link configuration between two nodes changes, the moving of traffic from one link to another must be performed in such a way that message sequentiality and cardinality per sender is preserved. The following situations may occur:
TOC |
7.3.1. Active Link Failure
Before opening the remaining link for messages with the failing link's selector, all packets in the failing link's send queue must wrapped into messages (tunneling messages) to be sent over the remaining link, irrespective of whether this is a load sharing active link or a standby link. These messages are headed by a TIPC Internal Header, the User field set to CHANGEOVER_PROTOCOL, Message Type set to ORIG_MSG. On the tunneling link the messages are subject to congestion control, fragmentation and bundling, like any other messages. Upon arrival in the arriving node, the tunneled packets are unwrapped, and moved over to the failing links receiving endpoint. This link endpoint must now be reset, if it has not already been done, and itself initiate tunneling of its own queued packets in the opposite direction. The unwrapped packets' original sequence numbers are compared to Last Received Sequence Number of the failed links receiving endpoint, and are delivered upwards or dropped according to their relation to this value. There is no need for the failing link to consider packet sequentiality or possible losses in this case, - the tunneling link must be considered a reliable media guaranteeing all the necessary properties. The header of the first ORIG_MSG sent in each direction must contain a valid number in the Message Count field, in order to let the receiver know how many packets to expect. During the whole changeover procedure both link endpoints must be blocked for any normal message reception, to avoid that the link is inadvertently activated again before the changeover is finished. When the expected number of packets has been received, the link endpoint is deblocked, and can go back to the normal activation procedure.
TOC |
7.3.2. Standby Link Failure
This case is trivial, as there is no traffic to redirect.
TOC |
7.3.3. Second Link With Same Priority Comes Up
When a link is active, and a second link with the same priority comes up, half of the traffic from the first link must be taken over by the new link. Before opening the new link for new user messages, the packets in the existing link's send queue must be transmitted over that link. This is done by wrapping copies of these packets into messages (tunnel messages) to be sent over the new link. The tunnel messages are headed by a TIPC Internal Header, the User field set to CHANGEOVER_PROTOCOL, Message Type set to DUPLICATE_MSG. On the tunneling link the messages are subject to congestion control, fragmentation and bundling, just like any other messages. Upon arrival in the arriving node, the tunneled packets are unwrapped, and delivered to the original links receiving endpoint, just like any other packet arriving over that link's own bearer. If the original packet has already arrived over that bearer, the tunneled packet is dropped as a duplicate, otherwise the tunneled packet will be accepted, and the original packet dropped as a duplicate when it arrives.
TOC |
7.3.4. Second Link With Higher Priority Comes Up
When a link is active, and a second link with a higher numerical priority comes up, all traffic from the first link must be taken over by the new link. The handling of this case is identical to the case when a link with same priority comes up. After the traffic takeover has finished, no more senders will select the old link, but this does not affect the takeover procedure.
TOC |
7.3.5. Link Deletion
Once created, a link endpoint continues to exist as long as its associated interface continues to exist.
Note: The persistence of a link endpoint whose peer cannot be reached for a significant period of time requires further study. It may be desirable for TIPC to reclaim the resources associated with such an endpoint by automatically deleting the endpoint after a suitable interval.
TOC |
7.3.6. Message Bundler Protocol
User: 6 (MSG_BUNDLER)
Message Types: None
A MSG_BUNDLER packet contains as many bundled packets as indicated in Message Count. All bundled messages start at a 4-byte aligned position in the packet. Each bundled packet is a complete packet, including header, but with the fields Broadcast Acknowledge Number, Link Level Sequence Number and Link Level Acknowledge Number left undefined. Any kind of packets, except LINK_PROTOCOL and MSG_BUNDLER packets, may be bundled.
TOC |
7.3.7. Link State Maintenance Protocol
User: 7 (LINK_PROTOCOL)
ID Value Meaning -------- ---------- 0 Detailed state of a working link endpoint (STATE_MSG) 1 Reset receiving endpoint. (RESET_MSG) 2 Sender in RESET_RESET,ready to receive (ACTIVATE_MSG)
Figure 22: Link Maintenance Prootocol Messages |
RESET_MSG messages must have a data part that must be a zero-terminated string. This string is the name of the bearer instance used by the sender node for this link. Examples of such names is "eth0","vmnet1" or "udp". Those messages must also contain valid values in the fields Session Number, Link Priority and Link Tolerance.
ACTIVATE_MSG messages do not need to contain any valid fields except Message User and Message Type.
STATE_MSG messages may leave bearer name and Session Number undefined, but Link Priority and Link Tolerance must be set to zero in the normal case. If any of these values are non-zero, it implies an order to the receiver to change its local value to the one in the message. This must be done when a management command has changed the corresponding value at one link endpoint, in order to enforce the same change at the other endpoint. Network Identity must be valid in all messages.
Link protocol messages must always be sent immediately, disregarding any traffic messages queued in the link. Hence, they can not follow the ordinary packet sequence, and their sequence number must be ignored at the receiving endpoint. To facilitate this, these messages should be given a sequence number guaranteed not to fit in sequence. The recommended way to do this is to give such messages the next unassigned Link Level Sequence Number + 362768. This way, at the reception the test for the user LINK_PROTOCOL needs to be performed only once, after the sequentiality check has failed, and we never need to reverse the Next Received Link Level Sequence Number.
TOC |
7.3.8. Link Changeover Protocol
User: 10 (CHANGEOVER_PROTOCOL)
ID Value Meaning -------- ---------- 0 Tunneled duplicate of packet (DUPLICATE_MSG) 1 Tunneled failed over original of packet (ORIGINAL_MSG)
Figure 23: Changeover Message Types |
DUPLICATE_MSG messages contain no extra information in the header apart from the first thee words. The first ORIGINAL_MSG message sent out MUST contain a valid value in the Message Count field, in order to inform the recipient about how many such messages, inclusive the first one, to expect. If this field is zero in the first message, it means that there are no packets wrapped in that message, and none to expect.
TOC |
7.3.9. Message Fragmentation Protocol
User: 12 (MSG_FRAGMENTER)
ID Value Meaning -------- ---------- 0 First fragment of message (FIRST_FRAGMENT) 1 Body fragment of message (FRAGMENT) 2 Last fragment of message (LAST_FRAGMENT)
Figure 24: Fragmentation Message Types |
All packets contain a dedicated identifier, Fragmented Message Number, to distinguish them from packets belonging to other messages from the same node. All packets also contain a sequence number within its respective message, the Fragment Number field, in order to, if necessary, reorder the packets when they arrive to the detination node. Both these sequence numbers must be incemented modulo 2^16-1.
TOC |
8. Broadcast Link
To effectively support the service multicast feature described in a Section 5.4 (Multicast Communication), a reliable cluster broadcast service is provided by TIPC.
Although seen as a broadcast service from a TIPC viewpoint, at the bearer level this service may be implemented as a multicast group comprising all nodes in the cluster.
At the multicast/broadcast sending node a sequence of actions is followed:
- When a service multicast is requested, TIPC first looks up all matching destinations in its name translation table.
- If any node external port is on the destination list, the message is sent to the multicast link for broadcast transport off node.
- If the own node is on the list, a replica is sent to the service multicast receive function in the own node.
TOC |
8.1. Broadcast Protocol
User: 5 (BCAST_PROTOCOL).
There is only one type of BCAST_PROTOCOL message, but it is still used for two very different purposes:
- STATE_MSG as broadcast: This carries a Broadcast Negative Acknowledge, and is sent as a broadcast visble by all other nodes according to the rules stated in the section Multicast Protocol.
- STATE_MSG as unicast: This works as a synchronization message between a pair of nodes, and is the very first message sent over newly activated first link to a new node. It conveys the the sender's broadcast sequence number at the moment the link went up at the sending node, so that receiver can know from which sequence number he should start accepting broadcast messages. This is needed to avoid that the receiver inadvertently accepts and acnowledges broadcast packets which were actually sent out before contact was established between the nodes. Otherwise the sender would receive too many acks for those packets, and release them prematurely, with easily predictable result.
Note that the receiver is still not allowed to start accepting broadcast messages. This he can do when he knows the initial bulk update of the name table is finished, i.e., when he sees a NAME_DISRIBUTOR packet with the M bit unset, as described in Section 6.3 (Name Distributor Message Format).
TOC |
8.2. Piggybacked Acknowledge
All packets, without exception, passed from one node to another, contain a valid value in the field Acknowledged Bcast Number. Since there is always some traffic going on between all nodes in the cluster (in the worst case only link supervision messages), the receiving node can trust that the Last Acknowledged Bcast counter it has for each node is kept well up-to-date. This value will under no circumstances be older than one CONTINUITY_INTERVAL, so it will inhibit a lot of unnecessary retransmissions of packets which in reality have already be received at the other end.
TOC |
8.3. Coordinated Acknowledge Interval
If the received packet fits in sequence as described above, AND if the last four bits of the sequence number of the packet received are equal to the last four bits of the own node's network address a LINK_PROTOCOL/STATE_MSG is generated and sent back as unicast to the receiving node, acknowledging the packet, and implicitly all previously received packets. This means that e.g. node <Z.C.1> will only explicitly acknowledge packet number 1, 17, 33, and so on, node number <Z.C.2> will acknowledge packet number 2, 18, 34, etc. This condition significantly reduces the number of explicit acknowledges needing to be sent, taking advantage of the normally ongoing traffic over each link.
TOC |
8.4. Coordinated Broadcast of Negative Acknowledges
If the Last Sent Broadcast field of a LINK_PROTOCOL/STATE_MSG differs from the registered last received broadcast data packet, or if a broadcast data packet is received out of sequence, a BCAST_PROTOCOL/STATE_MSG ("NACK") packet MAY be broadcast back to the node in question. It is RECOMMENDED that such NACKs are not sent every time a gap is detected, to avoid possible overload of the sender node. It is RECOMMENDED that a node always looks into NACKs being broadcasted from other nodes, so it can identify if these report the same sequence gap as registered locally for that node. In that case, the node SHOULD delay the sending its own corresponding NACK until a later occasion.
TOC |
8.5. Replicated Delivery
When an in-sequence service multicast is delivered upwards in the stack, TIPC looks up in the NAME TABLE and finds all node local destination ports. The destination list created this way is stripped of all duplicates, so that only one message replica is sent to each identified destination port.
TOC |
8.6. Congestion Control
Messages sent over the "broadcast link" are subject to the same congestion control mechanisms as point-to-point links, with prioritized transmission queue appending, message bundling, and as last resort a return value to the sender indicating the congestion. Typically this return value is taken care of by the socket layer code, blocking the sending process until the congestion abates. Hence, the sending application should never notice the congestion at all.
TOC |
9. Neighbor Detection
TIPC supports the automatic discovery of the physical network topology and the establishment of links between neighboring nodes through the use of a neighbor detection protocol.
TOC |
9.1. Neighbor Detection Protocol Overview
A node initiates neighbor detection by sending a "link request" message to all of its potential neighbors over each bearer that the node has been configured to use. This message identifies the requesting node and specifies both the subset of network nodes the node is willing to establish links to and the media address to be used by such links. A node that receives a link request message and determines that a new link between the nodes must be established must return a "link response" message to the requesting node; this message identifies the receiving node and specifies the receiving node's own media address. The exchange of messages permits each node to create a link endpoint which has the necessary information to begin communicating with its peer.
The conditions under which a node sends link request messages is not specified in this document. For example, implementations may send messages periodically as long as a node is operational, and may suspend the sending of requests whenever a node has working links to all of its potential neighbors. In contrast, the conditions under which a node sends link response messages is specified.
TOC |
9.1.1. Link Request Message Processing
A link request message SHOULD be sent to all potential neighbors simultaneously using multicasting or broadcasting if a bearer's media type supports this capability; otherwise, separate link request messages SHOULD be sent to all potential neighbors in individually.
A node that receives a link request message MUST ignore the message if it is not supposed to communicate with the requesting node on the associated bearer. Conditions that prohibit communication include the following:
The requesting node has a different TIPC network identifier than the receiving node.
The receiving node has the same TIPC network address as the requesting node (i.e. a node must ignore a message from itself).
The requesting node does not lie within the network domain that the receiving node is authorized to communicate with over the associated bearer.
The receiving node does not lie within the network domain that the requesting node has specified in its request.
In addition, a node that receives a link request message MUST ignore the message if it would interfere with existing communication with the requesting node. (Request messages of this nature can arise if network nodes are not configured correctly, resulting in two or more nodes having the same network address.) Conditions that cause interference include the following:
The receiving node currently has a working link to the requesting node on the associated bearer.
The receiving node has a working link to the requesting node on another bearer that was established using a different node signature.
A node that receives a link request message that is not ignored SHOULD establish a link endpoint capable of communicating with the requesting node. If the receiving node currently has a (non-operational) link endpoint to the requesting node on the associated bearer it MUST delete or reconfigure the link endpoint to preclude the existence of two parallel links to the same node on the same bearer. If the receiving node currently has one or more (non-operational) link endpoints to the requesting node on other bearers that were established using a different node signature it MUST delete or recongfigure those link endpoints to preclude the existence of links to two different nodes having the same network address.
Once the receiving node has established the required link endpoint it MUST send a link response message to the requesting node on the associated bearer. The link response message MUST be directed only to the requesting node; if possible, it SHOULD be sent without using multicasting or broadcasting.
TOC |
9.1.2. Link Response Message Processing
A node that receives a link response message MUST ignore the message if it is not supposed to communicate with the responding node on the associated bearer. Conditions that prohibit communication include the following:
The responding node has a different TIPC network identifier than the receiving node.
The receiving node has the same TIPC network address as the responding node (i.e. a node must ignore a message from itself).
The responding node does not lie within the network domain that the receiving node is authorized to communicate with over the associated bearer.
The receiving node does not lie within the network domain that the responding node has specified in its response.
In addition, a node that receives a link response message MUST ignore the message if it would interfere with existing communication with the responding node. Conditions that cause interference include the following:
The receiving node currently has a working link to the responding node on the associated bearer.
The receiving node has a working link to the responding node on another bearer that was established using a different node signature.
A node that receives a link response message that is not ignored SHOULD establish a link endpoint capable of communicating with the responding node. If the receiving node currently has a (non-operational) link endpoint to the responding node on the associated bearer it MUST delete or reconfigure the link endpoint to preclude the existence of two parallel links to the same node on the same bearer. If the receiving node currently has one or more (non-operational) link endpoints to the responding node on other bearers that were established using a different node signature it MUST delete or recongfigure those link endpoints to preclude the existence of links to two different nodes having the same network address.
Once the receiving node has established the required link endpoint it MUST NOT send a link configuration message (either a request or a response) to the responding node.
TOC |
9.1.3. Link Discovery Message Format
The format of the link discovery message used to exchange link requests and link responses is shown in Figure 25 (Neighbor Discovery Message Format)
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0: | Ver | User | Hsize |N|R|R|R| Message Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1: |Mtype| Capabilities | Node Signature | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2: | Destination Domain | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3: | Previous Node | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4: | Network Id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5: | | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w6: | | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w7: | | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w8: | | +-+-+-+-+-+-+- Media Address +-+-+-+-+-+-+-+ w9: | | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w10:| | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w11:| | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w12:| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w13:| Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w14:| Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w15:| Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 25: Neighbor Discovery Message Format |
The interpretation of the fields of the message is as follows:
R, RESERVED: Defined in Payload Message.
Ver: 3 bits : Defined in Payload Message.
User: 4 bits: Defined in Payload Message. A LINK_DISCOVERY message is identified by the value 13.
Hsize: 4 bits: Defined in Payload Message. A LINK_DISCOVERY message header is 64 bytes.
N: 1 bit: Defined in Payload Message. A LINK_DISCOVERY message sets this bit, as it is not part of a normal flow of messages over a link.
Message Size: 17 bits: Defined in Payload Message. A LINK_DISCOVERY message is 64 bytes in length.
Mtype: 3 bits: Defined in Payload Message. A LINK_DISCOVERY message specifies 0 for a link request message or 1 for a link response message.
Capabilities: 13 bits : A bitmap indicating capabilities of the sender node that receivers may need to be aware of. Currently only one bit is defined: LSB of the field (bit 15 in word 1) indicates that the sender node will send out SYN messages with the SYN bit set. All other bits are reserved for future use, and MUST be zero.
Destination Domain: 32 bits: The network domain to which the message is directed. <Z.C.N> denote that the sender desires a link to a specific node; <Z.C.0>, <Z.0.0>, and <0.0.0> denotes that the message can be processed by any node in the sender's cluster, zone, and network, respectively.
Previous Node: 32 bits: Defined in Payload Message.
Network Id: 32 bits: The network identity of the sender.
Media Address: 32 bytes: The media address of the sender. This has media specif format, described in Section 9.1.4 (Media Address Formats).
TOC |
9.1.4. Media Address Formats
The media address of the sender, the format of which is media-specific. Currently, the following formats are defined:
TOC |
9.1.4.1. Ethernet Address Format
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5: | Zero | Addr Type = 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w6: | Ethernet MAC Address | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w7: | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+ w8: | | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w9: | | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w10:| Zero | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w11:| | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w12:| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 26: Ethernet Address Format |
TOC |
9.1.4.2. Infiniband Address Format
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5: | | +-+-+-+-+-+-+ -+-+-+-+-+-+-+-+ w6: | | +-+-+-+-+-+-+ -+-+-+-+-+-+-+-+ w7: | Infiniband Address | +-+-+-+-+-+-+ -+-+-+-+-+-+-+-+ w8: | | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w9: | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w10:| | +-+-+-+-+-+-+- +-+-+-+-++-+-+| w11:| Zero | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w12:| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 27: Infiniband Address Format |
Note that there is no address type field in the Infiniband address format
TOC |
9.1.4.3. UDP/IPv4 Address
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5: | Zero | Addr Type = 3 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w6: | IPv4 Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w7: | Port Number | Zero | +-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-++-+-+-+-+-+-+-++-+-+- w8: | | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w9: | | +-+-+-+-+-+-+- Zero +-+-+-+-+-+-+-+ w10:| | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w11:| | +-+-+-+-+-+-+- +-+-+-+-+-+-+-+ w12:| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 28: UDP/IPv4 Address Format |
TOC |
10. Topology Service
TIPC provides a message-based mechanism for an application to learn about the port names that are visible to its node. This is achieved by communicating with a Topology Service that has knowledge of the contents of the node's name table.
TOC |
10.1. Topology Service Semantics
A "topology subscription" is a request by a subscriber to TIPC, telling TIPC to indicate when a port name sequence overlapping the requested range is published or withdrawn. Subscription for an individual port name is requested by specifying a port name sequence with whose lower and upper instance values are identical.
An "event" is a response by TIPC to a subscriber, telling the subscriber about a change in availability of the port name(s) specified by a subscription, or in the status of the subscription itself. Each event associated with the availability of port names indicates the portion of the requested port name sequence that has changed its availability, as well as identifying the physical address involved in the change. A subscription may cause zero, one, or more events during its lifetime.
TOC |
10.2. Topology Service Protocol
An application subscribing for the availability of port name sequences must follow these steps:
- Establish a TIPC connection to the Topology Server, using the port name {1,1}.
- Send a subscription message on the new connection for each port name sequence to be monitored.
- Wait for arrival of event messages indicating status changes for the requested port name sequence(s).
After a subscription has been received and registered by the Topology Server, the subscriber will immediately receive zero or more events, in accordance with the state of the name table at the time of registration, and the flags in the subscription message. Thereafter, the subscriber will receive an event for each change in the name table corresponding to the subscription.
Each subscription issued by an application remains registered until one of the following conditions arises:
- The time limit specified for the subscription expires. (This results in the Topology Server issuing a final event to the application, indicating that the subscription has timed out.)
- The subscription is cancelled by the application. (This is achieved by resending the original subscription message with a cancellation bit set; no acknowledgement is provided by the Topology Server.)
- The application's connection to the Topology Server is terminated.
TOC |
10.2.1. Subscription Message Format
The format of a subscription message is shown in Figure 29 (Format of toplogy subscription message). The five first words are integers, while the format of the final two words is unspecified. The words of a subscription message may be sent in network byte order or host byte order, however all words MUST utilize the same ordering. (The byte ordering used in a specific subscription message can be deduced by examining the high-order and low-order bytes of the fifth word of the message, exactly one of which will be non-zero.)
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Lower | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2 | Upper | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3 | Timeout | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4 | RESERVED |C|S|P| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5 | User Reference | w6 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 29: Format of toplogy subscription message |
The interpretation of the fields of the message is as follows:
Type: The type of the port name sequence subscribed for.
Lower: Lower bound of the port name sequence subscribed for.
Upper: Upper bound of the port name sequence subscribed for.
Timeout: The time before the subscription expires, in milliseconds. A timeout of zero means that the subscription expires immediately, but the Topology Server MUST still respond with all events reflecting the state of the requested sequence at the time of the subscription's arrival; this enables an application to perform a one-shot inquiry into the name table to obtain a result immediately, regardless of whether or not the desired names are present. A timeout of 0xffffffff means the subscription will never expire.
Filter: Describes the semantics of the subscription. All bits must be zero, except for the following:
Name Description ---- ----------- P When set, the S-bit MUST NOT be set. The Topology Server MUST send an event for each publication or withdrawal of a sequence overlapping the requested one. When clear, the S-bit MUST be set. S When set, the P-bit MUST NOT be set. The Topology Server MUST send an event only when the number of sequences overlapping the requested one goes from zero to non-zero, or vice versa. When clear, the P-bit MUST be set. C When clear, the Topology Server MUST register the subscription specified by the message. When set, the Topology Server MUST cancel a registered subscription corresponding to the one indicated in this message, if one exists. 'Corresponding' means that all the fields (except the C-bit itself) have the same value as in the original subscription message, and the message is submitted via the same connection. User Reference: An opaque 8-byte character sequence, to be used by the subscriber for his own purposes. The Topology Server MUST NOT interpret or alter this field in any way, and must return it, along with the rest of the original subscription, in all event messages.
Figure 30: Definition of bits in a subscription message
TOC |
10.2.2. Event Message Format
The format of an event message is shown in Figure 31 (Format of toplogy event message) The five first words in the message are integers; the remainder of the message is specified in Figure 29 (Format of toplogy subscription message) All words of an event message MUST be sent using the same byte order used by the subscription message that registered the subscription. (The byte ordering used in a specific event message can be deduced by examining the high-order and low-order bytes of the tenth word of the message, exactly one of which will be non-zero.)
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Event | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Found Lower | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2 | Found Upper | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3 | Port Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w4 | Node Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w5 / / \ Subscription \ w11/ / +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 31: Format of toplogy event message |
The interpretation of the fields of the message is as follows:
-
Event: Identifies the status change relating to a subscription.
The value MUST be one of the following:
Value Description ----- ----------- 1 A sequence overlapping with requested range was published. 2 A sequence overlapping with requested range was withdrawn. 3 Timeout limit specified by the subscription has been reached.
Figure 32: Topology event definitions
- Found Lower: The lower bound of the actually published or withdrawn sequence that overlaps the requested sequence. In timeout events this field is the Lower value of the associated subscription message.
- Found Upper: The upper bound of the actually published or withdrawn sequence that overlaps the requested sequence. In timeout events this field is the Upper value of the associated subscription message.
- Reference: The reference portion of the port identifier associated with the published or withdrawn sequence. In timeout events this field is zero.
- Node: The network address portion of the port identifier associated with the published or withdrawn sequence. In timeout events this field is zero.
- Subscription: An exact copy of the subscription message (as described in Figure 29 (Format of toplogy subscription message) which triggered the sending of the event message.
TOC |
10.3. Monitoring Service Topology
The service topology of the network can be continuously monitored by subscribing for the relevant port names or name sequences corresponding to the services of interest to the application.
TOC |
10.4. Monitoring Physical Topology
The physical topology of the network can be considered a special case of the functional topology, and can be monitored in the same way. To track the availability or disappearance of a specific node or group of nodes, an application running on these node(s) can publish a port name representing this "function"; this name can then be subscribed to by other applications. TIPC's Topology Service can then notify subscribing applications whenever it discovers or loses contact with a node publishing that name.
TIPC enables an application to easily monitor the availability of the nodes within its cluster by having each node automatically publish the reserved name {0,<Z.C.N>} with cluster scope, where <Z.C.N> is the network address of the node. The port identifier associated with this name identifies the node's Configuration Service.
TOC |
11. Configuration Service
TIPC provides a message-based mechanism for an application to inquire about the configuration and status of a TIPC network and, in some instances, to alter the configuration. This is achieved by communicating with a Configuration Service that implements a variety of network management-style commands.
TOC |
11.1. Configuration Service Semantics
A "configuration command" is an operation supported by TIPC's Configuration Service that alters the configuration of a network node or returns information about the current configuration or state of the network. There are three classes of configuration command defined by TIPC:
- "Public commands" are operations that can be issued by any application and executed by the Configuration Service on any network node. These operations are typically non-intrusive and MUST NOT impact other applications running on the affected node.
- "Protected commands" are operations that can only be issued by an application that has network administration privileges on its node and executed by the Configuration Service on any network node. These operations are potentially intrusive and MAY impact other applications running on the affected node.
- "Private commands" are operations that can only be issued by an application that has network administration privileges on its node and executed by the Configuration Service on its node only. These operations are typically intrusive and MAY impact other applications running on the affected node.
A "command message" is a message exchanged by an application and TIPC. There are two classes of command message defined by TIPC:
- A "command request" is a request by an application to the Configuration Service, asking it to perform a specific configuration command.
- A "command reply" is a response by the Configuration Service to an application that acknowledges that a command request has been acted upon, and returns any requested information.
TOC |
11.2. Configuration Service Protocol
Command messages may be sent over any protocol (e.g. Netlink [RFC 3549]), and may have different formats, to be decided by the particular implementation. Definition such formats falls outside the scope of this document. Here, we only define the formats that MUST be used when the command messages are carried over TIPC itself.
An application that interacts with the Configuration Service uses TIPC payload messages containing command requests and replies. The application MUST follow these steps:
- Send a connectionless command request to a Configuration Server using the port name {0,<Z.C.N>}, where <Z.C.N> is the network address of the node to be queried or manipulated.
- Wait for the arrival of a command reply from the Configuration Server corresponding to the previously issued command request.
After a command request is received by the Configuration Server, the server will attempt to perform the requested operation and return a command reply indicating the results of the operation.
TOC |
11.2.1. Command Message Format
The data portion of a command message consists of a command descriptor followed by zero or more command arguments.
TOC |
11.2.1.1. Command Descriptor
The format of a command descriptor is shown in Figure 33 All fields of the command descriptor MUST be stored in network byte order.
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0:| Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1:| Command | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2:| | + RESERVED + w3:| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 33 |
The interpretation of the fields of the descriptor is as follows:
- Length: 32 bits: The total length of the command, including the command descriptor and its arguments.
- Command: 16 bits :Identifies a specific configuration command.
-
Flags: 16 bits : Describes the semantics of the configuration command.
Two values are defined:
Flags Meaning ----- ------- 0 Message is a command reply 1 Message is a command request
Figure 34: Command Message Types
- RESERVED: 8 bytes: Defined in Payload Message.
TOC |
11.2.1.1.1. Command Arguments
A command message contains zero or more Type-Length-Value (TLV) triplets that provide details about the associated request or reply. The set of TLVs associated with a command request may be different than the set of TLVs associated with its reply.
The format of a command argument TLV is shown in Figure 35 (Command argument TLV format) The first two fields of the TLV MUST be stored in network byte order; the order used in the value field that follows depends on TLV's type. TLV triplets MUST begin on a 32-bit word boundary offset from the start of the command message; thus, it may be necessary to include one, two, or three bytes of padding between adjacent TLVs in a command message.
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0:| Length | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1:\ \ / Value / wN:\ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 35: Command argument TLV format |
The interpretation of the fields of a command argument TLV is as follows:
- Length: 16 bits : The length of the TLV, in bytes.
- Type: 16 bits : Identifies the data type encoded in Value.
- Value: 0 to 65,531 bytes
- The data portion of the TLV.
TOC |
11.2.1.2. Command Argument TLV Descriptions
The TLVs defined for TIPC's Configuration Service are described in this section.
TOC |
11.2.1.2.1. VOID
VOID (type 1) is a zero-byte TLV type that can be used as a placeholder in command messages. Currently, no command messages utilize this type.
TOC |
11.2.1.2.2. UNSIGNED
UNSIGNED (type 2) is a TLV type designating a generic unsigned integer. It is represented by a 32-bit integer, which MUST be stored in network byte order.
TOC |
11.2.1.2.3. STRING
STRING (type 3) is a TLV type designating a moderately-sized character string. It is represented by a zero-terminated sequence of characters, which may range from one byte to 128 bytes, including the terminating zero character.
TOC |
11.2.1.2.4. LARGE_STRING
LARGE_STRING (type 4) is a TLV type designating a large-sized character string. It is represented by a zero-terminated sequence of characters, which may range from one byte to 2048 bytes, including the terminating zero character.
TOC |
11.2.1.2.5. ULTRA_STRING
ULTRA_STRING (type 5) is a TLV type designating a very large-sized character string. It is represented by a zero-terminated sequence of characters, which may range from one byte to 32768 bytes, including the terminating zero character.
TOC |
11.2.1.2.6. ERROR_STRING
ERROR_STRING (type 16) is a TLV type designating the reason for the failure of a command request. It is represented by a zero-terminated sequence of characters, which may range from one byte to 128 bytes, including the terminating zero character.
The first character of an ERROR_STRING may be a special error code character, lying in the range 0x80 to 0xFF, which corresponds to one of the following pre-defined reasons:
Value Meaning ----- ------- 0x80 The request contains incorrect TLV(s) 0x81 The request requires network administrator privileges 0x83 The designated node does not permit requests from off-node 0x84 The request is not supported 0x85 The request has invalid argument values
Figure 36: Command Error Codes |
TOC |
11.2.1.2.7. NET_ADDR
NET_ADDR (type 17) is a TLV type designating a TIPC network address. It is represented by a 32-bit integer denoting zone, cluster, and node identifiers (using 8, 12, and 12 bits, respectively), with the zone identifier occupying the most significant bits and the node identifier occupying the least significant bits. This value MUST be stored in network byte order.
TOC |
11.2.1.2.8. MEDIA_NAME
MEDIA_NAME (type 18) is a TLV type designating a media type usable for TIPC messages. It is represented by a zero-terminated sequence of characters, which may range from one byte to 16 bytes, including the terminating zero character.
As an example, the media type for Ethernet bearers is "eth".
TOC |
11.2.1.2.9. BEARER_NAME
BEARER_NAME (type 19) is a TLV type designating a TIPC bearer. It is represented by a zero-terminated sequence of characters, which may range from one byte to 32 bytes, including the terminating zero character.
The resulting string MUST have the form "medianame:interfacename". For example, an Ethernet bearer may have the name "eth:eth0".
TOC |
11.2.1.2.10. LINK_NAME
LINK_NAME (type 20) is a TLV type designating a TIPC link endpoint. It is represented by a zero-terminated sequence of characters, which may range from one byte to 60 bytes, including the terminating zero character.
The resulting string MUST have the form "Z.C.N:own_side_interfacename-Z.C.N:peer_side_interfacename". For example, an Ethernet link endpoint may have the name "1.1.7:eth0-1.1.12:eth0".
TOC |
11.2.1.2.11. NODE_INFO
NODE_INFO (type 21) is a TLV type designating the reachability status (up/down) of a neighboring node. It is represented by the 8-byte structure shown in Figure 37 (Node Availability Info). All fields of this structure MUST be stored in network byte order.
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Node Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Up | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 37: Node Availability Info |
The interpretation of the fields of the structure is as follows:
- Node Address: 32 bits : The network address of a neighboring node.
- Up: 32 bits : Non-zero if there is a working link to the specified node.
TOC |
11.2.1.2.12. LINK_INFO
LINK_INFO (type 22) is a TLV type designating the status (up/down) of a link endpoint. It is represented by the 68-byte structure shown in Figure 38. The first two fields of this structure MUST be stored in network byte order.
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Node Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Up | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2 \ \ / Link Name / w16\ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 38 |
The interpretation of the fields of the structure is as follows:
- Node Address: 32 bits : The network address of a neighboring node.
- Up: 32 bits : Non-zero if the specified link is working.
- Link Name: 60 bytes : Zero-terminated string identifying a local link endpoint. MUST have format "Z.C.N:own_side_interfacename-Z.C.N:peer_side_interfacename".
TOC |
11.2.1.2.13. BEARER_CONFIG
BEARER_CONFIG (type 23) is a TLV type used to enable a bearer. It is represented by the 40-byte structure shown in Figure 39 (Value field of BEARER_CONFIG TLV). The first two fields of this structure MUST be stored in network byte order.
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Priority | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Discovery Domain | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2 \ \ / Bearer Name / w9 \ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 39: Value field of BEARER_CONFIG TLV |
The interpretation of the fields of the structure is as follows:
- Priority: 32 bits : Desired priority for bearer.
- Discovery Domain: 32 bits : Network domain whose nodes the bearer will establish links to. This MUST be a domain containing the node itself.
- Bearer Name: 32 bytes : Zero-terminated string designating the name of a bearer. MUST have format "medianame:interfacename".
TOC |
11.2.1.2.14. LINK_CONFIG
LINK_CONFIG (type 24) is a TLV type used to change the properties of a link. It is represented by the 64-byte structure shown in Figure 40 (Link Configuration Command Format). The first field of this structure MUST be stored in network byte order.
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 | Value | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 \ \ / Link Name / w15\ \ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 40: Link Configuration Command Format |
The interpretation of the fields of the structure is as follows:
- Value: 32 bits : Desired value of the link property being set.
- Link Name: 60 bytes : Zero-terminated string designating a local link endpoint. MUST have format "Z.C.N:interfacename-Z.C.N:interfacename".
TOC |
11.2.1.2.15. NAME_TBL_QUERY
NAM_TBL_QRY (type 25) is a TLV type used when requesting name table information. It is represented by the 16-byte structure shown in Figure 41 (Value field of NAME_TBL_QUERY TLV). All fields of this structure MUST be stored in network byte order.
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 9 8 7 6 5 4|3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w0 |A| Depth | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w1 | Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w2 | Lower Bound | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ w3 | Upper Bound | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 41: Value field of NAME_TBL_QUERY TLV |
The interpretation of the fields of the structure is as follows:
- A: 1 bit : All types flag. If this bit is set the Configuration Server examines all entries in the name table, rather than just the name sequence specified by {Type, Lower Bound, Upper Bound}; in such cases Type, Lower Bound, and Upper Bound MUST be set to zero. If this bit is clear only names having a type value of Type are examined.
- Depth: 31 bits : Amount of information to be displayed. The value MUST be one of the following:
-
Depth Meaning ----- ------- 1 Display port name type info only 2 As 1, but also display port name instance info 3 As 2, but also display port identity info 4 As 3, but also display any additional info available
Figure 42: Name Table Display Codes
- Type: 32 bits : The type value in the set of desired port names.
- Lower Bound: 32 bits : The lowest instance value in the set of desired port names.
- Upper Bound: 32 bits : The upper instance value in the set of desired port names.
TOC |
11.2.1.3. Command Message Descriptions
The set of commands that MAY be supported by the Configuration Service are described in this section.
The description of the command reply message for each command assumes that the associated command request is executed successfully. If an error occurs during the processing of a request the Configuration Service MUST include a TLV of type ERROR_STRING as part of the command reply returned to the requesting application.
TOC |
11.2.1.3.1. NOOP
NOOP (command 0x0000) is a public command that performs no action. This command may be useful for demonstrating that an application can interact successfully with the Configuration Service.
The command request contains no TLV. The command reply contains no TLV.
TOC |
11.2.1.3.2. GET_NODES
GET_NODES (command 0x0001) is a public command that is used to obtain information about the status of a node's neighbors.
The command request contains a single TLV of type NET_ADDR, which represents a network domain. The command reply contains zero or more TVLs of type NODE_INFO, one for each node within the specified domain that this node has a direct link to (even if it is not currently operational).
TOC |
11.2.1.3.3. GET_MEDIA_NAMES
GET_MEDIA_NAMES (command 0x0002) is a public command that is used to obtain the names of all media types currently configured on a node.
The command request contains no TLV. The command reply contains zero or more TLVs of type MEDIA_NAME.
TOC |
11.2.1.3.4. GET_BEARER_NAMES
GET_BEARER_NAMES (command 0x0003) is a public command that is used to obtain the names of all bearers currently configured on a node.
The command request contains no TLV. The command reply contains zero or more TLVs of type BEARER_NAME.
TOC |
11.2.1.3.5. GET_LINKS
GET_LINKS (command 0x0004) is a public command that is used to obtain information about the status of a node's link endpoints.
The command request contains a single TLV of type NET_ADDR, which specifies a network domain. The command reply contains zero or more TLVs of type LINK_INFO, corresponding to the node's own broadcast link endpoint and any link endpoint whose peer node lies within the specified network domain.
TOC |
11.2.1.3.6. SHOW_NAME_TABLE
SHOW_NAME_TABLE (command 0x0005) is a public command that is used to obtain information about the contents of a node's name table.
The command request contains a single TLV of type NAME_TBL_QUERY. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified.
TOC |
11.2.1.3.7. SHOW_PORTS
SHOW_PORTS (command 0x0006) is a public command that is used to obtain status and statistics information about a link endpoint.
The command request contains no TLV. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified.
TOC |
11.2.1.3.8. SHOW_LINK_STATS
SHOW_LINK_STATS (command 0x000B) is a public command that is used to obtain status and statistics information about a link endpoint.
The command request contains a single TLV of type LINK_NAME. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified.
TOC |
11.2.1.3.9. SHOW_STATS
SHOW_STATS (command 0x000F) is a public command that is used to obtain status and statistics information about TIPC for a node.
The command request contains a single TLV of type UNSIGNED, which indicates the information to be obtained; a value of zero returns all available information, while no other values are currently defined. The command reply contains a single TLV of type ULTRA_STRING, whose content is unspecified.
TOC |
11.2.1.3.10. GET_REMOTE_MNG
GET_REMOTE_MNG (command 0x4003) is a private command that is used to determine whether a node can be remotely managed by another node in the TIPC network.
The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED; a value of zero indicates that the node's Configuration Service is unable to process command requests issued by another node, while any other value indicates that processing of off-node command requests is enabled.
TOC |
11.2.1.3.11. GET_MAX_PORTS
GET_MAX_PORTS (command 0x4004) is a private command that is used to obtain the maximum number of ports that can be supported simultaneously by a node.
The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED.
TOC |
11.2.1.3.12. GET_NETID
SET_NETID (command 0x400B) is a protected command that is used to obtain the TIPC network identifier used by a node.
The command request contains no TLV. The command reply contains a single TLV of type UNSIGNED.
TOC |
11.2.1.3.13. ENABLE_BEARER
ENABLE_BEARER (command 0x4101) is a protected command that is used to initiate a node's use of the specified bearer for TIPC messaging. The node will respond to requests from neighboring nodes to establish new links if the nodes lie within the specified discovery domain.
The command request contains a single TLV of type BEARER_CONFIG. The command reply contains no TLV.
TOC |
11.2.1.3.14. DISABLE_BEARER
DISABLE_BEARER (command 0x4102) is a protected command that is used to terminate a node's use of the specified bearer for TIPC messaging. The node deletes all existing link endpoints that utilize that bearer and will ignore all requests from neighboring nodes to establish new links.
The command request contains a single TLV of type BEARER_NAME. The command reply contains no TLV.
TOC |
11.2.1.3.15. SET_LINK_TOL
SET_LINK_TOL (command 0x4107) is a protected command that is used to configure the tolerance attribute of a link endpoint. (The tolerance attribute of the link's peer endpoint will be configured to match automatically.)
The command request contains a single TLV of type LINK_CONFIG. The command reply contains no TLV.
TOC |
11.2.1.3.16. SET_LINK_PRI
SET_LINK_PRI (command 0x4108) is a protected command that is used to configure the priority attribute of a link endpoint. (The priority attribute of the link's peer endpoint will be configured to match automatically.)
The command request contains a single TLV of type LINK_CONFIG. The command reply contains no TLV.
TOC |
11.2.1.3.17. SET_LINK_WINDOW
SET_LINK_WINDOW (command 0x4109) is a protected command that is used to configure the message window attribute of a link endpoint. (The priority attribute of the link's peer endpoint MUST NOT be configured to match automatically.)
The command request contains a single TLV of type LINK_CONFIG. The command reply contains no TLV.
TOC |
11.2.1.3.18. RESET_LINK_STATS
RESET_LINK_STATS (command 0x410C) is a protected command that is used to reset the statistics counters for a link endpoint.
The command request contains a single TLV of type LINK_NAME. The command reply contains no TLV.
TOC |
11.2.1.3.19. SET_NODE_ADDR
SET_NODE_ADDR (command 0x8001) is a private command that is used to configure the network address of a node.
The command request contains a single TLV of type NET_ADDR, indicating the desired network address. The command reply contains no TLV.
TOC |
11.2.1.3.20. SET_REMOTE_MNG
SET_REMOTE_MNG (command 0x8003) is a private command that is used to configure whether a node can be remotely managed by another node in the TIPC network.
The command request contains a single TLV of type UNSIGNED; a value of zero disables the node's Configuration Service from processing command requests issued by another node, while any other value enables processing of off-node command requests. The command reply contains no TLV.
TOC |
11.2.1.3.21. SET_MAX_PORTS
SET_MAX_PORTS (command 0x8004) is a private command that is used to configure the maximum number of ports that can be supported simultaneously by a node.
The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV.
TOC |
11.2.1.3.22. SET_NETID
SET_NETID (command 0x800B) is a private command that is used to configure the TIPC network identifier used by a node.
The command request contains a single TLV of type UNSIGNED. The command reply contains no TLV.
TOC |
12. Security Considerations
TIPC is a special-purpose transport protocol designed for operation within a secure, closed network of interconnecting nodes within a cluster. TIPC does not possess any native security features, and relies on the properties of the selected bearer protocol (e.g. IP-Sec) when such features are needed.
TOC |
13. IANA Considerations
This memo includes no request to IANA.
TOC |
14. Contributors
TOC |
15. Acknowledgements
Thanks to Marshall Rose for developing the XML2RFC format.
TOC |
16. References
TOC |
16.1. Normative References
[RFC2119] | Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML). |
TOC |
16.2. Informative References
[RFC0793] | Postel, J., “Transmission Control Protocol,” STD 7, RFC 793, September 1981 (TXT). |
[RFC2960] | Stewart, R., Xie, Q., Morneault, K., Sharp, C., Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., Zhang, L., and V. Paxson, “Stream Control Transmission Protocol,” RFC 2960, October 2000 (TXT). |
[RFC2104] | Krawczyk, H., Bellare, M., and R. Canetti, “HMAC: Keyed-Hashing for Message Authentication,” RFC 2104, February 1997 (TXT). |
TOC |
Appendix A. Change Log
The following changes have been made from draft-spec-tipc-09 .
- Rewritten abstract.
- Filled numerous remaining empty chapters and paragraphs.
- Updated many other chapters and paragraphs for more accuracy
- Stylistic and language changes.
- Updated network hierachy section to reflect that we now have looser requirements on intra-cluster and inta-zone connectivity.
- Removed all references to TIPC-level message routing, since this is not supported
- Added description of broadcast initial synchronization protocol.
- Introduced SYN bit in message header.
- Extended neighbor discovery protocol header to 64 bytes.
- Added definition of a capability bit in neigbor discovery protocol.
- Introduced two new bearers in neighbor discovery protocol: Infiniband and UDP/IPv4.
TOC |
Appendix B. Remaining Issues
This document is a "work-in-progress" edition of the specification for version 2 of the TIPC protocol. It is still believed to be accurate and complete, although there are certainly a potential for improvements in many chapters.
This document reflects the capabilities of TIPC 2.0 as implemented by the Open Source TIPC project (see http://tipc.sf.net).
TOC |
Authors' Addresses
Jon Paul Maloy | |
Ericsson | |
8400, boul. Decarie | |
Ville Mont-Royal, Quebec H4P 2N2 | |
Canada | |
Phone: | +1 514 591-5578 |
EMail: | jon.maloy@ericsson.com |
Allan Stephens | |
Wind River | |
350 Terry Fox Drive, Suite 200 | |
Kanata, ON K2K 2W5 | |
Canada | |
Phone: | +1 613 270-2259 |
EMail: | allan.stephens@windriver.com |