TIPC Programmer's Guide

Jon Maloy, Red Hat Inc., October 2020


Introduction

This document is designed to assist software developers writing applications that use TIPC. Introductory information about the TIPC protocol, its main concepts, and instructions for setting up and operating a node and a cluster are available at the project website at http://www.tipc.io. To benefit from this text, the reader should also have at least basic knowledge about computer networking and be familiar with socket programming.

It should be noted that TIPC has undergone significant functional changes and upgrades between Linux 3.10, the version referred to in the now superseded TIPC 2.0 documentation, and Linux 5.8, the version on which this document is based.

The most important of those changes are:

  • A new, work queue based topology server. (Introduced by Ying Xue in Linux 3.10.)
  • A new management tool, tipc, obsoleting the old tipc-config. (Richard Alpe, Linux 3.18.)
  • Introduction of kernel name space support. (Ying Xue, Linux 3.19.)
  • Ability to carry TIPC over UDP/IPv4 or UDP/IPv6 bearers. (Erik Hugne, Linux 4.0.)
  • A more robust, performant and scalable multicast service. (Jon Maloy, Linux 4.3.)
  • Ability to scale out to hundreds of nodes, using the Overlapping Ring Monitoring Algorithm. (Jon Maloy, Linux 4.7.)
  • Introduction of a message bus service through the Communication Group feature. (Jon Maloy, Linux 4.14.)
  • Obsoleted the zone concept, which was never implemented anyway. (Jon Maloy, Linux 4.17.)
  • Introduced a flat, 128 bit node identity, replacing the previous structured 32 bit node address. (Jon Maloy, Linux 4.17.)
  • Making the node identity auto configurable. (Jon Maloy, Linux 4.17.)
  • Changed terminology in the user API, from port name to service address, port name sequence to service range and port identity to socket address. (Jon Maloy, Linux 4.17.)
  • Added extensive tracing support. (Tuong Lien, Linux 4.20)
  • Introduced a 'Smart Nagle' mechanism that switches itself on and off depending on the traffic characteristics. This doubles stream socket throughput for small message traffic. (Jon Maloy, Linux 5.4)
  • Added built-in AEAD-based encryption and authentication protocol for all inter-node traffic (Tuong Lien, Linux 5.4). NOT YET IN RHEL-8.
  • A 'wormhole' inter-name space mechanism that makes it possible for messages to take a shortcut, bypassing any network interfaces, between containers located on the same host. This gives inter-container traffic the same performance (and much better than TCP) as normal intra-node traffic. (Hoang Le, Linux 5.4)
  • Introduced a variable window link flow control algorithm, based on the Reno algorithm. Gives 25% improvement of max link throughput. (Jon Maloy/Xin Long, Linux 5.5)
  • Made binding table updates, which are typically broadcast in nature, based on true network broadcast when the network infrastructure so permits. (Hoang Le, Linux 5.8) NOT YET IN RHEL-8.

During the same period there has been a huge effort to improve TIPC code quality, regarding readability, maintainability, robustness and performance. Data structures have been simplified and disentangled, the locking policy has been much improved, and the code aligned with common Linux coding style guidelines.
It should be emphasized that through all these changes great care has been taken to remain backwards compatible with earlier versions of TIPC, both regarding the protocol and the user API.

1. TIPC Fundamentals

A brief summary of the major concepts used in TIPC is provided in the following sections.

1.1. Cluster

A TIPC network consists of individual processing elements or nodes. Those nodes are arranged into clusters according to their assigned cluster identity. All nodes having the same cluster identity will establish links to each other, provided the network is set up to allow mutual neighbor discovery between them. It is only necessary to change the cluster identity from its default value if nodes in different clusters potentially may discover each other, e.g., if they are attached to the same subnet. Nodes in different clusters cannot communicate with each other using TIPC.

Each node has a 128 bit node identity which must be unique within the node's cluster. If the node will be part of a cluster, the user can either rely on the auto configuration capability of the node, where the identity is generated when the first bearer is attached, or he can set the identity explicitly, e.g., from the node's host name or a UUID. If a node will not be part of a cluster its identity can remain at the default value, zero.

Each node is also uniquely identified by a 32 bit hash number. This number is generated from the node identity, and is used internally by the protocol as node address. An extension to the neighbor discovery protocol guarantees that its value is unique within the cluster. This number is helpful both for protocol performance and for backwards compatibility.

1.2. Addressing

The TIPC socket API provides three different address types:

  • Service Address. This address type consists of a 32 bit service type identifier and a 32 bit service instance identifier. The type identifier is typically determined and hard coded by the user application programmer, but its value may have to be coordinated with other applications which might be present in the same cluster. The instance identifier is often calculated by the program, based on application specific criteria.

  • Service Range. This address type represents a range of service addresses of the same type and with instances between a lower and an upper range limit. By binding a socket to this address type one can make it represent many instances, something which has proved useful in many cases. This address type is also used as multicast address, which will be explained in more detail later.

  • Socket Address. This address is a reference to a specific socket in the cluster. It contains a 32 bit port number and a 32 bit node hash number. The port number is generated by the system when the socket is created, and the node hash is generated from the corresponding node identity as explained earlier. An address of this type can be used for connecting or for sending messages in the same way as service addresses can be used, but is only valid as long as long as the referenced socket exists.

When binding a socket to a service address or address range, the visibility scope of the binding must be indicated. There are two options, TIPC_NODE_SCOPE if the user only wants node local visibility, and TIPC_CLUSTER_SCOPE if he wants cluster global visibility. There are almost no limitations to how sockets can be bound to service addresses: one socket can be bound to many addresses or ranges, and many sockets can be bound to the same address or range. The service types 0 through 63 are however reserved for system internal use, and are not available for user space applications.

When sending a message by service address the sender may indicate a lookup scope, also called lookup domain. This is a node hash number, limiting the set of eligible destination sockets to the indicated node. If this value is zero, all matching sockets in the whole cluster, as visible from the source node, are eligible.

1.3. Messaging

TIPC message transmission can be performed in different ways.

1.3.1. Datagram Messaging

Datagram messages are discrete data units between 1 and 66,000 byte of length, transmitted between non-connected sockets of type SOCK_DGRAM or SOCK_RDM. Just like their UDP counterparts, TIPC datagrams are not guaranteed to reach their destination, but their chances of being delivered are still much better than for the former. Because of the link layer delivery guarantee, the only limiting factor for datagram delivery is the socket receive buffer size. The chances of success can also be increased by the sender, by giving his socket an appropriate delivery importance priority. There are four such priority levels. The default value is TIPC_LOW_IMPORTANCE, but this can be changed via the socket option TIPC_IMPORTANCE.

Furthermore, when there is receive buffer overflow, the sender can choose whether he just wants his message to be dropped, or if he should receive an indication about the failure to deliver. In the latter case, a potentially truncated (to the first 1024 bytes) copy of the original message is returned to the sender, along with the error code TIPC_ERR_OVERLOAD. This mechanism also works for other types of delivery failure, so a user may even encounter the error codes TIPC_ERR_NO_NODE and TIPC_ERR_NO_PORT. The sender reads the error code and and returned bytes as ancillary data from the socket. The default setting for this property is off, but it can be enabled via the socket option TIPC_DEST_DROPPABLE.

Datagram messages can be sent either by socket address, service address or multicast address.

  • If a socket address is indicated the message is transmitted to that exact socket.

  • When a service address is used, there might be several matching destinations, and the transmission method becomes what is often denoted anycast, i.e., that any of the matching destinations may be selected. When this is the case, the function translating from service address to socket address uses a round-robin algorithm to decrease the risk of load bias among the destinations. It should however be noted that this algorithm is node global, so unless a sender is alone on the node to use this address he has no guarantee that his particular messages will be evenly distributed among the destinations.

  • The address type service range also doubles as multicast address. When an application specifies a service range as destination address, it does effectively instruct TIPC to send a copy of the message to all matching sockets in the cluster. Any socket bound to one or more instances inside the indicated multicast range will receive exactly one copy of the message, - never more. Multicast datagram messaging differ from the unicast/anycast ditto in one respect, - the lookup scope is always cluster global, and cannot be changed.

The risk of message rejection can be reduced by increasing the receive buffer size from the default value.

Datagram messages sent by service address may be subject to another mechanism intended to reduce the risk of delivery failure. Because of the transaction-free and non-atomic nature of binding table updates, a primary address lookup for a message may be successful on the source node, while it turns out that the destination socket has disappeared at arrival on the target node. In such cases, a secondary service lookup is attempted on the destination node, and only if that fails is the message dropped or returned to sender.

When a datagram is received in a socket, the receiver can read out the source socket address from the recvmsg() control block, as one would expect. In addition, it is possible to read out the service address the sender was using, if any. This feature might prove convenient in some cases.

Because of the lacking delivery guarantee for datagram messages, this transmission method should only be used when the programmer feels confident there is no risk of receive buffer overflow, or that he can handle the consequences. If he needs a more robust mechanism, with end-to-end flow control, he should instead consider using group messaging.

1.3.2. Connection Oriented Messaging

Connections can be established and used much like we are used to from TCP: the server creates a SOCK_STREAM socket and calls accept(); the client creates a blocking or non-blocking socket of the same type, does connect() on it and starts sending or receiving. The address types used can be any of service address or socket address (client side), or service address or service range (server side).

TIPC does however provide two varieties of this scenario, which may be useful in some cases.

  • First, instead of being of type SOCK_STREAM, the sockets can be created as SOCK_SEQPACKET, implying that data exchange must happen in units of maximum 66,000 byte messages.

  • Second, a client can initialize a connection by simply sending a data message to an accept()'ing socket. Likewise, the spawned server socket can respond with a data message back to the client to complete the connection. This way, TIPC provides an implied, also known as 0-RTT connection setup mechanism that is particularly time saving in many cases.

The most distinguishing property of TIPC connections is still their ability to react promptly to loss of contact with the peer socket, without resorting to active neighbor heart-beating.

  • When a socket is ungracefully closed, either by the user or because of a process crash, the kernel socket code will by its own initiative issue a FIN message to the peer.

  • When contact to a cluster node is lost, the local link layer will issue FIN messages to all sockets having connections towards that node. The peer node failure discovery time is configurable down to 50 ms, while the default value is 1,500 ms.

  • To handle the very unlikely scenario of a half-finished, dangling connection, each socket endpoint maintains a 1-hour period timer to probe the peer if it has been silent during the past period. If there has been no response at the next timeout expiration the connection is aborted.

1.3.3. Group Messaging

Group messaging can a little simplistically be described as datagram messaging, as described above, but with end-to-end flow control, and hence with delivery guarantee. There are however a few notable differences that must be described further.

  • Messaging can only be done within a closed group of member sockets.

  • A socket joins a group by calling the socket option TIPC_GROUP_JOIN with struct tipc_group_req* as argument. Part of this structure is a service address, where the type field indicates the group identity and the instance field indicates member identity. Hence, a member can only bind to one single service address, and nothing more.

  • When sending an anycast message, the lookup algorithm applies the regular round-robin algorithm. However, it also considers the current load, i.e., the advertised send window, on potential receivers before making a selection.

  • Just like with regular datagram multicasting, group multicast is performed by indicating a service range as destination. However, in group multicast only the lower value of the range is considered during lookup. This means that only those members which have joined the group with exactly that instance value will receive a copy of a sent multicast message.

  • There is also a group broadcast mode which transmits a message to all group members, without considering their member identity. The sender indicates his intention to broadcast by using the send() primitive.

  • There is a dedicated coordinated sliding window protocol in place to handle "traffic crunches", i.e., cases where many members may try to simultaneously send traffic to a the same destinations. This means that the delivery guarantee is valid even in such extreme cases.

  • When receiving a message, the receiver uses recvmsg(), and can from the accompanying struct msghdr read out both source addresses of the sender, - its socket address and its bound service address.

  • Apart from the message delivery guarantee, there is also a sequentiality guarantee. This guarantee is even valid between messaging modes, e.g., a member may send unicast, anycast, multicast or broadcast in any order, and the messages are guaranteed to be received in the same order in any particular recipient.

  • When joining a group, a member may indicate if it wants to be eligible for loopback copies of his own sent anycast/multicast/broadcast messages. By default, this setting is off.

  • When joining a group, a member may indicate if it wants to receive join or leave events for other members of the group. This feature leverages the service tracking feature, but contrary to other users, a group member will receive the events in the member socket proper. Because of this, it has become possible to issue a sequentiality guarantee; - a join event will always arrive before the first message from a new member, - a leave event is always delivered after the last message from the leaving member. An event is simply the reception of an empty out-of-band message, accompanied by the new member's two source addresses. I addition, a leave event message has the EOR bit set. The default value of the event subscription setting is off.

Message groups are both scalable and performant, but how much depends on their distribution across nodes and traffic pattern. On a single node with full traffic blast there is no problem establishing a group with e.g., 64 members, while there can be more if the group is distributed. If optimal performance for a sender is important, it is recommended to not let it switch between unicast/anycast and multicast/broadcast too frequently.

1.4. Service Tracking

TIPC provides a service tracking function that makes it possible for an application to follow the availability of service addresses and service ranges in the cluster.

1.4.1. Service Subscription

An application accesses the topology service by opening a SOCK_SEQPACKET type connection to the TIPC internal topology service, using the service address {TIPC_TOP_SRV, TIPC_TOP_SRV}. It can then send one or more service subscription messages to the topology service, indicating the service address or range it wants to track. In return, the topology service sends service event messages back to the application whenever matching addresses are bound or unbound by sockets within the cluster. An application is allowed to have multiple subscriptions active at the same time, using the same connection.

The exchange of messages between application and topology service is entirely asynchronous. The application may issue new subscription requests at any time, while the topology service may send event messages about matching bindings to the application at any time.

The connection between the application and the topology service continues until the application terminates it, or until the topology service encounters an error that requires it to terminate the connection. When the connection ends, for whatever reason, all pertaining subscriptions are automatically canceled by TIPC.

Although service subscriptions are most often directed towards the node local topology server, it is fully possible to establish connections to other nodes' servers as well. This might be useful if there is a need to subscribe for node local bindings on a remote node. It should be noted that there is no need to issue the subscription in network byte order in such cases, - the receiving topology server will detect the used representation and respond to it correspondingly.

1.4.2. Cluster Topology Subscription

When TIPC establishes contact with another node, it does internally create a binding {type = TIPC_NODE_STATE, instance = peer node hash number} in the binding table. This makes it possible for applications on a node to keep track of reachable peer nodes at any time.

1.4.3. Cluster Connectivity Subscription

When TIPC establishes a new link to another node, it does internally create a binding {type = TIPC_LINK_STATE, instance = peer node hash number} in the binding table. This makes it possible for applications on a node to keep track of working links to peer nodes at any time. This type of binding differs from the topology subscription binding described above in that there may be two links, and hence two bindings, to keep track of for each peer node. Although this binding type only is published with node visibility, it is possible to combine it with remote node topology subscriptions, as described above, to obtain a full and continuous matrix view of the connectivity in the cluster.


2. Socket API

Programmers can access the capabilities of TIPC using the BSD socket API. It supports the common socket API a programmer will know from other socket-based protocols, but in the context of TIPC some routines are given a specific interpretation that has to be understood. This section outlines those differences.

2.1. TIPC Specific Structures

The TIPC specific parts of the socket API are defined in /usr/include/linux/tipc.h in most Linux distributions. It can also be studied here. The following structures are defined:

2.1.1. Address Types

struct tipc_socket_addr {        struct tipc_service_addr {         struct tipc_service_range {
        __u32 ref;                       __u32 type;                        __u32 type;
        __u32 node;                      __u32 instance;                    __u32 lower;
};                               };                                         __u32 upper;
                                                                      };
																		
struct sockaddr_tipc {
	    unsigned short family;
	    unsigned char  addrtype;
	    signed   char  scope;
	    union {
		        struct tipc_socket_addr id;
		        struct tipc_service_range nameseq;
		        struct {
			            struct tipc_service_addr name;
			            __u32 domain;
		        } name;
	    } addr;
};
The meaning of the three first address types are described under addressing. The last one,
struct sockaddr_tipc is used during interaction with the local socket, and is filled in as follows:
  • family must always be AF_TIPC.
  • addrtype is one of TIPC_SOCKET_ADDR, TIPC_SERVICE_ADDR, TIPC_SERVICE_RANGE or
    TIPC_ADDR_MCAST. The latter value is used when sending multicast messages, and indicates a service range.
  • scope is one of TIPC_CLUSTER_SCOPE or TIPC_NODE_SCOPE, and is only used during binding.
  • domain indicates a lookup scope during sending, and contains either a valid node hash number or zero.

2.1.2. Service Tracking Definitions

struct tipc_subscr {                                         struct tipc_event {
        struct tipc_service_range seq;                               __u32 event;
        __u32 timeout;                                               __u32 found_lower;
        __u32 filter;                                                __u32 found_upper;
        char usr_handle[8];                                          struct tipc_socket_addr port;
};                                                                   struct tipc_subscr s;
                                                             }; 

Once a connection has been established to the topology service, the application can send instances of struct tipc_subscr, one by one, to it. Its fields must be filled in as follows:

  • seq: The service address or range of interest. If only a single service address is tracked the upper and lower fields are set to be equal.
  • timeout: a value specifying, in milliseconds, the life time of the subscription. If this field is set to TIPC_WAIT_FOREVER the subscription will never expire.
  • filter: A bit field specifying how the topology service should act on the subscription. Three possible actions can be specified:
    - TIPC_SUB_SERVICE: The subscriber only wants 'edge' events, i.e., a TIPC_PUBLISHED event when the first matching binding is encountered, and a TIPC_WITHDRAWN event when the last matching binding is removed from the binding table. Hence, this event type only informs if there exists any matching binding in the cluster at the moment.
    - TIPC_SUB_PORTS: The subscriber wants an event for each matching update of the binding table. This way, it can keep track of each individual matching service binding that exists in the cluster.
    - TIPC_SUB_CANCEL: The subscriber doesn't want any more events for this service range, i.e., the subscription is canceled. Apart from the 'cancel' bit, this subscription must be a copy of the original subscription request.
  • usr_handle: A 64 bit field which is set and used at the subscriber's discretion.

After the subscription is sent, the subscriber will receive zero or more messages containing a struct tipc_event with the following information:

  • event: Indicates type of event. Three values are possible:
    - TIPC_PUBLISHED: A matching binding was found in the binding table. If there are matches in the table at the moment the subscription is issued, there will be an event for one or all of these matches, as specified in the event filter. Thereafter there will be an event for each change, also depending on to what was specified.
    - TIPC_WITHDRAWN: A matching binding was removed from the binding table. This event type also follows the rules specified in the event filter.
    - TIPC_SUBSCR_TIMEOUT: The subscription expired, as specified by the given timeout value, and has been removed.
  • found_lower/found_upper: Describes the matching binding's range. This range is not necessarily the same as the one subscribed for, as any overlap between the two is considered a match and trigs an event.
  • port: The socket address of the matching socket.
  • s: An exact copy of the original subscription request.

2.1.3. Socket Control Structures

 
struct tipc_group_req {
        __u32 type;
        __u32 instance;
        __u32 scope;
        __u32 flags;
};
struct tipc_group_req is passed along in a setsockopt() to join a communication group and in
getsockopt() to read the own membership setting after joining. Its fields are used as follows:
  • type: The group identity.
  • instance: The member identity.
  • scope: Visibility scope. TIPC_CLUSTER_SCOPE or TIPC_NODE_SCOPE.
  • flags: Bit field with member properties: TIPC_GROUP_LOOPBACK and/or TIPC_GROUP_MEMBER_EVTS.
struct tipc_sioc_nodeid_req {                       struct tipc_sioc_ln_req {
        __u32 peer;                                         __u32 peer;
        char node_id[TIPC_NODEID_LEN];                      __u32 bearer_id;
};                                                          char linkname[TIPC_MAX_LINK_NAME];
                                                       };
  • struct tipc_sioc_nodeid_req is used with ioctl() to retrieve a node identity, given a node hash number as key.
  • struct tipc_sioc_ln_req is used with ioctl() to retrieve a link name, given a peer node hash number and a bearer identity as key. The hash number is obtained from tipc_event::port::node, and the bearer identity is obtained from the lowest 16 bits of tipc_event::port::ref in a received cluster connectivity event. (The upper 16 bits contain the peer endpoint's bearer identity, and makes possible for a third party to correlate the two endpoints.) A cluster connectivity subscription is one ordered with a {TIPC_LINK_STATE, 0, 0xffffffff} service range.

2.2. Routines

The following socket API routines are supported by TIPC:

  • int accept(int sockfd, struct sockaddr *cliaddr, socklen_t *addrlen)
    Accept a new connection on a socket. If non-NULL, cliaddr is set to the socket address of the peer socket.

  • int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen)
    Bind or unbind a service address or range to a socket. Visibility scope is set to either TIPC_NODE_SCOPE or TIPC_CLUSTER_SCOPE. An unbind operation is performed by using the arithmetic inverse of the originally used scope, or by setting addrlen to zero. Note that scope 0 (zero) is legal to use, with the meaning TIPC_CLUSTER_SCOPE, but that such bindings cannot be individually unbound. Also note that bind() can be called multiple times on the same socket.

  • int close(int sockfd)
    Closes socket. If the socket is unconnected, unprocessed messages remaining in the receive queue are rejected (i.e. discarded or returned), according to their 'droppable' bit.

  • int connect(int sockfd, const struct sockaddr *servaddr, socklen_t addrlen)
    Connect the socket to a peer. Although this is using TIPC's more efficient two-way handshake mechanism, it does at the surface look exactly like a TCP connection setup. If a service address is used, the lookup is subject to the same rules as for datagram messaging. Both blocking and non-blocking setup is supported.

  • int getpeername(int sockfd, struct sockaddr *peeraddr, socklen_t *addrlen)
    Get the socket address of the peer socket. This call is only defined for s connected socket.

  • int getsockname(int sockfd, struct sockaddr *localaddr, socklen_t *addrlen)
    Get the socket address of the socket. Note that this call also is useful for obtaining the own node's hash number, as that is part of the obtained address.

  • int getsockopt(int sockfd, int level, int optname,
                   void *optval, socklen_t *optlen)

    Get the current value of a socket option. The following values of optname are supported at level SOL_TIPC:
    TIPC_IMPORTANCE: See setsockopt() below.
    TIPC_DEST_DROPPABLE: See setsockopt() below.
    TIPC_CONN_TIMEOUT: See setsockopt() below.
    TIPC_SOCK_RECVQ_DEPTH: Returns the number of messages in the socket's receive queue.
    TIPC_GROUP_JOIN: Returns a struct tipc_group_req* with identity and settings of a group member.

  • int ioctl(int fd, unsigned long request, ...)
    There are two ioctl() calls:
    1) SIOCGETLINKNAME: takes a struct tipc_sioc_ln_req* with a node hash number and a bearer identity as keys and returns a link name.
    2) SIOCGETNODEID: takes a struct tipc_sioc_node_req* with a node hash number as key and returns
         a node identity.

  • int listen(int sockfd, int backlog)
    Enables a socket to listen for connection requests. The backlog parameter is ignored.

  • int poll(struct pollfd *fdarray, unsigned long nfds, int timeout)
    Indicates the readiness of the specified TIPC socket(s) for I/O operations, using the standard poll() mechanism. The returned event flags are set as follows:
    POLLRDNORM and POLLIN are set if the socket's receive queue is non-empty.
    POLLOUT is set if the socket is not in a transmit congested state, i.e., is not blocked trying to send over a congested link or to a congested peer.
    POLLHUP is set if a socket's connection has been terminated.
    Note that select() and epoll_create()/epoll_ctl()/epoll_wait() also are
    supported according to the same rules.

  • ssize_t recv(int sockfd, void *buff, size_t nbytes, int flags)
    Attempt to receive a message from the socket.
    Note that a return value of 0 has a special meaning in TIPC, depending on the socket type:
    - For datagram sockets it indicates a returned undelivered data message, which 1024 first bytes
      can now be read as ancillary data.
    - For a connected socket, it indicates a graceful shutdown by the peer socket; otherwise -1 is returned.
    - For group sockets, it indicates reception of a membership event, when applicable.

    Applications can determine the exact cause of connection termination and/or message non-delivery by using recvmsg() instead of recv(). The MSG_PEEK, MSG_WAITALL and MSG_DONTWAIT flags are supported during reception; all other flags are ignored.

  • ssize_t recvfrom(int sockfd, void *buff, size_t nbytes, int flags,
                     struct sockaddr *from, socklen_t *addrlen):

    Attempt to receive a message from the socket. If successful, the socket address of the message sender is returned in the from parameter. Also see comments for the recv() routine.

  • ssize_t recvmsg(int sockfd, struct msghdr *msg, int flags)
    Attempt to receive a message from the socket. If successful, the socket address of the message sender is captured in the msg_name field of msg(if non-NULL) and ancillary data relating to the message is captured in the msg_control field of msg (if non-NULL).
    The following ancillary data objects may be captured:
    TIPC_ERRINFO: The TIPC error code associated with a returned data message or a connection termination message, and the length of the returned data. (8 bytes: error code + data length)
    TIPC_RETDATA: The contents of a returned data message, up to a maximum of 1024 bytes.
    TIPC_DESTNAME: The service address or range that was specified by the sender of the message (12 bytes).

    If ancillary data objects capture is requested (i.e. msg->msg_control is non-NULL) but insufficient space is provided, the MSG_CTRUNC flag is set to indicate that one or more available objects were not captured.

  • int select(int maxfdp1, fd_set *readset, fd_set *writeset,
               fd_set *exceptset, const struct timevale *timeout)

    Indicates the readiness of the specified TIPC socket(s) for I/O operations, using the standard select() mechanism. See entry for poll().

  • ssize_t send(int sockfd, const void *buff, size_t nbytes, int flags)
    This routine has two uses:
    1) Attempt to send a message from a connected socket to its peer socket.
    2) Attempt to send a group broadcast from a group member socket.
    In case 1) only the MSG_DONTWAIT flag is supported.

  • ssize_t sendmsg(int sockfd, struct msghdr *msg, int flags)
    Attempt to send a message from the socket to the specified destination. There are three cases:
    1) If the destination is a socket address the message is unicast to that specific socket.
    2) If the destination is a service address, it is an anycast to any matching destination.
    3) If the destination is a service range, the message is a multicast to all matching sockets.
        Note however that the rules for what is a match differ between datagram and group messaging.

  • ssize_t sendto(int sockfd, const void *buff, size_t nbytes, int flags,
                   const struct sockaddr *to, socklen_t addrlen)

    Attempts to send a message from the socket to the specified destination. See comments under sendmsg().

  • int setsockopt(int sockfd, int level, int optname, const void *optval,
                   socklen_t optlen)

    Set a socket option. The following optname values are supported at level SOL_TIPC:
    TIPC_IMPORTANCE: This option governs how likely a message sent by the socket is to be affected by
           congestion. A message with higher importance is less likely to be delayed due to link congestion
           and less likely to be rejected due to receiver congestion. The following values are defined:
       TIPC_LOW_IMPORTANCE, TIPC_MEDIUM_IMPORTANCE, TIPC_HIGH_IMPORTANCE, and
           TIPC_CRITICAL_IMPORTANCE, where TIPC_LOW_IMPORTANCE is default value.
    TIPC_DEST_DROPPABLE: This option governs the handling of a sent message if it cannot be delivered
            to its destination. If set, the message is discarded; otherwise it is returned to the sender. By default,
            this option is enabled for SOCK_RDM and SOCK_DGRAM sockets, and disabled otherwise.
    TIPC_CONN_TIMEOUT: This option specifies the number of milliseconds connect() will wait before
            giving up because of lack of response. Default value is 8000 ms.
    TIPC_MCAST_BROADCAST: Force datagram multicasts from this socket to be transmitted as
            bearer broadcast/multicast (instead of replicated unicast) whenever possible.
    TIPC_MCAST_REPLICAST: Force datagram multicasts from this socket to be transmitted as
            replicated unicast instead of bearer broadcast/multicast.
    TIPC_GROUP_JOIN: Join a communication group. Argument is a struct tipc_group_req*.
    TIPC_GROUP_LEAVE: Leave a communication group. No argument.

  • int shutdown(int sockfd, int howto)
    Shuts down socket send and receive operations on a connected socket. The socket's peer is notified that the connection was gracefully terminated (by means of the TIPC_CONN_SHUTDOWN error code), rather than as the result of an error. Applications should normally call shutdown() to terminate a connection before calling close(). The howto parameter must be set to SHUT_RDWR, to terminate both reading and writing, since there is no support for partial shutdown in TIPC.

  • int socket(int family, int type, int protocol
    Creates an endpoint for communication. TIPC supports the following socket types: SOCK_DGRAM and SOCK_RDM for datagram messaging and group messaging. SOCK_SEQPACKET and SOCK_STREAM for connection oriented messaging. The family parameter must be set to AF_TIPC. The protocol parameter must be set to 0.

2.3. Examples

A variety of useful demo, test and utility programs can be downloaded from our project page, and should be useful in understanding how to write an application that uses TIPC.

3. libtipc C API

Many programmers will after a while develop a wrapper around the TIPC socket API, both for reducing the code footprint and to save work when new users are added. In tipcutils we provide a small, but powerful example of such a wrapper, which can be copied by the users and modified according to their needs.


/* Addressing:
 * - If (type == 0) struct tipc_addr is referring to a socket
 * - If (node == 0) the lookup/binding scope is cluster global
 */
struct tipc_addr {
        uint32_t type;
        uint32_t instance;
        uint32_t node;
};

uint32_t tipc_own_node(void);
char* tipc_ntoa(const struct tipc_addr *addr, char *buf, size_t len);
char* tipc_rtoa(uint32_t type, uint32_t lower, uint32_t upper,uint32_t node,
                char *buf, size_t len);

/* Socket:
 * - 'Rejectable': sent messages will return if rejected at destination
 */
int tipc_socket(int sk_type);
int tipc_sock_non_block(int sd);
int tipc_sock_rejectable(int sd);
int tipc_close(int sd);
int tipc_sockaddr(int sd, struct tipc_addr *addr);

int tipc_bind(int sd, uint32_t type, uint32_t lower,
              uint32_t upper, uint32_t scope);
int tipc_unbind(int sd, uint32_t type, uint32_t lower, uint32_t upper);

int tipc_connect(int sd, const struct tipc_addr *dst);
int tipc_listen(int sd, int backlog);
int tipc_accept(int sd, struct tipc_addr *src);
int tipc_join(int sd, struct tipc_addr *member, bool events, bool loopback);
int tipc_leave(int sd);

/* Messaging:
 * - NULL pointer parameters are always accepted
 * - tipc_sendto() to an accepting socket initiates two-way connect
 * - If no err pointer given, tipc_recvfrom() returns 0 on rejected message
 * - If (*err != 0) buf contains a potentially truncated rejected message
 * - Group event: tipc_recvfrom() returns 0; err == 0/-1 indicates up/down
 */
int tipc_recvfrom(int sd, void *buf, size_t len, struct tipc_addr *socket,
                  struct tipc_addr *member, int *err);
int tipc_recv(int sd, void *buf, size_t len, bool waitall);
int tipc_sendmsg(int sd, const struct msghdr *msg);
int tipc_sendto(int sd, const void *msg, size_t len,
                const struct tipc_addr *dst);
int tipc_send(int sd, const void *msg, size_t len);
int tipc_mcast(int sd, const void *msg, size_t len,
               const struct tipc_addr *dst);

/* Topology Server:
 * - Expiration time in [ms]
 * - If (expire < 0) subscription never expires
 */
int tipc_topsrv_conn(uint32_t topsrv_node);
int tipc_srv_subscr(int sd, uint32_t type, uint32_t lower,
                    uint32_t upper, bool all, int expire);
int tipc_srv_evt(int sd, struct tipc_addr *srv, struct tipc_addr *addr,
                 bool *up, bool *expired);
bool tipc_srv_wait(const struct tipc_addr *srv, int expire);

int tipc_neigh_subscr(uint32_t topsrv_node);
int tipc_neigh_evt(int sd, uint32_t *neigh_node, bool *up);

int tipc_link_subscr(uint32_t topsrv_node);
int tipc_link_evt(int sd, uint32_t *neigh_node, bool *up,
                  int *local_bearerid, int *remote_bearerid);
char* tipc_linkname(char *buf, size_t len, uint32_t peer, int bearerid);
		

3.1. Addressing

A potential user having read the previous chapters should have no problems understanding and using this API. There are however some major differences to the socket API that should be explained further:

  • The two address types struct tipc_socket_addr and struct tipc_service_addr have been unified into one single address type struct tipc_addr. In the service tracking API, the service "type" zero can be interpreted as "node service". This has been extended to also be valid for the addressing/messaging API, so service type == 0 means "node service" even here, while the instance number represents a port number on that node.
  • Multicast can only be sent to individual instances, both for datagram and group messaging. Experience shows that few, if any, users need to use service ranges for multicast matching.
  • Because service ranges now are needed only for socket binding, even that address type has been removed, and been replaced with indicating discrete lower/upper values in the tipc_bind() routine.
  • Both lookup scope and binding scope are now indicated by a valid node hash number or a zero.

3.2. Examples

Even for this API there is a number of examples for download at the project page.

4. GO API

The utility package provides an example of a wrapper in the language GO, based on the libtipc C API.

5. Python API

The utility package provides an example of a wrapper in the language Python, based on the libtipc C API.

6. Java API

The utility package provides an example of a wrapper in the language Java, based on the libtipc C API.

7. TIPS and Techniques

This section illustrates some techniques that may be useful when designing applications using TIPC.

7.1. Determining own node

Create a TIPC socket. Then call getsockname() to obtain the socket's socket address. The own node's hash number is found in the node field of that address. If the node identity is wanted, one can thereafter call ioctl() with SIOCGETNODEID and the node hash as arguments.

7.2. Changing socket receive buffer

When looking into /proc/sys/net/tipc/tipc_rmem one will observe that the default receive buffer size is 2097152 bytes, while the lower settable limit is 524288 bytes and the upper settable limit is 16777216 bytes. The receive buffer can be be modified within these limit simply by writing a new value into the file.
The receive buffer can also be modified for only one socket by calling setsockopt() with the option SO_RCVBUF at level SOL_SOCKET. In the latter case one may also have to increase the generic upper limit in /proc/sys/net/core/rmem_max to match the one in tipc_rmem.

The set limit lim is valid for datagram messages with TIPC_LOW_IMPORTANCE, while the general pattern per priority level is: [ lim, lim * 2, lim * 4, lim * 8 ].

7.3. When to use implied connect

The easiest way to decide whether to use the implied connect technique or the explicit connect approach is to design your program assuming that you will be using the latter.
If you end up with: 1) connect() followed by one single 2) send(), later followed by a 3) recv(), then the connect() and the send() can be combined into a single sendto(). This saves your program the overhead of an additional system call, and more important the exchange of empty handshaking messages during connection establishment. In all other cases explicit connect() should be used.

7.4. Processing a returned message

When a message is returned by the receiver, it is truncated to max 1024 bytes, and put into the ancillary data area of the original sender socket. An example of how this data, inclusive error code, can be read by the sender application is shown in the libtipc code in the utility package.