Building distributed messaging software is easy. Building distributed messaging software that is reliable and scalable is hard. At Poynt, our solution to this is called Poynt Cloud Messaging (PCM). It lets us remotely communicate and control our payment terminals distributed all over the world. In this post I will discuss the technologies we use behind this system and some of the design decisions we've made to build PCM.
During the early days of developing the Poynt Smart Terminal, the engineering team needed a way to remotely ping a device. While we would like to claim that this solved our need for a quick way to diagnose that a terminal is alive, it was more likely that some engineers misplaced their terminal in the office and needed a way to find it.
It turned out, building such a system was not simple. Many questions came up:
- How frequent should the terminal poll to receive message?
- Where and how long would the server store a message if the terminal was not online to receive it?
- How does the server know if the terminal had properly processed the message?
- How do you prevent duplicate messages over an unreliable network (i.e the internet)?
Perhaps the hardest question was this: how do we simulate an "instant response" effect from a message that was just sent?
In order to simulate an always connected state, we needed to pick a reliable protocol to use on top of TCP/IP. We originally prototyped with our terminals connected to RabbitMQ using the AMQ Protocol. The availability of various ways to route messages using exchanges and message queues in RabbitMQ was very attractive. However, we ultimately abandoned RabbitMQ due to the difficulty in diagnosing runtime issues with it.
In evaluating an alternative to AMQ, we considered using WebSocket and MQTT. WebSocket was a clear winner due to the ease and simplicity with which we were able to implement a solution on both the client and server sides. While the MQTT protocol was fully developed at the time, its support in our environment (Android) was far from mature. We will likely revisit this decision in the near future as MQTT has gained momentum in the IoT world. As with many technology decisions, this decision was made based on the familiarity of the engineering team and the practicality of the implementation.
With WebSocket as our protocol, we set out to create a PCM Client that ensures the stability of the connection without overtaxing the terminal's battery. The PCM Client is an Android service running in the background. To keep the connection alive, we use Android's Alarm Manager to schedule regular heartbeat pings to the server. Our client also gracefully retries to establish the connection should there be a network disruption. However, the reconnect attempt times are always interjected with a random value in order to prevent our terminals from creating the "stampede" effect by trying to all reconnect at the same time. We also realized that individual terminals have different usage patterns. Some terminals rely more heavily on PCM because of the applications that were installed on them while others have an idle PCM connection most of the time. This prompted us to introduce PCM profiles on the client. The PCM client can be configured to switch between an "aggressive", "normal", or "passive" profile that determines how frequently it attempts to keep the connection active.
In ordered to scale the server side against the number of active TCP connections from our terminals, we needed a way for our terminal to discover where it should connect to rather than hardcode the connection to a static host name. The solution was a general purpose Discovery API that the terminal would poll every hour for its assigned PCM host.
Having a mechanism to route our terminal to different servers helps in scaling and distributing the load across our systems.
The PCM server implementation has gone through many iterations as we constantly adjust it to scale. The frontend of PCM is implemented in a Spring Java application using the Spring Websocket library. Spring Websocket made it extremely easy to accept websocket connection. We simply had to implement a
WebSocketHandler interface to get a handle on a
WebSocketSession. Using the session, we are able to push messages down to the client. Sessions are stored in a memory-resident cache. Each PCM server instance is configured to eject older session using an LRU algorithm when the cache is full. This was done to limit the amount of memory consumption.
The backend of PCM is powered by Kafka and Couchbase DB.
In order to deliver a message instantly to a connected PCM terminal, we needed a way to route the message to the PCM server instance which holds the websocket connection from the PCM server instance that received the message. Because we already had Kafka as the main messaging bus for inter-application communication in our infrastructure, we ended up using Kafka as a message broadcasting system between our PCM servers. When a PCM server instance receives a message from the sender and it does not have a websocket connection to the recipient terminal, it will broadcast that message on Kafka to all other PCM server instance.
However, what happens when a terminal was not online to receive the message? We needed a way to temporarily store the message and easily retrieve them when the terminal returns. We also did not want to keep all messages forever in our datastore since they might become stale. Couchbase was a good fit for PCM needs because it provided a TTL mechanism on messages as well as a very efficient mechanism to do scanning lookup of messages through Couchbase View. All messages intended for a terminal were identified by the terminal's UUID appended with a counter that is kept specific to that terminal. As new messages are stored, the counter is incremented and the result used as the message ID. When a terminal reconnects, the PCM server performs a Couchbase View scan, which is similar to a traditional RDBMS index scan, for all messages with IDs that begin with the terminal ID. Once the messages are processed, they are removed from Couchbase. In essence, we created a dynamic message queue within Couchbase for each terminal by simply formatting the ID in a specific way without having to do much in terms of managing the message queue.
Couchbase will allow us to scale up to millions of terminals without having to maintain individual message queue in traditional message brokers that would not perform well at that scale.
PCM was originally built as a simple push messaging system for Poynt's internal use. However, it has grown into a system that supports many of the use cases of the Poynt Developer Platform. Along the way, we've had to make many interestinga and nontrivial design choices to scale the system up. While the system is constantly evolving to better handle increased use, we're pretty happy with the state that it has evolved to so far.
Our engineers are still using PCM to locate their misplaced terminal today but PCM has grown to support use cases like the Payment Bridge API and Mission Control Remote Terminal Management. For a detailed guide on how to integrate with PCM, head over to the PCM developer's guide.
About the Author
Victor Chau is a founding engineer at Poynt, working on the cloud infrastructure that powers its platform. Prior to joining Poynt, he was a payments platform architect at PayPal.