The problem is that there is very little support documentation available from VMware for vRealize Log Insight, and even the support engineer assigned to our case admitted to that. That is when I realised that actually, the cause of our problem might have very little to do with “Log Insight”, but everything to do with Cassandra. So, that evening I spent a few hours doing research on Cassandra and how Log Insight makes use of Cassandra (or rather how Cassandra is configured within each of the Log Insight appliances that makes up the Log Insight cluster).
So, what is Cassandra?
Cassandra is an open source high performance distributed database system, initially developed by Facebook and used as the database system for many of the large internet brands that we see day in and day out. Instagram, eBay, Apple, Netflix, CERN, Reddit, the list goes on and on (in fact, check out the list here http://www.planetcassandra.org/companies/); they all realised that the traditional relational SQL database architecture just won’t make the cut in terms of performance and scalability. Apple currently operates the largest Cassandra deployment, with over 70,500 nodes, storing over 10PB of data. I think it is safe to say, that when implementing Cassandra correctly, there are few limits to what can be achieved. However, the emphasis lies on “implementing Cassandra correctly”.
Cassandra has a masterless architecture, unlike master/slave relational database systems and is capable of delivering continuous availability. As data is distributed across nodes, Cassandra provides real write/read anywhere capabilities, which allows for data to be written/read to data nodes anywhere in the world and distributed and replicated appropriately. It enables millisecond response times with linear scalability. Throughput is doubled with two nodes, quadrupled with four nodes etc.
So Cassandra is scalable, flexible and “always on”, but the question is, how is it implemented in VMware Log Insight? The answer is simple. It’s rather default. Basically, all Log Insight nodes in a cluster are configured to assume that all sibling nodes are in the same server rack. And this is essentially why we ran into performance problems in our Log Insight implementation.
The day after the P1 call was opened with VMware support, I spent the majority of the day looking at how Cassandra is configured within Log Insight, and in our deployment of Log Insight. By the end of that Tuesday, VMware still had no answer as to why it’s so slow, and the case then got forwarded to the engineering and product management teams. However, before that escalation had happened, I was still investigating the issue on my own and by mid-day on Tuesday while VMware GSS was still working with another colleague of mine to diagnose the problem, I realised what had caused our issue. It was a simple miss-configuration with massive performance implications.
During troubleshooting, I noticed that on some “remote forwarders”, Log insight would just shut down after a few minutes, which required a manual service restart by “service loginsight start” form the console. Monitoring the Cassandra logs, I noticed that there were a lot of “remote node n is now DOWN”, “remote node n is now UP” events. This concerned me. Using Cassandra’s rather awesome nodetool utility, I could see that some nodes were periodically down. This was a bad sign. Also, I could see that the node distribution in the ring was random, and made no sense in terms of topology with regards to where these servers were located on the planet. So that was easy to find. The difficulty thing to find out was why these nodes went down, especially since no clear error messages were logged on a node as it went down, other than “shutting down”.
At this point, I started digging into the Cassandra 2.0 documentation provided by Datastax. I compared the configuration found within Log Insight’s implementation of Cassandra to the possible options as documented by Datastax. This is where I diagnosed the issue. VMware’s implementation of Cassandra within Log Insight assumes that all of the nodes within the Log Insight (Cassandra) cluster are located within the same server rack. However, I found 4x nodes within our cluster that were remote, very remote. This is apart form the fact that the cluster had more nodes than what I had expected to see. The cluster should have been 6 nodes, but it had 10.
So, What Had Happened?
One of the consultants who was deploying the Log Insight infrastructure, and performing initial stand up tasks such as deploying the OVA’s and initialising the new cluster, had misunderstood some ambiguity within the design (which was produced by a 3rd party) and added 4x instances which were meant to be standalone Log Insight instances configured to forward events to the main cluster in Europe, as “data nodes” to the master cluster. The issue is that these 4x instances weren’t local at all. In fact two of them were in the US and the other two were in Asia, with the main cluster in Europe. The default Cassandra configuration within Log Insight just doesn’t understand that topology. Cassandra stores data on data nodes in a similar way to RAID 5 works in disk storage (Cassandra experts, please don’t hurt me for using that slightly flawed analogy). Data nodes are placed in a ring and data is written from node to node. Now, if the Cassandra configuration doesn’t take the node topology into account, you could end up with something like this:
The following is only a theoretical example, and is a very simplified way of describing how Cassandra writes and distributed data, so is not completely accurate. However, it should discribe the problems that can be encountered when not configuring Cassandra with the correct Snitch topology for your deployment. For more information, see the official Cassandra documentation.
1x write to the Cassandar DB enters node 1 in the Europe data centre. The write is forwarded to node two, but node two is actually in Asia, so latency comes into play. From node two, the write is forwarded to node 3, which is again in Europe. It’s then forwarded to node 4 in the US after which it makes its way to node 5 back in Asia and then on to node 6 in Europe. It then hits node 7 in the US again before making it back to node 8 in Europe to complete the ring.
Now that’s a complex route to travel. But it’s gets more complicated. Not only do you have nodes accepting agent traffic in Europe, agents are committing writes to all nodes, and as this is Log Insight, were talking 3000 – 5000 events ingesting into the application each and every second. And that’s only with 500 agents deployed. Imagine what will happen when we get all 10,000 agents forwarding logs.
The Log Insight application simply cannot be deployed and is not supported to be deployed across datacenter in this way. It just won’t work. However, although this won’t be supported, I reckon that with some time (which I don’t have) I probably could make such as setup work with Log Insight, simply by configuring Cassandra to understand the node topology properly.
The real fear I have is that Cassandra skills, and those who understand Cassandra are few and far apart. Even within VMware, who built (Ok probably bought) Log Insight on Cassandra, it took 48 hours to some up with an answer as to why we were experiencing these massive performance issues.
How the issue was resolved?
When VMware support came back and confirmed that our deployment isn’t supported in this configuration as I had expected, we basically had to rebuild the entire environment. Well, I say that simply from a support point of view, as removing nodes from a Log Insight cluster isn’t supported by VMware, although it is technically possible, as I will demonstrate below.
Before the 4x remote nodes were added to the cluster, the cluster’s performance was very good. Unfortunately, the Log Insight implementation was by this time integrated into vRealize Operations Manager (vROPS). I guess I could have simply deleted the entire deployment and started again, but I wanted to do things properly by breaking the link to vROPS first. In order to do this, I wanted to use the Log Insight web interface to unregister from vROPS, but as the interface was so slow and unusable, I had to fix the broken Log Insight First.
In order to stabilise the Log Insight cluster, I had to find a way of safely removing the remote nodes that were causing the performance problems. The steps I had to take to achieve this were:
Please do note again, the following procedure is to remove a node from Log Insight and is not supported.
Open an SSH session to each Log Insight node
Establish the node states for each node, but running nodetool status –r from one of the main cluster nodes, for this I used the Log Insight Master node (Cassandra has no concept of Mater/Slave nodes). On the vRealize Log Insight 2.5 appliances, the nodetool can be found at /usr/lib/loginsight/application/lib/apache-cassandra-2.0.10/bin/nodetool
nodetool status -r will produce something like this:
Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN vrlimn.spiesr.com 3.56 MB 256 50.6% 8665bb98-a3d9-4d4d-91f7-0733f2708e33 rack1 UN vrlidn.spiesr.com 76.94 KB 256 49.4% 18481ba1-e949-4e4b-b156-2a691d8d5df8 rack1
The “UN” indicates that the node is UP and NORMAL. If any nodes are showing “DN”, then they are “DOWN” and “NORMAL”. In order to remove and nodes, they all need to show “UN”.
You can try and restart the Log Insight service on any node that is showing as down by running service loginsight restart. However, if the service is already running, wait a minute to so before running nodetool status again to see if they are reporting as “UN”. If the Log Insight service is running, it means that Cassandra is running as Log Insight cannot run without Cassandra.
On each of the nodes, except the master node, monitor the cassandra.log file using tail. You need to do this so that you can tell if a Cassandra stops on a node while you’re trying to remove a node. If you spot one log that states that Cassandra has stopped on the node, you will need to restart Log Insight on that node before this procedure will work.
tail -f /storage/var/loginsight/cassandra.log
To remove a node from the cluster, you need to use the decommission command with the nodetool command:
./nodetool –h <node IP address> decommission
Once the decommission command has finished the following will be logged in the removed node’s cassandra.log file:
INFO [RMI TCP Connection(5)-192.168.1.36] 2015-09-06 16:45:40,658 StorageService.java (line 2927) Announcing that I have left the ring for 30000ms INFO [AntiEntropyStage:1] 2015-09-06 16:45:40,666 Validator.java (line 254) [repair #b3e1f820-54b6-11e5-869c-25fb6c5d8739] Sending completed merkle tree to /192.168.1.36 for logdb/spock_pending_clusters INFO [AntiEntropyStage:1] 2015-09-06 16:45:40,780 Validator.java (line 254) [repair #b3e1f820-54b6-11e5-869c-25fb6c5d8739] Sending completed merkle tree to /192.168.1.36 for logdb/fields_by_name INFO [RMI TCP Connection(5)-192.168.1.36] 2015-09-06 16:46:10,659 ThriftServer.java (line 141) Stop listening to thrift clients INFO [RMI TCP Connection(5)-192.168.1.36] 2015-09-06 16:46:10,688 Server.java (line 182) Stop listening for CQL clients INFO [RMI TCP Connection(5)-192.168.1.36] 2015-09-06 16:46:10,690 Gossiper.java (line 1279) Announcing shutdown INFO [RMI TCP Connection(5)-192.168.1.36] 2015-09-06 16:46:12,690 MessagingService.java (line 701) Waiting for messaging service to quiesce INFO [ACCEPT-/192.168.1.34] 2015-09-06 16:46:12,692 MessagingService.java (line 941) MessagingService has terminated the accept() thread INFO [RMI TCP Connection(5)-192.168.1.36] 2015-09-06 16:46:12,698 StorageService.java (line 966) DECOMMISSIONED INFO [BatchlogTasks:1] 2015-09-06 16:46:15,266 ColumnFamilyStore.java (line 794) Enqueuing flush of Memtable-batchlog@1224669606(4472/44720 serialized/live bytes, 8 ops) INFO [FlushWriter:7] 2015-09-06 16:46:15,266 Memtable.java (line 355) Writing Memtable-batchlog@1224669606(4472/44720 serialized/live bytes, 8 ops) INFO [FlushWriter:7] 2015-09-06 16:46:15,272 Memtable.java (line 402) Completed flushing; nothing needed to be retained. Commitlog position was ReplayPosition(segmentId=1441557218309, position=4394196)
With the remote nodes removed from the cluster, performance of the Log Insight cluster was normal, and I was able to unregister Log Insight with vROPS. However, I still went ahead to rebuild the Log Insight environment in order to conform to VMware's support statement.
I'll publish some more information on Cassandra's configuration options soon. These options aren't really relevant to Log Insight as I doubt that Log Insight will be supported once the Cassandra configuration has been tampered with. However, I'd like to cover node topology and Cassandra snitches in future posts, as it is quite interesting.