|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| September 24, 2008 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NetDocuments Report - Slowness in Service
The most important service aspect of NetDocuments is availability and response time. For nine days (primarily afternoons in U.S. Time Zones) beginning in late August 2008 and lasting for a few days in September our Service experienced slowness. The purpose of this document is to report to our customers and explain in plain English and candor what happened, and the procedures implemented to prevent the recurrence of such problems in the future. During the performance degradation, the NetDocuments Service was always available, but it was slow. We thought it might be helpful to provide customers with the methodology we employ to monitor performance of the NetDocuments Service. The three major components of our “Service” include 1) our systems, such as servers, multiplexers, storage devices, etc.; 2) network devices, such as routers, firewalls, intrusion prevention systems, etc.; and 3) our software applications. In this structure, consider the following information. Historical Perspective of NetDocuments Service We monitor service availability in 12 areas described below. Since 2003, we have recorded all service performance and summarized the cumulative total service failures in the table below.
We pay extremely careful attention to Application Software failures (bugs) and Human failures (uploading the wrong patch, mis-cabling connections, resetting the wrong server, etc), which are often the major causes of down times. We have had no problems in these two areas since we first launched the NetDocuments service in the fall of 1999. Historically, database failures represent the highest probability of service interruption, according to the table above. To eradicate this uncertainty, we have eliminated our database usage for indexing documents. Since upgrading our Directory from NDS to eDirectory, the stability of the service increased with no directory service failures experienced. The biggest environmental change since January 2003 was the implementation of the primary datacenter at LexisNexis' facility in November of 2006, which today constitutes our primary datacenter. Service up-time at the LexisNexis datacenter, from November 2006 to the time of publication of this document is 99.9906%. Cause of Service Slowness After an enormous amount of investigation from NetDocuments IT staff, LexisNexis' IT staff, and hardware and software vendors, we identified that the primary problem which caused the recent service slowness was the malfunction of an Intrusion Prevention System (IPS), which inspects network packets for malicious intentions. The IPS at Lexis’ datacenter was taking too long to inspect each network packet, slowing the communications between the back-end servers and the storage system during peak times. Usually, IPS systems are set to monitor the front-end public area of the network, however, the datacenter staff decided to place network inspection also at the back-end in our environment, for protection against a remote chance of a hacker taking control over the web servers. LexisNexis carefully monitors the IPS for service failures. Our data logs confirm that everything is being monitored. However, in this particular case, the IPS alert failures were not being sent to the LexisNexis IT staff because the individuals responsible for managing alerts were not fully aware that the IPS was monitoring back-end traffic (since packet inspection is usually performed only at the front-end). In our division of labor with LexisNexis, the entire Network infrastructure, including the load balancing, BGP-4, firewalls, routers and switches, and intrusion prevention and detection are the responsibility of LexisNexis. When NetDocuments alerted LexisNexis of service slowness in the afternoons (peak times), they promptly investigated the network, and assumed everything was running well, because there were no reported failures. The IPS at the backend was considered optional and not monitored to the same level as all the other network components as it was not hardwired to the alert system. After opening two consecutive tickets and being assured all was well with the network, NetDocuments assumed everything was OK with the network infrastructure, and launched an intensive investigation into other areas, primarily on the disk subsystem and operating system. Under the guidance of our vendors, we increased the traffic flow between servers and disks, in an attempt to remedy the slowness, believing that bandwidth was being restricted by configuration settings. Our vendors did raise the issue that the servers could communicate with the disk at a faster rate, and we promptly took actions to make this happen. We increased the number of CIFS channels between servers and the NAS (Network Attached Storage), changed key configuration parameters, and re-channeled the iSCSI traffic to isolate it from CIFS. However, this fine tuning only caused a bigger stress to the failing IPS, as it had to work harder with increased traffic. On 9/8/08, after our planned weekend service maintenance, the increased traffic flow did cause the IPS to generate enough network congestion to make it detectable by those monitoring the service. The service performance immediately and literally improved by a factor of 10 after the IPS was fixed at about 2:30pm Mountain Time. NetDocuments Service performance is measured by a third-party tool called WebMetrics. Every 5 minutes, a WebMetrics server located at different places in the world is activated to access NetDocuments. WebMetrics performs a 14-step operation, and the elapsed time per step is recorded. Normally, a 7-second elapsed time is considered to be excellent. If the average response for a particular 60-minute period is worse than 12 seconds, we consider it too slow. The total slowness was recorded to be 1,680 minutes spread over a three week period. Even if we consider these slow times as down time, then NetDocuments high-performance up-time (when the system is not down, nor slow) remains a very respectable 99.9285%. Future Prevention Reed Elsevier Technology Services (RETS) has made several improvements in monitoring the IPS:
In addition, NetDocuments is launching many other preventative initiatives in other areas of the Datacenter to minimize unexpected slowness or downtimes. We regret the recent slowdown performance degradation and accept responsibility for it. We appreciate your business and will always strive to ensure that you have as close to 100% uptime as possible. Yours sincerely, Alvin Tedjamulia, CTO NetDocuments |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||