Managing news websites is no trivial task, and it can cost a company millions if the website goes down for even a minute. A live news website should always be well equipped to handle a sudden increase in traffic or load, server problems related to power failures and even hard-disk crashes. During such events, the main challenge is to continue streaming data to website viewers without interruption so that users remain unaware about the back-end problems.
If the website shows up such errors to the viewer, it results in turning away hundreds (if not thousands) of visitors while reducing the effectiveness of purchased ad campaigns and associated revenue. ibnlive.in.com, one of Network 18′s most reputed news websites, faced such problems, and gradually resolved them by implementing PostgreSQL servers along with a Memcached server and a load balancer.
ibnlive.com witnessed a massive growth in the number of website viewers over the years, but it did not foresee the technical problems that this growth brought. The high load of traffic was tough to handle, and as a result, the website started experiencing frequent downtimes. As a media website, it had to prepare for high-traffic days and felt the need for a replication feature.
When the database server went down due to power failure, the website did not display news stories that viewers requested, and instead, displayed error messages to them. “One of the worst problems we faced was when a hard-disk server blew out, and the website went down for over two hours — at a time when we were expecting high traffic,” reveals Manish Singh, senior Linux administrator, CNN-IBN.
For a live news website that gets 8-9 million views a day, a downtime of two hours is unacceptable. He continues, “We took a few hours to gather the data that was stored on the server, but there were many corrupted files. After deleting those, we managed to retrieve only 70 per cent of the files stored. For two hours and ten minutes, there were no stories on the website, and we lost many clients, revenue and couldn’t fulfil our purchased ad campaigns, as the advertisements were not displayed during the downtime.”
A stand-by solution
Memcached 1.2.5 was installed as a “middleman” server for the website. This object-caching system saves repeated database queries by caching “answer” data when it is obtained from the database server(s). Hence, when a visitor to the site tries to view a story, the website sends a request to the Memcached server. If the required data is available in cache, the Memcached server supplies it, saving a database query. In case the data is not cached by the Memcached server, it acts as an intermediary and requests the required data from other servers. Responses to the Memcached server are passed back to the website. “This way, even if a database server fails, the website continues to communicate smoothly with the viewer, without any interruptions,” comments Singh.
Singh’s team also set up a stand-by server (log archiving, in other words), by which the server gets updates from other main servers every ten minutes, and stores updated 16 MB files using Wal (Write ahead log) archives. The log describes every change made to the database’s data files. This log exists primarily for crash-safety purposes, and if the system crashes, the database can be restored by replaying the log entries made since the last checkpoint. Stories that formed the website’s content were stored in different directories.
When a main server failed, the stand-by server would be converted into a live server manually, ensuring a smooth flow of data. It also provided PITR (point-in-time recovery), which allowed data to be restored or recovered to the state it was in at any time since PITR logging was started for that database. However, this was not a mature solution, as it had to be worked on manually. There was a chance that the server could crash, or even go down due to a power failure.
The cherry on top: PostgreSQL 9.0
While the stand-by and Memcached server partly met the requirements of the website, it was PostgreSQL 9.0 that completely solved the problems associated with manual replication. In version 9.0, streaming replication is automated, and people do not have to manually convert the stand-by server into a live server. Singh explains, “We also felt the need for a load balancer, as we get high traffic on the website, and implemented Pgpool II. This distributes load uniformly across different database servers. For example, if Database Server 1 has too much load, and Database Server 2 is less loaded, the load balancer shifts some of the load from Server 1 to 2.”
ibnlive.in.com currently runs on five back-end servers: one Memcached, one load balancer and three database servers, and Singh believes this takes care of future requirements. “The replication feature has brought flexibility with it. If our Web traffic suddenly increases to 15 million visitors a day, we can deploy two more servers to go live, making it a much more scalable solution than what we previously used, i.e., MySQL. The server could not handle 2000-3000 connections and would fail, but PostgreSQL now handles over 4,000 connections smoothly. It also handles a good volume of data, as we serve about 32 GB of data from the database server on a daily basis. In terms of concurrency, the website is able to handle over a 100 connections at a time, enabling over 8 million page views in a day, as the Memcached server handles it. Also, from the application point of view, it does not get corrupted easily,” Singh reveals.
The website had first migrated to PostgreSQL 8.0 from MySQL in 2007. “We first tested PostgreSQL on some of our servers. The ease of configuration and tuning made a huge difference, and we were convinced about it since viewers did not have to wait long to see their requested data,” Singh explains. The team was satisfied with the migration to PostgreSQL 8.0, since the developers did not have to add, delete or modify much data. It was extracted in MySQL format and inserted in PostgreSQL, with minimal modification using PHP coding. The company quickly upgraded to 8.3 the next year, and later, to 9.0 in 2010.
Singh is planning to upgrade to PostgreSQL 9.1, but only after testing it on the servers. “It comes with stability, and also offers a synchronous solution for replication. Now, asynchronous solutions allow some delay between the time of a commit and its propagation to the other servers, with the possibility that some transactions might be lost in the switch to a backup server and those load-balanced servers might return slightly stale results. The synchronous server will replicate data written on the main master server to the other servers at the same time, making it possible to avoid delay.” Singh concludes.