7 things one can do to scale up a web application

scaling up

Recently at work, we had to undertake a quick exercise of scaling up our web application which taught me a few things which I thought of sharing with the community. We are using following technology stack at work: 

  • Python as our primary language for most of our work at the backend
  • Pylons (Webframework)
  • MongoDB (NoSQL datastore)
  • Redis (Cache)

Lets jump in to the seven steps that worked for us and hope that most of them can be applied to any web application.

Profile your web application:

In order to understand the execution pattern from performance perspective, first step would be to profile application. Profiling can bring out some interesting insights about your application. Within 10 minutes of profiling we were able to figure out some of the very important (low-hanging) performance fixes. Another reason for enabling profiling is that if you do some performance fixes, it quickly helps you measure the difference as well.

These days most of the web frameworks provide profiling tools with them. For our web framework Pylons, we used  ProfileMiddlware from paste package.

  • You need to install python-profiler package. following command should do the trick for you if you are using Ubuntu:

sudo apt-get install python-profiler

  • Add following lines in your pylon’s application middleware.py i.e. /config/middleware.py

from paste.debug.profile import ProfileMiddleware

app = ProfileMiddleware(app, config, log_filename=’profile.log.tmp’, limit=40) #in the custom middle ware section

With above steps performed, you should see profiler output on the console (stdout) if you are running in dev mode. Now identlfy the code paths which are the real bottlenecks.

DB Query Profiling

Since most of webapps are powered by some kind of data store, profiling data store query profiling would give you some interesting insights about the slower operations in your application. In our case, since we are using mongodb. It provided one command line switch for verbose mode (-vvvvv) for different verbosity levels to understand query execution happening at server. It helped us to identify some of the most frequent and slow queries and in 90% of the cases, all we needed was to define indexes on our collections and we were done. Things may not be that simple in your case but it will ateast give your engineers enough to understand what needs to be attacked in the application.

Enable Data Caching

Caching can be your biggest friend for scaling up and it can be done at various levels. Caching strategy depends on usecases in your application and for some of the popular usecases like page-level-caching etc, most of the frameworks provides support out of the box. For ex. for Pylons, beaker cache module provides supports most of caching use cases.Just to give you example of caching scenarios, in our cases we observed most of our application pages can be cached for non-logged state and we wrote our custom caching module to enable page-level caching for non-logged mode. Now we are in process of going one step down to enable data-level caching for even logged-in version. Caching can be your biggest friend for scaling up your app (I am going to do a follow up blog post on caching work that we did)

Background certain tasks

While improving response time for some of the requests in our webapplication, we found lot of things which were not needed to be performed inline in the request handling path and could be performed as background job. There are some standard off the shelf components available these days for most of the web frameworks. For ex. resque if you are using Ruby on Rails. In our case, we used python based Celery for backgrounding certain tasks.

Combining JS/CSS

We observed that we had 17 CSS and 9 JS files being included in different pages of our application which were leading to 26 IO requests which were bad from the server as well page-load perspective. So simply combining all the JS files and CSS files in one file for JS and CSS each, we cut down on 23 IO requests from our server which improved our page load performance as well. Most of the webframeworks provide minification/combining modules for JS/CSS files. In our case, we used MinificationWebHelpers module. javascript_link helper and stylesheet_link needs to be passed with extra flags as shown below.

${h.stylesheet_link(‘/css/ext-libs/jqModal.css’,                     ‘/css/ext-libs/jquery-ui-1.8.custom.css’,                     ‘/css/ext-libs/jquery.jcarousel.css’,                     ..                     ‘/css/explore/exp_common.css’,                     minified = True, combined = True, combined_filename = ‘app.css’)}

Server Static content from Other Server

If your web application contains lot of static content like images etc., then it would be good idea to serve the static content from other services like Amazon S3 which are better suited for this purpose. It will further cut down on IO requests being served from your web server. We used Amazon S3 for serving our images. Also in our case, there was some content which is not exactly static like user images which get changed when user uploads a new image, we used Amazon S3 API (python’s boto library) to push the new/changed images on the fly to the Amazon S3. You can take further advantage of hosting images on Amazon S3 by enabling Amazon CDN service to power this content from Amazon’s CDN infrastructure which can further improve page-load performance.

Correct Logging Strategy

This one is a very low-hanging one and may not be the problem in your case but we observed that there were lot of logs enabled in our production setup and were needed to bumped down in their log-levels. A quick one hour exercise led to assigning right log levels to all the noisy log statement.

I hope you will find above tips useful. It would be great to hear about some of the tips that you must have applied in your app. We are pretty much done with the vertical scaling exercise and I am going to follow up this post with the horizontal scaling exercise which we are starting off this week.

Database Backup from MongoDB to Amazon S3 and Restoring it Back.

Database Backup from MongoDB and Restore Copyright: John Boston

I couldn’t have asked for better picture for my blog entry. I am going to use John’s words straight from this photo page on flicker to set the ground for today’s blog post.

This server was destroyed in the Choteau fire in NE Oklahoma on 11/27. This is why I stress to my clients the need for off-site backups. That’s a telephone on top.

Any small business owners out there should ask themselves “What would I do tomorrow if none of my data was retrievable”. The answer in many cases is “I’d go out of business”. Backups are cheap insurance, and if you don’t store them off-site, you will regret it one day. Even taking yesterday’s tape home with you is better than leaving it at the office.

Offsite back is much needed (checkout real example the ma.gnolia disaster) and everyone writes their own mechanism to take offsite backup. Last week, I spent sometime developing a simple mechanism of taking snapshot of certain section of database. We are using MongoDB at work. Though we have all the real-time replication mechanism in place but we wanted to take offsite backup for some section of user data in another place and we decided to go with Amazon S3 service. I thought of shareing some of the code snippets here because I think there are lot of foIks who could make use of this and save their time.  So, rest of the blog discusses steps for backing up data from MongoDB, storing it to Amazon S3 store and then restoring back from Amazon S3.

The first step is to take backup of database. The documentation on MongoDB website describes various strategies for taking backup, the one that I followed was:

  1. Fsync, Write Lock
  2. Backup
  3. Unlock

Now this strategy would work well for the scenario where the datasize you are backing up is small because the first operation is going to prevent any writes on the DB and will pretty much halt your application.

Fsync, WriteLock

Fsync, Writelock operation ensures flushing all pending writes to datafiles and then locks the DB for any further writes. It is good point to take snapshot of database because it dataset is in consistent state with no pending writes.I created a JS file fsync_lock.js.

Here is the command to do the first operation: /opt/mongodb/bin/mongo admin fsync_lock.js


Next step is to actual taking backup of your database. There are different methods of taking backup mentioned on mongodb website and I have used export method in this example. Lets assume the database you want backup has name “my_db”.

The following command will take backup your db. /opt/mongodb/bin/mongodump -d my_db -o dump

The above command will general backup files in folder dump/my_db

Unlock database:

Since we are done with the snapshotting exercise, unlock the database. I have created a small JS unlock.js for unlocking operation. The script contains following one JS line.

  db.$cmd.sys.unlock.findOne(); Following command will unlock the database: /opt/mongodb/bin/mongo admin unlock.js

Saving the backup archieve in S3:

I will share a python script that I wrote for four common operations that you will be doing with the Amazon S3 store:

  • Saving a backup archieve in the S3 bucket
  • List operation to check all archieves in S3 buckets
  • Retrieving a backup archive from S3
  • Deleting all backup archieves

There is a python library called boto which is widely used for accessing Amazon Web Services programmatically. There is very good tutorial to get started with boto. You need to create the bucket for saving the data in the S3 store first.

Here is aws_s3.py file code which does the above three operations I just described.

Here is the complete shell script which takes backup for you:

When shit happens:

Rest part of the blog will focus on restoring your database from the backup saved in S3. For restoring backup, following are the steps involved:

  1. Getting the backup archive from S3: you can use list operation from aws_s3.py to list the backup archieves that you have saved in S3. To retrieve a backup archieve from S3, use following command

python aws_s3.py get

  1. Restoring your DB: Restoring database from is a simple step. You need to stop your application from accessing the database.

Assuming unarhieving backup archive will create a directory dump which contains my_db folder with all the files.

/opt/mongodb/bin/mongorestore -d my_db dump/my_db

Here is a complete shell script to do the complete restore operation:

This is pretty much it. Let me know your comments or reply me at @sunil incase you face any trouble you face using these scripts.

Displaying date in user friendly format in Python

In this post, I am going to share a quick piece of code for a functionality which I think is required in almost every web application these days i.e. displaying date format in more user friendly format. Every application has objects in the system which have time-stamps associated with them i.e. user objects will have creation time or last activity time or a content publishing system will have content publishing time.

Consider twitter for example, just note the time-stamps in the picture below. 40 minutes ago or 1 hour ago is way easier to understand than 26th Jan, 5:30 PM. Infact, if a user is presented with a date format like 26th Jan, 2009 5:23PM, he tries to compute an easier format like whether it means 2 hours ago, a week ago or 2 months in his mind. So if you really care for a good user experience, you would like to display the date in a more user friendly manner.tw_timeline

So, lets talk about the code which does that for you in Python. We need to have a function which accepts a Python datetime object and returns us a ‘user friendly’ version of that. We will be using python-dateutil module  which provides some handy date manipulation functions. relativedelta module in the dateutil package does most of the magic for us.

relativedelta(date1, date2) function in dateutil.relativedelta returns a relativedelta objects which captures difference between the two dates. So for ex, if two dates differs by3 years, 2 months, 10 hours, 5 minutes, 30 seconds, the relativedelta objects is going to store these difference values in rdelta.year, rdelta.months, rdelta.hours, rdelta.minutes, rdelta.seconds respectively. I have written code that looks at this difference of current date with the passed date and compute the friently format version for given date. The function displays top unit only by default what that means is, if timestamp is 2 year, 6 months older, then default display will be “2 years” and full version will display two top units i.e. “2 years, 6 months”.  If you want friendly format to display top two units of time instead of one unit, then you need to pass ‘display_full_version = True’.

Now you can jump on to the code directly. If you have any questions, you can contact me at @sunil

The Code

Shit happens, It does! - Violet Hill

shit happens

Picture by Anant S.

We’ve all had embarrassing moments in our carrer where they involved inadvertently wreaking havoc on a production system. When it happens; for a second you (so desperately) want to believe it didn’t. You will be so afraid to even cross-check that it actually happend. 

Github went through an outage yesterday and Chris was brave enough to reveal how it happend, then hacker news post generated a good buzz around the subject. While reading comments on both the threads, I hand picked a few interesting stories about production mishaps. Here they are:

seldo: My worst was discovering I had written a unique ID generator which was (due to me typing “==” instead of “!=”), producing duplicate IDs – and not only that, it was producing them at exponentially increasing rates – and every duplicate ID was destroying an association in our database, making it unclear what records belonged to who.

pixdamix: Mine was for a French social networking site 4 years ago. They used to send mails everyday to say “hey look at the people who you might know”. The links on the mail would automatically log the user on the website. When I sent the code live it took 2 days (and more than 50000 mails to found out that when I sent a mail to person Z about person Y the link logged in Z ON Y’s account.

SkyMarshal: I sent a test email to thousands of customers in your prod database encouraging them to use web check-in for their non-existent flight tomorrow. Yeah, did that five years ago, talk about heart-attack-inducing. Quickly remedied by sending a second email to the same test set, thankfully, but that’s the kind of mistake you never forget.

Would love to hear about your production mishaps if any :). 

Serializable decorator for Python Class

I am going to describe a quick recipe for adding serializability to a Python Class. I was playing with Redis lately and needed a quick way to save python objects in Redis. I wanted objects to be saved in JSON format and found jsonpickle to be an appropriate module to use.

Initially I thought of having a base class with serialize/deserialize functions which other class could simple inherit to add serialize functionality. Here is the base class:

class Serializer():
  def serialize(self):
    return jsonpickle.encode(self)

    def deserialize(ser_str):
      return jsonpickle.decode(ser_str)

To add serialize functionality to a Python class, all i have to do is simply inherit from the above class shown below and I am done. (Note that deserialize is a static method)

class Visitor(Serializer): pass  

But then I thought of an elegant way of doing the same would be if I can specify a decorator ‘@serializable’ while defining the class. So here is the code for decorater.

def serializable(cls):
  class wrapper:
    def __init__(self, *args):
      self.wrapped = cls(*args)
    def __getattr__(self, *args):
      return getattr(self.wrapped, *args)
    def serialize(self):
      return jsonpickle.encode(self.wrapped)

  def deserialize(ser_str):
    return jsonpickle.decode(ser_str)

  return wrapper

So now, I can simple specify the decorator on Visitor class to make it serializable as shown in following snippet.

class Visitor():
  def __init__(self, ip_addr = None, agent = None, referrer = None):
    self.ip = ip_addr
    self.ua = agent
    self.referrer= referrer
    self.time = datetime.datetime.now()

orig_visitor = Visitor('192.168', 'UA-1', 'http://www.google.com')
#serialize the object
pickled_visitor = orig_visitor.serialize()
#restore object
recov_visitor = Visitor.deserialize(pickled_visitor)

So this was it. Both the above approaches has got its GOOD and BAD. Do share your views about it. If you are interested in python decorators, you might find PythonDecoratorLibrary useful.