Login

PyCon India Presentation on Redis And Python

Pune (India) witnessed probably the second biggest PyCon event this year. It was third PyCon to be held in India and was very successfull event.

I took a comprehensive tutorial on Redis & Python subject. I am embedding the slides from the same session below.

Hope you will find it useful. Ping me @_sunil_ if you have any questions.

Parsing signed_request parameter in Python based Facebook Canvas application

Recently Facebook announced a new way to passing user information who is viewing your Facebook canvas application using "signed_request" parameter which is implemented on top of new signature scheme based on OAuth2.0 proposal. Facebook documentation describes "signed_request" as

The signed_request parameter is a simple way to make sure that the data you're receiving is the actual data sent by Facebook. It is signed using your application secret which is only known by you and Facebook. If someone were to make a change to the data, the signature would no longer validate as they wouldn't know your application secret to also update the signature.

Facebook's python-sdk does not support parsing request parameter. Today at work, I had to write this piece of code snippet for parsing "signed_request", so thought of sharing it here.

I know there is some cryptic code in base64_url_decode because translate, maketrans does not work that well with unicode strings. Anyways, if you have any questions, just drop a line in the commments below or message me @_sunil_.

Why StackOverFlow hates Ruby and Loves C#

No wonder why stackoverflow folks decide to go with C#, the answer is simple because C# is going to make lot of money for them :) and it works really hard for them.

If you are interested in economics of programming languages with StackOverFlow, you are going to find this one amusing.

An hour back, just out of curiosity I thought of checking out popularity of programming languages on stackoverflow. I was looking for simple representative figures to compare engagement value of programming lanugages on stackoverflow. The simplest I could think of was 'Number of Questions tagged to each language'. I chose four languages:

For the simplcity, let assume that each question for a language asked on Stackoverflow generates 1 USD (I know in the long run, value of each question will be way over 1USD as it will increase over time...but lets keep it 1 USD fixed), then here are some interesting economics that each programming language creates for StackOverFlow.

1. Dollars earned so far from the four languages:  poor ruby :)

stackover_dollars_earned_till_now

2. Dollars earned this Month: poorer ruby :)

stackover_dollars_earned_last_month

3. Dollars earned this this week: somebody save ruby :)

stackover_dollars_earned_last_weekOn the serious note, the stats for all the four languages are interesting and puzzled me when I tried to draw some logic out of them. Oh yes, the money graphs were made using the 'Piles of Samples' visualization from Google. You can reach me @_sunil_

7 things one can do to scale up a web application

scaling up

Recently at work, we had to undertake a quick exercise of scaling up our web application which taught me a few things which I thought of sharing with the community. We are using following technology stack at work: 

  • Python as our primary language for most of our work at the backend
  • Pylons (Webframework)
  • MongoDB (NoSQL datastore)
  • Redis (Cache)

Lets jump in to the seven steps that worked for us and hope that most of them can be applied to any web application.

1. Profile your web application: In order to understand the execution pattern from performance perspective, first step would be to profile application. Profiling can bring out some interesting insights about your application. Within 10 minutes of profiling we were able to figure out some of the very important (low-hanging) performance fixes. Another reason for enabling profiling is that if you do some performance fixes, it quickly helps you measure the difference as well.

These days most of the web frameworks provide profiling tools with them. For our web framework Pylons, we used  ProfileMiddlware from paste package.

  • You need to install python-profiler package. following command should do the trick for you if you are using Ubuntu:

sudo apt-get install python-profiler

  • Add following lines in your pylon's application middleware.py i.e. <app_name>/config/middleware.py

from paste.debug.profile import ProfileMiddleware

app = ProfileMiddleware(app, config, log_filename='profile.log.tmp', limit=40) #in the custom middle ware section

With above steps performed, you should see profiler output on the console (stdout) if you are running in dev mode. Now identlfy the code paths which are the real bottlenecks.

2. DB Query Profiling: Since most of webapps are powered by some kind of data store, profiling data store query profiling would give you some interesting insights about the slower operations in your application. In our case, since we are using mongodb. It provided one command line switch for verbose mode (-vvvvv) for different verbosity levels to understand query execution happening at server. It helped us to identify some of the most frequent and slow queries and in 90% of the cases, all we needed was to define indexes on our collections and we were done. Things may not be that simple in your case but it will ateast give your engineers enough to understand what needs to be attacked in the application.

3. Enable Data Caching: Caching can be your biggest friend for scaling up and it can be done at various levels. Caching strategy depends on usecases in your application and for some of the popular usecases like page-level-caching etc, most of the frameworks provides support out of the box. For ex. for Pylons, beaker cache module provides supports most of caching use cases.Just to give you example of caching scenarios, in our cases we observed most of our application pages can be cached for non-logged state and we wrote our custom caching module to enable page-level caching for non-logged mode. Now we are in process of going one step down to enable data-level caching for even logged-in version. Caching can be your biggest friend for scaling up your app (I am going to do a follow up blog post on caching work that we did)

4. Background certain tasks: While improving response time for some of the requests in our webapplication, we found lot of things which were not needed to be performed inline in the request handling path and could be performed as background job. There are some standard off the shelf components available these days for most of the web frameworks. For ex. resque if you are using Ruby on Rails. In our case, we used python based Celery for backgrounding certain tasks.

5. Combining JS/CSS:  We observed that we had 17 CSS and 9 JS files being included in different pages of our application which were leading to 26 IO requests which were bad from the server as well page-load perspective. So simply combining all the JS files and CSS files in one file for JS and CSS each, we cut down on 23 IO requests from our server which improved our page load performance as well. Most of the webframeworks provide minification/combining modules for JS/CSS files. In our case, we used MinificationWebHelpers module. javascript_link helper and stylesheet_link needs to be passed with extra flags as shown below.

${h.stylesheet_link('/css/ext-libs/jqModal.css',
                    '/css/ext-libs/jquery-ui-1.8.custom.css',
                    '/css/ext-libs/jquery.jcarousel.css',
                    ...........
                    '/css/explore/exp_common.css',
                    minified = True, combined = True, combined_filename = 'app.css')}

6. Server Static content from Other Server: If your web application contains lot of static content like images etc., then it would be good idea to serve the static content from other services like Amazon S3 which are better suited for this purpose. It will further cut down on IO requests being served from your web server. We used Amazon S3 for serving our images. Also in our case, there was some content which is not exactly static like user images which get changed when user uploads a new image, we used Amazon S3 API (python's boto library) to push the new/changed images on the fly to the Amazon S3. You can take further advantage of hosting images on Amazon S3 by enabling Amazon CDN service to power this content from Amazon's CDN infrastructure which can further improve page-load performance.

7. Correct Logging Strategy: This one is a very low-hanging one and may not be the problem in your case but we observed that there were lot of logs enabled in our production setup and were needed to bumped down in their log-levels. A quick one hour exercise led to assigning right log levels to all the noisy log statement.

I hope you will find above tips useful. It would be great to hear about some of the tips that you must have applied in your app. We are pretty much done with the vertical scaling exercise and I am going to follow up this post with the horizontal scaling exercise which we are starting off this week.

Displaying date in user friendly format in Python

In this post, I am going to share a quick piece of code for a functionality which I think is required in almost every web application these days i.e. displaying date format in more user friendly format. Every application has objects in the system which have time-stamps associated with them i.e. user objects will have creation time or last activity time or a content publishing system will have content publishing time.

Consider twitter for example, just note the time-stamps in the picture below. 40 minutes ago or 1 hour ago is way easier to understand than 26th Jan, 5:30 PM. Infact, if a user is presented with a date format like 26th Jan, 2009 5:23PM, he tries to compute an easier format like whether it means 2 hours ago, a week ago or 2 months in his mind. So if you really care for a good user experience, you would like to display the date in a more user friendly manner.tw_timeline

So, lets talk about the code which does that for you in Python. We need to have a function which accepts a Python datetime object and returns us a 'user friendly' version of that. We will be using python-dateutil module  which provides some handy date manipulation functions. relativedelta module in the dateutil package does most of the magic for us.

relativedelta(date1, date2) function in dateutil.relativedelta returns a relativedelta objects which captures difference between the two dates. So for ex, if two dates differs by3 years, 2 months, 10 hours, 5 minutes, 30 seconds, the relativedelta objects is going to store these difference values in rdelta.year, rdelta.months, rdelta.hours, rdelta.minutes, rdelta.seconds respectively. I have written code that looks at this difference of current date with the passed date and compute the friently format version for given date. The function displays top unit only by default what that means is, if timestamp is 2 year, 6 months older, then default display will be "2 years" and full version will display two top units i.e. "2 years, 6 months".  If you want friendly format to display top two units of time instead of one unit, then you need to pass 'display_full_version = True'.

Now you can jump on to the code directly. If you have any questions, you can contact me at @_sunil_

The Code

URL Shortener Service Using Redis

I have been following 'Redis' from sometime and got an opportunity to use it in one of the weekend projects (URL Shortner). I had a great experience using Redis in the project, so thought of writing a blog entry describing some aspect of the project to demonstrate a use case for Redis.

The project was to build URL shortener web application. We want our URL shortner service to provide following functionalities:

  • URL Shorten/Expand: Given a long URL (say, http://www.google.com/....), our service will generate a short-key e.g. xyz so that a short URL http://ab.ly/xyz (hosted on a domain http://ab.ly) when accessed, visitor will be redirected to the original long URL (http://www.google.com/..)
  • Record Clicks: We want to record the number of times a short URL (http://ab.ly/xyz) has been visited
  • Recent Visitors for a short URL: We want to store basic information e.g referrer, user agent, ip etc for recent n visitors for a given short URL.
  • List of Recent Short URLs generated by the service.

Skipping the web-application details, I would focus on the core URL-Shortner module that I wrote in python using Redis as back-end storage. You can read more about the Redis at http://code.google.com/p/redis/, here is a quick description from Redis website.

"Redis is an advanced key-value store. It is similar to memcached but the dataset is not volatile, and values can be strings, exactly like in memcached, but also lists, sets, and ordered sets. All this data types can be manipulated with atomic operations to push/pop elements, add/remove elements, perform server side union, intersection, difference between sets, and so forth. Redis supports different kind of sorting abilities." more..

One good part about the redis is that it took just 2 minutes (a few lines of instructions) to get Redis up and running. I loved the simplicity and documentation of the project. I think simplicity of installation and getting a sample app running in few minutes are signs of a good open source project.  Big thanks to Salvator Sanfilippo (lead developer of Redis and team) for the wonderful work. 

Rather than debating over why Redis for storage part, I want to focus on HOW Redis can be used to build a URL shortener service (that is the best part I like about weekend projects, you dont have to break your head to justify why you are doing things in a specific manner :)).

The core module is written in python, so we will use py-redis, the python bindings for Redis client. You can download latest code of py-redis from the project's github page. Again, py-redis is also a very well documented library and Thanks to Andy McCurdy.

Basic Design:

The basic design is that we will use a global counter with redis key 'next.url.id' for generating key for short urls. So everytime we generate a new short URL, we just increment the global counter and returns the hexadecimal version of the counter value as the short url key.

Assuming current value of the counter stored in the key 'next.url.id' is 1000. So next URL shorten request for a URL (say, http://www.google.com/...) will increment the counter to 1001 (hex value = 3e9). So Redis key 'url:<short-url-key>:id' (e.g.'url:3e8:id') for short url http://ab.ly/3e8 will contain string value for long URL 'http://www.google.com/...'

For each short URL, we need to store clicks. So, we will maintain another key 'clicks:<short-url-key>:url'  (e.g.'clicks:3e8:url') with value type int to store number of clicks.

For each short URL, we also will maintain another key 'visitors:<short-url-key>:url' (e.g. 'visitors:3e8:url') which will point to list data structure which will store each visitor detail in form of JSON format. 

A global key with name 'global:urls' pointing to a LIST data structure containing the hex-code of n recent short URLs.

Code Details:

Lets look at some code of the UrlShortner module. please note that you will encounter references to 'self.r' in the code which refers to the redis object created using self.r = Redis(db=0) in the  __init__ method of our core module.

Now lets look at the basic functions of the URL shortner core module.

1. Shorten function of our module will look something like this. Given a long_url, this function generates the short-url-key which can be distributed to outside world. e.g. for http://www.xyz.com/... ==> 3e8 as above, short-url = http://ab.ly/3e8

    def shorten(self, long_url=None):

        url_hash = '%x' % self.r.incr('next.url.id') #hex value of the counter

        #next we store the long_url against this hex key

        self.r.set('url:%s:id' % url_hash, long_url)

        #push this new short url in global list of urls 

        self.r.push('global:urls', url_hash)

        return url_hash 

2. Expand function is required by the service to resolve a short URL needed for redirection when a user visits the short url. So expand function will look like this:

    def expand(self, url_hash = None):

        return self.r.get('url:%s:id' % url_hash) 


    3. When someone visits a short URL, then we need a function to record the visit information (e.g IP address, user agent, http referrer etc). Here is what visit function will look like:

    def visit(self, url_hash = None, ip_addr = None, agent = None, referrer = None):

            #create an object of Visitor type

            visitor = Visitor(ip_addr, agent, referrer)

            #push the visitor object in the visitor list of this short URL

            self.r.push('visitors:%s:url' % url_hash, json.dumps(visitor))

            #increments the clicks of the short url

            return self.r.incr(clicks:%s:url % url_hash)

Please note that we would have wanted the above two operations to be atomic. but I realized very soon that they do not have any side effect if they are performed in an interleaved fasion. 

4. In order to find out the click count of a short URL, the code will be very simple as:

    def clicks(self, url_hash = None):

        return self.r.get(self.URL_CLICKS_KEY % url_hash)

 

5. In order to list down the recent 100 visitors of a given short URL, the function will look like this:

    def recent_visitors(self, url_hash = None):

            visitors = []

            for v in self.r.lrange('visitors:%s:url' % url_hash, 0, 100):

                visitors.append(json.loads(v))

            return visitors 

Please note above, as we were storing visitor info object in JSON format, we need to convert back in to python object.        

6. In order to view recent 100 short urls in the system, another helper function in the module will look something like this:

    def short_urls(self):

        return self.r.lrange('global:urls', 0 , 100)


So a web application on top of this core module can be written to power up URL Shortening web service. I will share the core module on github after giving some final touches and will share the link. 

I hope the post will help users to think more about use cases for Redis. Do share your thoughts in comments.

Deserialize a binary search tree from an inorder sequence

A friend of mine asked me about the ways a binary search tree (BST henceforth) could be transferred over the network. There are many ways to do so but I found one approach worth discussing here. BST can be serialized by  storing the BST in pre-order or in-order sequence and then wired over the network and then reconstructed by deserializing the sequence. Deserialization from a pre-order sequence is pretty straight forward but deserializing a in-order sequence in to BST is worth a blog post, so I decided to post it here.  I have been spending some time with Python for sometime so decided to give a try (The fact is that I find Python to be really cool).

DISCLAIMER: I am not a Python expert, so please do not expect perfect pythonic code here :)

It is being assumed that an in-order sequence will be given as input to be de-serialized.

Defined a class TreeNode to capture a  properties of tree node.

class TreeNode():
def __init__(self, val):
          self.val = val
          self.left = None
          self.right = None

We can use famous Divide and Conquer philosophy to solve this problem by building first the left sub-tree, then right-subtree and then combining the two to create the complete tree. Here is the recursive function.

def build_tree(sequence, low, high):
        if low > high:
return None
        if low == high:
                  return TreeNode(sequence[low])
        mid = (low + high)/2
        node = TreeNode(sequence[mid])
        node.left = build_tree(sequence, low, mid -1)
        node.right = build_tree(sequence, mid + 1, high)
        return node

Recursive function to print the content of the tree in in-order sequence.

def traverse(node):
        if not node:
           return
        traverse(node.left)
        print node.val,
        traverse(node.right)

Finally lets test out the code with following inorder sequence. Currently, we do not check if the input sequence if indeed a sorted sequence or not.

if __name__ == "__main__":
         sequence = [8, 10, 11, 12, 14, 18, 20]
        tree = build_tree(sequence, 0, len(sequence) -1)
        traverse(tree)

Do share your views about it.