Performance – When do I start worrying?
A common problem of the application designers is to predict when they need to start worrying about the Architectural/System improvements on their application. Do I need to add more resources? If yes, then how long before I am compelled to do so? The question is not only when but also what. Should I plan to implement a true caching layer on top of my application or do I need to shard my database. Do I need to move to a distributed search infrastructure and if yes when ! Essentially we try to find out the functionalities of the application that will become critical over time. The reasons that a nice working functionality becomes critical over time are mainly two -
- Volume – The data is increasing over time, so some queries that work pretty well now become inefficient over time.
- Concurrency – The traffic has increased and simultaneous usage has made the system inefficient.
The problem definition hence becomes to find out critical pieces of inefficiencies in the presence of future data and future traffic.
The performance tools like loadrunner don’t look suitable for this job. They certainly work well to find points of inefficiencies from a black box perspective but you are the owner of your own application. You can do better ! Moreover, these performance tools will help at the testing phase but not so much in the planning phase. Here we will not focus on end to end performance but the most critical part of it – The efficiency to retrieve the result for the queries (from the Backend perspective), simply because we want to estimate not tune. After going through the rest of the post, you may ask – do I need to do these modeling or make an educated guess or simple calculations of the future data and performance. Don’t let this post mislead you, you must do that if you can. Any quick, handy, estimated result is certainly gold. Moreover, a mathematical model will work far superior than any simulated model, but unfortunately, in most cases, it may be quite difficult to do so.
Now, if you are writing your own performance tool, a big question is – will you write a tool that will emulate users similar to loadrunner and start hitting the application and keep running on the stage server in real time ? That may be a bigger and more resource consuming project than the application itself ! Well, you won’t have to. The concept of the proposed performance tool is based on Discrete Event Simulation(DES). The badly sketched picture below aims at describing the steps.

Modeling Users/Traffic : The first step is to model the user behavior, for example – how many users of the application are 1) heavily active 2) moderately active 3) one time visitors. You also have a growth rate of users (actual or projected). The users come and browse through pages (read actions) and make certain actions (write actions). If you have structured the application well, the user action simulation (using the model classes) should not be a difficult job. The above three partitioning of users is just an example. Since you know your application best, you can make your own models and categorizations.
Events : The above actions are events in the context of DES. Each event triggers another event after a time span. If you have to choose the next event from a set of n events, use some analysis and put a probability on each of them. The time interval of next event can be chosen randomly with mean sitting at the average time between clicks. Those values will be different for different categories of the above user categories. You will also have to put a period to the return of those user types and again choose randomly (at run time) between the typical values. To model growth rate, introduce another event which adds a calculated amount of users with fixed periodicity. If there are other sources of write actions like a crawler, add suitable events for them.
Simulation: Now since we have modeled the future traffic and events, we need to bootstrap it with the current average load which is pretty simple. Note that, Unlike, a thousand thread load generator, it is a single threaded simulator and it jumps to the head of the event queue without waiting for the real time to elapse. This is depicted pictorially below. You may skip most of the read queries (from actual firing) and just focus on writing a parametrized write queries, which will be adding simulated data. For most cases, you will be able to emulate the system run of few days in few hours.

A list of DES tools that might help is given here.
Inference: After the end of the simulation, you have a snapshot of the simulated future data and a load profile containing sets of concurrent events (read actions). The concurrent events are defined by the set of events occurring within a time span in the load profile. By varying the window, you can over/under approximate the actual concurrent load. Simply by looking at the events, a lot of insights can be derived of the future behavior. Now a simple multi-threaded query executor can fire up these queries simultaneously for the peak window and see the response of each query is acceptable or not under concurrent load. You can better predict the type of improvements that is suitable for the specific application by looking at the events and the responses. For starters -
- Are they locking on some bottleneck ? (distributed instance of that module?)
- If a significant chunk of queries are actually the same ? (caching?)
- How many of queries are independent of each other ? (replication, sharding?)
By running the simulation repeatedly for few days at each stretch, and doing the analysis after each run, you can also get an idea about when the application will start to become critical and by when it will break down completely. Again, it will not be accurate to the calendar date but on a high level – is it weeks/months/years?
I have kept this post procedural for simplicity but all parts are not a must have and may not be or can’t be followed exactly. If you can project the current load profile to future and just use this method to add data, such optimizations might lead to quicker estimation and hence always preferable. Whatever you need to do, you must do to get these estimations as quick as possible and as close to the reality as possible. The blog started with the question, “when do i start worrying? “ – Well If you have stopped, then probably you are going out of business as well.
11 Comments »
Leave a Reply
-
Archives
- March 2009 (3)
-
Categories
-
RSS
Entries RSS
Comments RSS
I remember the first time I had my hands on a PC and tried installing Linux the very first day, many years back, and got the PC to crash. After two days of looking at the blank screen, I went to a lab with dumb terminals (colorless, mouse less, no-gui) and figured to learn to use lynx to connect to internet and search for some solutions. It worked – I think the search engine was 39.com. From lynx to chrome, from 39.com to Google, from then and now, I have found the Tech blogging and discussion forums helping me a lot. The motivation behind this blog is to give something back to such community.
This is very useful. Can you share your experience with the tool, with some indicative numbers. A coarse example would be great.
Also, how much time do you think, on an average, would be required to come up with this estimation set-up for a typical web-application ?
Are there any scenarios when a product might be needing this. I can think only of only web applications open to end-users.
Thanks.
You are right. This tool is particularly suitable for web applications where unpredictable heavy concurrent access is of concern. Moreover, it is more useful for web applications where real time data mining is more prominent than static content serving (like photos). If I neglect the user categorizations which is very application specific, the tool development should take 10-14 days as per my best estimate. For a typical web application and one production unit, it simulates the data for 3 months in 2 hours. Though as apparent, it will vary from application to application and modeling to modeling. The subsequent steps of analyzing the load profiles are relatively simpler.
[...] Performance – When do I start worrying? – Empyrean [...]
Pingback by Cloud Computing Links March 26, 2009 at Cloud Curious | March 26, 2009 |
Nice work !
Thanks man.
You start worrying when you are sure it will be a problem . YAGNI (you aint gonna need it). look at what gives your business needs more value. having worked on several high traffic applications we never had to think about performance till we know for sure we need to optimize it then we profile the app and figure out where the bottlenecks are and then come up with a solution.
I totally agree with KK. Good article too.
[...] Performance – When do I start worrying? – “A common problem of the application designers is to predict when they need to start worrying about the Architectural/System improvements on their application. Do I need to add more resources? If yes, then how long before I am compelled to do so? The question is not only when but also what.” [...]
Pingback by Software Quality Digest - 2009-03-30 | No bug left behind | March 30, 2009 |
[...] Performance – When do I start worrying? Empyrean (tags: architecture) [...]
Pingback by links for 2009-04-05 « pabloidz | April 5, 2009 |
[...] Performance – When do I start worrying? [...]
Pingback by Lenguajes X » Enlaces rápidos (07-04-2009) | April 8, 2009 |
[...] Performance – When do I start worrying? [...]
Pingback by Enlaces rápidos (07-04-2009) | dominios, diseño web, ecommerce - Mantis Technology Solutions Blog | April 8, 2009 |