Replicate production leak into test environment

Warning: WP_Syntax::substituteToken(): Argument #1 ($match) must be passed by reference, value given in /membri/maips21/wp-content/plugins/wp-syntax/wp-syntax.php on line 380

Warning: WP_Syntax::substituteToken(): Argument #1 ($match) must be passed by reference, value given in /membri/maips21/wp-content/plugins/wp-syntax/wp-syntax.php on line 380

Warning: WP_Syntax::substituteToken(): Argument #1 ($match) must be passed by reference, value given in /membri/maips21/wp-content/plugins/wp-syntax/wp-syntax.php on line 380

Warning: WP_Syntax::substituteToken(): Argument #1 ($match) must be passed by reference, value given in /membri/maips21/wp-content/plugins/wp-syntax/wp-syntax.php on line 380

This post is related to a use-case I had in a client environment.

The production webapp, running Magnolia EE 3.6.8 over Apache Tomcat, Apache Httpd, MySql was permanently over high load, cpu fixed at 100% even when the internet connection is pull-out from live servers.

The client has another test environment, but unfortunately it is not full compliant to the production one.

The main goal has been to replicate the critical situation over test environment, in order to try to fix it discovering and testing freely, without disturbing the live site.

So, these are the followed steps:

Trace apache accesses
Reproduce the same requests to the test environment
Fix the error (if found) and retry the test
Based on log file produced at point 2, make some statistics to summarize the work

Trace apache accesses

This is easy. Apache httpd by default store access log. Simply be sure to have this feature enabled and wait for the error happens. You can find more info here: [apache-access-log].

Log are in the form of this snippet:

82.72.135.51 - - [14/Apr/2010:04:05:35 +0200] "GET /home/comunicazione/impressum.html HTTP/1.0" 200 4906 ""  "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.1)" [9580]
82.72.135.51 - - [14/Apr/2010:04:04:24 +0200] "GET /home/channels/com/about/teleone.html HTTP/1.0" 200 5469 ""  "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.1)" [10401]
82.72.135.51 - - [14/Apr/2010:04:03:41 +0200] "GET /home/net/02-rightBox/06/smart/0/image/showcase.jpg HTTP/1.0" 503 402 ""  "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB6; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" [11725]
82.72.135.51 - - [14/Apr/2010:04:05:39 +0200] "GET /docroot/pria-layout/css/templates/t11-content-page.css HTTP/1.0" 200 305 ""  "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618)" [20709]
82.72.135.51 - - [14/Apr/2010:04:04:01 +0200] "GET /home/net/rete/radio.html?po=4f6679a4-4171-4c01-8727-5cf4708eaaf1&amp;date=25.12.2009 HTTP/1.0" 200 5502 "-"  "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)" [8945]

Reproduce accesses over test instance

In order to be able to reproduce these calls to test instance, first has been produced another file, containing only the URI for each request.

The code used is a simple UNIX command:

cat apache-access.log | awk '{print $7}' >> localhost-resources.txt

where ‘7’ is the “column” of the access log you want to take from original apache-access.log to localhost-resources.txt

This produced a file like this:

...
/resource/uri/page.html
/resource/uri/stylesheet.css
/resource/uri/page2.html
/resource/uri/page3.html
...

Now another shell command will launch curl to ask the webapp (a tomcat webapp running on localhost) the same requests.

A third file is produced, in order to make some statistics over it.

for i in $(cat $1); do curl -w "%{http_code}|%{time_total}|%{size_download}|%{speed_download}|$i\n" -s -o /dev/null   http://localhost:8080$i  ;done

The log file is something like

...
200|0.001|61|47843.000|/resource/uri/page.html
200|0.002|314|208915.000|/resource/uri/stylesheet.css
200|0.394|29049|73802.000|/resource/uri/page2.html
404|0.003|27260|9422744.000|/resource/uri/page3.html
...

On another machine we call the top unix command, in order to keep performance monitorized.

Now, it is easy to filter over requests that ends with 500 / 503… codes (or 404), or maybe filter requests fulfilled in time more than x seconds, to discover critical pages / actions.

Fix error and retry test

The error was easy. A RSS feed generated a loop in java code. After few requests of the same RSS java reached 100% of processors, without serving pages. This brought client servers down. After fixing the code that generates the xml feed (too heavy JCR queries!) the test has been relaunched, without any other problems.

Statistics to summarize the work

Now is the time to report some number.

(text for cut & paste)

Requests: 100.447: generated requests
- 77127 (77%) are 200 (resource load as expected)
- 22496 (22%) are 404 (resource not found, but it is normal, on test environment we don’t have some extra website)
- 824 (1%) are 500, 302 or 400 code. The 500 code has been investigated and we found another little bug in a redirect code.
Timing: 11987 seconds (around 3 hours and a little..)
- 98765 (98%) responses below 1 second
- 1682 (2%) responses between 1 second and 5 seconds
- 0 responses over 5 seconds

More in detail on this procedure

The method used here is really simple and full of constraints, but keep the team to reach an analitic solution of the puzzle.

I’m strongly sure that starting from apache logs is a good idea, because it represents a real usage request pool. The test can be optimized collecting only requests that are temporally around to the 100% threashold of the servers; otherwise is it possibile to try to use only requests that generates error http code (500, 503…)

I strongly encourage to use some kind of standard stress tool, because the curl method is really simple and can not simulate concurrent calls, calls can come from different ip, at different frequency, and so on.

A good idea can be to log at the same time the process / total cpu and memory usage, in order to correlate requests to server efforts

Replicate production leak into test environment

Latest articles

Be First to Comment

Leave a Reply