PERL Scripting:: Data Analytics

One of my jobs is to be a Data analyst  and generate statistics by parsing,scraping
and formatting data and ultimately to generate graphs and self analysis
tools for managers.

I have been working with sanitized Bugzilla database to generate blocker bugs daily in-flow out-flow trend graphs. Bugzilla DB admin has been very kind to provide me with a sanitized Bugzilla dump database which is hosted on a secret server in colo and I can access that db instance in read only mode and that too , by VPN'ing to a specific server in server room. So, naturally I am getting a little bit tired of going through multiple firewalls to access the naked DB instance.

So, today I got an idea. Why not I get the required information all from the BugZilla front end from the master DB itself [ instead of having a sanitized back end DB replication ] !! This will allow me to quickly automate the graphs generation job.

So, here is what I did in about 2 hours ... I  wrote a crazy PERL script that gets an ATOM feed from Bugzilla and then grabs the list of all BugIds' that match my query. Then for each bug the Bugzilla is again queried to get an XML output for each bug [ which will contain a ton of information like when the bug was filed, who filed, status, product,component, severity,priority, blocker status etc., etc., ].

However, the XML response back from Bugzilla does not provide the historical activity on the bug. This activity includes every thing that has ever happened to the bug and provides the transient state information on the entire life history of the bug.

I can query the Bugzilla to get the activity info on any bug  but Bugzilla returns the info in a plain old HTML table format. The table contains rowspans, colspans etc., to make my life exciting.

So, I wrote another crazy PERL script that queries for each bug  [ that fits my criteria ] and gets the activity in HTML table format. Then parses the HTML table to get information like how many times the bug is RESOLVED - Reopened - RESOLVED or when the bug was actually marked as confirmed or when the bug was flagged as a blocker and who did it..... and any such info that I would be interested in.

I pumped all this data into a simple sqlite3 database which is created on the fly and this way, I can port my entire scripts set to any machine that has PERL and has the right PERL modules.

Using the seed information I grabbed from the Bugzilla database, I was able to create some nice graphs that show the IO trends of blocker bugs in more than one view and also on more than one time scale.

Look at the graphs generated here  LINK.

Just like my previous graphs here, here and here  the above graphs set would also be generated using a cron job on a daily basis.

Your reviews and comments are always appreciated.

Comments

Samuel Sidler said…
Doesn't the Metrics team already have a bunch of this bug data given to them every day for Pentaho? I know they haven't done many reports with it, but it seems like they've already solved the data problem...
The OLAP DB is much bigger than the BugZilla DB. The creation of data warehouse for bug metrics is a long term project and not ready to use yet.You already know the behind the firewall issue for the portal.
Sam, have you checked out the new set of graphs. How are they!
Unknown said…
It might be useful for you to know that in Bugzilla 3.4 (which we're upgrading to as soon as we can finish porting our local changes forward - probably about a month) lets you retrieve the bug history via XMLRPC.

Popular Posts