Should large source files be split into smaller ones for work flow efficiencies ?

I am working on a study of risk associated with code changes made in the Mozilla Firefox browser.

Here is the ten thousand feet view of my approach

  • Get the list of all bugs that were fixed in Firefox in the last one year [ from Bugzilla ]. 
  • Assign a score to each bug fix.
    • Example: If the source file is modified to fix a security bug in the recent quarter .. give it 20 points, if it was fixed 4  quarters ago, then give it 5 points ... that means a sec-fix can get a score of  20 or 15 or 10 or 5 points based on when a sec bug is fixed  in the last 365 days

      If the source file is modified to fix a regression bug in the recent quarter .. give it 16 points, if it was fixed 4 quarters ago, then give it 4 points ... that means a reg-fix can get a score of 16 or 12 or 8 or 4 points based on when a reg bug is fixed  in the last 365 days

        Using a similar logic, a general bug fix would get a score of 12 or 9 or 6 or 3 points based on whether it was fixed  0-90 days ago or 91-180 days ago or 181-270 days ago or 271-365 days ago

      Then add all the scores for all source files and stack rank the files in descending order.

  • Each source file gets a cumulative score based on how many bugs are fixed in it, what types of bugs are fixed in it and when.
Using the bug patch attachments from Bugzilla and bugId in the source control log comments I could create a fairly accurate mapping between bug number to actual source files that are modified to fix the bug.

Remember ... this study is done not on all bugs in the Bugzilla but only on the fixed bugs.

Without getting into too many details, I have found that top 10 files in the stack ranked order of fix-score have 12% of all bug fixes in them. Since there are thousands of files in the executable,  this is a significant concentration.

All of these TOP score files are big in size. For example the number one score file has fifteen thousand lines of code in it.

But if I double click the source file and see which methods are actually modified to fix the bugs, the distribution of hotspots is fairly concentrated in the file itself into segments. Typical scenarios are more like 20% of the source file itself has 80% of fixes concentrated in it.

It is fairly obvious  that the top 10 files should have additional review mechanisms in place to account for the fact that they have been centers of a lot of bug fix activity. But the actual bug fixing in those files is not uniformly spread across the files.

Here is my question ...  since the bug fix hotspots in the files are concentrated , does it make sense to recommend that the large file be split into smaller files so that only those ( small ) files that contain the hotspot segments should be tracked for additional review and the safer portion(s) follow the normal check-in procedures ?
5 comments

Popular Posts