Should large source files be split into smaller ones for work flow efficiencies ?

I am working on a study of risk associated with code changes made in the Mozilla Firefox browser.

Here is the ten thousand feet view of my approach

  • Get the list of all bugs that were fixed in Firefox in the last one year [ from Bugzilla ]. 
  • Assign a score to each bug fix.
    • Example: If the source file is modified to fix a security bug in the recent quarter .. give it 20 points, if it was fixed 4  quarters ago, then give it 5 points ... that means a sec-fix can get a score of  20 or 15 or 10 or 5 points based on when a sec bug is fixed  in the last 365 days

      If the source file is modified to fix a regression bug in the recent quarter .. give it 16 points, if it was fixed 4 quarters ago, then give it 4 points ... that means a reg-fix can get a score of 16 or 12 or 8 or 4 points based on when a reg bug is fixed  in the last 365 days

        Using a similar logic, a general bug fix would get a score of 12 or 9 or 6 or 3 points based on whether it was fixed  0-90 days ago or 91-180 days ago or 181-270 days ago or 271-365 days ago

      Then add all the scores for all source files and stack rank the files in descending order.

  • Each source file gets a cumulative score based on how many bugs are fixed in it, what types of bugs are fixed in it and when.
Using the bug patch attachments from Bugzilla and bugId in the source control log comments I could create a fairly accurate mapping between bug number to actual source files that are modified to fix the bug.

Remember ... this study is done not on all bugs in the Bugzilla but only on the fixed bugs.

Without getting into too many details, I have found that top 10 files in the stack ranked order of fix-score have 12% of all bug fixes in them. Since there are thousands of files in the executable,  this is a significant concentration.

All of these TOP score files are big in size. For example the number one score file has fifteen thousand lines of code in it.

But if I double click the source file and see which methods are actually modified to fix the bugs, the distribution of hotspots is fairly concentrated in the file itself into segments. Typical scenarios are more like 20% of the source file itself has 80% of fixes concentrated in it.

It is fairly obvious  that the top 10 files should have additional review mechanisms in place to account for the fact that they have been centers of a lot of bug fix activity. But the actual bug fixing in those files is not uniformly spread across the files.

Here is my question ...  since the bug fix hotspots in the files are concentrated , does it make sense to recommend that the large file be split into smaller files so that only those ( small ) files that contain the hotspot segments should be tracked for additional review and the safer portion(s) follow the normal check-in procedures ?

Comments

Taras said…
Yes.

I think we should do it at function granularity. So risky functions would have an attribute

ie
nsresult NEEDS_EXTRA_REVIEW
foo() {
}
where is a macro that expands to nothing at compile time.
smaug said…
Taras' suggestion to use some annotation looks good.
But splitting methods out of the file makes coding and reviewing harder. nsEditor related code for example has the methods of the classes in many different files, and following that code is way too difficult.
We should just merge those different files to one.
Gerv said…
When you say "in Firefox", exactly which products did you include? Core? Toolkit?

An interesting question is whether you can connect regressions to the bugs they regressed, and then see whether fixes in particular files or places are more likely to cause regressions. That might indicate that the code is fragile and should be refactored or reviewed.

In fact, it would probably be worth evaluating anything you wanted to tag with NS_EXTRA_REVIEW for possible refactoring.
Did you look at the "bugs fixed per line", or other metric for file size. Could it just be that the big files have more bugs because they have more code?
I think your scoring mixes up "code which is changing a lot" and "code which is scary". I'd be much more interested in this analysis if you only counted changes security/regression bugs (and counted them as equally severe).
hamdrackliff said…
The Best Casino Sites for Slots Games
You can play the games 유니 벳 on the casino sites at one 브라 벗기기 of the many providers, 승인전화없는 토토사이트 such 도박 사이트 as Pragmatic Play, Microgaming, Play'n bet 뜻 GO, Evolution Gaming. So far, the

Popular Posts