Economics chapter added to “Empirical software engineering using R”

March 26th, 2017 No comments

The Economics chapter of my Empirical software engineering book has been added to the draft pdf (download here).

This is a slim chapter, it might grow a bit, but I suspect not by a huge amount. Reasons include lots of interesting data being confidential and me not having spent a lot of time on this topic over the years (so my stash of accumulated data is tiny). Also, a significant chunk of the economics data I have is used to discuss issues in the Ecosystems and Projects chapters, perhaps some of this material will migrate back once these chapters are finalized.

You might argue that Economics is more important than Human cognitive characteristics and should have appeared before it (in chapter order). I would argue that hedonism by those involved in producing software is the important factor that pushes (financial) economics into second place (still waiting for data to argue my case in print).

Some of the cognitive characteristics data I have been waiting for arrived, and has been added to this chapter (some still to be added).

As always, if you know of any interesting software engineering data, please tell me.

I am after a front cover. A woodcut of alchemists concocting a potion appeals, perhaps with various software references discretely included, or astronomy related (the obvious candidate has already been used). The related modern stuff I have seen does not appeal. Suggestions welcome.

Ecosystems next.

Tags: , ,

Happy 30th birthday to GCC

March 22nd, 2017 No comments

Thirty years ago today Richard Stallman announced the availability of a beta version of gcc on the mod.compilers newsgroup.

Everybody and his dog was writing C compilers in the late 1980s and early 1990s (a C compiler validation suite vendor once told me they had sold over 150 copies; a compiler vendor has to be serious to fork out around $10,000 for a validation suite). Did gcc become the dominant open source because one compiler would inevitably become dominant, or was there some collection of factors that gave gcc a significant advantage?

I think gcc’s market dominance was driven by two environmental factors, with some help from a technical compiler implementation decision.

The technical implementation decision was the use of RTL as the optimization+code generation strategy. Jack Davidson’s 1981 PhD thesis (and much later the LCC book) describe the gory details. The code generators for nearly every other C compiler was closely tied to the machine being targeted (because the implementers were focused on getting a job done, not producing a portable compiler system). Had they been so inclined Davidson and Christopher Fraser could have been the authors of the dominant C compiler.

The first environment factor was the creation of a support ecosystem around gcc. The glue that nourished this ecosystem was the money made writing code generators for the never ending supply of new cpus that companies were creating (that needed a C compiler). In the beginning Cygnus Solutions were the face of gcc+tools; Michael Tiemann, a bright affable young guy, once told me that he could not figure out why companies were throwing money at them and that perhaps it was because he was so tall. Richard Stallman was not the easiest person to get along with and was probably somebody you would try to avoid meeting (I don’t know if he has mellowed). If Cygnus had gone with a different compiler, they had created 175 host/target combinations by 1999, gcc would be as well-known today as Hurd.

Yes, people writing Masters and PhD thesis were using gcc as the scaffolding for their fancy new optimization techniques (e.g., here, here and here), but this work essentially played the role of an R&D group trying to figure out where effort ought to be invested writing production code.

Sun’s decision to unbundle the development environment (i.e., stop shipping a C compiler with every system) caused some developers to switch to another compiler, some choosing gcc.

The second environment factor was the huge leap in available memory on developer machines in the 1990s. Compiler vendors cannot ship compilers that do fancy optimization if developers don’t have computers with enough memory to hold the optimization information (many, many megabytes). Until developer machines contained lots of memory, a one-man band could build a compiler producing code that was essentially as good as everybody else. An open source market leader could not emerge until the man+dog compilers could be clearly seen to be inferior.

During the 1990s the amount of memory likely to be available in developers’ computers grew dramatically, allowing gcc to support more and more optimizations (donated by a myriad of people targeting some aspect of code generation that they found interesting). Code generation improved dramatically and man+dog compilers became obviously second/third rate.

Would things be different today if Linus Torvalds’ had not selected gcc? If Linus had chosen a compiler licensed under a more liberal license than copyleft, things might have turned out very differently. LLVM started life in 2003 and one of my predictions for 2009 was its demise in the next few years; I failed to see the importance of licensing to Apple (who essentially funded its development).

Eventually, success.

With success came new existential threats, in particular death by a thousand forks.

A serious fork occurred in 1997. Stallman was clogging up the works; fortunately he saw the writing on the wall and in 1999 stepped aside.

Money is what holds together the major development teams supporting gcc and llvm. What happens when customers wanting support for new back-ends dries up, what happens when major companies stop funding development? Do we start seeing adverts during compilation? Chris Lattner, the driving force behind llvm recently moved to Tesla; will it turn out that his continuing management was as integral to the continuing success of llvm as getting rid of Stallman was to the continuing success of gcc?

Will a single mainline version of gcc still be the dominant compiler in another 30 years time?

Time will tell.

Learning from some legal decisions

March 13th, 2017 No comments

The British and Irish Legal Information Institute provides “Access to Freely Available British and Irish Public Legal Information”. Searching the England and Wales High Court (Technology and Construction Court) Decisions throws up some interesting reading (when searching on software).

For those who have never seen a decent sized project go wrong from the inside, DE BEERS UK LIMITED (Formerly: THE DIAMOND TRADING COMPANY LIMITED) vs. ATOS ORIGIN IT SERVICES UK LIMITED provides a well written example. De Beers contracted Atos to write some software. The development of the software did not go well. Were the original requirements/spec underdone or were subsequent personnel not up to the job? Difficult to tell from the Decision, as is the reason Atos thought they had a chance of winning a court case.

SAP UK LIMITED vs. DIAGEO GREAT BRITAIN LIMITED was a licensing dispute, or more accurately an example of why it is important to check what your third-party software gets up to. Diageo had signed a licensing agreement with SAP and 5,800 Diageo users had used a app which, unknown to them, made use of SAP. The end result was a bill for £55 million, which Diageo had not been expecting.

There are probably more interesting cases to learn from, but I am supposed to be writing a book in my ‘spare’ time.

Uncovering the undefined behaviors

March 7th, 2017 2 comments

I think that all programming languages contain some constructs that have undefined behavior.

The C Standard explicitly lists various constructs as having undefined behavior. It also specifies that: Undefined behavior is otherwise indicated in this International Standard by the words “undefined behavior” or by the omission of any explicit definition of behavior.; the second half of the sentence refers to what might be called implicit undefined behavior. Implicit undefined behavior can be subdivided into intentional and unintentional. Intentional undefined behavior applies to constructs that the committee considered and decided (and continues to decide) to say nothing about (e.g., question 19), while unintentional undefined behavior applies to constructs that the committee did not explicitly consider (when discovered, these often end up as defect reports, which are sometimes resolved as intentionally undefined behavior).

Fans of some languages claim that ‘their’ language does not contain any undefined behaviors.

Ada does not explicitly specify any construct as having undefined behavior, but it does specify that some constructs generate a bounded error; a rose by any other name…

I sometimes bump into language inventors claiming that ‘their’ language is fully specified, i.e., does not contain any undefined behaviors. My first question to them, about the behavior of division involving negative values, invariable requires me to explain that there are two possible ways of doing it (ignorance is bliss when fully specifying a language). The invariable answer is that the behavior is whatever the underlying implementation does (which is often written in C). In other words, they have imported all the undefined behaviors of the implementation language.

Follow-up question include: what is the order of expression evaluation (e.g., left-to-right, right-to-left, inside out…), what is the order of function argument evaluation (often driven by the direction of stack growth), what is the order of initialization and other order related questions that comes to mind. Their fully specified language quickly turns out to be a sham.

A recent post by John Regehr talks about Gödel’s incompleteness Theorem as a source of undefined behavior. My understanding is that the underlying argument is built on non-termination. How is it possible to tell the difference between non-termination and lasting longer than the age of the universe? In itself I don’t think this theorem is a source of undefined behavior; more enlightenment welcome.

C compilers of the 20th century running on Microsoft operating systems

March 2nd, 2017 No comments

There used to be a huge variety of C compilers available for sale under MS-DOS and later Microsoft Windows. A C compiler validation suite vendor once told me they had sold over 150 copies; a compiler vendor has to be serious to fork out around $10,000 for a validation suite (actually good value for money given the volume of tests in a commercial suite).

C compilers of the 20th century running on Microsoft operating systems would make a great specialist subject for a Mastermind contestant. The August 1983 issue of BYTE must be the go-to reference for C in the 1980s.

Here is my current list of compilers that were once and perhaps still are commercially available on Microsoft operating systems.

Aztec C: from Manx Software Systems.

Borland C: from Borland

cc65: …and on Github.

IBM PC C Compiler: from Lattice???

Lattice C:…

CI-C86: from Computer Innovations.


DeSmet C:…

Digital Research C: Was this ever sold on a Microsoft OS?

Eco-C and Eco-C88 C:…

LCC: sold as a book in the 20th century, but its Microsoft OS implementations, such as lcc-win (with over 2 million copies distributed) and Pelles C, are really 21st century compilers.

Mark Williams C compiler: A US company having an entry in the German Wikipedia ranked significantly higher by Google than its English Wikipedia page shows that this compiler was a big success on the Atari ST (very popular in Germany) but not DOS/Windows.

MetaWare High C:…

Microsoft C: The compiler that nobody got fired for buying. Vendors had to try hard generate worse code than this compiler (which some achieved, i.e., MIX) and also very hard to provide better the runtime support (which nobody ever could). Version 2 of Microsoft C was actually the Lattice C compiler.

MIX C from Mix Software


Supersoft C:…

TopSpeed C: from Jensen & Partners International.

Watcom C: open sourced as Open Watcom

Wizard C: from Bob Jervis who sold (licensed???) it to Borland, where it became Turbo C.

Zorland C, Zortech C: from Walter Bright and my compiler of choice for several years.

If you know of a compiler that is missing from this list, or have better information, please let me know in the comments. Hopefully I will start to remember more about long forgotten C compilers.


Estimating the yearly spend on developing software

February 28th, 2017 No comments

How much does a software company spend on developing its software?

The plot below shows revenue vs software development costs for 100 US companies, in industry categories Computer programming services and Packaged software, with revenues greater than $100 million during 2014-2015. The data is from company accounts filed with the government (code+data, plus the Georgia Tech financial analysis lab where I found the data).

Company revenue vs amount spent on software development

A straight line fits very well (a quadratic is slightly better, but let’s keep things simple) and shows companies spending 13% of their revenue on software development. A log-log graph suggests a power law, but in this case the fitted exponent is one, i.e., no power law as such.

If 13% is the figure for companies that would be expected to be spending heavily to develop software, how much do companies in other industry sectors pay? Google and Facebook are media companies (their income is from advertising), do they really spend that much on software?

There are an estimated 3.3 million software developers in the US. What is the average cost of a software developer? If we take an average salary of $80K, and do the usual doubling to factor in overheads, we get $160K. This gives a total software development cost (most of the cost is for people) in the US of around $0.5 trillion per year.

The above plot shows 1.6%0.6%6% of the estimated $0.5 trillion yearly software development costs in the US. Who is spending the other 98.4%99.4%94%? One place to look is the Form 10-K that public companies are required to submit to the Securities and Exchange Commission.

Facebook’s 10-K, for 2015, shows $4,816 million spent on R&D (is this all software?) and $3,633 million on “Computer software, office equipment and other” (I’m guessing almost none of this is capitalized software). Dividing R&D expenditure by number of employees (12,691 at the end of 2015) gives $380K. I know average Silicon valley salaries are high, but not that high. I have enough trouble following my own company’s accounts, so trying to understand Facebook’s is a lost cause before it starts.

Scraping the Form 10-K’s on the SEC site will not provide sensible numbers, they will have to be read and analyzed. There is enough material for several MBA projects…

Tags: ,

NWIP for Monochrome inkjet yield

February 23rd, 2017 No comments

As a member of IST/5, the British Standards’ programming language committee, I receive a daily notification of relevant documents that have arrived at BSI. The email arrives just before midnight and contains a generous helping of acronyms, such as: N13344 SC 28 ISO-IECJTC1-SC28 N2051 NWIP for Monochrome inkjet yield.

The line break on the above line resulted in “Monochrome inkjet yield” appearing at the start of a line and it caught my attention, so I downloaded the document.

SC28 is the ISO committee for office equipment and this NWIP (New Work Item Proposal) is for WG2 (the Working Group responsible for consumables) to create a new ISO Standard with the title: “Method for the Determination of Ink Cartridge Yield for Monochrome Inkjet Printers and Multifunction Devices that Contain Printer Components”. Voting, on whether or not work should start on this proposal, closes on July 12.

Why was information about inkjet yield sent to a programming language list? Are SC28/WG2 having a membership drive and have been tipped off that our workload is declining? More importantly, are they following the C++ model of having regular meetings in Hawaii; the paperwork does not say. The standard for color injet printers appeared in 2009; was the production of this document such a traumatic event that it decimated committee membership and it has taken eight years to put together a skeleton group.

Attached to the proposal is a 20-page draft document; somebody has been busy.

So how is it proposed that monochrome inkjet yield be calculated? You need at least nine inkjet cartridges, three printers and a room at a temperature of 23 degrees (plus/minus 2 degrees, with readings taken every 15 minutes and an hourly running average calculated; “… temperature can have a profound effect on test results.”). Load “… a common medium weight paper and must conform to the printer’s list of approved papers.” into the three printers that have been “… temperature acclimated to the test room environment.” and count the number of pages printed by each printer (using at least three cartridges in each printer) before “…an end of life judgement.” Divide total number of pages printed by total number of cartridges used and there you go.

End of life? “The cartridge yield is determined by an end of life judgement, or signalled with either of two phenomena: fade, caused by depletion of ink in the cartridge or automatic printing stop caused by an Ink Out detection function.”

What is fade?
“3.1 Fade
A phenomenon where a significant reduction in uniformity occurs due to ink depletion.
NOTE In this test, fade is defined as a noticeably lighter, 3 mm or greater, gap located in the text, in the bar chart, or in the boxes around the periphery of the test page. The determination of the change in lightness is to be made referenced to the 25th page printed for each cartridge in testing. For examples of fade, please consult Annex A.”

And Annex A?
“Examples of Fade <future edit: add picture>”

Formula for calculating the standard deviation and a 90% confidence interval are given (the 90% confidence interval formula assumes a Normal distribution; I would have thought that the distribution of pages printed by a cartridge might be skewed and a bootstrap procedure would be more reliable).

It is daylight now and my interest in inkjet yield is satiated. But if you, dear reader, have a longing for more, then Ms. Michelle Pangborn (Hewlett-Packard), USA or Mr. Nobuaki Hamada (Epson), Japan are the people to contact.

Some printer test pages to add to your link collection.

DACS: Software Life Cycle Empirical/Experience Database

February 19th, 2017 No comments

Economic data relating to software development is very very hard to find. Companies just don’t want to reveal how much they spent/charged to writing a software system. This kind of data is invariably confidential.

I’m currently working on the Economics chapter of my book on Empirical Software Engineering and the data is somewhat thin.

I’m hoping one of my readers can help out with a copy of the “DACS data”.

DACS (The Data & Analysis Center for Software), a US DOC information analysis center, used to sell copies of their Software Life Cycle Empirical/Experience Database for $50. The most interesting data set was the DACS Productivity Dataset containing effort and schedule data on over 500 software projects.

DACS was merged into CSIAC (Cyber Security & information systems Information Analysis Center; not sure if I capitalized the appropriate information) and the data availability is no more.

If you have a copy of this data, or know somebody who does, please send me a copy.

The person who put the data together, Richard Nelson, no longer works for the government, has a consulting firm registered in Orlando, and is an officer of the NASA Alumni League Florida Chapter. All the obvious searches for an email address fail, and I suspect that a retirement is being enjoyed.

Of course I am always happy to hear about any software engineering data that you think I don’t have.

Fault density: so costly to calculate that few values are reliable

February 10th, 2017 No comments

Fault density (i.e., number of faults per thousand lines of code) often appears in claims relating to software quality.

Fault density sounds like a very useful value to know; unfortunately most quoted values are meaningless and because obtaining reliable data is very costly.

The starting point for calculating fault density is the number of reported faults (I will leave the complexity of what constitutes a line of code for a future post). Most faults don’t get reported.

If there are no reported faults, fault density is zero. The more often software is executed the more likely a fault will be experienced (i.e., the large the range of input values thrown at a program the more likely it will go down a path containing a fault). Comparing like-with-like requires knowing how many different kinds of input a program processed to experience a given number of faults; we don’t want to fall into the trap of claiming heavily used code is less fault prone than lightly used code.

What counts as a fault? One study found that 46% of reported faults in Open Source bug tracking systems were misclassified (e.g., a fault report was actually a request for enhancement). Again, comparing like-with-like requires agreement on what constitutes a fault.

How should faults in code that is no longer shipped be counted? If the current version of a program contains 100K lines and previous versions contained 50K lines that have been deleted, should the faults in those 50K lines contribute to the fault density of the current program? I would say not, which means somebody has to figure out which reported faults apply to code in the current version of the program.

I am aware of less than half a dozen fault density values that I would consider reliable (most calculated during the Rome period). Everything else is little better than reading tea-leafs.

I have been reading your interesting paper

February 2nd, 2017 No comments

In the last six years or so I have sent around 420 emails whose first line started: “I have been reading your interesting paper”, followed a few lines later by: “Would it be possible to obtain a copy of the data?”, and then some background and links to blog posts and my previous book.

The response break down is roughly as follows:

Received data                       136  32%
No reply                            132  32%
Pending (received a positive reply)  49  12%
Confidential                         42  10%
No longer have the data              20   5%
Best known address bounces           11   3%

Thanks to those 136 researchers who took the time to collect together their data and send me a copy.

The “No reply” response get a second email 6-9 months after the first. I’m hoping that the availability of a draft of the book will generate some positive publicity that reminds researchers they have had an email from me and are missing out.

The “Confidential” case is relatively low because it is often obvious that the data is confidential and I don’t bother asking for a copy (I only use data that can be made public).

A common reason behind “No longer have the data” is a change of laptop and sometimes a change of jobs. If the paper is more than five years old, I tend not to ask unless the data looks very interesting. Mine and others’ experiences show that research data has a relatively short half-life.

I try quite hard to find a workable address, sometimes emailing supervisors and going via LinkedIn.

Tags: , ,