Using SQL for Complex Business Logic Is a Bad Idea 2 – Resilient Software and PL/SQL

All of the code between the BEGIN and END was just plain SQL with no error or exception handling at all. Having so many inserts, updates, deletes, and joins is really an accident just waiting to happen. Initially, the developers were delighted that their code was experiencing no errors. One could even see a little of the Dunning–Kruger effect as they began to tempt fate by saying things like “This PL/SQL stuff is pretty simple, just add a BEGIN, COMMIT, and END and you’re good to go.” The only problem was that the PL/SQL code was initially running successfully each night, i.e., no inconvenient errors. This initial success would prove to be short-lived.

Over time, with new incoming data and minor modifications to the script, the overly complex SQL block failed because of an attempted duplicate key insertion. This is a common enough error condition and one that it is prudent to prepare for in your PL/SQL code. As you’ll see in later chapters, it’s easy enough to add this type of error handling. But in the present case, once the SQL block failed, then the whole block was automatically rolled back. Sadly, the log table update was also part of the main transaction, so it too was rolled back. The difficulty with this technique is that any clues about the source of the error are also rolled back into oblivion. This would result in a great deal of developer angst.

The overall effect was that, after the first time the error appeared, it looked like one of the following had occurred:

  • The job didn’t run at all.
  • The job succeeded with no errors or data.
  • The job was rolled back.

It was hard to know which of these had happened. It’s a classic integration scenario where there is insufficient logging detail (and business workflow knowledge) to assist with any attempt at diagnosing the problem.

The job then continued to fail each night for a full week, and the issue only came to light when the downstream team began wondering why there were no sales figures for that week. In the week in question, many sales had been made and bonuses were due, but this was not known by the development team. The developers were still laboring under the misapprehension that no error had in fact occurred. After a few more days of no finance report data, the matter was escalated and it then became a development priority.

It was then quite embarrassing for the developers, who had to work late to attempt to manually cobble together the required report data. As they did not understand the business-specific data constraints and rules, this manual override approach then resulted in handcrafted changes to table data and even more errors in the final results. Worse still, some of the upstream data had changed by this time and the results were an erroneous mixture of old and new data.

Needless to say, the finance team was not impressed. Then, another downstream team from the finance group started asking for their consolidated data. It’s not hard to see that tempers then started to fray, emails started to fly, and the whole sorry affair deteriorated into a blame game. All because of some poorly crafted PL/SQL and SQL script.

This type of antipattern and the resultant panicky responses to user-reported errors does tend to undermine the faith that the end users have in the overall process. It can also undermine the confidence of the developers. This is a tough road for developers trying to produce resilient solutions and it’s entirely avoidable. In Part III, starting in Chapter 7, I’ll develop a similar but far more resilient workflow compared to the one just described.