Tampilkan postingan dengan label disasters. Tampilkan semua postingan
Tampilkan postingan dengan label disasters. Tampilkan semua postingan

Selasa, 07 Agustus 2007

Minneapolis Bridge Lessons for Digital Security

The Minneapolis bridge collapse is a tragedy. I had two thoughts that related to security.

  1. If the bridge collapsed due to structural or design flaws, the proper response is to investigate the designers, contractors, inspectors, and maintenance personnel from a safety and negligence perspective. Based on the findings architectural and construction changes plus new safety operations might be applied in the future. This is a technical and operational response.

  2. If the bridge collapsed due to attack, the proper response is to investigate, apprehend, proseceute, and incarcerate the criminals. Redesigning bridges to withstand bomb attack is unlikely. This is a threat reduction and deterrence response.


Do you agree with that assessment? If yes, why do you think response 1 (try to improve the "bridge" and similar operations) is the response to every digital security attack (i.e., case 2)? My short answer: everyone blames the victim, not the criminal.

The NTSB is on scene in Minneapolis with law enforcement to figure out if the bridge collapse was caused by scenario 1 or 2. Why don't we have a National Digital Security Board investigating breaches? My short answer: it's easier to hide a massive security breach than the destruction of any bridge, building, plane, or train.

Senin, 09 Juli 2007

More Engineering Disasters

I've written several times about engineering disasters here and elsewhere.

Watching more man-made failures on The History Channel's "Engineering Disasters," I realized lessons learned the hard way by safety, mechanical, and structural engineers and operators can be applied to those practicing digital security. >In 1983, en route from Virginia to Massachusetts, the World War II-era bulk carrier SS Marine Electric sank in high seas. The almost forty year-old ship was ferrying 24,000 tons of coal and 34 Merchant Mariners, none of whom had survival suits to resist the February chill of the Atlantic. All but three died.

The owner of the ship, Marine Transport Lines (MTL), blamed the crew and one of the survivors, Chief Mate Bob Cusick, for the disaster. Investigations of the wreck and a trial revealed the Marine Electric's coal hatch covers were in disrepair, as reported by Cusick prior to the disaster. Apparently the American Bureau of Shipping (ABS), an inspection organization upon which the Coast Guard relied, but funded by ship operators like MTL, had faked reports on the Marine Electric's status. With gaping holes in the coal hatches, the ship's coal containers filled with water in high seas and doomed the crew.

In the wake of the disaster, the Coast Guard recognized that ABS could not be an impartial investigator because ship owners could essentially pay to have their vessels judged seaworthy. Widespread analysis of ship inspections revealed many similar ships and others were unsound, and they were removed from service. Unreliable Coast Guard inspectors were removed. Finally, the Coast Guard created its rescue swimmer team (dramatized by the recent movie "The Guardian") to act as a rapid response unit.

The lessons from the Marine Electric disaster are numerous.

  1. First, be prepared for incidents and have an incident response team equipped and trained for rapid and effective "rescue."

  2. Second, be suspicious of reports done by parties with conflicts of interest. Stories abound of vulnerability assessment companies who find all of their clients "above average." To rate them otherwise would be to potentially lose future business.

  3. Third, understand how to perform forensics to discover root causes of security incidents, and be willing to act decisively if those findings demonstrate problems applicable to other business assets.

In 1931, a Fokker F-10 Trimotor carrying eight passengers and crew crashed near Kansas City, Kansas. All aboard died, including Notre Dame football coach Knute Rockne. At the time of the disaster, plane crashes were fairly common. Because commercial passenger service had only become popular in the late 1920's, the public did not have much experience with flying. The death of Knute Rockne caused shock and outrage.

Despite the crude state of crash forensics in 1931, the Civil Aeronautics Authority (CAA) determined the plane crashed because its wood wing separated from its steel body during bad weather. TWA, operator of the doomed flight, removed all F-10s from service and burned them. Public pressure forced the CAA, forerunner of today's Federal Aviation Administration, to remove the veil of secrecy applied to its investigation and reporting processes. TWA turned to Donald Douglas for a replacement aircraft, and the very successful DC-3 was born.

The crash of TWA flight 599 provides several sad lessons for digital security.

  1. First, few seem to care about disasters involving new technologies until a celebrity dies. While no one would like to see such an event occur, it's possible real change of opinion and technology will not happen until a modern Knute Rockne suffers at the hands of a security incident.

  2. Second, authorities often do not have a real incentive to fix processes and methods until a tragedy like this occurs. Out of this incident came pressure to deploy flight data recorders and more robust aviation organizations.

  3. Third, real inspection regulations and technological innovation followed the crash, so such momentum may appear after digital wrecks.

The final engineering disaster involves the Walt Disney Concert Hall in Los Angeles. This amazing, innovative structure, with a polished stainless steel skin, was completed in October 2003. When finished, visitors immediately realized a problem with its construction. The sweeping curves of its roof acting like a parabolic mirror, focusing the sun's ray like laser on nearby buildings, intersections, and sections of the sidewalk. Temperatures exceeded 140 degrees Fahrenheit in some places, while drivers and passersby were temporarily blinded by the glare.

Investigators decided to model the entire facility in a computer simulation, then monitor for the highest levels of sunlight over the course of a year. Using this data, 2% of the building's skin was discovered to be causing the reflection problems. The remediation plan, implemented in March 2005, resulted in sanding problematic panels to remove their sheen. The six-week, $60,000 effort fixed the glare.

The lessons from the concert hall involve complexity and unexpected consequences. Architect Frank Geary wanted to push the envelope of architecture with his design. His innovation caused a building that no one, prior to its construction, really understood. Had the system been modeled before being built, it's possible problems could have been avoided. This situation is similar to those involving enterprise network and software architects who design systems that no single person truly understands. Worse, the system may expose services or functionality never expected by its creators. Explicitly taking steps to simulate and test a new design prior to deployment is critical.

Digital security engineers should not ignore the lessons their analog counterparts have to offer. A commitment to learn from the past is the best way to avoid disasters in the future.

Jumat, 22 September 2006

Nisley on Failure Analysis

Since I'm not a professional software developer, the only reason I pay attention to Dr. Dobb's Journal is Ed Nisley. I cited him earlier in Ed Nisley on Professional Engineering and Insights from Dr. Dobb's. The latest issue features Failure Analysis, Ed's look at NASA's documentation on mission failures. Ed writes:

[R]eviewing your projects to discover what you do worst can pay off, if only by discouraging dumb stunts.

What works for you also works for organizations, although few such reviews make it to the outside world. NASA, however, has done a remarkable job of analyzing its failures in public documents that can help the rest of us improve our techniques.


Documenting digital disasters has been a theme of this blog, although my request for readers to share their stories went largely unheeded. This is why I would like to see (and maybe create/lead) a National Digital Security Board.

Here are a few excerpts from Ed's article. I'm not going to summarize it; it takes about 5 minutes to read. These are the concepts I want to remember.

NASA defines the "root" cause of mishap as [a]long a chain of events leading to a mishap, the first causal action or failure to act that could have been controlled systematically either by policy/practice/procedure or individual adherence to policy/practice/procedure.

The root causes of these mishaps (incorrect units, invalid inputs, inverted G-switches) seem obvious in retrospect. How could anyone have possibly made those mistakes?

In addition to the root cause, the MIB Reports also identify a "contributing" cause as [a] factor, event or circumstance which led directly or indirectly to the dominant root cause, or which contributed to the severity of the mishap.


The "chain of events" is symptomatic of disasters. A break in that chain prevents the disaster.

However, the MIB [Mishap Investigation Board] discovered that [t]he Software Interface Specification (SIS) was developed but not properly used in the small forces ground software development and testing. End-to-end testing ... did not appear to be accomplished. (emphasis added)

Lack of end-to-end testing appears to be a common theme with disasters.

Mars, the Death Planet for spacecraft, might not have been the right venue for NASA's then-new "Faster, Better, Cheaper" mission-planning process...

The Mars Program Independent Assessment Team (MPIAT) Report pointed out that overall project management decisions caused the cascading series of failed verifications and tests. One slide of their report showed the MCO and MPL project constraints: Schedule, cost, science requirements, and launch vehicle were established constraints and margins were inadequate. The only remaining variable was risk.

In this context, "Faster" means flying more missions, getting rid of "non-value-added" work, and reducing the cycle time by working smarter rather than harder. "Cheaper" has the obvious meaning: spending less to get the same result. The MCO [Mars Climate Orbiter] and MPL [Mars Polar Lander] missions together cost less than the previous (successful) Mars Pathfinder mission.

The term "Better" has an amorphous definition, which I believe is the fundamental problem. In general, management gets what it measures and, if something cannot be measured, management simply won't insist on getting it.

You can easily demonstrate that you're doing things faster, that you've eliminated "non-value-added" operations, and that you're spending less money than ever before. You cannot show that those decisions are better (or worse), because the only result that really matters is whether the mission actually returns science data. Regrettably, you can measure that aspect of "better" after the fact and, in space, there are no do-overs.
(emphasis added)

The last part is crucial. For digital security, the only result that really matters is whether you preserve confidentiality, integrity, and availability, usually by preventing and/or mitigating compromise. All the other stuff -- "percentage of systems certified and accredited," "percentage of systems with anti-virus applied," "percentage of systems with current patch levels" -- is absolutely secondary. In the Mars mission context, who cares if you build the spacecraft quicker, launch on time, and spend less money, if the vehicle crashes and the mission fails?

Thankfully NASA is taking steps to learn from its mistakes by investigating and documenting these disasters. It's time the digital security world learned something from these rocket scientists.

Rabu, 30 Agustus 2006

Pandemic Reporting Like Digital Security Incident Reporting

The 12 August 2006 issue of the Economist featured the story Global Health: A Shot of Transparency (subscription required). It reminded me of the state of reporting digital security incidents.

At the moment, the world's pandemic-alert system is distressingly secretive. Some countries, such as Vietnam, have been fairly open about new outbreaks of the sorts of infectious disease that might lead to pandemics, and have even invited foreigners in to help diagnose the problem. Most, however, have not been so forthright. Public-health experts point to China and Thailand, both of which suffered outbreaks of potential pandemic illnesses in the past few years (SARS in China and avian influenza in Thailand) as examples of places that do not fully disclose the relevant details...

The reasons for countries' reluctance to share information are understandable, though hardly defensible. Some believe that full disclosure could cause locals to panic and foreign tourists to stay away...

Larry Brilliant, a former WHO official who helped to eradicate smallpox in India, dreams of an open-source, non-governmental, public-access network that would help the world move quickly whenever potential pandemics start brewing. He looks for inspiration to the Global Public Health Intelligence Network (GPHIN), an obscure programme run by the Canadian government that searches public databases in seven languages looking for early signs of disease outbreak...

His proposed open network could well spot the next, as yet undiscovered, threat.


I don't want to stretch the analogy too far, but some interesting ideas are here. I wonder what the effect of publishing the IP addresses of botnet hosts would be? Not the controllers, but the hosts themselves. That would reveal (at least in the narrow botnet case) how widespread certain compromises might be (ignoring the NAT effect).

I was hoping to hear other ways of encouraging reporting, but no others appeared in the article.

Rabu, 26 April 2006

Disaster Stories Help Envisage Risks

The April 2006 issue of Information Security Magazine features an article titled Security Survivor All-Stars. It profiles people at five locations -- LexisNexis, U Cal-Berkeley, ChoicePoint, CardSystems, and Georgia Technology Authority -- who suffered recent and well-publicized intrusions. My guess is that InfoSecMag managed to arrange these interviews by putting a "happy face spin" on the story: "We know your organization was a security mess, but let's look on the bright side and call you an all-star!" Although the article is light on details, I recommend reading these disaster stories. They help make security incidents more real to management.

ChoicePoint is one of the companies profiled. That story really bothers me. To know why, read The Five Most Shocking Things About the ChoicePoint Debacle and The Never-Ending ChoicePoint Story by Sarah D. Scalet. I noticed the InfoSecMag did not interview ChoicePoint chairman and CEO Derek V. Smith, author of The Risk Revolution: Threats Facing America & Technology’s Promise for a Safer Tomorrow and A Survival Guide in the Information Age (both published prior to the ChoicePoint debacle).

InfoSecMag also avoided interviewing former ChoicePoint CISO Rich Baich, author of Winning as a CISO. No, I am not making this up. This is the same Mr. Baich about whom Ms. Scalet wrote the following. Baich is speaking:

"Look, I'm the chief information security officer. Fraud doesn't relate to me." He indicated that he would be doing the CISO community a service by explaining to the media why fraud was not an information security issue. (The company later denied his request to grant the interview.)

The feds, however, are acting as if it's an information security issue. ChoicePoint has indicated that the Federal Trade Commission is "conducting an inquiry into our compliance with federal laws governing consumer information security and related issues."


In this interview with TechTarget, Baich says:

It's created a media frenzy; this has been mislabeled a hack and a security breach. That's such a negative impression that suggests we failed to provide adequate protection. Fraud happens every day. Hacks don't.

Wow, this guy is out of touch. Instead of having difficulty finding work, now he's on the speaking circuit as a Managing Director with PriceWaterhouseCoopers. And why is he still a CISSP? This is an excellent example of problems with the CISSP -- no one loses their certification.

For a stark contrast, peruse the Maryland Real Estate Commission - Disciplinary Actions site. You can read about the real estate workers who lost their licenses for mispractice. It is sad to think that information security is treated less seriously than selling real estate.

By the way -- everyone who wants an overview of risk management frameworks should read Alphabet Soup by Shon Harris in the same InfoSecMag issue.

Kamis, 09 Februari 2006

Ed Nisley on Professional Engineering

I get a free subscription to Dr. Dobb's Journal. The March 2006 issue features an article by Ed Nisley titled "Professionalism." Ed is a software developer with a degree in Electrical Engineering. After working at a computer manufacturer for ten years in New York state, he decided to become a "consulting engineer." Following the state's advice, Ed pursued a license to be a Professional Engineer. Now, 20 years after first earning his PE license, Ed declined to renew it. He says "the existing PE license structure has little relevance and poses considerable trouble for software developers." You have to register with DDJ to read the whole article, but the process is free and the article is worthwhile.

Here are a few of Ed's reasons to no longer be a PE:

  • "[T]o maintain my Professional Engineering license, I must travel to inconvenient places, take largely irrelevant courses, and pay a few kilobucks. As nearly as I can tell from the course descriptions, the net benefit would be close to zero."

  • ["T]here's no generally applicable Software Engineering Body of Knowledge (SWEBOK) upon which to base a Software Engineering examination, so (as I understand it) a Texas engineer seeking a PE license for software activities must demonstrate a suitable amount of experience, as attested by letters of recommendation." (He was discussing efforts in Texas to make software engineers be PEs.)

  • "A 2001 ACM task force report on Licensing of Software Engineers Working on Safety-Critical Software concluded that professional licensing as it stands today simply wouldn't work in that field. They observe that very few 'software engineers' have an engineering degree accredited by the Accreditation Board for Engineering and Technology, which all state PE licensing boards require. Most programmers, it seems, don't have the opportunity to forget Thermo and Chem, having not studied them in the first place."

  • "Software development also moves much faster than the NCEES testing process. Mechanical and electrical engineering questions dating back three decades remain perfectly useful, but most recent graduates have little knowledge of Fortran and GOTOs."

  • "If you produce work as a PE, you must follow established design practices or risk a malpractice lawsuit when your design fails. Software engineering, even in the embedded field, simply doesn't have any known-good design practices: Most projects fail despite applying the current crop of Best Practices."

  • "Worse, without a good self-imposed technical solution, we're definitely going to get legislative requirements that won't solve the problem."


If you think that creating a test designed for "software engineers" is a good idea, check out the rest of the article to see Ed's experience taking the exams. They sound like nothing more than a check to ensure the ability to answer a smattering of science and math questions.

The process reminded me of an exam we took at the Air Force Academy for what was then called (and may still be) Engineering 410. This was supposed to be a "capstone course" that all seniors took to demonstrate their engineering prowess. Yes, even your local history/political science double major took chemistry, physics (two courses), math (Cal III and Diff Eq), thermodynamics, and the five pure engineering courses (electrical, mechanical, civil, aeronautical, astronautical) prior to this capstone course. (That's why I have Bachelor of Science degrees and not BAs. At a normal college I would also have a minor in Engineering.)

To enter the capstone course, all students had to pass a cross-subject exam, where anything studied up to that point was fair game. I should add that non-engineering subjects like biology or the "soft sciences" were also included. If you failed the exam (with a possibility of one retake) you failed the course. If you took the course in the fall semester, you could return in the spring. If you took the course in the last semester of senior year (like me), and you failed the test, you were coming back for a special "fifth year" (USAFA has no real "fifth year" of study!) just to take Engineering 410.

In the dreamworld of the academic faculty, I'm sure they believed this exam would test the quality of the "engineers" they were producing. In reality all they tested was our ability to cram as much as we could fit into our brains prior to the test. By the time I was a senior I had no clue what I had studied in chemistry or biology three years earlier. After reading Ed's story, it sounds like his PE exams were exactly the same. They test the candidate's ability to remember information (itself no mean feat, granted) and then apply that to a test. They say nothing about whether the candidate is a good or even qualified engineer.

If the test is worthless, what might really drive PEs to do good work? I think the fact that PEs can lose their license to practice is a big factor. That happened in the 1981 Hyatt collapse I blogged about earlier. If you're a PE and you lose your license because your project fails, you've lost your ability to make a living. If you're a software developer and your project fails, you continue working or you get a job elsewhere.

Incidentally, the skies over USAFA looked exactly like the photo posted above. Every day. Ok, I'm kidding, but it felt like that. That is a real photo taken 10 August 2004.

Minggu, 05 Februari 2006

Another Engineering Disaster

Does the following sound like any security project you may have worked?

  1. Executives decide to pursue a project with a timetable that is too aggressive, given the nature of the task.

  2. They appoint a manager with no technical or engineering experience to "lead" the project. He is a finance major who can neither create nor understand design documents. (This sounds like the news of MBAs being in vogue, as I reported earlier.)

  3. The project is hastily implemented using shoddy techniques and lowest-cost components.

  4. No serious testing is done. The only "testing" even tried does not stress the solution in any meaningful way -- it only "checks a box."

  5. Shortly after implementation, the solution shows signs of trouble. The project manager literally patches the holes and misdirects attention without addressing the underlying flaws.

  6. Catastrophe eventually ensues.


What I've just described is the Boston Molasses Flood of 1919, best described by the Boston Society of Civil Engineers in their newsletter (.pdf). I learned about this event by watching another episode of Engineering Disasters on Modern Marvels. Here's what happened.

  1. In 1915, United States Industrial Alcohol needed to build a tank in Boston to support World War I munitions production. They decide to place it in an immigrant-dominated portion of the city; Italians live there.

  2. USIA puts Arthur Jell in charge. He is a finance major with no technical or engineering experience or training. He can't even read blueprints, yet he designs a five-story, 90' diameter tank capable of holding over 2 million gallons of molasses, in the middle of a populated area.

  3. The tank is built by contractors who use thin steel and too few rivets. No one supervises their work. They hurry to complete the tank 2 days before it is filled.

  4. Prior to being filled, the tank is "tested" by holding between 4 and 8 inches of water!

  5. The tank stands three years, although apparently it was never filled to capacity until shortly before its collapse. During those three years, molasses leak from the tank on a daily basis. Jell orders the leaks plugged and has the tank painted brown to divert attention from the leaks.

  6. In 1918, with WWI ending and prohibition approaching, USIA decides to switch production from industrial alcohol to drinking alcohol. They want to cash out as fast as possible by supporting customers who want to "stock up" before prohibition begins. They accept a shipment of molasses from Cuba in January 1919, which fills the tank to capacity. Three days later, on January 15, the tank ruptures, killing 21 people and injuring 150.


USIA claimed Italian anarchists had destroyed the tank, but the evidence showed otherwise. USIA was subjected to the first ever class action lawsuit in the US, which the company lost. Safety regulations were enacted which required supervision of construction, real testing, and stamps of approval of blueprints by architects and engineers.

I foresee a similar event, with similar consequences, for the digital security industry. Hopefully not as much death and destruction will occur, but the remedies will be the same.

Kamis, 01 Desember 2005

Engineering Disasters in Information Security Magazine

The December 2005 issue of Information Security magazine features an article I wrote titled History Lessons with the subtitle "Digital security could learn a lot from engineering's great disasters." It is based on this blog entry describing analog engineering disasters like the 1931 Yangze River damn failure, the 1944 Cleveland LNG tank fire, the 1981 Kansas City Hyatt Regency hotel walkway collapse, and the 1933 Atlanta Marriott parking lot sink hole.

I am considering expanding this topic of digital security disasters to encompass a new book. I would like to take a historical and technical look at digital security failures on a case-by-case basis. Ideally the cases would be based on testimony from witnesses or participants wishing to (anonymously) share lessons with their colleagues.

My concept is simple: when a bridge fails in the "analog" world, everyone knows about it. The disaster is visible, and engineers can analyze and learn from the event. The lessons they take away make future bridges stronger and safer. I do not see this happening in the digital world. Organizations suffer disasters all the time due to poor techniques, tools, configuration, management decisions, and so on. Unfortunately, few people ever hear about these problems, so they are repeated elsewhere. The only parties to benefit are intruders. Security engineers never get to learn from the mistakes of others.

What would a sample story look like? As a simple example, I know of an ISP who suffered a two hour router ACL drop that allowed remote intruders to exploit their development network. The ISP suffered a major compromise by Russian intruders that required a dedicated multi-week, guerilla-warfare incident response effort. Several lessons can be learned: (1) a router with an ACL is not a firewall, especially when you can attack any high port using source port 20 TCP; (2) development networks with unpatched machines should not bear publicly routable IP addresses and be Internet facing; and (3) there is value in monitoring to detect when your defensive measures fail.

Would any of you be willing to share your stories with me? I would be willing to communicate in any reasonable manner you wish to preserve your identities and sensitivities. The goal of the book is to provide real-world cases that can teach lessons to fellow security engineers. I am not trying to embarrass or humiliate anyone. I do not expect to hear any company or personal names, and if you still provide them I will not repeat them in the book. I am most interested in stories that have plenty of technical details.

Please email your thoughts to richard at taosecurity dot com. Thank you.

Senin, 24 Oktober 2005

More on Engineering Disasters and Bird Flu

Here's another anecdote from the Engineering Disasters story I wrote about recently. In 1956 the cruise ship Andrea Doria was struck and sunk by the ocean liner Stockholm. At that time radar was still a fairly new innovation on sea vessels. Ship bridges were dimly lit, and the controls on radar systems were not illuminated. It is possible that the Stockholm radar operators misinterpreted the readings on their equipment, believing the Andrea Doria was 12 miles away when it was really 2 miles away. The ships literally turned towards one another on a collision course, based on faulty interpretation of radar contact in the dense fog. Catastrophe ensued.

This disaster shows how humans can never be removed from the equation, and they are often at center stage when failures occur. The commentator on the show said a 10 cent ligh bulb illuminating the radar controls station could have shown the radar range was positioned in a setting different from that assumed by the operator. Following the Andrea Doria collision, illumintation was added to ship radar controls. This story reminded me that the latest security technology is worthless -- or even worse, damaging -- in the hands of people who are not trained or able to use it properly.

On a different subject, I heard an interview on NPR with Health and Human Services Secretary Mike Leavitt about bird flu. He likened the situation to "surveillance" of a dry forest during fire season. He said that the best defense was vigilance and rapid response. His analogy assumed being nearby when a small fire erupts. First responders who are quickly on the scene can stamp out a fire before it becomes uncontrollable. If the response team is unaware of the fire, it can spread and then be beyond containment. He concluded the interview saying "ultimately, another pandemic will come. Right now we are not prepared."

I thought his comments applied well to digital security incidents. NSM is surveillance, and incident response helps stamp out fires (or bird flu outbreaks) quickly before they exceed an organization's capacity to deal with them. Is your organization ready? If you want to know, TaoSecurity provides services like incident response training and CSIRT assessments and evaluations.

Sabtu, 22 Oktober 2005

Further Thoughts on Engineering Disasters

My TiVo managed to save a few more episodes of Modern Marvels. You may remember I discussed engineering disasters last month. This episode of the show of the same title took a broader look at the problem. Three experts provided comments that resonated with me.

First, Dr. Roger McCarthy of Exponent, Inc. offered the following story about problems with the Hubble Space Telescope. When Hubble was built on earth, engineers did not sufficiently address issues with the weight of the lens on Earth and deflections caused by gravity. When Hubble was put in orbit, the lens no longer deflected and as a result it was not the proper shape. Engineers on Earth had never tested the lens because they could not figure out a way to do it.

So, they launched and hoped for the best -- only to encounter a disaster that required a $50 million orbital repair mission. Dr. McCarthy's comment was "A single test is worth a thousand expert opinions." This is an example of management by fact instead of management by belief, mentioned previously on this blog.

Second, Dr. Charles Perrow, author of Normal Accidents: Living With High-Risk Technologies, explained the makings of a disaster. Essentially, he said disasters are caused by the unforeseen consequences of multiple, individually non-devastating, failures in complex systems. Most catastrophes could be prevented if any one of the small failures had not occurred. Third, Mary Schiavo commented on the Challenger disaster. She described the well-known problems with operating the Shuttle's rocket O-rings in temperatures below 53 degrees F. The Shuttle had launched at lower temperatures prior to the Challenger explosion, but NASA knew they were risking catastrophe. Ms. Schiavo said NASA engineers begged their managers not to let Challenger launch, seeing that chunks of ice covered the launch pad and Shuttle. They were overruled and disaster occurred.

This struck a chord with me, because a few days earlier I read a new story in Time about how Steve Jobs gets Apple to bring innovative products to market:

Apple CEO Steve Jobs [will] tell you an instructive little story. Call it the Parable of the Concept Car. "Here's what you find at a lot of companies," he says, kicking back in a conference room at Apple's gleaming white Silicon Valley headquarters, which looks something like a cross between an Ivy League university and an iPod. "You know how you see a show car, and it's really cool, and then four years later you see the production car, and it sucks? And you go, What happened? They had it! They had it in the palm of their hands! They grabbed defeat from the jaws of victory!

"What happened was, the designers came up with this really great idea. Then they take it to the engineers, and the engineers go, 'Nah, we can't do that. That's impossible.' And so it gets a lot worse. Then they take it to the manufacturing people, and they go, 'We can't build that!' And it gets a lot worse."

When Jobs took up his present position at Apple in 1997, that's the situation he found. He and Jonathan Ive, head of design, came up with the original iMac, a candy-colored computer merged with a cathode-ray tube that, at the time, looked like nothing anybody had seen outside of a Jetsons cartoon. "Sure enough," Jobs recalls, "when we took it to the engineers, they said, 'Oh.' And they came up with 38 reasons. And I said, 'No, no, we're doing this.' And they said, 'Well, why?' And I said, 'Because I'm the CEO, and I think it can be done.'"


Would Steve Jobs have overruled the NASA engineers and launched Challenger? Who knows.

From what I have learned, disasters are prone to happen in complex, tightly-coupled systems. The only way to try to avoid them is to test and monitor their operation, exercise response, and then implement those plans when catastrophe occurs. Anything less is like launching a defective, untested Hubble and hoping for the best, and then paying through the nose to clean up the mess.

Here are a few footnotes to this post. Dr. McCarthy's company offers security engineering services, including services for information systems. They are described thus: "We have assembled one of the largest private collections of computerized accident and incident data in the world. Our web-based solutions put this information at your disposal, giving you comprehensive risk data quickly and at low cost." Dr. McCarthy was recently elected to the National Academy of Engineering, which has a Computer Science and Telecommunications Board with a Improving Cybersecurity Research in the United States project. My research for this story also led me to the System Safety Society.