Do you really know your applications - the case for Operational Architecture
The business owners that I’ve worked with have expressed to me that one of the most frustrating experiences they’ve had with IT organizations is extended systems outages. The impacts to their operations are as wide reaching as the scope of the business processes they support. Some of the more notable impacts to the business included:
- Financial penalties for missing critical milestones
- Overtime paid to operations staff to catch up on the backlog of work
- Revenue impact resulting from loss of customers
- Costs associated with acquiring or reacquiring customers
- Impact to customer and production analytics resulting from data loss, especially around high-volume transactional systems
More than likely, the cause behind extending the outage was the lack of unremarkable documentation about the application, system, or service that would allow the IT department to speed service restoration.
The question here is do you know enough about your applications to restore service quickly if for some reason they take an outage? In my experience consulting and working with various applications and infrastructure teams, the answer is no - chiefly amongst applications or infrastructure that have been transferred from one manager or team to another.
Source code and requirements documentation is not sufficient. The nature of systems and services developed now go beyond older constructs of client / server environments and instead are developed on a framework of interdependent services such as cloud computing. So to understand these systems require a larger documentation framework. It should include information about the services, their dependencies, information about the server and network infrastructure, and their respective SLAs.
I have used this framework to document the operational architecture of systems that I’ve had to support. This framework should be used in conjunction with coding and requirements documentation to provide a holistic view of the applications / services your team is required to support:
- Overview – a section describing the purpose of the application or service and impact to the business. Include here key contact information and usage (such as who is using the services)
- Key supporting tasks – a description of the tasks that are required to support the application or service and who performs those tasks
- Maintenance tasks – what are the regularly reoccurring tasks that must be performed and who performs those tasks
- Server information – what are the servers, VMs, SAN, or other devices that are used to deliver the services or applications you support
- Backup schedule – when are the backups made for each of the devices that are mentioned in the previous section and what type of backups are made. You also want to know the recovery SLA for each backup.
- Reboot / maintenance schedule – if there is a regular reboot schedule for any of these devices, or other maintenance windows that would require an outage of these dependent devices, you will note them here as this will play into your SLA calculations
- Databases – the name, location, and configurations of any databases that are used and any security requirements or configurations associated with these databases
- Architecture diagram – a diagram of how the various technical components fit together. Include any dependent services that your application or service depends on. Also include the network infrastructure and any specific security requirements inherit to the network infrastructure
- Services – list out the names of the services that the application or service physically runs and any permission requirements that are needed to support those services. Also list specific child processes that are implemented that either may be implemented on their own or as a part of another service and their purpose in delivering the application or service.
- Key files and configurations – list any files that are used in support of the service or application. If this is a Wintel environment, note any registry entries that are required as well
- Recovery procedures – layout the disaster recovery procedures that are in place (or reference their location) as well as any procedures to recover parts of the system, particularly those components that may be prone to corruption or failure.
As you can see, the documentation of a system goes far beyond the narrow confines of source code and requirements document. If you have this as a minimum, your ability to recover from an outage will improve. This is also not an inclusive list. You may find the need to document other infrastructure or configurations such as firewall rules or other security protocols needed to support your application or service. The goal here is to have as much information about the systems you maintain as possible.
Saying that, if the system is poorly documented and / or inherited, this will not be an easy task to complete. It will take some time, but the benefits to the busienss of going through this exercise clearly outweigh the time investment that you will make in working through this.



Comments