Wednesday, December 11, 2013

Designing and Building your CMDB, Part 2: Test, Test, and Test Again

With apologies for the large break between these two parts, a lot of project work has eaten up all my time and it's not been possible to post as often as I'd have liked!

So the questions that were left at the end of Part 1 is; how do we know we've correctly identified everything related to our system for the CMDB? The answer is, because we didn't write the software and so can't claim to fully understand it, that we don't. Now I'll explain why that doesn't matter as much as you'd think it would.

As with most things in business it's not about a 100% guarantee  - it's about doing exactly the right amount of work to minimising the risk of impact to the business from the "unknown unknowns" (those things we don't know we don't know). In order to work out the ways we've eliminated the risk we need to go back to the reason we're doing this; to take the knowledge out of people's heads and put it into a system so it can be shared. I don't need to create a CI for every item in a config file - the CMDB needs to know where the file is and what it's configuring. The rest is up to the person looking at it and the problem/ incident they're dealing with.

In order to test out that our CI's will meet our needs we need to look at how we are expecting them to be used after we've created them. Here are a few examples of the scenarios we might like to test our configuration with;
  • A user X who has just joined the company needs to be added to the users list (1)
  • User X has changed role and no longer needs access (2)
  • User X needs access to the Admin Interface (3)
  • User X can't access the system (4)
  • User can access parts of the system but isn't seeing any maps (5)
  • Microsoft has released a patch for a critical vulnerability in IIS  and Engineer Y needs to find all the boxes with IIS installed so he can patch them manually (6)
  • Emails from "System X" don't seem to be being sent (7)
As you could see this list could go on forever but, as I pointed out earlier, we're not trying to capture every possible thing that could happen - we're just trying to cover the 90% of things that are most likely to occur and a few other things we (as developers) might like to worry about.

Now that we've got out list let's go through and see whether we have enough information so that someone who isn't familiar with the system but has access to the CI structure can solve the issues we've highlighted;
  1. Looking at the CI list we have the CI "UK InfoMaps Standard Users" and "FR LocalMaps Standard Users" so if the new user joins in the UK we add them to the former, in France we add them to the latter
  2. As 1, except rather than adding them we just remove them
  3. Again we have two active directory groups "UK InfoMaps Administrators" and "FR LocalMaps Administrators" so we can add them to the right group depending on the Country
  4. The first non-straight forward one! We have the two DNS entries, the person taking the call can quickly test these and see if the service is down for everyone or just the user, if it'd down for everyone is the box down? Is the Hyper-V host down? Is the database accessible? Is there an error message? In short there are lots of things to try - and with access to the CI list you can start to do clever things like look for other services using the same Hyper-V host - if they're running then the problem probably isn't with the entire host, etc
  5. Two DNS entries provide some points for testing - is Google down (it has been known ...)? Has the firewall changed so the ports are blocked?
  6. IIS is linked to WIN005 so it should just be a quick case of searching the CMDB for IIS and seeing which boxes have IIS components on them
  7. Is the SMTP server accessible? Is the user account locked?
As you can see there is a lot here that can be done with relatively little technical experience and (trust me as a developer - this bit is key!) *if* the incident eventually gets escalated to a software engineer to look at then there is going to be a lot more information in the call so, rather than having to chase people and get answers to simple questions like "what box is it on?", a lot of that information will already be in the call because whoever answered the phone will have already done most of that work. The key here is what do you want your software engineers doing - chasing users for answers or fixing issues and then getting on with doing other work?

The next part of this series (which will hopefully not take so long to put together) will continue this example and look at metrics and the things you might like to consider doing to keep your CMDB up-to-date and relevant as your business changes.

No comments: