Friday, January 8, 2016

Build and Release Management Tasks for Visual Studio Online


I’ve just published the first iteration of a new Visual Studio Online (VSO) / Visual Studio Team Services set of build tasks. 
There are 3 tasks at the moment. 
  • Swap Azure Deployment Task – For cloud services, swaps the production and staging slots with two options.  The first option is to include the configuration data for each slot (the configuration data from staging goes to production and vice versa).  I’ve run across a few situations where this has been necessary.  The second option is to delete the staging slot when complete.  In this scenario, when we make the swap, we assume we no longer need the old production.  The reason for implementing was to save costs in Azure deployments, especially in lower environments.   
  • Remove Azure Deployment Slot Task – Also targeted at cloud services, given the cloud service and slot name will remove any deployment and optionally the associated VHD. 
  • Update Azure Configuration Setting – Targeted at cloud services, will update the value of a configuration setting (.cscfg) in an already existing deployment.   No need to redeploy the entire package through the Azure deployment pipeline. 
You can install the extension to your own VSO instance here.  Please feel free to give any feedback!  [https://marketplace.visualstudio.com/items/polarissolutions.vso-agent-tasks]

Monday, September 28, 2015

TFS and SQL locks: monitoring for the future

Nothing like walking in Monday morning and the first thing you here is 'Hey Chris, any idea why TFS is down?'


The first thing I do like anyone is attempt to log into the web interface to see if I'm getting a HTTP error, .NET error, or something else. I received none, in fact I didn't even get the pop up for authentication (I'm off domain) however the server never responded with an error. We investigated the application logs on the application tier and nothing was out of the ordinary. Launching the TFS Administration Console and SQL Profiler we found that it couldn't get a response to a command from the database (it would connect, attempt a query, but would never get a response in the proper time and would fail gracefully).


I was able to connect to the database and perform some system level queries on the actual database server. It was pushing 50-60% CPU continuously from sqlserver.exe in the task manager. Commands were slow, but for the most part the server was responsive as were the various databases we were querying against.


Digging deep into my little knowledge of DMVs I was able to locate a query to show me any resource contention that may be happening. Interestingly enough, it showed that I had a few RESOURCE_SEMAPHORE locks that had been hanging out for many hours on the TfsConfiguration database with clearly no intentions to let go.


Here's where it got a little interesting. Look at the image below (you'll need to click on it to actually see it) which is a screen cap of the output from the above query. Notice the wait_durations and database? The interesting thing is, once we rebooted, I've not seen any entries in this table for longer than a few seconds.


sqlresourcelocks

The culprit seemed to be the SPID 110, but... just as I was attempting to potentially kill these SPIDs management made the call to reboot the server. In the past it had worked, so I completely understand, especially when it was killing productivity and we had everything running through it (Release Management, Test Management, VC, Work items, the whole kit and caboodle). However, upon rebooting we did see that all locks were removed and did not show back up.


The challenge we have is being able to recreate the situation. It usually only happens every 4-6 weeks, no direct pattern at this point, but that's what infrastructure has reported. Their request to us was how to be more preemptive to this problem. Since I'm not 100% that what I found is the issue, I don't have a firm answer but I had them run the query provided above through the monitoring tool and if we see wait times above ~5 minutes, to set of an alarm to make sure TFS is still responding. We don't have causation, but there is at least some correlation.


It's not over though. I wanted to know more. I headed on over to the tfs/_oi page to see if anything stood out. It did. The graph below shows us the job timing of all the TFS jobs in the server. At the top, we see Common Structures Warehouse Sync.


commonstructuressyncbad


This seemed odd, normally this job is pretty quick as there isn't a lot of data that goes along with it in comparison to so many other tables. We dug deeper into the details and found the execution times and queuing seemed to be way off.


commonstructuretimingbad 

I went to our own (company) TFS server to see if we had the same issue, I found that the timings were way different, and as suspected the Common Structures Warehouse Sync Service took almost no time at all execute.

commonstructuressyncgood

And the drill down into the details we see that as suspected, this job should take a minimal amount of resources.

commonstructuretiminggood

As to why this is happening, I'm not sure. I've sent through a colleague to get answered by the product team. In the meantime, we are simply seeing if we can find a relationship between the locking we are seeing and TFS not responding. I'll write a follow up post if I get more information. Until then, monitor and reboot! :)

Tuesday, December 2, 2014

TFS 2013.4, Team Fields, Test Plan Work Item Definition, and the quest for the missing backlog

At my current assignment we upgraded to TFS 2013.4 from TFS 2013.2 last night. Everything went swimmingly as expected with the install and the upgrade but when we went to look at the backlogs a configuration error showed up.

TF400917: The current configuration is not valid for this feature. This feature cannot be used until you correct the configuration.


Details about the validation error appear below:
  • The following element contains an error: TypeFields/TypeField[type='Team']. TF400517: The value of this element is set to: MyCompany.TeamField. You must set the value to a field that exists in all of the work item types that are defined in Microsoft.TestPlanCategory. The field does not exist in the following work item types: Test Plan.

For fun, I did follow the link it provided for additional detail but when looking at the error message it seemed fairly clear as I've dealt with something similar in the past regarding Team Fields.  

It is important to note that we have customized our process template (every time I say this, Angela's blood pressure goes up a little bit ;)) and as such that provides certain challenges when we upgrade. 

By default, the included process templates use the Area Path as the way to define teams and their associated backlogs.  However, if you are using the Area Path for another purpose in organization or work items, Microsoft did us a solid by allowing us to define another custom field for storing team association.  

One thing learned in the past is that in order for the Team Field to work correctly, it must exist in all work item definitions.  In the past (2013.2) it seemed to only affect the visibility of the items vs. causing an actual exception.  That has apparently changed in the most recent releases as seen with the error above.  

To solve the issue simple export the work item definition for a Test Plan work item type with something similar to.

witadmin exportwitd /collection:http://myservername:8080/tfs/DefaultCollection /p:MyProject /n:"Test Plan" /f:"TestPlan.xml"  

From here, make the same modifications you have in other work item types assuming you've already made these modifications to other work item definitions and not the Test Plan definition.  

<FIELD name="Team" refname="MyCompany.TeamField" reportable="dimension" type="String"></FIELD>

And the same for displaying that in the work item window.

<CONTROL emptytext="<None>" fieldname="MyCompany.TeamField" label="Team" labelposition="Left" type="FieldControl"></CONTROL>

Last step is to save and import the modified Test Plan work item definition.

witadmin importwitd /collection://http://myservername:8080/tfs/DefaultCollection /p:MyProject /f:"TestPlan.xml"

Refresh TWA and your backlog should be available again!

Note:, You may find that test plans are not listed in TWA under the Test navigation hub. This is likely because a known team is not assigned to the Test Plan.  To correct, simply find the test plan ID (which you can get from MTM), search this work item ID in TWA, and assign the appropriate team in the team field we just designated.  Refresh again, and your plans are available!



Thursday, November 20, 2014

The perfect storm: Windows 2012, UAC, and silent [un]installs with MSI

Over the past few months I've done a lot of work with the Microsoft Installer (MSI) platform and in that time spent many sleepless nights trying to figure out how to get valuable debugging information from rather cryptic error messages.  As such I've been labeled as the go-to resource for solving such obscure issues.

Recently one of my colleagues approached me asking for assistance with an issue she was having when trying to uninstall and subsequently re-install an MSI package via Release Management.  When they would run msiexec from the command line as the same account as the deployment agent, the command was successful, however when the agent attempted to do so, it came back with this vague error:

Error 1601: Windows Installer is not accessible

Not the most useful error message, so we started doing some debugging looking for the usual suspects including group membership, local security policy, and even if there was a pending reboot. Everything was in order.  Next step, grab some more debugging/logging information from the uninstall process using the built in logging mechanisms.

msiexec /x {79C635ED-150B-4DE7-92D2-522E1EDC1BC5} /quiet /lvx* c:\logfiles\uninstall-ffi.log

NOTE: I specify a directory called logfiles on the root drive because the msiexec.exe program is being executed from the working directory c:\windows\syswow64 and would have attempted to write the log file to that directory which would fail because of rights assignment (and it's a bad practice). You may need to assign additional permissions to this directory so all users (or at least your deployment agent account) can write to it.

The log file produced more information but still somewhat cryptic.

=== Verbose logging started: 11/20/2014  12:25:40  Build type: SHIP UNICODE 5.00.9600.00  Calling process: C:\Windows\SysWOW64\msiexec.exe ===
MSI (c) (2C:F8) [12:25:40:388]: Resetting cached policy values
MSI (c) (2C:F8) [12:25:40:388]: Machine policy value 'Debug' is 0
MSI (c) (2C:F8) [12:25:40:388]: ******* RunEngine:
           ******* Product: {79C635ED-150B-4DE7-92D2-522E1EDC1BC5}
           ******* Action: 
           ******* CommandLine: **********
MSI (c) (2C:F8) [12:25:40:388]: Client-side and UI is none or basic: Running entire install on the server.
MSI (c) (2C:F8) [12:25:40:388]: Grabbed execution mutex.
MSI (c) (2C:F8) [12:25:40:403]: Failed to connect to server. Error: 0x80070005

MSI (c) (2C:F8) [12:25:40:403]: Note: 1: 2774 2: 0x80070005 
1: 2774 2: 0x80070005 
MSI (c) (2C:F8) [12:25:40:403]: Failed to connect to server.
MSI (c) (2C:F8) [12:25:40:403]: MainEngineThread is returning 1601
=== Verbose logging stopped: 11/20/2014  12:25:40 ===

Notice the line that says Failed to connect to server. Error: 0x80070005. Some research shows us this is often an access denied error across many Microsoft products but as to what access is unclear. If you look deeper into the recesses of how the Windows Installer works you will see there is a service that is invoked and runs under the highest privileged account it can [SYSTEM]. In order to connect to the service the user must have elevated access rights, which our user did, so what gives?

However, with all the modern security standards in place to make sure you are actually doing something you intend to do, Windows 2012 will enforce UAC no matter if you are an administrator or not.  When running a silent install/uninstall you do not see the prompt and in turn, the engine fails with the cryptic 1601 - 0x80070005 error.

While likely not the most elegant of fixes, disabling UAC [via registry] on these servers seems to do the trick. TechNet provides a download to make this even easier.

Once the key was modified, we attempted to publish a release through RM again and the error was gone and deployment moved forward!

Sunday, March 30, 2014

Transforming a software company - Part 1: Tripping out of the gate

Some of you may wonder, "Where the hell has this guy been the past few months?"  At the very least you may wonder why all my emails are signed with my blog that hasn't really been updated since November when I (re-)released Team Explorer Build Tree in conjunction with fellow developer Josh Rack.  That little development ended up getting me invited to work on the future versions of Community TFS Build Extensions with Mike Fourie and Terje Sandstrom.

That's not what has had me in the weeds the past few months.  No sir/ma'am, it's a way more complex story than that.  Now that there is a little time and to prevent Angela and Chris from constantly saying "you should blog this...", I am.  Grab your smoking jacket, relax in your favorite arm chair, pour a tall glass of your favorite beverage and saddle up for this epic $^%&* tale.

We landed a contract for a large product distributor to do a TFS implementation and Agile transformation project with responsibilities split between myself and fellow Polaris cohort Angela Dugan.  My responsibilities included (yet not limited to) infrastructure setup of TFS, source migration, and developer training.  Angela had a little bit more involvement in doing the Agile transformation to the organization as a whole and using TFS to assist in managing that process.  It was also a chance for me to hone in on my consulting skills by shadowing Angela and establish credibility within my own organization.   

That was the plan...

Prior to consulting I worked in the bureaucratic world of financial IT. Forms and approvals, and approvals to fill out forms that get more approvals to talk to the guy next to you about getting a VM for a couple days to test on.  That would take 3 months.  Real servers?  They were top of the line when purchased, outdated when finally delivered.

Maybe that's exaggerating a little but for those that have been dared to explore this part of the IT universe knows this to be a long and often very frustrating process.  Add in some consultant that demands to have equipment yesterday and raises hell until it's provided just makes it worse.  I didn't want to be hated immediately, there was plenty of time for that later.  

Trying to be the good consultant I had several communications with the development manager prior to my first on-site visit.  Knowing that there was a parent organization and some sort of data center, I assumed the organization had at least had some process to provisioning production level servers.  I sent specifications based on the demand they had specified and reiterated the importance of having these ready the day I walk in.  This included active directory accounts and the permissions I needed for each account.  Simple right?    

Finally game day arrives, I was scheduled to be on site 4 days for 3 weeks.  Kiss the wife and kids goodbye, pack the truck, and head out on an 8 hour drive (I'm not a fan of flying) to St. Cloud, Minnesota.  

I arrive about 10pm, check-in to the hotel and go to catch up on what I've missed while travelling through northern Wisconsin (where Edge network is the only type of data you can get).  

This is where the first problem arrives.  I've forgotten my power supply.  I also learned that universal power supplies work with everything except the Lenovo W5XX series.  Fortunately, my boss had one dropped shipped to the hotel, would be there first thing Tuesday.  I could survive one day going back and forth between sleep mode, I'm sure this organization had a PC I could use temporarily if needed. I would later find out it wouldn't matter.  

The development manager (also a consultant) was not local either.  However, he flew in on Monday morning and always arrived to the office a little bit later.  I was instructed to do the same.  Cool beans, I forgot socks anyways and who doesn't love browsing Walmart at 8am?  Even grabbed the complimentary continental breakfast from the hotel.  

Around 10, the development manager arrives, makes a short introduction, grabs a few employees, a notepad, sits down ready to write and asks "Ok, so what types of servers do we need?"  

"You're kidding right?", I awkwardly laugh.  

No one else laughs.


I come to find out it's no joke.  With a quick interrogation I have a slight glimmer of hope of actually getting servers provisioned and ready in a short time (they had a Hyper-V farm in a shared data center) but we have to find the contact that does said provisioning.  We do!  

Within moments all hope is shattered as I find out there are in fact forms to be filled out.  Not only that, the lead time on new virtual machines was 4 weeks (which was a bit ridiculous), the more ridiculous part was it really took 6.  Yes, Hyper-V machines.   

Panicking ensued.  The company we were contracting to happened to be purchasing the source code from a 3rd party and as part of the cut over wanted a company owned repository to keep code in.  

On top of the acquisition the organization had contracted out to another development company in Chicago to do future development until a full time staff could be ramped up to do all development internally.  They were starting the following week and currently had no where to share code.  We were about to bring development to a grinding halt and burn a lot of cash really fast.  Not on my watch you don't...  

"Alright, I have an idea..."

I recommended we migrate the purchased source code to TFService and use that to continue development in the short term.  It only required a Windows Live ID and everyone was willing to work with that.  At the same time, when the servers were finally provisioned and I had set up TFS locally, we would implement the TFS Integration Platform (TIP) to continuously replicate changes from TFService -> On Premises for both source control and work item tracking.  Once everything looked clear on the local configuration, we'd throw the switch and cut over.  

That's when the real joyride began.  

In the following posts I will be covering a number of topics from the details behind our migration, challenges with project management, proper (more importantly improper) tactics for agile software development, process template customization, lab management, release management, TFS Event Workflows, ClickOnce Deployment, SpecFlow, SalesForce integration with TFS, WiX, and much, much, more.  


Until then, get some rest, wash your jacket, subscribe to my feed and come back for more later.  

Saturday, November 16, 2013

Bringing back Team Foundation Server Build Folders in 2013!

I've mentioned before my job is pretty cool for a number of reasons but most often it's for the people I get to work with.

A few weeks ago I had the pleasure of being part of a team that was migrating 50+ active projects from Team Foundation Server 2008 as well literally hundreds of build definitions.  Beyond the build definitions they had created a number of custom tasks to extend some of the build capabilities for complex deployment to multiple environments, custom metrics, you name it.

They did it with one TFSBuild.proj file.  Make no mistake.  One. Impressive work Tim Stall.

The challenge of hundreds of build definitions 


I wish I was making that part up, but I'm not. These guys had hundreds of definitions. One may argue that it could have been set up differently, however until you've witnessed the system in action, walked through the complexity, or understood all of the outputs, don't judge. This was the best way.

The company had structured their build definitions by product and then by branch, then branch environment (per branch).   Perhaps this picture will better explain it.

As you can see, it's an ugly mess and just begging for organization.  What's a programmers best organization tool?  Tree views.

Wait, didn't someone make something for this?


Yes!  Inmeta Solutions made a package for Visual Studio 2010 that was pretty awesome but for whatever reason development stopped after 2010 and Microsoft never included it in future versions.  This became troublesome for those of us that loved the tool but also wanted to go beyond VS 2010.  When attempting to go to 2012, we found it to be simply incompatible.  

Why didn't we just use that source code?  

Fellow coding cohort (and coordinator of the new project) Josh Rack had done some research on contributing to the current project but in the end we wanted to extend it out a little differently.  Not because anything was wrong with the Inmeta solution we just wanted to do something different.  That's the fun of open source/community development isn't it?  :)

Josh started off using some of the existing code and trimming back to the necessities.  That's when he contacted me and asked to patch a few bugs and add some features to work more cleanly with Visual Studio 2012/2013 as well as Team Foundation Server 2012/2013.  

Currently the only features that are supported are simply a hierarchical view of the build tree using the '.' as the separator character and you can view the queue, edit the definition, and queue a new build.  We've got a couple items in the queue that we think many users will enjoy (such as the ability to choose the separator character and using that across your entire team).  

At this very moment, I'm waiting for Rack to go ahead and publish the project so you guys can download the latest bits. There are two installers, one for 2012 and the other for 2013. For whatever reason we couldn't get it to play well with 2013 without using the 2013 libraries.  Maybe we'll fix that sometime, until then, enjoy having build folders back!  

You can grab the latest bits and log any issues over at Codeplex.  Let us know what you think! 

Sunday, April 21, 2013

Real Life Lab Management – Developer Metrics and Value

For the past week I’ve been at a pretty progressive client (compared to where I came from, very sexy offices, young hipster types, no bureaucracy) basically the opposite.  That in and of itself has been quite a change. 

 

Top that, this is my first assignment and I get to do one of the coolest features in Visual Studio and Team Foundation Server 2012, Lab Management.  We were not going full blown SCVMM lab management but around the idea of Standard Lab Management

gremlin

Are they significantly different?  That is a yes and no kind of answer.  No, because in the end it comes down to the Test Agents being installed on the machine and registering with the test controller.  TFS 2012 even makes it easy enough that you don’t need to install them on each one.  In turn this has made things really easy regardless of your choice in virtualization platforms. 

On the other hand I say yes they are different in comparison to how much of the feature you can use.  SCVMM environments become exponentially easier to debug pesky issues that seem to be the ‘No-repro’ gremlins from Production. 

Alright, back to my progressive client.  It’s a cool setup, they’ve recently upgraded everything to Visual Studio/Team Foundation Server 2012, ASP.NET MVC4, Database Projects (the good ones), they brought someone in to write Web Tests and CodedUI Tests, bought a good amount of supporting hardware and asked us to tie it all together and make it happen in just under a week.  Success. 

LabBuildResults

It was amazing, we were literally watching through the lab management viewer two Windows 7 Virtual Machines running a battery of coded UI tests like some person was actually there.  At the end, all the information was dumped into the build detail and you could see how you were doing. It was glorious. 

Ok. What does all this mean?

 

 

LabChildBuildResults

When I was at JPMorgan we had a number of different ways they wanted to measure what was labeled as developer efficiency. Some metrics were as simple as churn rates, some were more advanced such as operational complexity, rate of change of code to requirements, etc. In the end we would get a single number, or a couple numbers, and no one had any idea what they meant.



Let's fast forward in time again to our current implementation. This isn't your ideal environment for development but it is a commonly normal one. This company has put together a fairly new team taking over a fairly large product. People need to find ways to prove what they are doing is providing value. The double edged sword we often have is trying to fix the application and figure out how to justify the work you've been doing provides real value. We know things like unit tests and code coverage are supposed to, but we struggle to both explain it and make it look good.

What do we do instead? Work our ass off tirelessly for weeks in hopes that management notices development is doing everything they can to make a better product thus giving some mercy.

Today friends, I'm here to help you with that in hopes of giving you both some more work/home-life balance and even a little personal satisfaction. Let's take a look at some key metrics that are listed on our build report and what they tell us about progress and quality.

First off we will start with the Lab Build and the CodedUI tests. One thing I notice is that blog entries such as this show a lot of successes, or they show a failure and then immediate success. That's the thing, these are not 1 day stories, these are going to continue to fail for awhile. You can expect for the first few weeks to have a color-blind mans nightmare of green, yellow, and quite a spike in bug rates. This is not uncommon nor should be used as a driver for some sort of large strategic change in your development organization.

CodeCoverageDetail2

What you will find is a lot of data errors, especially in CodedUI tests. If you don't use something like a gold database that is guaranteed to be the same data each time as well as properly updated to whatever the latest schema changes are. Entity Framework and Code First do a really good job at coming up with a pretty good solution going forward, but most of us are working on already existing apps under tight budgets and enough other stressful factors to take the time to properly implement. It took us close to 2 years at JPM to do it (given it was a pretty complex set of databases) but there are a lot of factors to consider we often don't. I digress...

Does that mean they provide no value? Of course not!  The data just needs to be interpreted differently right now than it will be in a few weeks. For now, it means we need to focus on the data quality of our CodedUI tests. Do we implement something to seed data each time? Do we make our data more dynamic using existing data sources (even the target database itself to determine what user names already exist)?

What we should not attempt to infer is that the application is buggy (we know it is) and our development team is doing a poor job. This data is simply exposing what the teams already would have discovered over a few months, instead the tools did it for us in a few minutes. That's a good thing, this is the beginning of our process, not the end.

Now we have a starting point to measure how good of a job we are doing at making our quality of integration tests better.

Second metric, code coverage and unit tests. I'll be honest, I wasn't really a fan of unit tests for a long time. I felt it didn't expose bugs as fast as something like integration tests could or CodeCoverageDetail at least the bugs that had more customer impact than some the unit tests would find. The problem came back to selling my own value to management of how good a job I was doing. They wanted to see numbers, I had no idea how to give it to them or at least how to explain it. Enter code coverage and what it means to developer productivity and application quality.

You can see on the details of this build that overall out of our passing unit tests only about 2% of the code that makes up this application is being tested. Again, early on in our process of implementing new quality controls such as more complete unit testing and integration testing we can expect to have what would be considered 'poor' metrics (like 2% coverage ;) ). It gives us the starting point. For developers it allows you to show how over the next few weeks they are increasing the amount of code tested in comparison to the entire code base.

At this point in time, it's important to not focus on bug rates. Like CodedUI tests you can expect a dramatic increase in the number of bugs discovered versus the amount of code covered. The reasons are slightly different in this case (data shouldn't really be a factor in a unit test) but again just exposes what you would have already discovered later. The other thing that it does not tell you is if the code tested fits the requirements, there are many other testing tools for that.

Lastly for our developers it comes down to how we make that 2% rise to 10% in the next 3 weeks. The beauty of Microsoft's Code Coverage and Analysis tools is they tell you exactly which blocks are covered, and which ones are not. (BTW, when we say blocks it could be something like code between an if {} statement, that's how granular it gets.). CodeCoverageDetail3 If you click on View Test Results of the unit tests it will download the results to your local machine for analysis. It goes into the detail of saying exactly what namespaces, methods, and blocks are tested, and which ones are not.  Double click on one of the lines and the red color indicates these uncovered blocks. This way you now know exactly what code isn't executed by any tests and can pretty easily just add new tests to cover those conditions. CodeCoverageDetail4 The value to the developer is two fold, first you are not guessing what to write and it gives you the ability to better communicate the value you and your team are providing to management in a very clear and measurable way.

In my next article, we will take a deeper look in some of the reports that come with TFS 2012 and how managers can use these to communicate a concise and clear message to stake holders about where development stands and what to expect in the future. These are not bad things, it allows people with a heavy (usually financially) investment help in situations early in the project (and spend more time thinking about the consequences) rather than knee jerk reactions late in the project that may have significantly more radical consequences.