Category Archives: Troubleshooting

Understanding File System IO, Lessons Learned From SQL Server

I do more than just SQL Server. I enjoy programming. In my former life I have worked with C/C++ and Assembler. As I spent more and more time with SQL Server my programming took a back seat career wise. Having that background though really helps me day in and day out understanding why SQL Server does some of the things it does at the system level.

Fast forward several years and I’ve moved away from C/C++ and spent the last few years learning C#.

Now that I work mostly in C# I do look up solutions for my C# dilemmas on sites like http://www.codeplex.com and http://www.codeproject.com. I love the internet for this very reason, hit a road block do a search and let the collective knowledge of others speed you on your way. But, it can be a trap if you don’t do your own homework.

I write mostly command line or service based tools these days not having any real talent for GUI’s to speak of. Being a person obsessed with performance I build these things to be multi-threaded, especially with today’s computers having multiple cores and hyper threading it just makes since to take advantage of the processing power. This is all fine and dandy until you want to have multiple threads access a single file and all your threads hang out waiting for access.

So, I do what I always do, ask by best friend Google what the heck is going on. As usual, he gave me several quality links and everything pointed to the underlying file not being set in asynchronous mode. Now having done a lot of C++ I knew about asynchronous IO, buffered and un-buffered. I could have made unmanaged code calls to open or create the file and pass the safe handle back, but just like it sounds it is kind of a pain to setup and if you are going down that path you might as well code it all up in C++ anyway.

Doing a little reading on MSDN I found all the little bits I needed to set everything to rights. I set up everything to do asynchronous IO and I started my test run again. It ran just like it had before slow and painful. Again, I had Mr. Google go out and look for a solution for me, sometimes being lazy is a bad thing, and he came back with several hits where people had also had similar issues. I knew I wasn’t the only one! The general solution? Something I consider very, very .Net, use a background thread and a delegate to keep the file access from halting your main thread, so your app “feels” responsive. It is still doing synchronous IO. Your main thread goes along but all file access is still bottle-necked on a single reader/writer thread. Sure, it solves the issue of program “freezing” up on file access but doesn’t really solve the problem of slow file access that I am really trying to fix.

I know that SQL Server uses asynchronous un-buffered IO to get performance from the file system. I did some refresh reading on the MSDN site again and struck gold. Writes to the file system may OR may not be asynchronous depending on several factors. One of which is, if the file must be extended everything goes back to synchronous IO while it extends the file. Well, since I was working with a filestream and a newly created file every time I was pretty much guaranteeing that I would be synchronous no matter what. At this point I dropped back to C++. I started to code it up when I realized I was doing things differently in my C++ version.

I was manually creating the file and doing an initial allocation growing it out to the size the file buffer and the file length on close if need be.

I started up my C++ version of the code and watched all the IO calls using Sysinternal’s Process Monitor. I watched my C++ version, and lo, it was doing asynchronous IO in the very beginning then switching to synchronous IO as the file started growing. I fired up my instance of SQL Server and watched as the asynchronous IO trucked right along…. until a file growth happened and everything went synchronous for the duration of the growth.

AH HA!

So, taking that little extra knowledge I manually created my file in C# set an initial default size and wouldn’t you know asynchronous IO kicked right in until it had to grow the file. I had to do a little extra coding watching for how much free space was in the file when I get close I now pause any IO,  manually the file by some amount and then start up the writes again keeping things from going into a synchronous mode without me knowing.

So, there you go my little adventure and how my old skills combined with knowing how  SQL Server works helped me solve this problem. Never assume that your new skills and old skills won’t overlap.

Found a Bug in SQL Server 2005

It doesn’t happen often but every once in a while you may be the lucky person to find a previously unknown bug in SQL Server.

It was a normal morning for me, checking the status of our servers going over any failure messages waiting for the day to ramp up. That’s when one of our lead developers came around the corner and told me he had an error when he had tried to create an index on a table he was working on. The more he tried to explain the error the more I started to worry. I had him send me the code and the error statement.

Location:     BtreeMgr.cpp:5372

Expression:   bufferLen > currOffset + ACCESSSOR_OVERHEAD

SPID:         116

Process ID:   5016

Msg 3624, Level 20, State 1, Line 2

A system assertion check has failed. Check the SQL Server error log for details. Typically, an assertion failure is caused by a software bug or data corruption. To check for database corruption, consider running DBCC CHECKDB. If you agreed to send dumps to Microsoft during setup, a mini dump will be sent to Microsoft. An update might be available from Microsoft in the latest Service Pack or in a QFE from Technical Support.

Msg 0, Level 20, State 0, Line 0

A severe error occurred on the current command.  The results, if any, should be discarded.

I had what we like to call in the high availability space a “pucker moment”. This wasn’t your normal, I typed something wrong and got an error, kind of problem. This was a real SEV 20 with an assert, the core engine had just puked on something it shouldn’t have.

Like all good DBA’s the first thing I did was run a DBCC on the database this error was generated from.

While that was going on I asked my very good friend, Google, if he had seen this particular assert before. For the first time in a very long time Google failed me! In the last few years if I hit this kind of hard error someone else has too and it is ether being fixed in a hot fix or addressed in the next version of SQL Server, but not this time.

So, we have this same schema on another server and the developer tried the exact same code there and had the exact same error.

I had him document the steps he took to get to this point and to his credit the steps were clear, concise and easily reproducible.

The DBCC finished with zero problems detected, which let me calm down a bit. That coupled with the fact it looked like I had a repeatable test case When the second database had cleared the DBCC I set about my task of reproducing the error and trying to find a work around. Lucky for us it was a simple matter of column organization in the index and we were able to apply it successfully and carry on with life.

I bundled up the documentation I had accumulated, ran the test case confirmed the bug and sent it off to the powers that be at Microsoft. Since we had a work around and it wasn’t a show stopper I didn’t raise it as a critical server down issue but Microsoft still worked it in a timely fashion.

So, what was the problem you say? It was an interesting edge condition.

We have a table that contains a composite primary key and the rest is made up of bit fields, a flag table.

We had to add a new column, another bit flag, to the table.

The non-clustered covering index was dropped the column was added to the end of the table.

The index was updated with the new column at the end of the column list and then *POOF* it blew up.

I think it has to do with two specific things.

First, bit fields are stored in a compact manor where multiple bits share a byte and aren’t truly separate from every other but field. It would be a huge waste of space to store each bit in it’s own byte but it would make things like this index issue less likely to happen.

Secondly we did a column add but didn’t drop and recreate the table repopulating it in the process so things at the page level weren’t nice and neat. The underlying clustered index wasn’t effected but when we tried to add an index back with the new field it couldn’t do it. The fix was simple, change the column order in the non-clustered index moving the new column up one. We verified the data without the index was correct and with the index was correct.

I haven’t tried it yet, but I am betting included columns won’t suffer the assert ether since the items don’t have to be sorted in the index.

So there you go! Having been on the software side of things a couple of times I always find it interesting when I find bugs in others products and work the issue to conclusion.

What is your take away from all of this? Never be afraid to submit a bug report to Microsoft. I have heard people say to the effect someone else will or has hit the bug and they will submit it. DON’T RELY ON THE ACTIONS OF OTHERS! Reporting bugs helps the community as a whole and makes the product better. When you assume someone else is worked it you are putting YOUR production servers in the hands of strangers. If someone has submitted it and it didn’t turn up in a search they will let you know, and be very kind about it to boot. You will get piece of mind that it is being worked on and it is a bug, or that you may keep someone else from stumbling onto this and not having the knowledge to fix it or work around it.

Wes