Category Archives: C#

At The End of the IO Road With C#? Pave New Road!

Not being one for letting a problem get the best of me, I took another look at the asynchronous overlapped IO problem. If you read my last post on the subject, you know I’ve done a lot of work on this already. None of the things I said last time have changed at all. If you want to do asynchronous and un-buffered IO in C# using the native file stream calls you can’t… So, I rolled my own. The kicker is, I don’t use any unmanaged code to do this. No call to VirtualAlloc() or anything else using DLL imports. Oh, and the speed is spectacular.

The Goal

My ultimate goal was to build a routine that would do un-buffered asynchronous IO. That means I don’t want the OS doing any buffering or funny stuff with the IO’s I issue. That goes for reads and writes. SQL Server uses this method to harden writes to the disk and it also performs well with excellent predictability. If you have ever use windows to do a regular copy you will see it eating up memory to buffer both reads and writes. If you copy the same file a couple of times you will notice that the first time it runs in about the speed you expect it, but the second time it may run twice as fast. This is all Windows, buffering as much data and holding on to that buffer. That’s great for smaller files but if you are pushing around multi-gigabyte files it is a disaster. As the system becomes starved for memory it pages then starts throttling back. Your 100MB/sec copy is now crawling along at 20MB/sec.

Where we left off..

I had settled on a simple routine that would allow me to do un-buffered reads from a file and write to a buffered file ether on disk or across the network.

internal class UnBufferedFileCopy
{
	public static int CopyBufferSize = 8 * 1024 * 1024;
	public static byte[] Buffer = new byte[CopyBufferSize];
	const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;

	public static int CopyFileUnbuffered(string inputfile, string outputfile)
	 {
		var infile = new FileStream(inputfile, FileMode.Open, FileAccess.Read
		, FileShare.None, 8, FileFlagNoBuffering | FileOptions.SequentialScan);
		var outfile = new FileStream(outputfile, FileMode.Create, FileAccess.Write
		, FileShare.None, 8, FileOptions.WriteThrough);

		int bytesRead;
		while ((bytesRead = infile.Read(Buffer, 0, CopyBufferSize)) != 0)
		{
			outfile.Write(Buffer, 0, bytesRead);
		}

		outfile.Close();
		outfile.Dispose();
		infile.Close();
		infile.Dispose();
		return 1;
	}
}

There are two problems with this routine. First off, only the read from source is truly un-buffered. C# offers the write through flag and I thought that would be enough. I fired up process monitor and watched the IO issued on writes and it wasn’t buffer sized requests, it was always broken up into 64k chunks. So, the read request would fetch say 16MB of data and pass that to the write request who would then break that up into chunks. This wasn’t the behavior I was going for! Doing some additional research I found adding the no buffering flag to the write through flag gave me the results I was after, almost. You can’t do un-buffered writes. Synchronous or asynchronous doesn’t matter. To do a un-buffered write the buffer area that you build from the byte array must be page aligned in memory and all calls must return a multiple of the page size. Again, this just isn’t possible in managed code. So, I investigated a horrible kludge of a solution. I do un-buffered writes until I get to the last block of data. Then I close and reopen the file in a buffered mode and write the last block. It isn’t pretty but it works. It also means that I can’t use write through and un-buffered on a file smaller than the buffer size. Not a huge deal but something to be aware of if you are doing a lot of small files. If you are going the small file route the first routine will probably be OK.

internal class UnBufferedFileCopy
{
	public static int CopyBufferSize = 8 * 1024 * 1024;
	public static byte[] Buffer1 = new byte[CopyBufferSize];
	const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;

	public static int CopyFileUnbuffered(string inputfile, string outputfile)
	{
		var infile = new FileStream(inputfile, FileMode.Open, FileAccess.Read
			, FileShare.None, 8, FileFlagNoBuffering | FileOptions.SequentialScan);
		//open output file set length to prevent growth and file fragmentation and close it.
		//We have to do it this way so we can do unbuffered writes to it later
		outfile = new FileStream(outputfile, FileMode.Create, FileAccess.Write
			, FileShare.None, 8, FileOptions.WriteThrough);
		outfile.SetLength(infilesize);
		outfile.Dispose();

		//open file for write unbuffered
		outfile = new FileStream(outputfile, FileMode.Open, FileAccess.Write
			, FileShare.None, 8, FileOptions.WriteThrough | FileFlagNoBuffering);
		long totalbyteswritten;
		int bytesRead1;
		//hold back one buffer
		while (totalbyteswritten &lt; infilesize - CopyBufferSize)
		{
			bytesRead1 = _infile.Read(Buffer1, 0, CopyBufferSize);
			totalbyteswritten = totalbyteswritten + CopyBufferSize;
			outfile.Write(Buffer1, 0, bytesRead1);
		}

		//close the file handle that was using unbuffered and write through
		outfile.Dispose();

		//open file for write buffered We do this so we can write the tail of the file
		//it is a cludge but hey you get what you get in C#
		outfile = new FileStream(outputfile, FileMode.Open, FileAccess.Write, FileShare.None, 8,
		FileOptions.WriteThrough);

		//go to the right position in the file
		outfile.Seek(infilesize - bytesRead1, 0);
		//flush the last buffer syncronus and buffered.
		outfile.Write(Buffer1, 0, bytesRead1);

		outfile.Dispose();
		infile.Dispose();
		return 1;
	}
}

This is as close to fully un-buffered IO on both the read and write side of things. There is a lot going on, but it is still synchronous all the way. If you look at performance monitor it will show you a saw tooth pattern as you read then write since you are only ever doing one or the other. Using this to copy a file across the LAN to another server never got better than 75MB/Sec throughput. Not horrible but a long way from the 105MB/Sec I get from something like FastCopy or TerraCopy. Heck, it’s not even close to the theoretical 125MB/Sec a gigabit connection could support. That leaves the last piece of the puzzle, going asynchronous.

Threading in C#, To Produce or Consume?

We know that using the asynchronous file IO built into C# isn’t an option. That doesn’t mean we can’t pattern something of our own like it. I’ve done quite a bit of threading in C#. It isn’t as difficult as C/C++ but you can still blow your foot off. It adds a whole other level of complexity to your code. This is where a little thought and design on paper and using a flow chart can help you out quite a bit. Also, it’s good to research design patterns and multi-threading. A lot of smart people have tackled these problems and have developed well designed solutions. Our particular problem is a classic producer consumer pattern, a simple one at that. We have a producer, the read thread, putting data in a buffer. We have a consumer, the write thread, that takes that data and writes it to disk. My first priority is to model this as simply as possible. I’m not worried with multiple readers or writers. I am concerned with locking and blocking. Keeping the time something has to be locked to a minimum is going to be key. That lead me to a simple solution. One read thread and the buffer it reads into, one write thread and the buffer it reads from and one intermediate buffer to pass data between them. Basically, an overlap buffer that is the same size as the read and write buffer. To give you a better visual example before showing you the code here are a couple of flow charts.

Read File
http://www.lucidchart.com/documents/view/4cac057f-d81c-472e-9764-52c00afcbe04

Write File
http://www.lucidchart.com/documents/view/4cac0726-dd14-46a6-8d44-53710afcbe04

There a few of things you need to be aware of. There is no guarantee of order on thread execution. That is why I’m using a lock object and a semaphore flag to let me know if the buffer is actually available to be written or read from. Keep the lock scope small. The lock can be a bottleneck and basically drop you back into a synchronous mode. Watch for deadlocks. With the lock and the semaphore flag in play if your ordering is wrong you can get into a deadlock between the two threads where they just sit and spin waiting for ether the lock or the flag to clear. At this point I’m confident I don’t have any race or deadlocking situations.

Here is a simplified sample, I’m serious this is as small a sample as I could code up.

internal class AsyncUnbuffCopy
{
	//file names
	private static string _inputfile;
	private static string _outputfile;
	//syncronization object
	private static readonly object Locker1 = new object();
	//buffer size
	public static int CopyBufferSize;
	private static long _infilesize;
	//buffer read
	public static byte[] Buffer1;
	private static int _bytesRead1;
	//buffer overlap
	public static byte[] Buffer2;
	private static bool _buffer2Dirty;
	private static int _bytesRead2;
	//buffer write
	public static byte[] Buffer3;
	//total bytes read
	private static long _totalbytesread;
	private static long _totalbyteswritten;
	//filestreams
	private static FileStream _infile;
	private static FileStream _outfile;
	//secret sauce for unbuffered IO
	const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;

	private static void AsyncReadFile()
	{
		//open input file
		_infile = new FileStream(_inputfile, FileMode.Open, FileAccess.Read, FileShare.None, CopyBufferSize,
		FileFlagNoBuffering);
		//if we have data read it
		while (_totalbytesread &lt; _infilesize)
		{
			_bytesRead1 = _infile.Read(Buffer1, 0, CopyBufferSize);
			lock (Locker1)
			{
				while (_buffer2Dirty)Monitor.Wait(Locker1);
				Buffer.BlockCopy(Buffer1, 0, Buffer2, 0, _bytesRead1);
				_buffer2Dirty = true;
				Monitor.PulseAll(Locker1);
				_bytesRead2 = _bytesRead1;
				_totalbytesread = _totalbytesread + _bytesRead1;
			}
		}
		//clean up open handle
		_infile.Close();
		_infile.Dispose();
	}

	private static void AsyncWriteFile()
	{
		//open output file set length to prevent growth and file fragmentation and close it.
		//We have to do it this way so we can do unbuffered writes to it later
		_outfile = new FileStream(_outputfile, FileMode.Create, FileAccess.Write, FileShare.None, 8,
		FileOptions.WriteThrough);
		_outfile.SetLength(_infilesize);
		_outfile.Close();
		_outfile.Dispose();
		//open file for write unbuffered
		_outfile = new FileStream(_outputfile, FileMode.Open, FileAccess.Write, FileShare.None, 8,
		FileOptions.WriteThrough | FileFlagNoBuffering);
		while (_totalbyteswritten &lt; _infilesize - CopyBufferSize)
		{
			lock (Locker1)
			{
				while (!_buffer2Dirty) Monitor.Wait(Locker1);

				Buffer.BlockCopy(Buffer2, 0, Buffer3, 0, _bytesRead2);
				_buffer2Dirty = false;
				Monitor.PulseAll(Locker1);
				_totalbyteswritten = _totalbyteswritten + CopyBufferSize;
			}
			_outfile.Write(Buffer3, 0, CopyBufferSize);
		}
		//close the file handle that was using unbuffered and write through
		_outfile.Close();
		_outfile.Dispose();
		lock (Locker1)
		{
			while (!_buffer2Dirty) Monitor.Wait(Locker1);
			//open file for write buffered We do this so we can write the tail of the file
			//it is a cludge but hey you get what you get in C#
			_outfile = new FileStream(_outputfile, FileMode.Open, FileAccess.Write, FileShare.None, 8,
			FileOptions.WriteThrough);
			//this should always be true but I haven't run all the edge cases yet
			if (_buffer2Dirty)
			{
				//go to the right position in the file
				_outfile.Seek(_infilesize - _bytesRead2, 0);
				//flush the last buffer syncronus and buffered.
				_outfile.Write(Buffer2, 0, _bytesRead2);
			}
		}
		//close the file handle that was using unbuffered and write through
		_outfile.Close();
		_outfile.Dispose();
	}
	
	public static int AsyncCopyFileUnbuffered(string inputfile, string outputfile, int buffersize)
	{
		//set file name globals
		_inputfile = inputfile;
		_outputfile = outputfile;
		//setup single buffer size, remember this will be x3.
		CopyBufferSize = buffersize * 1024 * 1024;
		//buffer read
		Buffer1 = new byte[CopyBufferSize];
		//buffer overlap
		Buffer2 = new byte[CopyBufferSize];
		//buffer write
		Buffer3 = new byte[CopyBufferSize];
		//get input file size for later use
		var f = new FileInfo(_inputfile);
		long s1 = f.Length;
		_infilesize = s1;

		//create read thread and start it.
		var readfile = new Thread(AsyncReadFile) { Name = &quot;ReadThread&quot;, IsBackground = true };
		readfile.Start();

		//create write thread and start it.
		var writefile = new Thread(AsyncWriteFile) { Name = &quot;WriteThread&quot;, IsBackground = true };
		writefile.Start();

		//wait for threads to finish
		readfile.Join();
		writefile.Join();
		Console.WriteLine();
		return 1;
	}
}

As you can see, we have gotten progressively more complex with each pass until we have finally arrived at my goal. With zero unmanaged code and only one undocumented flag I’ve built a C# program that actually does fast IO like the low level big boys. To handle the small file issue I just drop back to my old copy routine to move these files along. You can see a working sample at http://github.com/SQLServerIO/UBCopy It also has an MD5 verification built in as well.

So, how well does it work?

FastCopy 1.99r4
TotalRead = 1493.6 MB
TotalWrite = 1493.6 MB
TotalFiles = 1 (0)
TotalTime= 15.25 sec
TransRate= 97.94 MB/s
FileRate = 0.07 files/s

UBCopy 1.5.2.1851 — Managed Code
File Copy Started
%100
File Copy Done
File Size MB : 1493.62
Elapsed Seconds : 15.26
Megabytes/sec : 102.63
Done.

I think it will due just fine.

At The End of the IO Road With C#

Previously I’ve written about doing fun IO stuff in C#. I found out that some of my old tricks still worked in C# but….

Now having done a lot of C++ I knew about async IO buffered and un-buffered and could have made unmanaged code calls to open or create the file and pass the handle back, but just like it sounds it is kind of a pain to setup and if you are going down that path you might as well code it all up in C++ anyway.

I was mostly right. I have been working on a file sync tool for managing all my SQL Sever backup files. Naturally, I wanted to be as fast as humanly possible. Wanting that speed and getting it from the CLR are two completely different things. I know how to do asynchronous IO, and with a little trick, you can do un-buffered IO as well. The really crappy part is you can’t do both in the CLR.

From my previous post, you know that SQL Server does asynchronous, un-buffered IO on reads and writes. The CLR allows you to so asynchronous reads with a fun bit of coding and an call back structure. I took this code from one of the best papers on C# and IO: Sequential File Programming Patterns and Performance with .NET I made some minor changes and cleaned up the code a bit.

internal class AsyncFileCopy
    {
        // globals
        private const int Buffers = 8; // number of outstanding requests
        private const int BufferSize = 8*1024*1024; // request size, one megabyte
        public static FileStream Source; // source file stream
        public static FileStream Target; // target file stream
        public static long TotalBytes; // total bytes to process    
        public static long BytesRead; // bytes read so far    
        public static long BytesWritten; // bytes written so far
        public static long Pending; // number of I/O's in flight
        public static Object WriteCountMutex = new Object[0]; // mutex to protect count
        // Array of buffers and async results.  
        public static AsyncRequestState[] Request = new AsyncRequestState[Buffers];

        public static void AsyncBufferedFileCopy(string inputfile, string outputfile)
        {
            Source = new FileStream(inputfile, // open source file
                                    FileMode.Open, // for read
                                    FileAccess.Read, //
                                    FileShare.Read, // allow other readers
                                    BufferSize, // buffer size
                                    FileOptions.Asynchronous); // use async
            Target = new FileStream(outputfile, // create target file
                                    FileMode.Create, // fault if it exists
                                    FileAccess.Write, // will write the file
                                    FileShare.None, // exclusive access
                                    BufferSize, // buffer size
                                    FileOptions.Asynchronous); //unbuffered async
            TotalBytes = Source.Length; // Size of source file
            Target.SetLength(TotalBytes); //Set target file lenght to avoid file growth
            var writeCompleteCallback = new AsyncCallback(WriteCompleteCallback);
            for (int i = 0; i < Buffers; i++) Request[i] = new AsyncRequestState(i);
            // launch initial async reads
            for (int i = 0; i < Buffers; i++)
            {
                // no callback on reads.                     
                Request[i].ReadAsyncResult = Source.BeginRead(Request[i].Buffer, 0, BufferSize, null, i);
                Request[i].ReadLaunched.Set(); // say that read is launched
            }
            // wait for the reads to complete in order, process buffer and then write it. 
            for (int i = 0; (BytesRead < TotalBytes); i = (i + 1)%Buffers)
            {
                Request[i].ReadLaunched.WaitOne(); // wait for flag that says buffer is reading
                int bytes = Source.EndRead(Request[i].ReadAsyncResult); // wait for read complete
                BytesRead += bytes; // process the buffer <your code goes here>
                Target.BeginWrite(Request[i].Buffer, 0, bytes, writeCompleteCallback, i); // write it
            } // end of reader loop
            while (Pending > 0) Thread.Sleep(10); // wait for all the writes to complete                 
            Source.Close();
            Target.Close(); // close the files                     
        }

        // structure to hold IO request buffer and result.

        // end AsyncRequestState declaration
        // Asynchronous Callback completes writes and issues next read
        public static void WriteCompleteCallback(IAsyncResult ar)
        {
            lock (WriteCountMutex)
            {
                // protect the shared variables
                int i = Convert.ToInt32(ar.AsyncState); // get request index
                Target.EndWrite(ar); // mark the write complete
                BytesWritten += BufferSize; // advance bytes written
                Request[i].BufferOffset += Buffers*BufferSize; // stride to next slot 
                if (Request[i].BufferOffset < TotalBytes)
                {
                    // if not all read, issue next read
                    Source.Position = Request[i].BufferOffset; // issue read at that offset
                    Request[i].ReadAsyncResult = Source.BeginRead(Request[i].Buffer, 0, BufferSize, null, i);
                    Request[i].ReadLaunched.Set();
                }
            }
        }

        #region Nested type: AsyncRequestState

        public class AsyncRequestState
        {
            // data that tracks each async request
            public byte[] Buffer; // IO buffer to hold read/write data
            public long BufferOffset; // buffer strides thru file BUFFERS*BUFFER_SIZE
            public IAsyncResult ReadAsyncResult; // handle for read requests to EndRead() on.
            public AutoResetEvent ReadLaunched; // Event signals start of read 

            public AsyncRequestState(int i)
            {
                // constructor    
                BufferOffset = i*BufferSize; // offset in file where buffer reads/writes
                ReadLaunched = new AutoResetEvent(false); // semaphore says reading (not writing)
                Buffer = new byte[BufferSize]; // allocates the buffer
            }
        }

        #endregion
    }

The Fun bit about this code is you don’t need to spawn your own threads to do the work. All of this happens from a single thread call and the async happens in the background. I do make sure and grow the file to prevent dropping back into synchronous mode on file growths.

This next bit is the un-buffered stuff.

internal class UnBufferedFileCopy
{
    public static int CopyBufferSize = 8 * 1024 * 1024;

    public static byte[] Buffer = new byte[CopyBufferSize];

    const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;

    public static int CopyFileUnbuffered(string inputfile, string outputfile)
    {
        var infile = new FileStream(inputfile,
                                    FileMode.Open, FileAccess.Read, FileShare.None, 8
, FileFlagNoBuffering | FileOptions.SequentialScan);
        var outfile = new FileStream(outputfile, FileMode.Create, FileAccess.Write,
                                     FileShare.None, 8, FileOptions.WriteThrough);

        int bytesRead;
        while ((bytesRead = infile.Read(Buffer, 0, CopyBufferSize)) != 0)
        {
            outfile.Write(Buffer, 0, bytesRead);
        }

        outfile.Close();
        outfile.Dispose();
        infile.Close();
        infile.Dispose();
        return 1;
    }
}

Since this is a synchronous call I’m not worried about extending the file for performance. There is the fragmentation issue to worry about. Without that the code is a bit cleaner. The secret sauce on this one is creating your own file option and passing it in.

const FileOptions FileFlagNoBuffering = (FileOptions)0x20000000;

I hear you asking now, where did this thing come from? Well, that is simple it is a regular flag you can pass in if you are doing things in C or C++ when you create a file handle. I got curious as to what the CLR was actually doing in the background. It has to make a call to the OS at some point and that means unmanaged code.

internal class UnmanagedFileCopy
{
    public static int CopyBufferSize = 8 * 1024 * 1024;

    public static byte[] Buffer = new byte[CopyBufferSize];

    private const int FILE_FLAG_NO_BUFFERING = unchecked(0x20000000);
    private const int FILE_FLAG_OVERLAPPED = unchecked(0x40000000);
    private const int FILE_FLAG_SEQUENTIAL_SCAN = unchecked(0x08000000);
    private const int FILE_FLAG_WRITE_THROUGH = unchecked((int)0x80000000);
    private const int FILE_FLAG_NONE = unchecked(0x00000000);

    public static FileStream infile;
    public static SafeFileHandle inhandle;
    public static FileStream outfile;
    public static SafeFileHandle outhandle;

    [DllImport("KERNEL32", SetLastError = true, CharSet = CharSet.Auto, BestFitMapping = false)]
    private static extern SafeFileHandle CreateFile(String fileName,
                                                    int desiredAccess,
                                                    FileShare shareMode,
                                                    IntPtr securityAttrs,
                                                    FileMode creationDisposition,
                                                    int flagsAndAttributes,
                                                    IntPtr templateFile);

    public static void CopyUnmanaged(string inputfile, string outputfile)
    {
        outhandle = CreateFile(outputfile,
                   (int)FileAccess.Write,
                   (int)FileShare.None,
                   IntPtr.Zero,
                   FileMode.Create,
                   FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH,
                   IntPtr.Zero);

        inhandle = CreateFile(inputfile,
                                  (int)FileAccess.Read,
                                  (int)FileShare.None,
                                  IntPtr.Zero,
                                  FileMode.Open,
                                  FILE_FLAG_NO_BUFFERING | FILE_FLAG_SEQUENTIAL_SCAN,
                                  IntPtr.Zero);

        outfile = new FileStream(outhandle, FileAccess.Write, 8, false);
        infile = new FileStream(inhandle, FileAccess.Read, 8, false);

        int bytesRead;
        while ((bytesRead = infile.Read(Buffer, 0, CopyBufferSize)) != 0)
        {
            outfile.Write(Buffer, 0, bytesRead);
        }

        outfile.Close();
        outfile.Dispose();
        outhandle.Close();
        outhandle.Dispose();
        infile.Close();
        infile.Dispose();
        inhandle.Close();
        inhandle.Dispose();
    }
}

If I was building my own unmanaged calls this would be it. When you profile the managed code for object creates/destroys you see that it is making calls to SafeFileHandle. Being the curious guy I am I did a little more digging. For those of you who don’t know there is an open source implementation of the Common Language Runtime called Mono. That means you can download the source code and take a look at how things are done. Poking around in the FileStream and associated code I saw that had all the file flags in the code but commented out un-buffered… Now I had a mystery on my hands. I tried to implement asynchronous un-buffered IO using all unmanaged code calls and couldn’t do it. There is a fundamental difference between a byte array in the CLR and what I can setup in native C++. One of the things you have to be able to do if you want asynchronous un-buffered IO is to sector align all reads and writes, including in and out of memory buffers. You can’t do it in C#. You have to allocate an unmanaged segment of memory and handle the reads and writes through that buffer. At the end of the day, you have written all the C++ you need to do the file copy stuff and rapped it in a managed code loop.

So, you can do asynchronous OR un-buffered but not both. From Sequential File Programming Patterns and Performance with .NET

the FileStream class does a fine job. Most applications do not need or want un-buffered IO. But, some applications like database systems and file copy utilities want the performance and control un-buffered IO offers.

And that is a real shame, I’d love to write some high performance IO stuff in C#. I settled on doing un-buffered IO since these copies are from a SQL Server which will always be under some kind of memory pressure, to the file server. If I could do both asynchronous and un-buffered I could get close to wire speed, around 105 to 115 megabytes a second. Just doing un-buffered gets me around 80 megabytes per second. Not horrible, but not the best.

Creating UDL Files On The Fly

Many years ago, in the dark ages of DTS I created a little app that would take two parameters and build a UDL file. This an some crafty VBScript allowed me to loop through a list of servers on the fly. I haven’t thought about this code in almost ten years when I came across John Paul Cooks’ blog post on using UDL files. I thought I’d just post up the code, it is basic but got the job done! I did clean it up a bit since it was written when .net 1.0 was all the rage. The secret sauce is writing the file encoded Unicode and doing binary writes to get all the little bits in the correct palace. the UDL file format is picky that way. Enjoy!

using System;
using System.Collections.Generic;
using System.Text;
using System.Data;
using System.IO;
namespace CreateUDL
{
 class Program
 {
 static void Main(string[] args)
 {
 string str2 = "";
 string str3 = "";
 try
 {
 str2 = args[0];
 str3 = args[1];
 }
 catch
 {
 if (args.Length < 2)
 {
 Console.WriteLine("Not enough arguments!");
 return;
 }
 }
 str2 = str2 + str3 + ".udl";
 if (str2.Length == 0)
 {
 Console.WriteLine("must provide file name");
 return;
 }
 if (str3.Length == 0)
 {
 Console.WriteLine("must provide server name");
 return;
 }
 else
 {
 try
 {
 FileStream output = new FileStream(str2, FileMode.OpenOrCreate);
 BinaryWriter writer = new BinaryWriter(output);
 writer.Write((short)(-257));
 string s = "[oledb]rn";
 byte[] bytes = new byte[(s.Length * 2) + 1];
 bytes = Encoding.Unicode.GetBytes(s);
 writer.Write(bytes);
 s = "; Everything after this line is an OLE DB initstringrn";
 bytes = new byte[(s.Length * 2) + 1];
 bytes = Encoding.Unicode.GetBytes(s);
 writer.Write(bytes);
 s = "Provider=SQLOLEDB.1;Integrated Security=SSPI;"+
 "Persist Security Info=False;"+
 "Initial Catalog=master;Data Source=" + str3 + "rn";
 bytes = new byte[(s.Length * 2) + 1];
 bytes = Encoding.Unicode.GetBytes(s);
 writer.Write(bytes);
 writer.Flush();
 writer.Close();
 writer = null;
 output = null;
 }
 catch(Exception e)
 {
 Console.WriteLine(e.Message);
 }
 }
 }
 }
}

Adventures in SQL CLR and C#

I’ve toyed with the CLR in SQL Sever 2005 off and on since the first Yukon beta had it enabled. And I’ll be honest with you, I was not a fan.It wasn’t like “YOU got chocolate in my peanut butter!” kind of moment for me. I really thought it was going to be a disaster of biblical proportions. As SQL Server DBA’s we caught a break, adoption wasn’t exactly stellar. The problem was there are enough restrictions and little gotchas to keep developers from whole sale abandoning Transact SQL for something more familiar. Fast forward a few years and now I’m not so scared.My biggest worry back then was memory usage. I’m still not very comfortable with it, but on a 64-bit platform you can mitigate those issues by adding more memory. On a 32-bit platform you could cause all kinds of damage by squeezing the lower 4GB memory space to the point you could have connection and backup failures due to lack of memory. Oh and the fix is usually restarting SQL Server. An example of this comes directly from http://msdn.microsoft.com/en-us/library/ms131075.aspx

Scalable Memory Usage

In order for managed garbage collection to perform and scale well in SQL Server, avoid large, single allocation. Allocations greater than 88 kilobytes (KB) in size will be placed on the Large Object Heap, which will cause garbage collection to perform and scale much worse than many smaller allocations. For example, if you need to allocate a large multi-dimensional array, it is better to allocate a jagged (scattered) array.

88KB!

This memory thing is serious.

The other biggie is what you can, or cannot do using the CLR.

Again from MSDN http://msdn.microsoft.com/en-us/library/ms131047.aspx

SAFE is the most reliable and secure mode with associated restrictions in terms of the allowed programming model. SAFE assemblies are given enough permission to run, perform computations, and have access to the local database. SAFE assemblies need to be verifiably type safe and are not allowed to call unmanaged code.

UNSAFE is for highly trusted code that can only be created by database administrators. This trusted code has no code access security restrictions, and it can call unmanaged (native) code.

EXTERNAL_ACCESS provides an intermediate security option, allowing code to access resources external to the database but still having the reliability guarantees of SAFE.

Most restrictive to least restrictive permissions. Something you don’t worry about in general as a C# programmer but in the database its always an issue in some way.

What it boils down to:

If you are just talking to SQL Server using basic C# stuff leave it in SAFE which is the default.

If you need access to the file system or the registry and some other limited stuff EXTERNAL_ACCESS is the way to go.

IF you want to have the ability to completely tank a production SQL Server UNSAFE puts it all into your hands. You can call unmanaged code via P/Invoke, all bets are off.

Some additional light reading on what libraries can and can’t be called in the CLR.

http://msdn.microsoft.com/en-us/library/ms403273.aspx

Fun stuff, no Finalizers or static fields, read-only static fields are ok though. You will see why this is important to me a little later on.

http://msdn.microsoft.com/en-us/library/ms403279.aspx

T-SQL vs. CLR

The other thing I had been promoting, and not always correctly, is putting complicated math functions in CLR. Generally, I’ve found that most math problems run faster in the CLR over native T-SQL. And I’ve found for the most part that holds true for the core algorithm. Once you add data retrieval into the mix things shift back in T-SQL’s favor for a lot of operations. Like everything else, test your ideas using real world scenarios or as close as you can before deciding on one technology over another. I prime example for me was coding up Pythagorean and Haversine equations for the classic distance between two zip codes in T-SQL and C# via CLR. Running test data through an array in the C# solution it ran rings around the T-SQL function I had coded up but once it had to start pulling and pushing data back to the database the T-SQL solution was the clear winner.

Another aspect where the CLR can be much better is string manipulation. I’ve written a couple of small UDF’s to handle some of this since using the LIKE ‘%’ would cause a table scan anyway the CLR UDF was faster internally when dealing with the string than T-SQL was using all the string handling functions.

I’m also seeing quite a bit on using the CLR for rolling aggregates and other kinds of aggregation problems. I don’t have any personal experience in that yet with the CLR.

There are also some things that aren’t practical at all using T-SQL, some would say you shouldn’t be using the database for some of this stuff in the first place but that is an argument for a different post.

And Now for Something Completely Different…

I’ve recently started working on my most complex project using the CLR, some aspects have been covered by other folks like Adam Machanic, Robin Dewson and Jonathan Kehayias but there was some specific requirements that I needed.

Thus was born….

SQL Server File System Tools

This is a codeplex hosted project and all the source code is available there for your viewing pleasure.

I’ve done a lot of C# stuff but this was my first hard core CLR app for SQL Server.

What the assembly does is pretty simple, store files in the database ether native, encrypted or compressed.Yoel Martinez wrote up a nice UDF that does blob compression using the CLR. Between this and examples in Pro SQL Server 2005 on storing files in the database I knew I could do what I needed to do.

The wrinkle in my project was not just reading the file and storing it compressed it was putting it back on disk compressed as well. Enter #ziplib (SharpZipLib). This library allows you to pretty easily create standard zip files that even Windows Explorer can open and extract from. So with all the bits in place I set out to build my little tool.

Development Cycle

The first thing I did was put together all the samples I’d found build them up as a set of stored procedures instead of UDF’s and just got the file in and out working. Next I added compression via C#’s DeflateStream to see what it would take to get the data flowing in and out and what the performance hit in memory and time would start looking like. At this point I was pretty optimistic I could knock this thing out in a day or two tops. That was all fine and dandy until I started integrating the #ziplib library. My initial goal was to have the assembly set to EXTERNAL_ACCESS since that was the most restrictive security model.

Since the guys that wrote #ziplib didn’t have the CLR in mind there are several things that break without UNSAFE set. As I mentioned earlier the use of finalizers and static fields were the two big ones. I will at some point recode those parts but for now they are still in place. The second thing is the library covers a lot more functionality that I actually need, So I’ve removed the bits I can without refactoring the library. The resulting DLL isn’t horribly big at this point but I figure when I get around to coding up the finalizers I’ll refactor down to what I need then. One big plus for me though is #ziplib is all managed code written in C# so it is pretty easily added directly into my DLL so I don’t have to register two assemblies or call down to the file system to a unmanaged DLL. Compression is handled by RijndaelManaged which is a built in .net 2.0 libraries.

The big downer for me was trying to debug the the code in Visual Studio 2008, when it did work it was ok but It would fail to connect or fail to register the assemblies so I just fell back to injecting debug messages and running tests manually in SSMS.

One thing I really like about programming languages like C# is method overloading, I really wished you could do that with stored procedures! Since I can’t there were only two options, a stored proc that had lots of flags and variables that may or may not be used and handle it all under the covers or just build each option into a proc with simple callers and a descriptive name. I voted for option two. Some of the T-SQL procedures are used internally by the CLR procedures while all the CLR procedures are called by the user.

Here is the list procedures and what they do.

Called by CLR procedures as helpers

InsertFile
Is called by every proc that inserts a file into the database.

SavePassphrase
Called by StorePassPhrase to handle insert into database.

RetrievePassPhrase
Called by any proc that has to decrypt a file stream

RetrieveFile
Called by any proc that retrieves a file from the database

Called by Users

T-SQL Procedures

InsertTag
Called by User supplying a file id and list of key words or “tags” as a search helper other than file name.

RetrieveFileDetails
Get details on a single file or every file stored in the database.

CLR Procedures

StorePassPhrase
Give it a password and it generates a secure hash stored into the database for symmetric encryption

Below all store a file from the file system into the database.
StoreFile
StoreFileEncrypted
StoreFileCompressed
StoreFileEncryptedCompressed

Below all retrieve a file from the database back to the file system.
ExtractFileToDisk
ExtractFileToDiskEncrypted
ExtractFileToDiskCompressed
ExtractFileToDiskEncryptedCompressed

Below all retrieve a file from the database but returns a record set with the file name and the BLOB data.
ExtractFileToRecord
ExtractFileToRecordEncrypted
ExtractFileToRecordCompressed
ExtractFileToRecordEncryptedCompressed

And lastly, I put in an xp_getfiledetails clone since I wanted a way to verify the file is on disk and get attributes it seemed pretty straight forward since I’m getting the same details when i write the file to the database anyway.

Final Thoughts

This project isn’t done yet. there are a few more things to be added other than the code cleanup I mentioned already.

Off line decryption tool so the files dumped to disk still encrypted can be worked with.

Additional stored procedures for searching for files by tag or by attributes like name, size, etc.

A real installer and not a zip file with T-SQL scripts.

After that it goes into maintained mode with no new features but work on speeding it up, reducing the memory impact and fixing any bugs that are found. I really want to avoid this growing into a huge library, Keep it simple, do one thing and do it well.

Resources

Here are some things that helped me along the way.

Development Tools

Visual Studio 2008
could have used notepad but hey I’m getting lazy in my old age.

JetBrains ReSharper 4.5
If you are using Visual Studio ReSharper is a must. I feel like I’m programming the the stone age without it.

GhostDoc
Free tool to help you document your C# code using XMLDoc. Yet something else I wished I could do with stored procedures

Doxygen
If you are building documentation and have XMLDoc in your code this can make it easier to gather it all together. It isn’t perfect but it is free.

Books

Both solid text Pro SQL Server 2005 has a chapter on CLR

Pro SQL Server 2005

This one is dedicated to just CLR and was also invaluable to me.

Pro SQL Server 2005 Assemblies

On The Web

http://www.sqlclr.net
Devoted to the CLR with some nice resources.

My Twitter buddies are always there to answer a question or two!

Until next time!

-Wes

Understanding File System IO, Lessons Learned From SQL Server

I do more than just SQL Server. I enjoy programming. In my former life I have worked with C/C++ and Assembler. As I spent more and more time with SQL Server my programming took a back seat career wise. Having that background though really helps me day in and day out understanding why SQL Server does some of the things it does at the system level.

Fast forward several years and I’ve moved away from C/C++ and spent the last few years learning C#.

Now that I work mostly in C# I do look up solutions for my C# dilemmas on sites like http://www.codeplex.com and http://www.codeproject.com. I love the internet for this very reason, hit a road block do a search and let the collective knowledge of others speed you on your way. But, it can be a trap if you don’t do your own homework.

I write mostly command line or service based tools these days not having any real talent for GUI’s to speak of. Being a person obsessed with performance I build these things to be multi-threaded, especially with today’s computers having multiple cores and hyper threading it just makes since to take advantage of the processing power. This is all fine and dandy until you want to have multiple threads access a single file and all your threads hang out waiting for access.

So, I do what I always do, ask by best friend Google what the heck is going on. As usual, he gave me several quality links and everything pointed to the underlying file not being set in asynchronous mode. Now having done a lot of C++ I knew about asynchronous IO, buffered and un-buffered. I could have made unmanaged code calls to open or create the file and pass the safe handle back, but just like it sounds it is kind of a pain to setup and if you are going down that path you might as well code it all up in C++ anyway.

Doing a little reading on MSDN I found all the little bits I needed to set everything to rights. I set up everything to do asynchronous IO and I started my test run again. It ran just like it had before slow and painful. Again, I had Mr. Google go out and look for a solution for me, sometimes being lazy is a bad thing, and he came back with several hits where people had also had similar issues. I knew I wasn’t the only one! The general solution? Something I consider very, very .Net, use a background thread and a delegate to keep the file access from halting your main thread, so your app “feels” responsive. It is still doing synchronous IO. Your main thread goes along but all file access is still bottle-necked on a single reader/writer thread. Sure, it solves the issue of program “freezing” up on file access but doesn’t really solve the problem of slow file access that I am really trying to fix.

I know that SQL Server uses asynchronous un-buffered IO to get performance from the file system. I did some refresh reading on the MSDN site again and struck gold. Writes to the file system may OR may not be asynchronous depending on several factors. One of which is, if the file must be extended everything goes back to synchronous IO while it extends the file. Well, since I was working with a filestream and a newly created file every time I was pretty much guaranteeing that I would be synchronous no matter what. At this point I dropped back to C++. I started to code it up when I realized I was doing things differently in my C++ version.

I was manually creating the file and doing an initial allocation growing it out to the size the file buffer and the file length on close if need be.

I started up my C++ version of the code and watched all the IO calls using Sysinternal’s Process Monitor. I watched my C++ version, and lo, it was doing asynchronous IO in the very beginning then switching to synchronous IO as the file started growing. I fired up my instance of SQL Server and watched as the asynchronous IO trucked right along…. until a file growth happened and everything went synchronous for the duration of the growth.

AH HA!

So, taking that little extra knowledge I manually created my file in C# set an initial default size and wouldn’t you know asynchronous IO kicked right in until it had to grow the file. I had to do a little extra coding watching for how much free space was in the file when I get close I now pause any IO, manually the file by some amount and then start up the writes again keeping things from going into a synchronous mode without me knowing.

So, there you go my little adventure and how my old skills combined with knowing how SQL Server works helped me solve this problem. Never assume that your new skills and old skills won’t overlap.