SATA and write-through

Basically, this post originates in a inspiring comment from James Hamilton a couple of years ago on SATA and write reliability.

We all know that Enterprise-Class transactionnal database systems use concepts such as ARIES to enforce transactionnal consistency. SQL Server implements ARIES semantics using what is called the Write-Ahead protocol. It means that any modification to a data in a transaction requires that all of the log records up to this transaction must be hardened on a media that can survive a power outage. These log records must be written ahead of the data, thus the name.

The concept of ‘hardening‘ is technically implemented using a special feature named write-through. This feature is defined in the win32 API as a flag passed to a file object during its initialization with CreateFile(),  FILE_FLAG_WRITE_THROUGH. Quoting MSDN:

“Write operations will not go through any intermediate cache, they will go directly to disk.”

SQL Server uses FILE_FLAG_WRITE_THROUGH to open data and transaction log files. Now the way this flag is interpreted depends on what kind of storage interface lies beneath the IO manager. For example, the SCSI protocol handles this flag using a special ForceUnitAccess (FUA) bit in a command disk block. A write CDB looks like this:

Whenever the file is opened or created using FILE_FLAG_WRITE_THROUGH, the scsi storport driver will translate this by turning FUA on.

This is where our story begins, because other protocols derived from ATAPI / IDE specification, including S-ATA, usually ignore this flag. Many SATA disks out there explicitly do not support the feature:

sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA

What does that mean ? Simply that when the windows kernel respects the requirement carried out by the write-through flag, the default SATA miniport driver just clears it for performance reasons (see JH post mentionned at the top).

I can see you tipping your chair back. Wow wow wow, wait a second. You mean my SATA commodity drives aren’t secure with regard to write persistance ? Well yours, I don’t know. But mine isn’t. Here’s how you can find out.

Diving in the IO waters

Before going any further, some interesting readings : Handling IRPs: what every driver writer needs to know.

Anytime SQL Server opens a data or transaction log file, it calls CreateFile(). The kernel then transfers the control to the Object Manager and initializes a FILE_OBJECT structure which represents the file along with its attributes. When CreateFile() has FILE_FLAG_WRITE_THROUGH set, the .Flags bitmask of the FILE_OBJECT will have a FO_WRITE_THROUGH property on, represented by a value of 0×00000010 as per winddk.h. This means that the Object Manager has understood the requirement and pushes it into the object structure.

Then when a write occurs, using WriteFile() / WriteFileGather() depending on the target, the control is transfered to the IO Manager. Everytime an IO has to be carried out the IO manager will create an IO Request Packet (IRP) to track it. IRPs are pure kernel structures and cannot be observed in user mode.

Every IRP has a number of stacks associated called IO_STACK_LOCATIONs. There is one IO_STACK_LOCATION per driver on the IO path. The header points to the current IO_STACK_LOCATION, and is updated as each driver passes the IRP to the next driver in the stack.

As for the FILE_OBJECT, an IRP also has a .Flags member to embed active switches. One of them is SL_WRITE_THROUGH, a sort of counterpart of FO_WRITE_THROUGH, and it equals 0×04 (see winddk.h for the complete list of .Flags members and bitmask values). Our mission should we choose to accept it is to track down an IRP during a write-through write on a SATA device, and check if the requirement is propagated properly to the IO_STACK_LOCATION corresponding to our driver.

Down the rabbit hole:

I’m not going to attach sqlservr in a kernel debugger. I’ve tried, scratched my head so much I’m almost bald now, thanks. Yeah bald is beautiful, I know. Seriously I found it impractical for what I’m looking for (catching a living write IRP).  I decided to code a small piece of C program posting write-through, sector-aligned, synchronous IOs every 1ms, and then looking at it in liveKD.

// -------------------------------------------------------------------------------------
// writeThrough.cpp : Ecrit en boucle des caractères dans un fichier en mode Writethrough / No Buffering
// Utilisé avec livekd, permet de scruter le contenu des IO_STACK_LOCATION dans l'IRP correspondante
// pour contrôler si le bit SL_WRITE_THROUGH est bien propagé jusqu'au driver (pb ATA / IDE /S-ATA).
//
// On utilisera un IOCTL (IOCTL_DISK_GET_DRIVE_GEOMETRY) pour récupérer la taille du secteur
//
// dbaffaleuf@capdata.fr (c) CapData Consulting 2011
//
// --------------------------------------------------------------------------------------

#include <stdafx.h>
#include <windows.h>
#include <stdlib.h>
#include <winioctl.h>
#include <direct.h>

int _tmain(int argc, _TCHAR* argv[])
{
if (argc!=2)
{
    printf("Usage is:  writeThrough outputfilename\n");
    return 1;
}

// Récupération de la taille du secteur disque avec un IOCTL ---------------------------------------------------
DISK_GEOMETRY dg;
HANDLE hDrive;
DWORD ioctlJnk;

hDrive=CreateFile(TEXT("\\\\.\\PhysicalDrive0")
      ,0
      ,FILE_SHARE_READ
      | FILE_SHARE_WRITE
      ,NULL
      ,OPEN_EXISTING
      ,0
      ,NULL);

if (INVALID_HANDLE_VALUE==hDrive)
{
      printf("Unable to open drive PhysicalDrive0. Last error=%d\n",GetLastError());
      return 1;
}

if (! DeviceIoControl(hDrive
      ,IOCTL_DISK_GET_DRIVE_GEOMETRY
      ,NULL
      ,0
      ,&dg
      ,sizeof(dg)
      ,&ioctlJnk
      ,NULL) )
{
      printf("Error on IOCTL (IOCTL_DISK_GET_DRIVE_GEOMETRY) to %s.  Last error=%d\n",argv[1],GetLastError());
      return 1;
}

CloseHandle(hDrive);

// Ouverture du fichier Writethrough + No Buffering ---------------------------------------------------------
HANDLE hOutputFile=CreateFile(argv[1]
      ,GENERIC_READ
      | GENERIC_WRITE
      ,0
      ,NULL
      ,CREATE_ALWAYS
      ,FILE_ATTRIBUTE_NORMAL
      | FILE_FLAG_WRITE_THROUGH
      | FILE_FLAG_NO_BUFFERING
      ,NULL);

if (INVALID_HANDLE_VALUE==hOutputFile)
{
      printf("Unable to open file %s.  Last error=%d\n",argv[1],GetLastError());
      return 1;
}

// Initialisation du tampon et écriture --------------------------------------------------------------------
wchar_t *csBuffer = (wchar_t *)HeapAlloc(GetProcessHeap(),HEAP_ZERO_MEMORY,dg.BytesPerSector * sizeof(wchar_t));
int len = swprintf_s(csBuffer,dg.BytesPerSector,L"abcdefghijklmnopqrstuvwxyz");
DWORD dwBytesWritten;
DWORD dwTotalBytesWritten=0;

do
{
      if (! WriteFile(hOutputFile
      ,csBuffer
      ,dg.BytesPerSector
      ,&dwBytesWritten
      ,NULL))
     {
           printf ("WriteFile failed with error %d.\n", GetLastError());
           return 1;
     }

     dwTotalBytesWritten+=dwBytesWritten;
     printf("Operation terminée, %d octets écrits\n", dwTotalBytesWritten);
     Sleep (1);

} while (TRUE);

HeapFree(GetProcessHeap(),0,csBuffer);
CloseHandle(hOutputFile);

return 0;
}

OK so nothing particular here, except that FILE_FLAG_WRITE_THROUGH requires to post sector aligned writes, so we need to collect the sector size using an IOCTL, that’s how SQL does it. Let’s run this :

C:\DBA\DEV\CPP\writeThrough\Debug>writeThrough.exe test.txt
Operation terminée, 512 octets écrits
Operation terminée, 1024 octets écrits
Operation terminée, 1536 octets écrits
Operation terminée, 2048 octets écrits
Operation terminée, 2560 octets écrits
Operation terminée, 3072 octets écrits
...

In the meantime, you will obviously have to install a kernel debugger such as LiveKD and set the proper symbol path:

set _NT_SYMBOL_PATH=srv*c:\temp\symbols*http://msdl.microsoft.com/download/symbols   

You may need to run a bcdedit /debug to enable debugging in kernel mode, and reboot the box. Now we need to find our process in the kernel address space, look for the handle of our output file, inspect the FILE_OBJECT and try to display any on-the-fly IRP we can find.

lkd> !sym quiet
quiet mode - symbol prompts on

lkd> !process 0 0 writeThrough.exe
PROCESS 881b39d0  SessionId: 0  Cid: 0c08    Peb: 7ffde000  ParentCid: 0960
DirBase: 0aa80740  ObjectTable: e4168638  HandleCount:   9.
Image: writeThrough.exe

OK here’s our process, we can define it as the current process we’ll look at:

lkd> .process 881b39d0
Implicit process is now 881b39d0

Then we need to enumerate the handles owned by this process, and try to look for our test.txt output file:

0: kd> !handle 0 3
(...)
07dc: Object: 8701f1b0  GrantedAccess: 0012019f Entry: e3baefb8
Object: 8701f1b0  Type: (8a57a900) File
ObjectHeader: 8701f198 (old version)
HandleCount: 1  PointerCount: 1
Directory Object: 00000000  Name: \CAPDATA\DEV\DEV++\writeThrough\Debug\test.txt {HarddiskVolume3}
(...)

Here it is, Object: 8701f1b0. Let’s inspect the FILE_OBJECT structure for test.txt:

0: kd> dt nt!_FILE_OBJECT 8701f1b0
+0x000 Type             : 0n5
+0x002 Size             : 0n112
+0x004 DeviceObject     : 0x8a4a5a30_DEVICE_OBJECT
+0x008 Vpb              : 0x8a4a71e8 _VPB
+0x00c FsContext        : 0xe6be4928 Void
+0x010 FsContext2       : 0xe6be4a80 Void
+0x014 SectionObjectPointer : 0x8839f5cc _SECTION_OBJECT_POINTERS
+0x018 PrivateCacheMap  : (null)
+0x01c FinalStatus      : 0n0
+0x020 RelatedFileObject : 0x88752ec0 _FILE_OBJECT
+0x024 LockOperation    : 0 ''
+0x025 DeletePending    : 0 ''
+0x026 ReadAccess       : 0x1 ''
+0x027 WriteAccess      : 0x1 ''
+0x028 DeleteAccess     : 0 ''
+0x029 SharedRead       : 0 ''
+0x02a SharedWrite      : 0 ''
+0x02b SharedDelete     : 0 ''
+0x02c Flags            : 0x4101a
+0x030 FileName         : _UNICODE_STRING "\CAPDATA\DEV\DEV++\writeThrough\Debug\test.txt"
+0x038 CurrentByteOffset : _LARGE_INTEGER 0x7c00
+0x040 Waiters          : 0
+0x044 Busy             : 0
+0x048 LastLock         : (null)
+0x04c Lock             : _KEVENT
+0x05c Event            : _KEVENT
+0x06c CompletionContext : (null)

Look at the .Flags member. Remember the Object Manager records the write-through requirement by setting the FO_WRITE_THROUGH (0×00000010) in the bitmask:

0: kd> ? (0x4101a & 0x00000010)
Evaluate expression: 16 = 00000010

Yep, here it is. For further investigation, we need to keep track of the device object identifier:

+0x004 DeviceObject     : 0x8a4a5a30_DEVICE_OBJECT

Hang on !

That’s where the fun really begins. You may need to run the following code more that once to be able to catch an active IRP on the fly, these things are really transient:

lkd> .foreach /pS 1 /ps 3 (irpaddr {.foreach /pS 21 /ps 9 (irpaddressblk {!irpfind 0 0 device 0x8a4a5a30}) { .echo ${irpaddressblk} } }) { !irp ${irpaddr} 1 }

You’ll have to put everything in one row and that makes it difficult to read. Basically we have two nested loops, the inner one is returning an IRP address for our device object:

irpaddr {.foreach /pS 21 /ps 9 (irpaddressblk {!irpfind 0 0 device 0x8a4a5a30}

And the outer one is trying to print the contents of the IRP structure by calling !irp on the IRP address returned (for more details have a look at the DDK documentation):

.foreach /pS 1 /ps 3 (...) { .echo ${irpaddressblk} } }) { !irp ${irpaddr} 1 }
 

And if you are lucky, you may catch one active IRP on its way to the driver:

Irp is active with 10 stacks 6 is current (= 0x882a312c)
Mdl=88195a60: No System Buffer: Thread 881059e8:  Irp stack trace.
Flags = 00000043
ThreadListEntry.Flink = 882a3018
ThreadListEntry.Blink = 882a3018
IoStatus.Status = 00000000
IoStatus.Information = 00000000
RequestorMode = 00000000
Cancel = 00
CancelIrql = 0
ApcEnvironment = 00
UserIosb = a0e60878
UserEvent = a0e60820
Overlay.AsynchronousParameters.UserApcRoutine = 00000000
Overlay.AsynchronousParameters.UserApcContext = 00000000
Overlay.AllocationSize = 00000000 - 00000000
CancelRoutine = 00000000
UserBuffer = 88083000
&Tail.Overlay.DeviceQueueEntry = 882a3048
Tail.Overlay.Thread = 881059e8
Tail.Overlay.AuxiliaryBuffer = 00000000
Tail.Overlay.ListEntry.Flink = 00000000
Tail.Overlay.ListEntry.Blink = 00000000
Tail.Overlay.CurrentStackLocation = 882a312c
Tail.Overlay.OriginalFileObject = 899ef028
Tail.Apc = 00000000
Tail.CompletionKey = 00000000
cmd  flg cl Device   File     Completion-Context
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
>[  4, 0]   0  0 8a483770 00000000 f77077ca-8a483548
\Driver\Disk
Args: 00000000 00000000 00000000 00000000
[  4, 0]   0  0 8a483548 00000000 f7238962-8a4a5ae8
\Driver\PartMgr
Args: 00000000 00000000 00000000 00000000
[  4, 0]   0  0 8a4a5a30 00000000 f74a7680-8a48a7b8
\Driver\Ftdisk
Args: 00000000 00000000 00000000 00000000
[  4, 0]   0  0 8a48a700 00000000 f7063e3f-a0e6045c
\Driver\VolSnap
Args: 00000000 00000000 00000000 00000000
[  4, 0]   0  0 8979c020 00000000 00000000-00000000
\FileSystem\Ntfs
Args: 00000000 00000000 00000000 00000000

Nice. At the bottom we can see the stack locations piled up (Ntfs -> VolSnap -> FtDisk -> PartMgr -> Disk). The current location pointed with ‘>’ on \Driver\Disk with major,minor = [4,0] shows us this is an IRP_MJ_WRITE, exactly what we seek. And what about the .Flags member:

lkd> ? (0x00000043 & 0x4)
Evaluate expression: 0 = 00000000

Alas. No SL_WRITE_THROUGH here. The FO_WRITE_THROUGH has been ignored somehow, and this write-through operation may end up right in the disk cache with a nice ack to the caller. You would argue that I’d only need to uncheck the cache checkbox for the disk.

Well, according to this article  from  MSR and the Disk Subsystem Performance whitepaper:

“There is no guarantee that the write caching will actually be disabled. In many cases a controller cache is only configurable using a special utility provided by the manufacturer.”

What then ?

Bottom-line, use Enterprise-Class disks for your data and transaction log files. And by that I mean SCSI.

Backup is different, because the IOs are buffered, and SQL Server issues a call to FlushFileBuffers() to make sure the on-board caches are properly flushed to the media. The directive is correctly understood by SATA, so you can use commodity disks for backup. You can check this by setting a breakpoint on kernel32!FlushFileBuffers when sqlservr is attached in windbg during a backup:

0:018> bp kernel32!FlushFileBuffers
0:018> g
Breakpoint 0 hit
eax=000006f8 ebx=35a364e0 ecx=651af380 edx=7c91e4f4 esi=35a364e0 edi=3602ddc0
eip=7c8126d1 esp=651af3c8 ebp=651af3f0 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000206
kernel32!FlushFileBuffers:
7c8126d1 8bff            mov     edi,edi

Actually, the situation is not as dramatic as it seems. Many RAID controllers have battery-backed cache and do not need to honor the write-through requirement. And we can hope that in future designs, judicious calls to FlushFileBuffers()  can be made in very appropriate scenarios to minimize the places where transactionnal consistency is in danger.

I’m looking forward to hear your thoughs on this. Keep in touch ~David.

Ps: you can read this article in french right here.

About these ads

About dbaffaleuf

I've been in the IT for a decade administering mission critical RDBMS systems such as Sybase ASE and Microsoft SQL Server, for some of the largests investment banks, asset managers and telecoms in France. I came to the position of principal consultant on SQL Server technologies at CapData Consulting in 2006, where I still manage a team of senior DBAs, do expertise jobs and software support for our remote monitoring solutions. I also have the privilege of sharing my experience by teaching classes on SQL Server topics, and blogging in french at http://blog.capdata.fr and in english at SQL Server Central. (http://www.sqlservercentral.com/Authors/Articles/David__Baffaleuf/405129/).
This entry was posted in Coding, Storage engine and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s