Denali SP1, PASS keynotes and incoming features for the next mountain

If you didn’t get the chance to attend PASS this year, you could still watch the keynotes and a couple of sessions live from SQLPASSTV.

Many announcements were made for the future release of SQL Server , and in particular:

– Hekaton project unleashed. This is an in-memory storage engine designed for highly-demanding OLTP systems. Removing the need for page latching will improve the performance of existing T-SQL by orders of magnitude (Hekaton is greek for 100x) without modifying the code. Although the secret had leaked previously (see Aaron Bertrand’s post here) and we were expecting a big performance bump, the demos were very impressive.

– Updatable and clustered ColumnStore indexes: the readonly constraint when adding a columnstore index to a table was a big pain, compared to other vendor’s similar technology (I can only think of Sybase’s IQ engine which has been around for about 20 years). Finally the next release of SQL Server will allow read/write. I really can’t wait to see the technical / performance implications on updating such structures.

– PDW v2: new features like updateable columnstore indexes will also be added to the next release of PDW. Insane query execution times showed at the demo against 1 Pb of data ! We have a session on PDW v2 coming up in december at the french SQLRally called ‘Journées SQL’ , in case you’re in Paris you can already register right here. Among other features, the Polybase engine allowing to query both relational and Hadoop data from SSMS. A demo of this was showed during Keynote 2.

SQL Server 2012 Service Pack 1 available: again an annoucement made at the keynote 1 by Ted Kummert.

In my humble opinion, the first keynote was much better than the second one. During the Polybase demo, multiple bad comments flooded out on twitter . Ironically the subject of the demo was about Twitter and Sentiments, by mining positive and negative feedback from twitter logs. You may read BrentO comments about this here.

As far as I’m concerned, I must admit that I can’t really appreciate what it takes to build a demo like this, but I found that it took a long time to get to the climax (at least what I considered to be the climax) when Julie added a SQL Server replica on Azure at the end of the session. Hopefully, just after the keynote, I watched Adam Machanic‘s session on paralellism and THAT WAS A MOMENT. The greatest session on the live feed so far. I’m still collecting a lot of facts about paralellism, serial zones, MAXOP and CTFP, and this helped a lot. Thanks Adam !

Today don’t miss Dr DeWitt’s session on Big Data, check out the schedule at SQLPASSTV.

Cheers ~David

Advertisements
Posted in Other musings | Tagged | Leave a comment

How about generate objects DDL from T-SQL ?

Some time ago, I’ve been trying to figure out how to generate the DDL of objects from the T-SQL command line. Powershell allows us to do that, from the PS command-line, as well as SMO. But SMO is not usable with SQLCLR so far so the list of our options goes thin.

We’d be tempted to play with the evil-feature itself, AKA COM Automation …

Why not ?

If you google around you’ll read many warnings against the usage of COM (or OLE) Automation with SQL Server. And in most cases, they’ll be legitimate, but only because there is an assumption that the external COM library is always loaded in SQL Server address space, and can jeopardize the process and all of its memory structures. Actually there is another way around.

You can use a DLL host process to host that library, and make it available to SQL Server using a COM server and ole automation stored procedures (sp_OA…). This is done using the @context parameter of the sp_OACreate stored procedure. Quoting from MSDN:

1 = In-process (.dll) OLE server only.
4 = Local (.exe) OLE server only.
5 = Both in-process and local OLE server allowed

When you specify a context value of 4, a local OLE server does not have access to any SQL Server resources, and it cannot damage SQL Server memory or resources.

This means when you pass @context=4, the DLL is loaded in a separate dllhost.exe process, and cannot trash any memory structure inside SQL Server address space anymore. Good.

Write and register the COM library:

A sample program (I am by no mean a C# guru here :)) to call the SMO Script() method for reverse engineering:

using System;
using System.Collections.Generic;
using System.Collections.Specialized;
using System.Linq;
using System.Text;
using Microsoft.SqlServer.Management.Smo;
using Microsoft.SqlServer.Management.Common;

namespace getDDL
{
     public interface IgetDDL          {
         string ddlThis(string _srvname, string _dbname, int _objid);
     }
     public class cgetDDL : IgetDDL           {
         string ddlstr;

         public cgetDDL() { }     
         public string ddlThis(string _srvname, string _dbname, int _objid)             {
             Server Srv = new Server(_srvname);
             Database Db = new Database();
             Db = Srv.Databases[_dbname];
             StringCollection _tablescr = Db.Tables.ItemById(_objid).Script();

             foreach (string _line in _tablescr)
                 ddlstr += _line;
             ddlstr += "GO"+'\n';

             IndexCollection _indexes = _t.Indexes;
             foreach (Index _i in _indexes)
             {
                 StringCollection _iscr = _i.Script();
                 foreach (string _line in _iscr)
                     ddlstr += _line + '\n';
             }
             if (_indexes.Count != 0)
                 ddlstr += "GO" + '\n';

             CheckCollection _checks = _t.Checks;
             foreach (Check _c in _checks)
             {
                 StringCollection _cscr = _c.Script();
                 foreach (string _line in _cscr)
                     ddlstr += _line + '\n';
             }
             if (_checks.Count != 0)
                 ddlstr += "GO" + '\n';

             TriggerCollection _triggers = _t.Triggers;
             foreach (Trigger _trg in _triggers)
             {
                 StringCollection _trgscr = _trg.Script();
                 foreach (string _line in _trgscr)
                     ddlstr += _line + '\n';
             }
             if (_triggers.Count != 0)
                 ddlstr += "GO" + '\n';

             foreach (Column _col in _t.Columns)
             {
                 if (_col.DefaultConstraint != null)
                 {
                     StringCollection _def = _col.DefaultConstraint.Script();
                     foreach (string _line in _def)
                         ddlstr += _line + '\n';
                     ddlstr += "GO" + '\n';
                 }
            }

             return ddlstr;
         }
     }
}

The method ddlThis() will be published by the public interface IgetDDL(). You’ll need to put the ComVisible to true in the assembly.cs file, and sign the assembly:

[assembly: ComVisible(true)]

Comes the COM part

Now we need to register the library and expose the type library file (.tlb) to the COM Server. We’ll use regasm to do that:

Your .Net path here >regasm /codebase getDDL.dll /tlb:getDDL.tlb
Microsoft (R) .NET Framework Assembly Registration Utility 2.0.50727.3053
Copyright (C) Microsoft Corporation 1998-2004. Tous droits réservés.
Inscription des types réussie
L'assembly a été exporté vers 'C:\DBA\DEV\C#\getDDL\bin\Debug\getDDL.tlb' et la bibliothèque de types a été inscrite avec succès

Once the TLB file has been generated, we can fire the comexp.msc snap-in and create an empty COM+ app called getDDL. I’ll set the same service account for the COM server as the service account for SQL Server, so it has elevated privileges, but you can create a special service account and declare it as sysadmin in SQL Server.

And finally, just drag and drop the TLB file inside the components of your new application and check that the exposed ddlThis() is visible as expected:

The service can be started and the presence of the dllhost.exe checked in task manager. Next step we’ll get back to do some T-SQL.

Last part, the sp_OA procedure:

First thing, we need to authorize SQL Server to run our very dangerous code:

sp_configure 'Ole Automation Procedures',1;
GO
reconfigure;
GO

And create our usp_getDDL encapsulating all the OA calls to the COM server:

create procedure usp_getDDL (@tablename varchar(128))
AS
BEGIN
     DECLARE @hr int, @obj int;
     DECLARE @Source nvarchar(255), @Desc nvarchar(255);
     DECLARE @DDLTEXT nvarchar(4000);
     DECLARE @dbname sysname, @objid int, @srvname sysname;

     select @srvname = @@servername
     select @dbname= db_name()
     select @objid = object_id(@tablename)
     if (@objid is NULL)
     BEGIN
         print 'Table '+@tablename+' non trouvee dans la base courante'
         RETURN
     END

     execute @hr = sp_OACreate 'getDDL.cgetDDL', @obj OUT, 4          --> @context = 4, DLL hosted in a separate process
     if (@hr <> 0)
     BEGIN
         EXEC sp_OAGetErrorInfo @obj,
         @Source OUT,
         @Desc OUT;
         SELECT HR = convert(varbinary(4),@hr),
         Source=@Source,
         Description=@Desc;
         RETURN
     END

     execute @hr = sp_OAMethod @obj, 'ddlThis', @DDLTEXT OUT, @srvname, @dbname, @objid
     if (@hr <> 0)
     BEGIN
         EXEC sp_OAGetErrorInfo @obj,
         @Source OUT,
         @Desc OUT;
         SELECT HR = convert(varbinary(4),@hr),
         Source=@Source,
         Description=@Desc;
         RETURN
     END

     execute @hr = sp_OADestroy @obj

     select @DDLTEXT as 'DDL'
END

You’ll notice the sp_OACreate @context=4 which is of extreme importance in our case. At last, we are able to retrieve some object DDL using the stored procedure from the SQLCMD prompt:

DOS>sqlcmd -E
1> use CAPDATA
2> go
Le contexte de la base de données a changé ; il est maintenant 'CAPDATA'.
1> exec usp_getddl 'ANNONCE'
2> go

DDL
------------------------------------------------------------------------------------------------------

SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
CREATE TABLE [dbo].[ANNONCE](
 [ID_ANNONCE] [int] NOT NULL,
 [ID_JOURNAL] [int] NOT NULL,
 [ID_PROPALOUER] [int] NOT NULL,
 [DATETIME_ANNONCE] [datetime] NOT NULL,
 [PRIX] [numeric](6, 2) NULL,
 [NB_PARUTION] [int] NULL
) ON [PRIMARY]

GO
ALTER TABLE [dbo].[ANNONCE] ADD  CONSTRAINT [PK_ANNONCE] PRIMARY KEY CLUSTERED
(
 [ID_ANNONCE] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF,
ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

Hopefully we have found a modest but safe way to make COM automation and SQL Server work together. BTW, I have found another very interesting initiative on Codeplex using SMO and Service Broker, by a fellow SQL Server MVP Simon Sabin.

Cheers,
~David

Posted in .NET, Coding | Tagged , , | Leave a comment

SATA and write-through

Basically, this post originates in a inspiring comment from James Hamilton a couple of years ago on SATA and write reliability.

We all know that Enterprise-Class transactionnal database systems use concepts such as ARIES to enforce transactionnal consistency. SQL Server implements ARIES semantics using what is called the Write-Ahead protocol. It means that any modification to a data in a transaction requires that all of the log records up to this transaction must be hardened on a media that can survive a power outage. These log records must be written ahead of the data, thus the name.

The concept of ‘hardening‘ is technically implemented using a special feature named write-through. This feature is defined in the win32 API as a flag passed to a file object during its initialization with CreateFile(),  FILE_FLAG_WRITE_THROUGH. Quoting MSDN:

“Write operations will not go through any intermediate cache, they will go directly to disk.”

SQL Server uses FILE_FLAG_WRITE_THROUGH to open data and transaction log files. Now the way this flag is interpreted depends on what kind of storage interface lies beneath the IO manager. For example, the SCSI protocol handles this flag using a special ForceUnitAccess (FUA) bit in a command disk block. A write CDB looks like this:

Whenever the file is opened or created using FILE_FLAG_WRITE_THROUGH, the scsi storport driver will translate this by turning FUA on.

This is where our story begins, because other protocols derived from ATAPI / IDE specification, including S-ATA, usually ignore this flag. Many SATA disks out there explicitly do not support the feature:

sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA

What does that mean ? Simply that when the windows kernel respects the requirement carried out by the write-through flag, the default SATA miniport driver just clears it for performance reasons (see JH post mentionned at the top).

I can see you tipping your chair back. Wow wow wow, wait a second. You mean my SATA commodity drives aren’t secure with regard to write persistance ? Well yours, I don’t know. But mine isn’t. Here’s how you can find out.

Diving in the IO waters

Before going any further, some interesting readings : Handling IRPs: what every driver writer needs to know.

Anytime SQL Server opens a data or transaction log file, it calls CreateFile(). The kernel then transfers the control to the Object Manager and initializes a FILE_OBJECT structure which represents the file along with its attributes. When CreateFile() has FILE_FLAG_WRITE_THROUGH set, the .Flags bitmask of the FILE_OBJECT will have a FO_WRITE_THROUGH property on, represented by a value of 0×00000010 as per winddk.h. This means that the Object Manager has understood the requirement and pushes it into the object structure.

Then when a write occurs, using WriteFile() / WriteFileGather() depending on the target, the control is transfered to the IO Manager. Everytime an IO has to be carried out the IO manager will create an IO Request Packet (IRP) to track it. IRPs are pure kernel structures and cannot be observed in user mode.

Every IRP has a number of stacks associated called IO_STACK_LOCATIONs. There is one IO_STACK_LOCATION per driver on the IO path. The header points to the current IO_STACK_LOCATION, and is updated as each driver passes the IRP to the next driver in the stack.

As for the FILE_OBJECT, an IRP also has a .Flags member to embed active switches. One of them is SL_WRITE_THROUGH, a sort of counterpart of FO_WRITE_THROUGH, and it equals 0×04 (see winddk.h for the complete list of .Flags members and bitmask values). Our mission should we choose to accept it is to track down an IRP during a write-through write on a SATA device, and check if the requirement is propagated properly to the IO_STACK_LOCATION corresponding to our driver.

Down the rabbit hole:

I’m not going to attach sqlservr in a kernel debugger. I’ve tried, scratched my head so much I’m almost bald now, thanks. Yeah bald is beautiful, I know. Seriously I found it impractical for what I’m looking for (catching a living write IRP).  I decided to code a small piece of C program posting write-through, sector-aligned, synchronous IOs every 1ms, and then looking at it in liveKD.

// -------------------------------------------------------------------------------------
// writeThrough.cpp : Ecrit en boucle des caractères dans un fichier en mode Writethrough / No Buffering
// Utilisé avec livekd, permet de scruter le contenu des IO_STACK_LOCATION dans l'IRP correspondante
// pour contrôler si le bit SL_WRITE_THROUGH est bien propagé jusqu'au driver (pb ATA / IDE /S-ATA).
//
// On utilisera un IOCTL (IOCTL_DISK_GET_DRIVE_GEOMETRY) pour récupérer la taille du secteur
//
// dbaffaleuf@capdata.fr (c) CapData Consulting 2011
//
// --------------------------------------------------------------------------------------

#include <stdafx.h>
#include <windows.h>
#include <stdlib.h>
#include <winioctl.h>
#include <direct.h>

int _tmain(int argc, _TCHAR* argv[])
{
if (argc!=2)
{
    printf("Usage is:  writeThrough outputfilename\n");
    return 1;
}

// Récupération de la taille du secteur disque avec un IOCTL ---------------------------------------------------
DISK_GEOMETRY dg;
HANDLE hDrive;
DWORD ioctlJnk;

hDrive=CreateFile(TEXT("\\\\.\\PhysicalDrive0")
      ,0
      ,FILE_SHARE_READ
      | FILE_SHARE_WRITE
      ,NULL
      ,OPEN_EXISTING
      ,0
      ,NULL);

if (INVALID_HANDLE_VALUE==hDrive)
{
      printf("Unable to open drive PhysicalDrive0. Last error=%d\n",GetLastError());
      return 1;
}

if (! DeviceIoControl(hDrive
      ,IOCTL_DISK_GET_DRIVE_GEOMETRY
      ,NULL
      ,0
      ,&dg
      ,sizeof(dg)
      ,&ioctlJnk
      ,NULL) )
{
      printf("Error on IOCTL (IOCTL_DISK_GET_DRIVE_GEOMETRY) to %s.  Last error=%d\n",argv[1],GetLastError());
      return 1;
}

CloseHandle(hDrive);

// Ouverture du fichier Writethrough + No Buffering ---------------------------------------------------------
HANDLE hOutputFile=CreateFile(argv[1]
      ,GENERIC_READ
      | GENERIC_WRITE
      ,0
      ,NULL
      ,CREATE_ALWAYS
      ,FILE_ATTRIBUTE_NORMAL
      | FILE_FLAG_WRITE_THROUGH
      | FILE_FLAG_NO_BUFFERING
      ,NULL);

if (INVALID_HANDLE_VALUE==hOutputFile)
{
      printf("Unable to open file %s.  Last error=%d\n",argv[1],GetLastError());
      return 1;
}

// Initialisation du tampon et écriture --------------------------------------------------------------------
wchar_t *csBuffer = (wchar_t *)HeapAlloc(GetProcessHeap(),HEAP_ZERO_MEMORY,dg.BytesPerSector * sizeof(wchar_t));
int len = swprintf_s(csBuffer,dg.BytesPerSector,L"abcdefghijklmnopqrstuvwxyz");
DWORD dwBytesWritten;
DWORD dwTotalBytesWritten=0;

do
{
      if (! WriteFile(hOutputFile
      ,csBuffer
      ,dg.BytesPerSector
      ,&dwBytesWritten
      ,NULL))
     {
           printf ("WriteFile failed with error %d.\n", GetLastError());
           return 1;
     }

     dwTotalBytesWritten+=dwBytesWritten;
     printf("Operation terminée, %d octets écrits\n", dwTotalBytesWritten);
     Sleep (1);

} while (TRUE);

HeapFree(GetProcessHeap(),0,csBuffer);
CloseHandle(hOutputFile);

return 0;
}

OK so nothing particular here, except that FILE_FLAG_WRITE_THROUGH requires to post sector aligned writes, so we need to collect the sector size using an IOCTL, that’s how SQL does it. Let’s run this :

C:\DBA\DEV\CPP\writeThrough\Debug>writeThrough.exe test.txt
Operation terminée, 512 octets écrits
Operation terminée, 1024 octets écrits
Operation terminée, 1536 octets écrits
Operation terminée, 2048 octets écrits
Operation terminée, 2560 octets écrits
Operation terminée, 3072 octets écrits
...

In the meantime, you will obviously have to install a kernel debugger such as LiveKD and set the proper symbol path:

set _NT_SYMBOL_PATH=srv*c:\temp\symbols*http://msdl.microsoft.com/download/symbols   

You may need to run a bcdedit /debug to enable debugging in kernel mode, and reboot the box. Now we need to find our process in the kernel address space, look for the handle of our output file, inspect the FILE_OBJECT and try to display any on-the-fly IRP we can find.

lkd> !sym quiet
quiet mode - symbol prompts on

lkd> !process 0 0 writeThrough.exe
PROCESS 881b39d0  SessionId: 0  Cid: 0c08    Peb: 7ffde000  ParentCid: 0960
DirBase: 0aa80740  ObjectTable: e4168638  HandleCount:   9.
Image: writeThrough.exe

OK here’s our process, we can define it as the current process we’ll look at:

lkd> .process 881b39d0
Implicit process is now 881b39d0

Then we need to enumerate the handles owned by this process, and try to look for our test.txt output file:

0: kd> !handle 0 3
(...)
07dc: Object: 8701f1b0  GrantedAccess: 0012019f Entry: e3baefb8
Object: 8701f1b0  Type: (8a57a900) File
ObjectHeader: 8701f198 (old version)
HandleCount: 1  PointerCount: 1
Directory Object: 00000000  Name: \CAPDATA\DEV\DEV++\writeThrough\Debug\test.txt {HarddiskVolume3}
(...)

Here it is, Object: 8701f1b0. Let’s inspect the FILE_OBJECT structure for test.txt:

0: kd> dt nt!_FILE_OBJECT 8701f1b0
+0x000 Type             : 0n5
+0x002 Size             : 0n112
+0x004 DeviceObject     : 0x8a4a5a30_DEVICE_OBJECT
+0x008 Vpb              : 0x8a4a71e8 _VPB
+0x00c FsContext        : 0xe6be4928 Void
+0x010 FsContext2       : 0xe6be4a80 Void
+0x014 SectionObjectPointer : 0x8839f5cc _SECTION_OBJECT_POINTERS
+0x018 PrivateCacheMap  : (null)
+0x01c FinalStatus      : 0n0
+0x020 RelatedFileObject : 0x88752ec0 _FILE_OBJECT
+0x024 LockOperation    : 0 ''
+0x025 DeletePending    : 0 ''
+0x026 ReadAccess       : 0x1 ''
+0x027 WriteAccess      : 0x1 ''
+0x028 DeleteAccess     : 0 ''
+0x029 SharedRead       : 0 ''
+0x02a SharedWrite      : 0 ''
+0x02b SharedDelete     : 0 ''
+0x02c Flags            : 0x4101a
+0x030 FileName         : _UNICODE_STRING "\CAPDATA\DEV\DEV++\writeThrough\Debug\test.txt"
+0x038 CurrentByteOffset : _LARGE_INTEGER 0x7c00
+0x040 Waiters          : 0
+0x044 Busy             : 0
+0x048 LastLock         : (null)
+0x04c Lock             : _KEVENT
+0x05c Event            : _KEVENT
+0x06c CompletionContext : (null)

Look at the .Flags member. Remember the Object Manager records the write-through requirement by setting the FO_WRITE_THROUGH (0x00000010) in the bitmask:

0: kd> ? (0x4101a & 0x00000010)
Evaluate expression: 16 = 00000010

Yep, here it is. For further investigation, we need to keep track of the device object identifier:

+0x004 DeviceObject     : 0x8a4a5a30_DEVICE_OBJECT

Hang on !

That’s where the fun really begins. You may need to run the following code more that once to be able to catch an active IRP on the fly, these things are really transient:

lkd> .foreach /pS 1 /ps 3 (irpaddr {.foreach /pS 21 /ps 9 (irpaddressblk {!irpfind 0 0 device 0x8a4a5a30}) { .echo ${irpaddressblk} } }) { !irp ${irpaddr} 1 }

You’ll have to put everything in one row and that makes it difficult to read. Basically we have two nested loops, the inner one is returning an IRP address for our device object:

irpaddr {.foreach /pS 21 /ps 9 (irpaddressblk {!irpfind 0 0 device 0x8a4a5a30}

And the outer one is trying to print the contents of the IRP structure by calling !irp on the IRP address returned (for more details have a look at the DDK documentation):

.foreach /pS 1 /ps 3 (...) { .echo ${irpaddressblk} } }) { !irp ${irpaddr} 1 }
 

And if you are lucky, you may catch one active IRP on its way to the driver:

Irp is active with 10 stacks 6 is current (= 0x882a312c)
Mdl=88195a60: No System Buffer: Thread 881059e8:  Irp stack trace.
Flags = 00000043
ThreadListEntry.Flink = 882a3018
ThreadListEntry.Blink = 882a3018
IoStatus.Status = 00000000
IoStatus.Information = 00000000
RequestorMode = 00000000
Cancel = 00
CancelIrql = 0
ApcEnvironment = 00
UserIosb = a0e60878
UserEvent = a0e60820
Overlay.AsynchronousParameters.UserApcRoutine = 00000000
Overlay.AsynchronousParameters.UserApcContext = 00000000
Overlay.AllocationSize = 00000000 - 00000000
CancelRoutine = 00000000
UserBuffer = 88083000
&Tail.Overlay.DeviceQueueEntry = 882a3048
Tail.Overlay.Thread = 881059e8
Tail.Overlay.AuxiliaryBuffer = 00000000
Tail.Overlay.ListEntry.Flink = 00000000
Tail.Overlay.ListEntry.Blink = 00000000
Tail.Overlay.CurrentStackLocation = 882a312c
Tail.Overlay.OriginalFileObject = 899ef028
Tail.Apc = 00000000
Tail.CompletionKey = 00000000
cmd  flg cl Device   File     Completion-Context
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[  0, 0]   0  0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
>[  4, 0]   0  0 8a483770 00000000 f77077ca-8a483548
\Driver\Disk
Args: 00000000 00000000 00000000 00000000
[  4, 0]   0  0 8a483548 00000000 f7238962-8a4a5ae8
\Driver\PartMgr
Args: 00000000 00000000 00000000 00000000
[  4, 0]   0  0 8a4a5a30 00000000 f74a7680-8a48a7b8
\Driver\Ftdisk
Args: 00000000 00000000 00000000 00000000
[  4, 0]   0  0 8a48a700 00000000 f7063e3f-a0e6045c
\Driver\VolSnap
Args: 00000000 00000000 00000000 00000000
[  4, 0]   0  0 8979c020 00000000 00000000-00000000
\FileSystem\Ntfs
Args: 00000000 00000000 00000000 00000000

Nice. At the bottom we can see the stack locations piled up (Ntfs -> VolSnap -> FtDisk -> PartMgr -> Disk). The current location pointed with ‘>’ on \Driver\Disk with major,minor = [4,0] shows us this is an IRP_MJ_WRITE, exactly what we seek. And what about the .Flags member:

lkd> ? (0x00000043 & 0x4)
Evaluate expression: 0 = 00000000

Alas. No SL_WRITE_THROUGH here. The FO_WRITE_THROUGH has been ignored somehow, and this write-through operation may end up right in the disk cache with a nice ack to the caller. You would argue that I’d only need to uncheck the cache checkbox for the disk.

Well, according to this article  from  MSR and the Disk Subsystem Performance whitepaper:

“There is no guarantee that the write caching will actually be disabled. In many cases a controller cache is only configurable using a special utility provided by the manufacturer.”

What then ?

Bottom-line, use Enterprise-Class disks for your data and transaction log files. And by that I mean SCSI.

Backup is different, because the IOs are buffered, and SQL Server issues a call to FlushFileBuffers() to make sure the on-board caches are properly flushed to the media. The directive is correctly understood by SATA, so you can use commodity disks for backup. You can check this by setting a breakpoint on kernel32!FlushFileBuffers when sqlservr is attached in windbg during a backup:

0:018> bp kernel32!FlushFileBuffers
0:018> g
Breakpoint 0 hit
eax=000006f8 ebx=35a364e0 ecx=651af380 edx=7c91e4f4 esi=35a364e0 edi=3602ddc0
eip=7c8126d1 esp=651af3c8 ebp=651af3f0 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000206
kernel32!FlushFileBuffers:
7c8126d1 8bff            mov     edi,edi

Actually, the situation is not as dramatic as it seems. Many RAID controllers have battery-backed cache and do not need to honor the write-through requirement. And we can hope that in future designs, judicious calls to FlushFileBuffers()  can be made in very appropriate scenarios to minimize the places where transactionnal consistency is in danger.

I’m looking forward to hear your thoughs on this. Keep in touch ~David.

Ps: you can read this article in french right here.

Posted in Coding, Storage engine | Tagged , | Leave a comment

A word for the Keynotes serie

When I started as a Sybase DBA more than a decade ago, there was no blog, no community, and very little documentation on how basically the storage engine worked. So I did what everybody was doing at that time: keeping keynotes in a notebook. The notebook soon became filled with crappy notes and I started a second one, then a third one, etc…

You may find many useful piece of information here and there about how SQL Server manages memory, handles IO, scheduling, query optimization, but apart from MSDN (which is good but not exhaustive), everything is said in many places, so having my own central store of information makes things easier for me.

Don’t get me wrong: this does not pretend to look like regular blog posts. It is rather a bag of information on a subject. Every post will be updated regularly as I gather new stuff. This is close to what Neil Hannon says :

‘The writer writes for himself, not for you’

Writing down things has always been a way for me to grasp concepts and memorize details of an implementation. I guess I’ll tweet or communicate when anything gets updated.

Keep in touch. ~David

Posted in Keynotes | Leave a comment

Point in time recovery and fn_dump_dblog()

For starters and to create some content I will simply translate articles I’ve already written for the past 3 years in french. Actually I am no lazy writer, but this is quite difficult to find original subjects for articles, so much has already been written. In time, I’ll come up with 100% new articles.

For this first one, we’ll be discussing point in time recovery using an undocumented TVF I’ve bumped into when trying to get my hands on OPENROWSET internals.

I want to do PIT recovery, what are my options ?

When you think of point in time recovery, first thing that comes to mind is STOPAT / STOPATMARK / STOPBEFOREMARK keywords used in conjunction with the RESTORE DATABASE command. Either you can stop the restore operation using a date (STOPAT) or using a transaction name (STOPATMARK/STOPBEFOREMARK). But in the end, I find these approaches quite unpractical: you never know the exact time your colleague has deleted a couple of rows by mistake. If he says the delete occured at about 13h45, this is quite imprecise in a real life environment with hundreds of transactions per second. Maybe the delete really occured at 13h47, and if you decide to restore at 13h45, you will potentially discard committed transactions, not pretty good…

Or you could use named transactions to restore at or before a specific transaction. Again in real life, who uses explicit named transactions ? I know this is a good practice to name transactions in T-SQL developments, but objectively, it is still barely implemented. The final solution would be to use another syntax of STOPATMARK /STOPBEFOREMARK:

RESTORE LOG ... WITH STOPBEFOREMARK = 'lsn:'

Now the real problem with this syntax is how do you find the LSN values for each transaction in a transaction log backup ?

Enter fn_dump_dblog():

If you copy the MDF/LDF files of a resource database and attach them, you will eventually discover this TVF burried in the system objects list:

create function sys.fn_dump_dblog
(
@start    nvarchar (25) = NULL,
@end      nvarchar (25) = NULL,
@devtype  nvarchar (260) = NULL, -- NULL(DISK) | DISK | TAPE | VIRTUAL_DEVICE
@seqnum   Int         = 1,
@fname1   nvarchar (260) = NULL,
@fname2   nvarchar (260) = NULL,
@fname3   nvarchar (260) = NULL,
...
@fname64   nvarchar (260) = NULL
)
returns table
as
return select
[Current LSN],  [Operation],  [Context],  [Transaction ID],   [Tag Bits],
[Log Record Fixed Length],  [Log Record Length],   [Previous LSN],
(...)
from OpenRowset (DBLog, @start, @end, @devtype, @seqnum,
@fname1,  @fname2,  @fname3,  @fname4,  @fname5,
(...)
@fname62,  @fname63,  @fname64)

It takes an awful lot of @fname arguments, and as long it is neither documented nor supported, I don’t even know if all those arguments are even used in the underlying code, but one of them is very interesting: @devtype. It typically takes the path of a transaction log backup, and the function returns an output similar to fn_dblog(), including the LSN of each LOP_BEGIN_XACT operator. Let’s have a try.

Oops, I’ve deleted some data by mistake

Forgetting a where clause in a DML statement can happen to anybody, but using fn_dump_dblog() in conjunction with the standby mode of a restore allows us to restore to the transaction just before the mistake. Let’s create some data to illustrate this:

use master
GO
if exists (select 1 from sys.databases where name='demorecovery')
 drop database demorecovery
GO
create database demorecovery
GO

use demorecovery
GO
create table T1(
 a numeric identity,
 b char(4000) default replicate('b',4000),
 c bigint default round(rand()*100,0))
GO
insert into T1 default values
GO 1000
create unique clustered index IDX_T1C on T1(a)
GO
create index IDX_T1NC__c on T1(c)
GO

And switch to full recovery model:

alter database demorecovery set recovery full
GO
backup database demorecovery to disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_FULL.bak' with init, stats
GO

Let’s create some transactions and backup them:

delete from T1 where a < 10
GO
backup log demorecovery to disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG1.bak' with init, stats
GO
delete from T1 where a < 20
GO
backup log demorecovery to disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG2.bak' with init, stats
GO
delete from T1 where a < 30
GO
backup log demorecovery to disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG3.bak' with init, stats
GO

Another two xacts and then the manual mistake:

delete from T1 where a < 40
GO
delete from T1 where a < 50
GO
delete from T1 where a < 60  -- <-- Oh my 😦
GO
backup log demorecovery to disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG4.bak' with init, stats
GO

Gone Fishing

First thing is to cut down the applications, to avoid further transaction processing past your potential recovery point. You can switch the database to single user mode with rollback immediate to do that. And then, we need to recover the database to the point in time before the last 10 rows are removed by mistake. We are going to use a combination of restore log with standby and fn_dump_dblog to do this. First we need to identify the different transactions embedded in the log backup taken just after the delete. This is when fn_dump_dblog comes in:

select *, [Current LSN]
 ,Operation
 ,[Transaction ID]
 ,AllocUnitId
 ,[Begin Time]
 ,[Transaction Name]
 ,[End Time]
 ,[Description]
from fn_dump_dblog(DEFAULT, DEFAULT,DEFAULT, DEFAULT,
'C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG4.bak',
DEFAULT,DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT)
where Operation  = 'LOP_BEGIN_XACT'

Current LSN             Operation         Transaction ID  Begin Time               Transaction Name    Description
----------------------- ----------------- --------------  ------------------------ ------------------- -----------------------------------------------------------------------
00000041:00000338:000e  LOP_BEGIN_XACT    0000:000008b8   2011/07/13 15:59:43:580  DELETE              DELETE;0x0105000000000005150000006aa3aebb1b5c3491642b62e2e8030000
00000042:00000021:0001  LOP_BEGIN_XACT    0000:000008b9   2011/07/13 15:59:43:587  DELETE              DELETE;0x0105000000000005150000006aa3aebb1b5c3491642b62e2e8030000
00000042:00000076:0001  LOP_BEGIN_XACT    0000:000008ba   2011/07/13 15:59:43:630  DELETE              DELETE;0x0105000000000005150000006aa3aebb1b5c3491642b62e2e8030000

So we have three candidates: 00000041:00000338:000e, 00000042:00000021:0001 and 00000042:00000076:0001. We have to manually try each one using the standby mode to reach exactly the point of recovery. First we need to restore the first set of backups :

restore database demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_FULL.bak' with stats, norecovery
GO
restore log demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG1.bak' with stats, norecovery
GO
restore log demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG2.bak' with stats, norecovery
GO
restore log demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG3.bak' with stats, norecovery
GO

Let’s restore to the first transaction and see if our missing data is here:

restore log demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG4.bak' with stats,
standby='C:\UTRECHT\MSSQL.1\MSSQL\Backup\stdby_files.bak', stopbeforemark='lsn:0x00000042:00000021:0001'
GO

select state_desc, is_in_standby from sys.databases where name='demorecovery'
GO

state_desc    is_in_standby
------------- --------------
ONLINE        1

use demorecovery
GO
select * from T1 where a < 60
GO

a      b                            c
------ ---------------------------- ----
40     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 31
41     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 28
42     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 77
43     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 18
44     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 12
45     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 42
46     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 2
47     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 46
48     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 39
49     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 74
50     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 45
51     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 2
52     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 32
53     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 10
54     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 2
55     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 99
56     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 71
57     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 38
58     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 2
59     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 41

Still here. So the point is a little bit forward. Let’s recover the second transaction:

use master
GO
restore log demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG4.bak' with stats,
standby='C:\UTRECHT\MSSQL.1\MSSQL\Backup\stdby_files.bak', stopbeforemark='lsn:0x00000042:00000076:0001'
GO

use demorecovery
GO
select * from T1 where a < 60
GO

a      b                            c
------ ---------------------------- ----
50     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 45
51     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 2
52     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 32
53     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 10
54     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 2
55     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 99
56     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 71
57     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 38
58     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 2
59     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 41

Closing in. We can recover the last transaction to confirm:

use master
GO
restore log demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG4.bak' with stats,
standby='C:\UTRECHT\MSSQL.1\MSSQL\Backup\stdby_files.bak'
GO

use demorecovery
GO
select * from T1 where a < 60
GO

a      b                            c
------ ---------------------------- ----

Looks like that’s our point. We need to restore the full backup sets over again and apply all transactions up to LSN ‘0x00000042:00000076:0001’:

use master
GO
restore database demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_FULL.bak' with stats, norecovery
GO
restore log demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG1.bak' with stats, norecovery
GO
restore log demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG2.bak' with stats, norecovery
GO
restore log demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG3.bak' with stats, norecovery
GO

restore log demorecovery from disk='C:\UTRECHT\MSSQL.1\MSSQL\Backup\demorecovery_LOG4.bak' with stats,
stopbeforemark='lsn:0x00000042:00000076:0001', recovery
GO

use demorecovery
GO
select * from T1 where a < 60
GO

a      b                            c
------ ---------------------------- ----
50     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 45
51     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 2
52     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 32
53     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 10
54     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 2
55     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 99
56     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 71
57     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 38
58     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 2
59     bbbbbbbbbbbbbbbbbbbbbbbbbbbb 41

Note that there is no way to apply further transactions past the recovery point, even if they happear to be 100% valid. It means that the fastest you are aware of the problem the littlest transactions you’ll lose.

Résumé:

Finally a way to recover to the exact point in time you need, thanks to an undocumented function.

See you next time. ~David

Posted in Backup & Recovery | Tagged | 1 Comment

As an introduction…

I’ve been blogging in french for my company for over 3 years, almost exclusively on technical subjects. The main reason I open this space is to try to get to more people, because when you don’t blog in english, you don’t get that much feedback.

I’ll still be blogging in french at my corporate’s blog, though, but I think there’s room for something else than binary digits and dbcc stuff in a SQL blog. There’s also room for sharing personal thoughts, make fun of myself and take a little perspective on my career.

I like the idea of corporatism in a job. Meeting people who do the same job as you do, sharing ideas, habits, and philosophy in what you do. This is typically something you wouldn’t put in a corporate blog.

See you soon.

David.

Posted in Other musings | Tagged | Leave a comment