Friday, February 10, 2006

Database filtering of documents - performance issues

Our application has a requirement to filter the documents a user can see based on various criteria e.g. security / approval rights. A user may only wish to see documents which they can approve or have approved. This would seem to be a common pattern for a lot of business applications.

To implement this our business objects create a list of document identifiers which are then appended to a SQL statement which selects the data and filters the documents returned. We used to use the IN clause to filter the documents returned using a pattern such as


SELECT * FROM PurchaseOrders WHERE PurchaseOrderID IN ( CommaSeperatedListOfIDs )

where the list of order ids was generated from the business object. After usage of the application quickly increased we encountered a stack overflow exception issue http://support.microsoft.com/kb/q288095. To resolve this we changed our approach to use temporary tables i.e. insert all the IDs from the business object into a temporary table and then use this to filter the main query e.g.

SELECT * FROM PurchaseOrders WHERE PurchaseOrderID IN ( SELECT ID FROM UniqueTempTableName )

The IDs are inserted into the temporary table using batch INSERTS e.g. in batches of 20 statements at a time. However when upwards of 10000 document identifiers of inserted into the table, it can take over 5 seconds to do this. E.g. an ASP screen which presents a list of invoices takes 15 seconds of which 11 seconds is attributed to the time to insert into the temporary table. Over the last day I've been exploring ways to minimise this.

Extended stored procedure
My first thoughts were creating an extended stored procedure which would create an in memory result set using a comma separated list of identifiers. The extended stored procedure would be called from a UDF which would return a table variable which could be used in a select statement e.g.

CREATE FUNCTION dbo.Tfn_IDs(@listIdentifiers text)
RETURNS @idTable TABLE (id int )
AS
BEGIN
INSERT INTO @idTable
EXEC master..xp_IDContainer(@listIdentifiers)
RETURN
END


However unfortunately EXEC cannot be used as the source for an INSERT INTO statement.

OPENXML
I then realised I maybe able to generate an XML representation of the identifiers using the extended stored procedure e.g. in the format

<values>
<value id="84">
<value id="85">
</values>

and then use the XML in a SQL OPENXML statement using

INSERT INTO @IDTable
SELECT ID
FROM OPENXML (@hDoc,'/Values/Value')
WITH (ID int)


UDF Only
I then went back to an approach without using an extended stored procedure and only using a UDF. The UDF could parse the comma separated list of values and insert into a table variable which would be returned. I won't outline the solution in detail but it basically encompasses using CHARINDEX to find the start and end positions of each identifiers using the comma as the delimmiter. It would be interesting to compare the timings of this and the OPENXML approach. The XML approach could be slower because of the more verbose structure of the data and the cost of creating the XML internally in SQL Server and performing XQuery expressions however it maybe offset by the improvement in speed of using a single INSERT INTO statement rather than multiple INSERT INTO statements for each identifier found and parsed in the UDF only approach.

However after increasing the size of the input to this solution it quickly became apparent that CHARINDEX truncates text data passed to varchar(8000) and so won't work with long text types which contains 10000 identifiers with an average of 3 characters each.

Back to a UDF + extended stored procedure
To overcome the issues faced in the UDF only solution an extended stored procedure would probably be the logical place to put all the parsing logic of the comma separated string into values used for an insert statement. However instead of using the resultset of an EXEC call as source to an INSERT INTO statement, which weve shown isn't possible, we could use an output parameter to get each identifier. Psuedo TSQL could look like

@sessionToken = xp_IDContainer_ParseValue('1,2,3')
WHILE xp_IDContainer_EndOfResults(@sessionToken) <> 0
BEGIN
@ID = xp_IDContainer_GetNextID(@sessionToken)
Insert id into table variable
END


Has anyone solved this problem using the approaches I've highlighted or other approaches which are successful with large volumes of data.

No comments: