Pages

Monday, July 6, 2009

Generic Process to Convert XML Data

Below I have drawn out a general architectural idea. As you can see, the application(s) output data into a general staging XML table. It can be one table per application or one table per server, or table(s) for any other grouping. The decision should be made based on the number of records, number of applications, and practicality. The number of records should be the last concern because application(s) will only insert statistical data and never consume it.



Let start with a simple example of an XML schema and pretend that the statistical data consists of three fields: companyid integer, prodno varchar, and userid integer. To be able to validate the XML value against the registered XML schema, the custom schema has to be created and registered in SQL Server. Remember that the schema registration is per database.
To avoid the schema registration in every database, I created a special staging database per server to hold table(s) with staging statistical data. The same database can be used to hold tables with extracted "raw" statistical data. But I would recommend create another independent database. I will not explain in detail the XML schema because it is not a purpose of the article. Instead I will touch on the necessary parts of the XML syntax's.
drop XML SCHEMA COLLECTION stat
go
create XML SCHEMA COLLECTION stat AS
'
















'
To verify the proper schema registration we can declare the XML variable type of schema "stat" and assign data.
declare @xmlvar1 xml (stat);
set @xmlvar1 = '

6
prd1234
12345

'
As I said before, different applications and different parts of the same application usually output different statistical data. This means that we will have data packed into multiple XML schemas that have to be inserted into one staging table. The following table can be created for such purposes that were described above.
CREATE TABLE [dbo].[XML_TestTbl](
[XML_ID] [bigint] IDENTITY(1,1) NOT NULL primary key,
[Table_NM] AS (('XML_TestTbl')),
[XML_Data] [xml] NOT NULL,
[Schema_NM] [varchar](20) NOT NULL,
[InsertDate] [datetime] NOT NULL default ( getdate() ) )
Computed field [Table_NM] AS (('XML_TestTbl')) is created for the future generic process and needs to be able to link extracted XML record with staging table in case that multiple staging tables exist.
As I mentioned before, some additional standard fields can be added to the table. For example, we can add application name or module code, or both.
The next step before we extract XML data from table XML_TestTbl into standard table with columns, is to verify that data is matching with registered schema in SQL Server. The simplest way to do it is to allow SQL Server to do the job for us by inserting data into another table with a column of XML data type of the registered schema.
declare @verify table (XML_Data XML(stat) );
Insert into @verify(XML_Data)
Select XML_Data from XML_TestTbl where schema_nm = 'stat';
If there are no errors, our verification for this schema is done, and it is possible to load data into regular raw data tables with regular columns by using the following statement as an example.
SELECT STATEMENT 1
select
[XML_Data].query('//customerid').value('.','integer') as customerid,
[XML_Data].query('//prodno').value('.','varchar(max)') as prodno,
[XML_Data].query('//userid').value('.','integer') as userid
from dbo.XML_TestTbl
where schema_nm = 'stat';
As you can see from the statement, each row in XML_TestTbl in field XML_Data has to keep only one row for output to do the easiest XML to column data conversion. To do this, XML inserted into table XML_TestTbl (XML_Data), has to be split to keep only one to one row. For example, it can be done by the code below. However, you can do it by many other ways
declare @xmlvar1 xml ;
set @xmlvar1 = '


6
prd06
12345




3
prd03
12345

'
select T1.rt.query('.')
from @xmlvar1.nodes('/*/*') T1(rt)
As you can see, if this is the case and multiple rows can be sent by application(s) with one XML record, the XML structure has to have another tag between and tags.
One of the issues with SELECT STATEMENT 1 is that the string value does not have a limit. Although in most cases we do know the size of the inserted data. Lets say, prodno has to be varchar(6). Let us limit the size of the prodno in the registration schema by adding schema's own data type.
drop XML SCHEMA COLLECTION stat
go
create XML SCHEMA COLLECTION stat AS
'





















'
declare @xmlvar1 xml (stat);
set @xmlvar1 = '

6
prd12
12345

'

If we are trying to set the value for prodno = ‘prd1234' where data length is more than 6 characters we will have an error:
Msg 6926, Level 16, State 1, Line 4
XML Validation: Invalid simple type value: 'prd1234'. Location: /*:statroot[1]/*:statrecord[1]/*:prodno[1]
Now we know how to add schema custom data type from the base one defined in SQL Server.
As you can see from the example, the first and last line for XML requires a name space of registered XML schema. A namespace is required for the custom data type declaration.
To minimize the amount of data passed by application, add some security, and provide generic decoding solution, which incidentally will be the topic of my next article; it is possible to save XML data in table XML_TestTbl without first and last tags like this:

6
prd12
12345

First and last tags will be added by the generic process to verify data and they are not required for the XML data extraction. However, the schema must be registered with root element tag. SQL Server XML meta data tables will be used in the verification process. Let us see how to get a namespace for the first tag

We need to use sys.xml_schema_collections and sys.xml_schema_namespaces
select sn.name as namespace_nm, sc.name as schema_nm
from sys.xml_schema_collections sc
inner join sys.xml_schema_namespaces sn
on sc.xml_collection_id = sn.xml_collection_id
where sc.name = 'stat'
Table sys.xml_schema_collections © 2008 Microsoft
Returns a row per XML schema collection. An XML schema collection is a named set of XSD definitions. The XML schema collection itself is contained in a relational schema, and it is identified by a schema-scoped Transact-SQL name. The following tuples are unique: xml_collection_id, namespace_id, and schema_id, name.
Column name
Data type
Description
xml_collection_id
int
ID of the XML schema collection. Unique within the database.
schema_id
int
ID of the relational schema that contains this XML schema collection.
principal_id
int
ID of the individual owner if different from the schema owner. By default, schema-contained objects are owned by the schema owner. However, an alternate owner may be specified by using the ALTER AUTHORIZATION statement to change ownership.
NULL = No alternate individual owner.
name
sysname
Name of the XML schema collection.
create_date
datetime
Date the XML schema collection was created.
modify_date
datetime
Date the XML schema collection was last altered.

Table sys.xml_schema_namespaces © 2008 Microsoft
Returns a row per XSD-defined XML namespace. The following tuples are unique: collection_id, namespace_id, and collection_id, and name.
Column name
Data type
Description
xml_collection_id
int
ID of the XML schema collection that contains this namespace.
name
nvarchar (4000)
Name of XML namespace. Blank name indicates no target namespace.
xml_namespace_id
int
1-based ordinal that uniquely identifies the XML namespace in the database.

It means that the next code can be utilized to verify XML data with root element constructed by process.
declare @schemanm varchar(20), @namespacenm varchar(20), @xmlvar1 varchar(1000);
declare @verify table (XML_Data XML(stat) );

set @schemanm = 'stat';

select @namespacenm = sn.name
from sys.xml_schema_collections sc
inner join sys.xml_schema_namespaces sn
on sc.xml_collection_id = sn.xml_collection_id
where sc.name = @schemanm;

set @xmlvar1 = '
6
prd12
12345
'

Insert into @verify(XML_Data)
select ' ' +
cast(@xmlvar1 as varchar(max) ) +
'
'
select xml_data from @verify;

Conclusion
This article describes the theory of generic process architecture that allows store data with different input XML schemas into one staging location and then creates a process that will output data into designated standard tables. It is important for such data that is not required for the immediate consumption and that will be used by applications or reports later with some form of aggregation or the other data processed form. In our case, I used this idea to gather statistical data. Different type of statistical "raw" data is gathered from the application(s) into one table and extracted daily. Then, once a week staging table clean up by the process which deleting processed records. In my next article I will show you how to create generic code for the XML data verification and extraction and I will explain some of the many additional options of the process.

0 comments:

Post a Comment