I love situations when the provider of software advertises some feature but does not tell what it really is and how it is implemented.
Both Ask Tom and Stackoverflow don’t really tell you what is it really about.
Well, in such cases it is best to actually see at some code which is parsing the data to tell the difference between binary and CLOB storage for XMLType column.
CLOB storage type
First of all, if you declare an XMLTYPE as a CLOB data this way:
CREATE TABLE TAB1 (
A NUMBER,
B XMLTYPE
) XMLTYPE B STORE AS CLOB;
This is the easy case. Data is stored as some CLOB value, it may be in row, may be out of row. But the inserted XML data is stored exactly as is. With white spaces, formatting, etc.
Binary storage type
The recommended way by Oracle is the binary type. You use it by default:
CREATE TABLE TAB2 (
A NUMBER(3),
B XMLTYPE
);
In this case the database is using some compression to make the size smaller.
The main facts are:
- The data is stored in a binary format,
- All text fields are stored in UTF-8 encoding – you can see the values between binary tags,
- The underlying type to hold XMLType data is BLOB (in or out of row),
- White space is removed,
- All XML namespaces are stored in a dictionary,
- All XML tags names are stored in a dictionary,
- All XML attributes names are stored in a dictionary,
- The dictionaries are shared among all schemas,
- There are different values for keys which originate from distinct namespaces.
And the most tricky thing: before any DML operation takes place you need to secure the dictionaries. Thus and an additional transaction is run to insert the data to the dictionaries. The additional transaction is committed despite the fact of the main transaction being committed or rolled back in the end.
Ok, so where is the dictionary? This is the tricky thing which is not explained.
The dictionaries is stored in in XDB schema in 3 tables + 1 key table.
- XDB.XDB$TTSET – The key table contains the key token.
- XDB.X$NMxxxxxx – Namespace table (xxx is the token read from the token table)
- XDB.X$QNxxxxxx – Dictionary table for XML tags and attributes (xxx is the token read from the token table)
- XDB.X$PTxxxxxxx – Pt table (xxx is the token read from the token table)
Real life examples
Run some test script and find out by yourself:
DECLARE
BEGIN
FOR V IN 0..16999999 LOOP
INSERT INTO TAB2 VALUES(V, xmltype('<A><B>1</B><C ee' || TRIM(TO_CHAR(V, '000000000')) || '="x">Test</C></A>'));
IF MOD(V, 20000) = 19999 THEN
COMMIT
DELETE FROM TAB2;
COMMIT;
END IF;
END LOOP;
END;
/
I hope you did not run the SQL above on any production system, or else you would exploit the size of XDB schema, particularly size of XDB.X$QNxxx. Watch it growing as you insert large number of unique values for XML tag or attribute names.
How to clean up unused/orphaned dictionary data? Refer to database support.
More information.
Try yourself with OpenLogReplicator. The current master branch (version 1.4.0) contains some preview version of supported reading of XMLType data. By default XMLType binary data is written to output in HEX values. After enabling the experimental feature of parsing binary XMLType data (setting “flags”: 65536) you can see OLR trying to parse the data and writing original XML data to output. Have fun.
An interesting fact is that the dictionary keys could be 1 byte, 2 byte or 4 byte. I have not tested that, but I suppose that the data pump export/import should contain also the XDB dictionary data when you export a table with XMLType column. It could happen that a key is 1 byte before export but become 4 bytes after import. Size of the data row would increase.