XML

XML - eXtensible Markup Language

Read in Elmasri & Navathe (EN)

Chapter 12: XML

XML

XML is a markup language more structured than HTML. Its strict hierarchical structure makes it suitable for textual encoding of databases; see Fig 12.3 and below.

The primary role of XML in databases is for transporting the data (eg formatting the result of a query), not storing it natively. There are attempts at defining native XML databases, but it is hard to get around the problem that tagged text is an inefficient storage mechanism. For example, finding records means searching linearly. If you are worried about going to a NoSQL approach because your database is getting too big, XML is not for you!

While the encoding of Fig 12.3 (below) looks promising, note that the <Worker> entries are more complex than they are in our relational database; they contain more than just a key to the EMPLOYEE table and the hours attribute. Arguably, even the employee and hours information really should be in a separate XML section <works_on>....</works_on>.

This method does illustrate the hierarchical structure supported by XML.

In this version, we are using xml only to list the data; we are not using xml to define its format or the table structure.

XML Structured Data: EN Fig 12.3

<?xml version="1.0" standalone="yes">
<Projects>
    <Project>
       <Name>ProductX</Name>
       <Number>1</Number>
       <Location>Bellaire</Location>
       <Dept_no>5</Dept_no>
       <Worker>
          <Ssn>123456789</Ssn>
          <Last_name>Smith</Last_name>
          <Hours>32.5</Hours>
       </Worker>
       <Worker>
          <Ssn>453453453</Ssn>
          <First_name>Joyce</First_name>
          <Hours>20.0</Hours>
       </Worker>
    </Project>

    <Project>
       <Name>ProductY</Name>
       <Number>2</Number>
       <Location>Sugarland</Location>
       <Dept_no>5</Dept_no>
       <Worker>
          <Ssn>123456789</Ssn>
          <Hours>7.5</Hours>
       </Worker>
       <Worker>
          <Ssn>453453453</Ssn>
          <Hours>20.0</Hours>
       </Worker>
       <Worker>
          <Ssn>333445555</Ssn>
          <Hours>10.0</Hours>
       </Worker>
    </Project>
</Projects>

The SQL query that might elicit the above data might be

select p.pname, p.pnumber, p.plocation, e.ssn, e.fname, w.hours
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and p.pnumber in (1,2)
order by p.pnumber

The output of the query in MySQL's tabular form is

+----------+---------+-----------+-----------+----------+-------+
| pname    | pnumber | plocation | ssn       | fname    | hours |
+----------+---------+-----------+-----------+----------+-------+
| ProductX |       1 | Bellaire | 123456789 | John     | 32.5 |
| ProductX |       1 | Bellaire | 453453453 | Joyce    | 20.0 |
| ProductY |       2 | Sugarland | 123456789 | John     |   7.5 |
| ProductY |       2 | Sugarland | 333445555 | Franklin | 10.0 |
| ProductY |       2 | Sugarland | 453453453 | Joyce    | 20.0 |
+----------+---------+-----------+-----------+----------+-------+

One advantage of XML is that it is easier to parse by text-based front-ends.

Note that the XML is not merely an alternative format for the five rows above; the information is hierarchical:

ProductX, 1, Bellaire

John, 32.5
Joyce, 20

ProductY, 2, Sugarland

John, 7.5
Franklin, 10
Joyce, 20

XML DTD (Document Type Definition)

One approach to imposing field structure on XML documents is illustrated by Fig 12.4 (reproduced as text below), which shows a small XML DTD (Document Type Definition) document. DTD was the original mechanism for defining data formats in XML. One of the weaknesses of DTD is that it has a very limited set of atomic attribute types; in the example here, #PCDATA (parsed character data) essentially means a string. There is no provision for strings of fixed length, or CHAR versus VARCHAR.

Here is the DTD code:

<!DOCTYPE projects [
    <!ELEMENT projects (project +)>
    <!ELEMENT project (Name, Number, Location, DeptNo?, Workers)
          <!ATTLIST Project ProjId ID #REQUIRED>
    >
    <!ELEMENT Name (#PCDATA)>
    <!ELEMENT Number (#PCDATA)>
    <!ELEMENT Location (#PCDATA)>
    <!ELEMENT DeptNo (#PCDATA)>
    <!ELEMENT Workers (Worker *)>
    <!ELEMENT Worker (SSN, LastName?, FirstName? , hours)>
    <!ELEMENT SSN (#PCDATA)>
    <!ELEMENT LastName (#PCDATA)>
    <!ELEMENT FirstName (#PCDATA)>
    <!ELEMENT hours (#PCDATA)>
] >
/pld/353
The syntax here is
    !IElement
    type_identifier being defined
    type-spec in parentheses: list of attributes, or definition as a list

Attributes followed by ? can be null (that is, can appear 0 or 1 times). Attributes or subtypes followed by + can appear 1 or more times. Attributes or subtypes followed by * can appear 0 or more times.

DTD allows us to specify that

Projects is a sequence of one or more Project objects
A Project has a Name, Number, Location, Dept_no (optional), and Workers. It also has a ProjId atribute.
Workers is a sequence of zero or more Worker objects.
Worker objects have SSn, Last_name (opt), First_name (opt), and Hours.

We can now verify that Fig 12.3 actually fits this form. If we wanted to enforce this, we might change the first line to
<?xml version="1.0" standalone="no">
<!DOCTYPE Projects SYSTEM "proj.dtd">

This allows us to enforce which fields can be present, but doesn't really spell out the full schema involving other tables.

Chapter 16: more about normalization

XML Schema Definition (XSD)

For this we move on to XML schema, or XML Schema Definitions, XSD, illustrated in Fig 12.5. While this format is more complex, it also supports unordered data, and shares a more general XML syntax. Note in the first of these the three entities department, employee (with subtype dependentNameUnique), and project. The locations table will later be listed as a multivalued attribute of Department, and dependents will be a multivalued attribute of Employee. WorksOn will be a multivalued attribute of Employee, and Worker a multivalued attribute of Project.

This format can serve (though perhaps not terribly efficiently) as a replacement for the SQL table definitions for the COMPANY database.

Parts of the XML schema:

defining the main entities of "company"
key and foreign-key constraints

constraints are given names here
xsd:selector is the table name
xsd:field is the attribute name within the table

table definitions

Note in the Employee and Project sections the reference to WorksOn, and in Employee the employeeDependent, and in Department the departmentLocation.

Here is the Fig 12.5 XML schema.

The MySQL command mysqldump can produce xml representations of specific tables. If I run the following command

mysqldump -p --xml plddb employee

I get XML that represents just the employee table. The table definition looks something like this:

    <table_structure name="employee">
        <field Field="fname" Type="varchar(15)" Null="NO" Key="" Extra="" Comment="" />
        <field Field="minit" Type="char(1)" Null="YES" Key="" Extra="" Comment="" />
        <field Field="lname" Type="varchar(15)" Null="NO" Key="" Extra="" Comment="" />
        <field Field="ssn" Type="char(9)" Null="NO" Key="PRI" Extra="" Comment="" />
        <field Field="bdate" Type="date" Null="YES" Key="" Extra="" Comment="" />
        <field Field="address" Type="varchar(30)" Null="YES" Key="" Extra="" Comment="" />
        <field Field="sex" Type="char(1)" Null="YES" Key="" Extra="" Comment="" />
        <field Field="salary" Type="decimal(10,2)" Null="YES" Key="" Extra="" Comment="" />
        <field Field="super_ssn" Type="char(9)" Null="YES" Key="MUL" Extra="" Comment="" />
        <field Field="dno" Type="int(11)" Null="NO" Key="MUL" Extra="" Comment="" />
        ...
    </table_structure>

Then follows the actual data:

        <table_data name="employee">
        <row>
                <field name="fname">John</field>
                <field name="minit">B</field>
                <field name="lname">Smith</field>
                <field name="ssn">123456789</field>
                <field name="bdate">1965-01-09</field>
                <field name="address">731 Fondren, Houston TX</field>
                <field name="sex">M</field>
                <field name="salary">30000.00</field>
                <field name="super_ssn">333445555</field>
                <field name="dno">5</field>
        </row>
        <row>
                <field name="fname">Franklin</field>
                <field name="minit">T</field>
                <field name="lname">Wong</field>
                <field name="ssn">333445555</field>
                <field name="bdate">1955-12-08</field>
                <field name="address">638 Voss, Houston TX</field>
                <field name="sex">M</field>
                <field name="salary">40000.00</field>
                <field name="super_ssn">888665555</field>
                <field name="dno">5</field>
        </row>
 	...
	</table_data>

There is no attempt here at identifying hierarchical relationships, or even to use XML tags that reflect the individual fields.