XML - eXtensible Markup Language
Read in Elmasri & Navathe (EN)
XML
XML is a markup language more structured than HTML. Its strict hierarchical
structure makes it suitable for textual encoding of databases; see Fig
12.3 and below.
The primary role of XML in databases is for transporting the data (eg
formatting the result of a query), not storing it natively. There are
attempts at defining native XML databases, but it is hard to get around the
problem that tagged text is an inefficient storage mechanism. For example,
finding records means searching linearly. If you are worried about going to
a NoSQL approach because your database is getting too big, XML is not for
you!
While the encoding of Fig 12.3 (below) looks promising, note that the
<Worker> entries are more complex than they are in our relational
database; they contain more than just a key to the EMPLOYEE table and the hours attribute. Arguably, even the
employee and hours information really should be in a separate XML section
<works_on>....</works_on>.
This method does illustrate the hierarchical structure supported by XML.
In this version, we are using xml only to list
the data; we are not using xml to define its format or the table structure.
XML Structured Data: EN Fig 12.3
<?xml version="1.0" standalone="yes">
<Projects>
<Project>
<Name>ProductX</Name>
<Number>1</Number>
<Location>Bellaire</Location>
<Dept_no>5</Dept_no>
<Worker>
<Ssn>123456789</Ssn>
<Last_name>Smith</Last_name>
<Hours>32.5</Hours>
</Worker>
<Worker>
<Ssn>453453453</Ssn>
<First_name>Joyce</First_name>
<Hours>20.0</Hours>
</Worker>
</Project>
<Project>
<Name>ProductY</Name>
<Number>2</Number>
<Location>Sugarland</Location>
<Dept_no>5</Dept_no>
<Worker>
<Ssn>123456789</Ssn>
<Hours>7.5</Hours>
</Worker>
<Worker>
<Ssn>453453453</Ssn>
<Hours>20.0</Hours>
</Worker>
<Worker>
<Ssn>333445555</Ssn>
<Hours>10.0</Hours>
</Worker>
</Project>
</Projects>
The SQL query that might elicit the above data might be
select p.pname, p.pnumber, p.plocation,
e.ssn, e.fname, w.hours
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and p.pnumber in (1,2)
order by p.pnumber
The output of the query in MySQL's tabular form is
+----------+---------+-----------+-----------+----------+-------+
| pname | pnumber | plocation |
ssn | fname |
hours |
+----------+---------+-----------+-----------+----------+-------+
| ProductX | 1 | Bellaire |
123456789 | John | 32.5 |
| ProductX | 1 | Bellaire |
453453453 | Joyce | 20.0 |
| ProductY | 2 | Sugarland |
123456789 | John | 7.5 |
| ProductY | 2 | Sugarland |
333445555 | Franklin | 10.0 |
| ProductY | 2 | Sugarland |
453453453 | Joyce | 20.0 |
+----------+---------+-----------+-----------+----------+-------+
One advantage of XML is that it is easier to parse by text-based front-ends.
Note that the XML is not merely an alternative format for the five rows
above; the information is hierarchical:
- ProductX, 1, Bellaire
- ProductY, 2, Sugarland
- John, 7.5
- Franklin, 10
- Joyce, 20
XML DTD (Document Type Definition)
One approach to imposing field structure on XML documents is illustrated by
Fig 12.4 (reproduced as text below), which
shows a small XML DTD (Document
Type Definition) document. DTD was the original mechanism for defining data
formats in XML. One of the weaknesses of DTD is that it has a very limited
set of atomic attribute types; in the example here, #PCDATA (parsed
character data) essentially means a string. There is no provision
for strings of fixed length, or CHAR versus VARCHAR.
Here is the DTD code:
<!DOCTYPE projects [
<!ELEMENT projects (project +)>
<!ELEMENT project (Name, Number, Location,
DeptNo?, Workers)
<!ATTLIST Project ProjId ID
#REQUIRED>
>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Number (#PCDATA)>
<!ELEMENT Location (#PCDATA)>
<!ELEMENT DeptNo (#PCDATA)>
<!ELEMENT Workers (Worker *)>
<!ELEMENT Worker (SSN, LastName?, FirstName?
, hours)>
<!ELEMENT SSN (#PCDATA)>
<!ELEMENT LastName (#PCDATA)>
<!ELEMENT FirstName (#PCDATA)>
<!ELEMENT hours (#PCDATA)>
] >
/pld/353
The syntax here is
!IElement
type_identifier being defined
type-spec in parentheses: list of attributes, or
definition as a list
Attributes followed by ? can be null (that is, can appear 0 or 1 times).
Attributes or subtypes followed by + can appear 1 or more times. Attributes
or subtypes followed by * can appear 0 or more times.
DTD allows us to specify that
- Projects is a sequence of one
or more Project objects
- A Project has a Name, Number,
Location, Dept_no (optional), and Workers.
It also has a ProjId atribute.
- Workers is a sequence of zero
or more Worker objects.
- Worker objects have SSn,
Last_name (opt), First_name (opt), and Hours.
We can now verify that Fig 12.3 actually fits this form. If we wanted to
enforce this, we might change the first line to
<?xml version="1.0" standalone="no">
<!DOCTYPE Projects SYSTEM "proj.dtd">
This allows us to enforce which fields can be present, but doesn't really
spell out the full schema involving other tables.
Chapter 16: more about
normalization
XML Schema Definition (XSD)
For this we move on to XML schema,
or XML Schema Definitions, XSD, illustrated in Fig 12.5. While this format
is more complex, it also supports unordered data, and shares a more general
XML syntax. Note in the first of these the three entities department,
employee (with subtype dependentNameUnique), and project. The locations
table will later be listed as a multivalued attribute of Department, and
dependents will be a multivalued attribute of Employee. WorksOn will be a
multivalued attribute of Employee, and Worker a multivalued attribute of
Project.
This format can serve (though perhaps not terribly efficiently) as a
replacement for the SQL table definitions for the COMPANY database.
Parts of the XML schema:
- defining the main entities of "company"
- key and foreign-key constraints
- constraints are given names here
- xsd:selector is the table name
- xsd:field is the attribute name within the table
- table definitions
Note in the Employee and Project sections the reference to WorksOn, and in
Employee the employeeDependent, and in Department the departmentLocation.
Here is the Fig
12.5 XML schema.
The MySQL command mysqldump can produce xml
representations of specific tables. If I run the following command
mysqldump -p --xml plddb employee
I get XML that represents just the employee table. The table definition
looks something like this:
<table_structure name="employee">
<field Field="fname" Type="varchar(15)" Null="NO" Key="" Extra="" Comment="" />
<field Field="minit" Type="char(1)" Null="YES" Key="" Extra="" Comment="" />
<field Field="lname" Type="varchar(15)" Null="NO" Key="" Extra="" Comment="" />
<field Field="ssn" Type="char(9)" Null="NO" Key="PRI" Extra="" Comment="" />
<field Field="bdate" Type="date" Null="YES" Key="" Extra="" Comment="" />
<field Field="address" Type="varchar(30)" Null="YES" Key="" Extra="" Comment="" />
<field Field="sex" Type="char(1)" Null="YES" Key="" Extra="" Comment="" />
<field Field="salary" Type="decimal(10,2)" Null="YES" Key="" Extra="" Comment="" />
<field Field="super_ssn" Type="char(9)" Null="YES" Key="MUL" Extra="" Comment="" />
<field Field="dno" Type="int(11)" Null="NO" Key="MUL" Extra="" Comment="" />
...
</table_structure>
Then follows the actual data:
<table_data name="employee">
<row>
<field name="fname">John</field>
<field name="minit">B</field>
<field name="lname">Smith</field>
<field name="ssn">123456789</field>
<field name="bdate">1965-01-09</field>
<field name="address">731 Fondren, Houston TX</field>
<field name="sex">M</field>
<field name="salary">30000.00</field>
<field name="super_ssn">333445555</field>
<field name="dno">5</field>
</row>
<row>
<field name="fname">Franklin</field>
<field name="minit">T</field>
<field name="lname">Wong</field>
<field name="ssn">333445555</field>
<field name="bdate">1955-12-08</field>
<field name="address">638 Voss, Houston TX</field>
<field name="sex">M</field>
<field name="salary">40000.00</field>
<field name="super_ssn">888665555</field>
<field name="dno">5</field>
</row>
...
</table_data>
There is no attempt here at identifying hierarchical relationships, or even
to use XML tags that reflect the individual fields.