Comp 353/453: Database Programming, Corboy
L08, 4:15 Mondays
Week 8, Mar 18
midterm
XML
XML is a markup language more structured than HTML. Its strict hierarchical
structure makes it suitable for textual encoding of databases; see Fig
12.3 and below.
The primary role of XML in databases is for transporting the data (eg
formatting the result of a query), not storing it natively. There are
attempts at defining native XML databases, but it is hard to get around the
problem that tagged text is an inefficient storage mechanism. For example,
finding records means searching linearly. If you are worried about going to
a NoSQL approach because your database is getting too big, XML is not for
you!
While the encoding of Fig 12.3 (below) looks promising, note that the
<Worker> entries are more complex than they are in our relational
database; they contain more than just a key to the EMPLOYEE table and the hours attribute. Arguably, even the
employee and hours information really should be in a separate XML section
<works_on>....</works_on>.
This method does illustrate the hierarchical structure supported by XML.
In this version, we are using xml only to list
the data; we are not using xml to define its format or the table structure.
XML Structured Data: EN Fig 12.3
<?xml version="1.0" standalone="yes">
<Projects>
<Project>
<Name>ProductX</Name>
<Number>1</Number>
<Location>Bellaire</Location>
<Dept_no>5</Dept_no>
<Worker>
<Ssn>123456789</Ssn>
<Last_name>Smith</Last_name>
<Hours>32.5</Hours>
</Worker>
<Worker>
<Ssn>453453453</Ssn>
<First_name>Joyce</First_name>
<Hours>20.0</Hours>
</Worker>
</Project>
<Project>
<Name>ProductY</Name>
<Number>2</Number>
<Location>Sugarland</Location>
<Dept_no>5</Dept_no>
<Worker>
<Ssn>123456789</Ssn>
<Hours>7.5</Hours>
</Worker>
<Worker>
<Ssn>453453453</Ssn>
<Hours>20.0</Hours>
</Worker>
<Worker>
<Ssn>333445555</Ssn>
<Hours>10.0</Hours>
</Worker>
</Project>
</Projects>
The SQL query corresponding to this might be
select p.pname, p.pnumber, p.plocation,
e.ssn, e.fname, w.hours
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and p.pnumber in (1,2)
order by p.pnumber
The output of the query in MySQL is
+----------+---------+-----------+-----------+----------+-------+
| pname | pnumber | plocation |
ssn | fname |
hours |
+----------+---------+-----------+-----------+----------+-------+
| ProductX | 1 | Bellaire |
123456789 | John | 32.5 |
| ProductX | 1 | Bellaire |
453453453 | Joyce | 20.0 |
| ProductY | 2 | Sugarland |
123456789 | John | 7.5 |
| ProductY | 2 | Sugarland |
333445555 | Franklin | 10.0 |
| ProductY | 2 | Sugarland |
453453453 | Joyce | 20.0 |
+----------+---------+-----------+-----------+----------+-------+
One advantage of XML is that it is easier to parse by text-based front-ends.
XML DTD (Document Type Definition)
One approach to imposing field structure on XML documents is illustrated by
Fig 12.4 (reproduced
as text below), which shows a small XML DTD
(Document Type Definition) document. DTD was the original mechanism for
defining data formats in XML. One of the weaknesses of DTD is that it has a
very limited set of atomic attribute types; in the example here, #PCDATA (parsed character data) essentially means
a string. There is no provision for strings of fixed length, or CHAR versus
VARCHAR.
Here is the DTD code:
<!DOCTYPE projects [
<!ELEMENT projects (project +)>
<!ELEMENT project (Name, Number, Location,
DeptNo?, Workers)
<!ATTLIST Project ProjId ID
#REQUIRED>
>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Number (#PCDATA)>
<!ELEMENT Location (#PCDATA)>
<!ELEMENT DeptNo (#PCDATA)>
<!ELEMENT Workers (Worker *)>
<!ELEMENT Worker (SSN, LastName?, FirstName?
, hours)>
<!ELEMENT SSN (#PCDATA)>
<!ELEMENT LastName (#PCDATA)>
<!ELEMENT FirstName (#PCDATA)>
<!ELEMENT hours (#PCDATA)>
] >
The syntax here is
!IElement
type_identifier being defined
type-spec in parentheses: list of attributes, or
definition as a list
Attributes followed by ? can be null (that is, can appear 0 or 1 times).
Attributes or subtypes followed by + can appear 1 or more times. Attributes
or subtypes followed by * can appear 0 or more times.
DTD allows us to specify that
- Projects is a sequence of one
or more Project objects
- A Project has a Name, Number,
Location, Dept_no (optional), and Workers.
It also has a ProjId atribute.
- Workers is a sequence of zero
or more Worker objects.
- Worker objects have SSn,
Last_name (opt), First_name (opt), and Hours.
We can now verify that Fig 12.3 actually fits this form. If we wanted to
enforce this, we might change the first line to
<?xml version="1.0" standalone="no">
<!DOCTYPE Projects SYSTEM "proj.dtd">
This allows us to enforce which fields can be present, but doesn't really
spell out the full schema involving other tables.
XML Schema Definition (XSD)
For this we move on to XML schema,
or XML Schema Definitions, XSD, illustrated in Fig 12.5. While this format
is more complex, it also supports unordered data, and shares a more general
XML syntax. Note in the first of these the three entities department,
employee (with subtype dependentNameUnique), and project. The locations
table will later be listed as a multivalued attribute of Department, and
dependents will be a multivalued attribute of Employee. WorksOn will be a
multivalued attribute of Employee, and Worker a multivalued attribute of
Project.
This format can serve (though perhaps not terribly efficiently) as a
replacement for the SQL table definitions for the COMPANY database.
Parts of the XML schema:
- defining the main entities of "company"
- key and foreign-key constraints
- constraints are given names here
- xsd:selector is the table name
- xsd:field is the attribute name within the table
- table definitions
Note in the Employee and Project sections the reference to WorksOn, and in
Employee the employeeDependent, and in Department the departmentLocation.
Here is the Fig 12.5
XML schema.