Database week 8

Comp 353/453: Database Programming, Corboy L08, 4:15 Mondays

Week 8, Mar 18

midterm

XML

XML is a markup language more structured than HTML. Its strict hierarchical structure makes it suitable for textual encoding of databases; see Fig 12.3 and below.

The primary role of XML in databases is for transporting the data (eg formatting the result of a query), not storing it natively. There are attempts at defining native XML databases, but it is hard to get around the problem that tagged text is an inefficient storage mechanism. For example, finding records means searching linearly. If you are worried about going to a NoSQL approach because your database is getting too big, XML is not for you!

While the encoding of Fig 12.3 (below) looks promising, note that the <Worker> entries are more complex than they are in our relational database; they contain more than just a key to the EMPLOYEE table and the hours attribute. Arguably, even the employee and hours information really should be in a separate XML section <works_on>....</works_on>.

This method does illustrate the hierarchical structure supported by XML.

In this version, we are using xml only to list the data; we are not using xml to define its format or the table structure.

XML Structured Data: EN Fig 12.3

<?xml version="1.0" standalone="yes">
<Projects>
    <Project>
       <Name>ProductX</Name>
       <Number>1</Number>
       <Location>Bellaire</Location>
       <Dept_no>5</Dept_no>
       <Worker>
          <Ssn>123456789</Ssn>
          <Last_name>Smith</Last_name>
          <Hours>32.5</Hours>
       </Worker>
       <Worker>
          <Ssn>453453453</Ssn>
          <First_name>Joyce</First_name>
          <Hours>20.0</Hours>
       </Worker>
    </Project>

    <Project>
       <Name>ProductY</Name>
       <Number>2</Number>
       <Location>Sugarland</Location>
       <Dept_no>5</Dept_no>
       <Worker>
          <Ssn>123456789</Ssn>
          <Hours>7.5</Hours>
       </Worker>
       <Worker>
          <Ssn>453453453</Ssn>
          <Hours>20.0</Hours>
       </Worker>
       <Worker>
          <Ssn>333445555</Ssn>
          <Hours>10.0</Hours>
       </Worker>
    </Project>
</Projects>

The SQL query corresponding to this might be

select p.pname, p.pnumber, p.plocation, e.ssn, e.fname, w.hours
from project p, works_on w, employee e
where p.pnumber = w.pno and w.essn = e.ssn and p.pnumber in (1,2)
order by p.pnumber

The output of the query in MySQL is

+----------+---------+-----------+-----------+----------+-------+
| pname    | pnumber | plocation | ssn       | fname    | hours |
+----------+---------+-----------+-----------+----------+-------+
| ProductX |       1 | Bellaire | 123456789 | John     | 32.5 |
| ProductX |       1 | Bellaire | 453453453 | Joyce    | 20.0 |
| ProductY |       2 | Sugarland | 123456789 | John     |   7.5 |
| ProductY |       2 | Sugarland | 333445555 | Franklin | 10.0 |
| ProductY |       2 | Sugarland | 453453453 | Joyce    | 20.0 |
+----------+---------+-----------+-----------+----------+-------+

One advantage of XML is that it is easier to parse by text-based front-ends.

XML DTD (Document Type Definition)

One approach to imposing field structure on XML documents is illustrated by Fig 12.4 (reproduced as text below), which shows a small XML DTD (Document Type Definition) document. DTD was the original mechanism for defining data formats in XML. One of the weaknesses of DTD is that it has a very limited set of atomic attribute types; in the example here, #PCDATA (parsed character data) essentially means a string. There is no provision for strings of fixed length, or CHAR versus VARCHAR.

Here is the DTD code:

<!DOCTYPE projects [
    <!ELEMENT projects (project +)>
    <!ELEMENT project (Name, Number, Location, DeptNo?, Workers)
          <!ATTLIST Project ProjId ID #REQUIRED>
    >
    <!ELEMENT Name (#PCDATA)>
    <!ELEMENT Number (#PCDATA)>
    <!ELEMENT Location (#PCDATA)>
    <!ELEMENT DeptNo (#PCDATA)>
    <!ELEMENT Workers (Worker *)>
    <!ELEMENT Worker (SSN, LastName?, FirstName? , hours)>
    <!ELEMENT SSN (#PCDATA)>
    <!ELEMENT LastName (#PCDATA)>
    <!ELEMENT FirstName (#PCDATA)>
    <!ELEMENT hours (#PCDATA)>
] >

The syntax here is
    !IElement
    type_identifier being defined
    type-spec in parentheses: list of attributes, or definition as a list

Attributes followed by ? can be null (that is, can appear 0 or 1 times). Attributes or subtypes followed by + can appear 1 or more times. Attributes or subtypes followed by * can appear 0 or more times.

DTD allows us to specify that

Projects is a sequence of one or more Project objects
A Project has a Name, Number, Location, Dept_no (optional), and Workers. It also has a ProjId atribute.
Workers is a sequence of zero or more Worker objects.
Worker objects have SSn, Last_name (opt), First_name (opt), and Hours.

We can now verify that Fig 12.3 actually fits this form. If we wanted to enforce this, we might change the first line to
<?xml version="1.0" standalone="no">
<!DOCTYPE Projects SYSTEM "proj.dtd">

This allows us to enforce which fields can be present, but doesn't really spell out the full schema involving other tables.

XML Schema Definition (XSD)

For this we move on to XML schema, or XML Schema Definitions, XSD, illustrated in Fig 12.5. While this format is more complex, it also supports unordered data, and shares a more general XML syntax. Note in the first of these the three entities department, employee (with subtype dependentNameUnique), and project. The locations table will later be listed as a multivalued attribute of Department, and dependents will be a multivalued attribute of Employee. WorksOn will be a multivalued attribute of Employee, and Worker a multivalued attribute of Project.

This format can serve (though perhaps not terribly efficiently) as a replacement for the SQL table definitions for the COMPANY database.

Parts of the XML schema:

defining the main entities of "company"
key and foreign-key constraints

constraints are given names here
xsd:selector is the table name
xsd:field is the attribute name within the table

table definitions

Note in the Employee and Project sections the reference to WorksOn, and in Employee the employeeDependent, and in Department the departmentLocation.

Here is the Fig 12.5 XML schema.