Even More SQL!

Record counts

CTEs (moved to sql2.html)

http://blog.2ndquadrant.com/postgresql-ctes-are-optimization-fences/

Random data

Functions

Serial

Dellstore

Recursive CTEs

OVER

Postgres information_schema and system catalog

Postgres pg_stats and query planning



Record counts

Suppose we want to peek at the first 20 records of a large dataset (say, bigcompany.employee, or passwords.password). We can do this:

select lname from employee limit 20;

What if we want the first 20, in sorted order?

select lname from employee order by lname limit 20;

What if we want the last 20? Here we have to know the total. There are 308 employees in bigcompany:

select count(*) from employee;

So the last of them are

select lname from employee offset 288;

But that's too much math. How about this:

select lname from employee offset ((select count(*) from employee) -20);



Random data

Suppose we want some random data. First we try this:

select generate_series(1,20) as id, 1000000*random();

Now let's store this in a table:

create table randos (
    id integer primary key,
    value integer
);

To populate the table, use

insert into randos select generate_series(1,20) as id, 1000000*random();

Note that the values are now integers because we asked for that in create table.

If when creating the table we declared id to be char(5), this still works:

create table randostr (
    id char(5) primary key,
    value integer
);

insert into randostr select generate_series(100000, 100200) as id, 1000000*random();


Functions

We can evaluate expressions in SQL:

select 2+3;

Note the use of select. In the examples below, the first three are special "constants"; the rest are functions.

select current_date;
select current_time;
select current_timestamp;

select now();
select timeofday();

We can also define functions. Recall that employee.salary was defined as numeric(10,2).

create or replace function getsal(name varchar) returns numeric
as
$$
    select e.salary from employee e where e.lname = name;
$$
language sql;

select getsal('Wong');

drop function getsal(varchar);

A function is a procedure if it returns void.

We use create or replace here because functions, unlike tables, are frequently redefined. If we'd just used create, we'd have to drop the function before redefining it.

We can also use the language PL/pgsql, which has some extensions. We specify it with language plpgsql. The main advantage lies in communicating with a remote database server. Using the SQL language means that each individual query has to be sent to the server for execution; PL/pgsql effectively batches the different queries.

Here are two functions that might be used to do grade calculations using the University database. The first compares two grades. We should get grade_less('B+','A-') = true, grade_less('B','B+') = true, etc. It illustrates declarations, substrings and if/elsif/else/end-if blocks

-- implements < operator for grades
create or replace function grade_less(g1 varchar, g2 varchar) returns boolean
as
$$
DECLARE
    l1 char;        -- the letter, 'A', 'B', etc
    l2 char;
    mod1 char;      -- the modifier, '+' or '-'
    mod2 char;
BEGIN
    l1 := upper(substring(g1 from 1 for 1));      -- 1st char of string is position 1
    l2 := upper(substring(g2 from 1 for 1));    -- we take 1 char starting at position 1
    if    l1 < l2 then return false;        -- 'A' is a better grade than 'B'
    elsif l1 > l2 then return true;
    end if;
    -- now we know l1 = l2
    if char_length(g1) >= 2 then
        mod1 := substring(g1 from 2 for 1);
    else
        mod1 := ' ';
    end if;
    if char_length(g2) >= 2 then
        mod2 := substring(g2 from 2 for 1);
    else
        mod2 := ' ';
    end if;
    if    mod1 = mod2 then return false;
    elsif mod1 = '-' then return true;
    elsif mod2 = '+' then return true;
    else return false;
    end if;
END;
$$
language plpgsql;

To test this, try

select grade_less('B+','A-');
select grade_less('B','B+');

The next function converts letter grades to a numeric form: 'B+' is 3.333, etc. Instead of using the previous function, we could also use the one below, and then compare the results numerically:

create or replace function grade_num(g varchar) returns real
as
$$
declare
    num real;
    letter char;
    mod char;
begin
    letter := upper(substring(g from 1 for 1));
    if    letter = 'A' then num:= 4.0;
    elsif letter = 'B' then num := 3.0;
    elsif letter = 'C' then num := 2.0;
    elsif letter = 'D' then num := 1.0;
    elsif letter = 'F' then return 0;
    end if;
    if char_length(g) = 1 then return num;
    end if;
    mod := substring(g from 2 for 1);
    if    mod = '+' then return num + 1.0/3;
    elsif mod = '-' then return num - 1.0/3;
    else return num;
    end if;
end;
$$
language plpgsql;

See sql2.html#universal_quantification for an application of these functions.


Serial

Suppose we define the following table:

create table serialdemo (
    ident serial primary key,
    name varchar(20)
);

The ident field will be automatically assigned increasing consecutive values:

insert into serialdemo values(default, 'alice');
insert into serialdemo values(default, 'bob');
insert into serialdemo values(7, 'zorro');
insert into serialdemo values(default, 'clarisse');
insert into serialdemo values(default, 'dale');
insert into serialdemo values(default, 'edmund');
insert into serialdemo values(default, 'fiona');

Now let's try one more. The problem is that the next value for ident is 7, which we already used.

insert into serialdemo values(default, 'georg');

This fails! But if we try it again, it works!

We can also try insertions within transactions:

begin;
insert into serialdemo
values (default, 'zool');


insert into serialdemo
values (default, 'harald');
rollback;

What is going on here?


One potential problem with using serial keys is that external observers can guess how many customers you have; see the German tank problem. An alternative is to use so-called UUIDs. To load the UUID module, run the following as user postgres:

\c pld
create extension if not exists pgcrypto;

The installation of the module can be verified with select gen_random_uuid();. Now, in database pld, I can define this:

create table uuiddemo (
    ident uuid primary key default gen_random_uuid(),
    name varchar(20)
);

We can then insert.

insert into uuiddemo values(default, 'alice');
insert into uuiddemo values(default, 'bob');
insert into uuiddemo values(default, 'clarisse');
insert into uuiddemo values(default, 'dale');
insert into uuiddemo values(default, 'edmund');
insert into uuiddemo values(default, 'fiona');



Dellstore

The Dellstore data is in one large file, dellstore.sql. This file includes create table commands and data-insertion commands. (For some reason, table keys are not created until the very end of the file, with alter table, but this does not really matter.)

Here's how to install the tables. You can either create a new named database or else just load the tables into your existing company/university database.

To create a new named database for Dellstore:

    createdb -U postgres dellstore
or
    createdb -U postgres -W dellstore    # -W to force request for password

To load the information into Dellstore:

From the shell:
    psql dellstore < dellstore.sql
Alternatively, from within postgres, where psql was started in the directory containing dellstore.sql:
    \i dellstore.sql

I had considerable difficulty with the string O'DONNELL in products.actor, as the quotation mark was a non-ASCII, non-UTF8 symbol.

Let's try some queries:

select count(*) from customers;
select count(*) from products;

select * from customers limit 50;

Who ordered product 4789?

select c.firstname, c.lastname, o.orderid
from customers c join orders o on c.customerid = o.customerid join orderlines ol on o.orderid = ol.orderid
where prod_id = 4789;

What titles did 'YHOODAJPYJ' order?

select p.title
from customers c join orders o on c.customerid = o.customerid
    join orderlines ol on o.orderid = ol.orderid
    join products p on ol.prod_id = p.prod_id
where c.lastname = 'YHOODAJPYJ';

Who ordered products that cost less than $10?

select c.lastname
from customers c join orders o on c.customerid = o.customerid
    join orderlines ol on o.orderid = ol.orderid
    join products p on ol.prod_id = p.prod_id
where p.price < 10.0;

What products were ordered 16 or more times?

select p.title, p.prod_id, count(*)
from products p join orderlines ol on p.prod_id = ol.prod_id
group by p.prod_id
having count(*) >= 16;

How many products did each customer order?

select count(*), c.lastname
from customers c join orders o on c.customerid = o.customerid
group by c.customerid
limit 50;

What was the maximum number of products ordered

with ccounts as
(select count(*) as count, c.lastname from customers c join orders o on c.customerid = o.customerid group by c.customerid)
select max(count) from ccounts;

We can also do this with an OVER instead of a GROUP BY (OVER is discussed below):

with ccounts(count) as
(select count(*) over (partition by c.customerid) from customers c join orders o on c.customerid = o.customerid)
select max(count) from ccounts;

Which customers placed the most orders? Here is one approach:

select c.firstname, c.lastname
from customers c join orders o on c.customerid = o.customerid
group by c.customerid
having count(*) = (
    with ccounts as
    (select count(*) as count, c.lastname from customers c join orders o on c.customerid = o.customerid group by c.customerid)
    select max(count) from ccounts
);

The following approach is trickier, but a little faster (about twice as fast). It uses Common-Table Expressions, or CTEs. The idea is to run the inner query once, generating the customer last name and per-customer order count. We name the resulting table ccounts. We then use that temporary table twice more, once to get the maximum value of the order count and then once to find the records whose count matches this maximum:

with ccounts as (
    select count(*) as count, c.lastname from customers c join orders o on c.customerid = o.customerid group by c.customerid
) select cc.lastname, cc.count
from ccounts cc
where cc.count = (select max(count) from ccounts);

What product sold the most?

-- separate query to find the max sales
with pcounts as
(select sum(quantity) as quan, p.prod_id from products p join orderlines ol on p.prod_id = ol.prod_id group by p.prod_id)
select max(pc.quan) from pcounts pc ;

select p.prod_id, p.title, sum(ol.quantity)
from products p join orderlines ol on p.prod_id = ol.prod_id
group by p.prod_id
having sum(ol.quantity) =
(with pcounts as
(select sum(quantity) as quan from products p join orderlines ol on p.prod_id = ol.prod_id group by p.prod_id)
select max(pc.quan) from pcounts pc) ;

Using the previous CTE approach, we can also do this:

with pcounts as (
    select sum(quantity) as quan, p.title from products p join orderlines ol on p.prod_id = ol.prod_id group by p.prod_id
) select pc.title, pc.quan
from pcounts pc
where pc.quan = (select max(quan) from pcounts);



Recursive CTEs

We looked earlier at basic Common Table Expressions (CTEs) as an alternative to writing nested queries. These begin with with followed by a definition of a temporary table, and then a query:

with temp as (select w.essn from works_on w where w.pno = 10)
select e.fname, e.lname from employee e where e.ssn in (select * from temp);

This one doesn't add any particular efficiency. But sometimes we need the same table twice, and in that case using a CTE (so we evaluate the table only once) is more efficient:

with ccounts as (
    select count(*) as count, c.lastname from customers c join orders o on c.customerid = o.customerid group by c.customerid
) select cc.lastname, cc.count
from ccounts cc
where cc.count = (select max(count) from ccounts);

We build the temporary table ccounts, and then run two queries:

By using the same temporary table twice, the query is more efficient.

It turns out we can also define recursive CTEs. Here's a classic simple one

with recursive times10 (n) as (
    select 10
    UNION ALL
    select n+10 from times10 where n < 100
)
select * from times10;

The table times10 has a single attribute: n. The type of n is integer, though that will be inferred. The select 10 states that the value 10 is in table times10. This is UNIONed with the values of the second select clause, which states that if n is in the table, then we also have n+10. Having built that table, we list all the values. The recursion must be limited by the table construction; it is not sufficient to define an "infinite" table that we then limit through the final select clause.

Here is another:

with recursive triangle (n,val) as (
    select 1, 1
    union
    select t.n+1, t.val+n+1 from triangle t where t.n <= 20
)  select * from triangle;

Real Applications

So what good does this do us in real life? The fundamental application of recursive CTEs is usually to "flatten out" some kind of complex hierarchical relationship. For example, the employee table defines a "supervisor" relationship, via the super_ssn field. We can use this to determine whether or not Alice is supervisor of Bob. But what if the question is whether there is a supervisor "chain" from Alice down to Bob? That is, are there S1, ...SN so that Alice supervises S1, who supervises S2, ..., who supervises SN, who supervises Bob?

Here's a recursive CTE to list all employees supervised directly or indirectly by Mr Borg (888-66-5555):

-- returns all employees supervised -- directly or indirectly -- by 888-66-5555 (Borg)
with recursive supervision as (
    select e.fname, e.lname, e.ssn, e.super_ssn from employee e
    where e.ssn = '888665555'
    UNION
    select e.fname, e.lname, e.ssn, e.super_ssn from employee e join supervision s on s.ssn = e.super_ssn
)  select * from supervision;

We define the table "supervision" recursively, and its attributes implicitly. That is, the atomic part of the recursive definition says that a record should look like (fname, lname, ssn, ssn). The attributes listed in the recursive part do in fact match; it is an error if they do not. (Because most database types are strings, it is an error that is easy to make.)

The next table lists pairs of employees (a,b), by ssn, where b is a "supervisor ancestor" of a:

with recursive superancestor as (
   select e.ssn, e.super_ssn from employee e where e.super_ssn is not null
   UNION
   select e.ssn, s.super_ssn from employee e join superancestor s
   on e.super_ssn = s.ssn
)  select * from superancestor;

Note how the recursion works: (a,b) is in the table if a is supervised by b (the atomic case), or a has supervisor s, and s has supervisor ancestor c.

Let's do that another way: this time the recursive relationship is that a has a supervisor ancestor s, and the supervisor of s is c:

with recursive superancestor as (
   select e.ssn, e.super_ssn from employee e where e.super_ssn is not null
   UNION
   select e.ssn, s.super_ssn from superancestor e join employee s
   on e.super_ssn = s.ssn where s.super_ssn is not null
)  select * from superancestor;

Is anyone their own supervisor-ancestor? First of all, would this lead to an infinite cycle in the CTE table? No: it would simply lead to a set of pairs (ai,ai+1), i<n where a0 is supervised by a1 is supervised by a2 is supervised by aN is supervised by a0.

Let's formulate the query, using our superancestor relationship above. The goal is to see if anyone is their own superancestor:

with recursive superancestor as (
   select e.ssn, e.super_ssn from employee e where e.super_ssn is not null
   UNION
   select e.ssn, s.super_ssn from employee e join superancestor s
   on e.super_ssn = s.ssn
)  select sa.ssn from superancestor sa where sa.ssn = sa.super_ssn;

We have to add someone!

update employee set super_ssn = '111220177' where ssn = '111220162';
update employee set super_ssn = '111220154' where ssn = '111220162';    -- undo

This should lead to a chain 177->165->162->177.



OVER and window queries

Sometimes we'd like to use aggregation functions but with more flexibility than provided by GROUP BY. Recall that GROUP BY runs the aggregation functions over each group. Let's warm up with some examples:

select sum(salary) from employee;

select dno, salary from employee;

select dno, sum(salary) from employee group by dno;

This one fails, because it has no group-by clause:  select dno, salary, sum(salary) from employee;

But this works:

select dno, salary, sum(salary) over() from employee;

What is it doing? Applying the sum(salary) function to all salaries. This isn't necessarily what we're looking for. But how about this:

select dno, lname, salary, sum(salary) over(partition by dno) from employee;

Now, we have a row for each employee, but the last column is a sum of that employee's department's salaries. This cannot be done with group by (at least without an inner query), because if we group by dno, we can't have an output record for each employee.

The partition clause, inside the over(), determines what we're taking the sum of. We find the partition bucket that the current record belongs to (that is, we find the employee's dno), and take the sum over all salaries in that bucket.

There's also an order by clause, for things within the bucket.

select lname, salary, sum(salary) over (order by lname) from employee;
select lname, salary, sum(salary) over (order by salary desc) from employee;

In each case the sum(salary) is cumulative. The order-by clause determines the order of addition in the cumulative sum.

In the next two examples the rows we're summing are spelled out explicitly; in each case we are doing a cumulative sum. In the first case the sum is cumulative over all employees; in the second it is cumulative by department.

select lname, salary, sum(salary) over (order by lname rows between unbounded preceding and current row) from employee;

select lname, dno, salary, sum(salary) over (partition by dno order by lname rows between unbounded preceding and current row) from employee;

The role of sum() with an over() clause is quite different from its role as an aggregation function. There are a set of functions that can be used with over(), known as window functions. Many apply to the entire partition, or (in the presence of an order clause) from the beginning of the partition to the current record. Here's an example:

select lname, row_number() over(order by lname) from employee;

What happens if we omit the over()?

The median

How about finding the median? We can't reference row_number() directly in the where clause, but we can use a CTE:

with temp(lname, salary, rn) as (select lname, salary, row_number() over (order by salary) from employee)
select * from temp where rn = (select count(*) from employee)/2;

select lname, salary, row_number() over (order by salary) from employee where rn = (select count(*) from employee)/2;

Some window functions apply to a "window frame", that is, a defined range within the ordered partition. We define this window frame with the between clause. Here's an example in which each employee is listed with the average salary of the employee, the previous employee and the following employee:

select lname, salary, avg(salary) over (order by salary rows between 1 preceding and 1 following) from employee;



Using the Postgres information_schema and system catalogs

The information_schema is more or less standardized; the Postgres system catalog (schema pg_catalog) is postgres-specific. We can see these schemas with the psql command \dnS. Though we'll refer to the objects in these schemas as tables, many of them are actually views. While reading from them is reasonable, updates to system tables can destroy your database.

These tables can shed light on

For some of these examples, the following psql option is useful:
    \x auto
It forces the printing of overly wide tables in a more vertical format. Disable with \x off.

Some references:

All the postgres commands, like \d and \d table, get their information from information_schema and pg_catalog queries.

Information about Tables

First we'll use the table information_schema.columns:

select column_name, table_schema, data_type from information_schema.columns where table_name = 'employee';

select * from information_schema.columns where table_name = 'employee' and column_name = 'ssn';

This gives the "normal" columns to table "employee", but does not give the system columns. For that we'll use the pg_class table. We can see the columns of this table with \d pg_class.

select oid, relpages, reltuples FROM pg_class WHERE relname = 'employee';

This is the output (on my system. right now.)

  oid  | relpages | reltuples
-------+----------+-----------
 16386 |        1 |         9

The oid (not printed with "select *"!) is the so-called table oid: an internal identifier of the table. The other two columns are estimates of the number of pages and number of tuples.

Now let's try a query that is comparable to the above information_schema.columns query.

select a.attname, a.atttypid, t.typname from pg_class c join pg_attribute a on c.oid=a.attrelid join pg_type t on t.oid = a.atttypid where c.relname = 'employee';

Here's the output.

  attname  | atttypid | typname
-----------+----------+---------
 tableoid  |       26 | oid
 cmax      |       29 | cid
 xmax      |       28 | xid
 cmin      |       29 | cid
 xmin      |       28 | xid
 ctid      |       27 | tid
 fname     |     1043 | varchar
 minit     |     1042 | bpchar
 lname     |     1043 | varchar
 ssn       |     1042 | bpchar
 bdate     |     1082 | date
 address   |     1043 | varchar
 sex       |     1042 | bpchar
 salary    |     1700 | numeric
 super_ssn |     1042 | bpchar
 dno       |       23 | int4


Note the first six.


More from pg_catalog

select datname from pg_database;

select datname, datdba, encoding from pg_database order by datname;

How about the name and owner of databases? This works, if you have DBA privileges:

select d.datname, d.datdba, a.rolname, pg_encoding_to_char(d.encoding) from pg_database d join pg_authid a on d.datdba = a.oid order by d.datname;

But if you're not a privileged user, use pg_roles instead of pg_authid; the former does not have the rolpassword field (which can be used to hack passwords).

select d.datname, d.datdba, a.rolname, pg_encoding_to_char(d.encoding) from pg_database d join pg_roles a on d.datdba = a.oid order by d.datname;


Information on permission grants

Access to tables is created through specific "grants". Here we view the grants on a table:

select grantee, privilege_type from information_schema.role_table_grants
where table_name='employee';

Here are the grantors and grantees:

select * from information_schema.table_privileges where table_name = 'employee';
select * from information_schema.column_privileges where table_name = 'employee' and column_name = 'ssn';



pg_stats

This is where Postgres keeps its information on where-clause selectivity. Some information: postgresql.org/docs/9.6/static/row-estimation-examples.html.

The pg_stats table is where postgres stores the statistics it has gathered to assist in query planning. The histogram_bounds attribute contains a histogram describing the data distribution. If histogram_bounds is null, there is no histogram data available to the query planner! We'll start with the Dellstore database (named "dell" on my machine):

select tablename, attname from pg_stats where schemaname = 'public' and histogram_bounds is null;

  tablename  |       attname       
------------+----------------------
 orderlines | quantity
 orderlines | orderlineid
 products   | special
 products   | price                  <--- not integer
 products   | category
 customers  | gender
 customers  | income
 customers  | age
 customers  | password
 customers  | creditcardexpiration
 customers  | creditcardtype
 customers  | region
 customers  | country
 customers  | state
 customers  | address2

Now let's see which attributes do have histogram data:

select tablename, attname from pg_stats where schemaname = 'public' and histogram_bounds is not null;


 tablename  |    attname    
------------+----------------
 orderlines | orderid
 orderlines | prod_id
 orderlines | orderdate
 inventory  | prod_id
 inventory  | quan_in_stock
 inventory  | sales
 products   | prod_id
 products   | title
 products   | actor
 products   | common_prod_id
 cust_hist  | customerid
 cust_hist  | orderid
 cust_hist  | prod_id
 customers  | customerid
 customers  | firstname
 customers  | lastname
 customers  | address1
 customers  | city
 customers  | zip
 customers  | email
 customers  | phone
 customers  | creditcard
 customers  | username
 orders     | orderid
 orders     | orderdate
 orders     | customerid
 orders     | netamount
 orders     | tax
 orders     | totalamount

Let's look at a couple of these:

select histogram_bounds from pg_stats where tablename = 'orders' and attname = 'totalamount';
select histogram_bounds from pg_stats where tablename = 'customers' and attname = 'zip';

Now let's try some queries on these attributes, with differing levels of selectivity:

explain select * from customers where customerid between 2000 and 2400; // index scan
explain select * from customers where customerid between 2000 and 15000;    // still an index scan
explain select * from customers where customerid between 2000 and 20000;

The cutoff seems to be somewhere around 15893. [!]

What if we try this with zip:

explain select * from customers where zip between 60000 and 62000;

Selectivity doesn't matter as there's no index on zip. We could add it, though.

Now let's combine two where conditions:

explain select * from customers where zip between 60000 and 62000 and customerid between 2000 and 2400;

We start with an index scan on customerid.

explain select * from customers where firstname between 'P' and 'Q';

We get a sequential scan. There's no index on firstname.

explain select * from customers where zip between 60000 and 62000 and firstname between 'P' and 'Q';

Again, there is no useful index.


Next we run analyze select. This actually runs the query.

explain analyze select * from customers where zip between 60000 and 62000 and firstname between 'P' and 'Q';

                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
 Seq Scan on customers  (cost=0.00..888.00 rows=9 width=268) (actual time=0.241..3.498 rows=13 loops=1)
   Filter: ((zip >= 60000) AND (zip <= 62000) AND ((firstname)::text >= 'P'::text) AND ((firstname)::text <= 'Q'::text))
   Rows Removed by Filter: 19987
 Planning time: 0.158 ms
 Execution time: 3.520 ms


Note that the estimator thinks there will be 9 rows, but the actual number of rows is 13.


Here's another:

explain analyze select * from customers where zip between 60000 and 70000 and firstname between 'P' and 'T';

                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
  Seq Scan on customers  (cost=0.00..888.00 rows=174 width=268) (actual time=0.579..47.226 rows=185 loops=1)
   Filter: ((zip >= 60000) AND (zip <= 70000) AND ((firstname)::text >= 'P'::text) AND ((firstname)::text <= 'T'::text))
   Rows Removed by Filter: 19815
 Planning time: 0.138 ms
 Execution time: 47.303 ms

The estimator thinks 174 but really there are 185.