Record counts
CTEs (moved to sql2.html)
http://blog.2ndquadrant.com/postgresql-ctes-are-optimization-fences/
DellstorePostgres information_schema and system catalog
Postgres pg_stats and query planning
Suppose we want to peek at the first 20 records of a large dataset (say, bigcompany.employee, or passwords.password). We can do this:
select lname from employee limit 20;
What if we want the first 20, in sorted order?
select lname from employee order by lname limit 20;
What if we want the last 20? Here we have to know the total. There are 308 employees in bigcompany:
select count(*) from employee;
So the last of them are
select lname from employee offset 288;
But that's too much math. How about this:
select lname from employee offset ((select count(*) from employee) -20);
Suppose we want some random data. First we try this:
select generate_series(1,20) as id, 1000000*random();
Now let's store this in a table:
create table randos (
id integer primary key,
value integer
);
To populate the table, use
insert into randos select generate_series(1,20) as id, 1000000*random();
Note that the values are now integers because we asked for that in create table.
If when creating the table we declared id to be char(5), this still works:
create table randostr (
id char(5) primary key,
value integer
);
insert into randostr select generate_series(100000, 100200) as id, 1000000*random();
We can evaluate expressions in SQL:
select 2+3;
Note the use of select. In the examples below, the first three are special "constants"; the rest are functions.
select
current_date;
select current_time;
select current_timestamp;
select
now();
select timeofday();
We can also define functions. Recall that employee.salary was defined as numeric(10,2).
create
or replace function getsal(name varchar) returns numeric
as
$$
select e.salary from employee e where e.lname = name;
$$
language sql;
select getsal('Wong');
drop function getsal(varchar);
A function is a procedure if it returns void.
We use create or replace here because functions, unlike tables, are frequently redefined. If we'd just used create, we'd have to drop the function before redefining it.
We can also use the language PL/pgsql, which has some extensions. We specify it with language plpgsql. The main advantage lies in communicating with a remote database server. Using the SQL language means that each individual query has to be sent to the server for execution; PL/pgsql effectively batches the different queries.
Here are two functions that might be used to do grade calculations using the University database. The first compares two grades. We should get grade_less('B+','A-') = true, grade_less('B','B+') = true, etc. It illustrates declarations, substrings and if/elsif/else/end-if blocks
--
implements < operator for grades
create or replace function grade_less(g1 varchar, g2 varchar) returns
boolean
as
$$
DECLARE
l1 char; -- the
letter, 'A', 'B', etc
l2 char;
mod1 char; -- the modifier,
'+' or '-'
mod2 char;
BEGIN
l1 := upper(substring(g1 from 1 for 1));
-- 1st char of string is position 1
l2 := upper(substring(g2 from 1 for
1)); -- we take 1 char starting at position 1
if l1 < l2 then return
false; -- 'A' is a better grade
than 'B'
elsif l1 > l2 then return true;
end if;
-- now we know l1 = l2
if char_length(g1) >= 2 then
mod1 := substring(g1 from 2
for 1);
else
mod1 := ' ';
end if;
if char_length(g2) >= 2 then
mod2 := substring(g2 from 2
for 1);
else
mod2 := ' ';
end if;
if mod1 = mod2 then return false;
elsif mod1 = '-' then return true;
elsif mod2 = '+' then return true;
else return false;
end if;
END;
$$
language plpgsql;
To test this, try
select
grade_less('B+','A-');
select grade_less('B','B+');
The next function converts letter grades to a numeric form: 'B+' is 3.333, etc. Instead of using the previous function, we could also use the one below, and then compare the results numerically:
create
or replace function grade_num(g varchar) returns real
as
$$
declare
num real;
letter char;
mod char;
begin
letter :=
upper(substring(g from 1 for 1));
if letter = 'A' then num:= 4.0;
elsif letter =
'B' then num := 3.0;
elsif letter =
'C' then num := 2.0;
elsif letter =
'D' then num := 1.0;
elsif letter =
'F' then return 0;
end if;
if char_length(g)
= 1 then return num;
end if;
mod :=
substring(g from 2 for 1);
if mod = '+' then return num + 1.0/3;
elsif mod = '-'
then return num - 1.0/3;
else return num;
end if;
end;
$$
language plpgsql;
See sql2.html#universal_quantification for an application of these functions.
Suppose we define the following table:
create table serialdemo (
ident serial primary key,
name varchar(20)
);
The ident field will be automatically assigned increasing consecutive values:
insert into serialdemo values(default,
'alice');
insert into serialdemo values(default, 'bob');
insert into serialdemo values(7, 'zorro');
insert into serialdemo values(default, 'clarisse');
insert into serialdemo values(default, 'dale');
insert into serialdemo values(default, 'edmund');
insert into serialdemo values(default, 'fiona');
Now let's try one more. The problem is that the next value for ident is 7, which we already used.
insert into serialdemo values(default, 'georg');
This fails! But if we try it again, it works!
We can also try insertions within transactions:
begin; | |
insert into serialdemo values (default, 'zool'); |
|
insert into serialdemo values (default, 'harald'); |
|
rollback; |
What is going on here?
One potential problem with using serial keys is that external observers can guess how many customers you have; see the German tank problem. An alternative is to use so-called UUIDs. To load the UUID module, run the following as user postgres:
\c pld
create extension if not exists pgcrypto;
The installation of the module can be verified with select gen_random_uuid();. Now, in database pld, I can define this:
create table uuiddemo (
ident uuid primary key default gen_random_uuid(),
name varchar(20)
);
We can then insert.
insert into uuiddemo values(default, 'alice');
insert into uuiddemo values(default, 'bob');
insert into uuiddemo values(default, 'clarisse');
insert into uuiddemo values(default, 'dale');
insert into uuiddemo values(default, 'edmund');
insert into uuiddemo values(default, 'fiona');
The Dellstore data is in one large file, dellstore.sql.
This file includes create table
commands and data-insertion commands. (For some reason, table keys are not
created until the very end of the file, with alter
table, but this does not really matter.)
Here's how to install the tables. You can either create a new named
database or else just load the tables into your existing
company/university database.
To create a new named database for Dellstore:
createdb -U postgres dellstore
or
createdb -U postgres -W dellstore #
-W to force request for password
To load the information into Dellstore:
From the shell:
psql dellstore < dellstore.sql
Alternatively, from within postgres, where psql was started in the
directory containing dellstore.sql:
\i dellstore.sql
I had considerable difficulty with the string O'DONNELL in
products.actor, as the quotation mark was a non-ASCII, non-UTF8 symbol.
Let's try some queries:
select count(*) from customers;
select count(*) from products;
select * from customers limit 50;
Who ordered product 4789?
select c.firstname, c.lastname, o.orderid
from customers c join orders o on c.customerid = o.customerid join
orderlines ol on o.orderid = ol.orderid
where prod_id = 4789;
What titles did 'YHOODAJPYJ' order?
select p.title
from customers c join orders o on c.customerid = o.customerid
join orderlines ol on o.orderid = ol.orderid
join products p on ol.prod_id = p.prod_id
where c.lastname = 'YHOODAJPYJ';
Who ordered products that cost less than $10?
select c.lastname
from customers c join orders o on c.customerid = o.customerid
join orderlines ol on o.orderid = ol.orderid
join products p on ol.prod_id = p.prod_id
where p.price < 10.0;
What products were ordered 16 or more times?
select p.title, p.prod_id, count(*)
from products p join orderlines ol on p.prod_id = ol.prod_id
group by p.prod_id
having count(*) >= 16;
How many products did each customer order?
select count(*), c.lastname
from customers c join orders o on c.customerid = o.customerid
group by c.customerid
limit 50;
What was the maximum number of products ordered
with ccounts as
(select count(*) as count, c.lastname from customers c join orders o on
c.customerid = o.customerid group by c.customerid)
select max(count) from ccounts;
We can also do this with an OVER instead of a GROUP BY (OVER is discussed below):
with ccounts(count) as
(select count(*) over (partition by c.customerid) from customers c join
orders o on c.customerid = o.customerid)
select max(count) from ccounts;
Which customers placed the most orders? Here is one approach:
select c.firstname, c.lastname
from customers c join orders o on c.customerid = o.customerid
group by c.customerid
having count(*) = (
with ccounts as
(select count(*) as count, c.lastname from customers c
join orders o on c.customerid = o.customerid group by c.customerid)
select max(count) from ccounts
);
The following approach is trickier, but a little faster (about twice as fast). It uses Common-Table Expressions, or CTEs. The idea is to run the inner query once, generating the customer last name and per-customer order count. We name the resulting table ccounts. We then use that temporary table twice more, once to get the maximum value of the order count and then once to find the records whose count matches this maximum:
with ccounts as (
select count(*) as count, c.lastname from customers c
join orders o on c.customerid = o.customerid group by c.customerid
) select cc.lastname, cc.count
from ccounts cc
where cc.count = (select max(count) from ccounts);
What product sold the most?
-- separate query to find the max sales
with pcounts as
(select sum(quantity) as quan, p.prod_id from products p join orderlines
ol on p.prod_id = ol.prod_id group by p.prod_id)
select max(pc.quan) from pcounts pc ;
select p.prod_id, p.title, sum(ol.quantity)
from products p join orderlines ol on p.prod_id = ol.prod_id
group by p.prod_id
having sum(ol.quantity) =
(with pcounts as
(select sum(quantity) as quan from products p join orderlines ol on
p.prod_id = ol.prod_id group by p.prod_id)
select max(pc.quan) from pcounts pc) ;
Using the previous CTE approach, we can also do this:
with pcounts as (
select sum(quantity) as quan, p.title from products p
join orderlines ol on p.prod_id = ol.prod_id group by p.prod_id
) select pc.title, pc.quan
from pcounts pc
where pc.quan = (select max(quan) from pcounts);
We looked earlier at basic Common Table Expressions (CTEs) as an
alternative to writing nested queries. These begin with with
followed by a definition of a temporary table, and then a query:
with temp as (select w.essn
from works_on w where w.pno = 10)
select e.fname, e.lname from employee e where e.ssn in (select * from
temp);
This one doesn't add any particular efficiency. But sometimes we need the same table twice, and in that case using a CTE (so we evaluate the table only once) is more efficient:
with ccounts as (
select count(*) as count, c.lastname from customers c
join orders o on c.customerid = o.customerid group by c.customerid
) select cc.lastname, cc.count
from ccounts cc
where cc.count = (select max(count) from ccounts);
We build the temporary table ccounts, and then run two queries:
By using the same temporary table twice, the query is more efficient.
It turns out we can also define recursive CTEs. Here's a classic simple one
with recursive times10 (n) as (
select 10
UNION ALL
select n+10 from times10 where n < 100
)
select * from times10;
The table times10 has a single attribute: n. The type of n is integer, though that will be inferred. The select 10 states that the value 10 is in table times10. This is UNIONed with the values of the second select clause, which states that if n is in the table, then we also have n+10. Having built that table, we list all the values. The recursion must be limited by the table construction; it is not sufficient to define an "infinite" table that we then limit through the final select clause.
Here is another:
with recursive triangle (n,val) as (
select 1, 1
union
select t.n+1, t.val+n+1 from triangle t where t.n <=
20
) select * from triangle;
So what good does this do us in real life? The fundamental application of recursive CTEs is usually to "flatten out" some kind of complex hierarchical relationship. For example, the employee table defines a "supervisor" relationship, via the super_ssn field. We can use this to determine whether or not Alice is supervisor of Bob. But what if the question is whether there is a supervisor "chain" from Alice down to Bob? That is, are there S1, ...SN so that Alice supervises S1, who supervises S2, ..., who supervises SN, who supervises Bob?
Here's a recursive CTE to list all employees supervised directly or indirectly by Mr Borg (888-66-5555):
--
returns all employees supervised -- directly or indirectly -- by
888-66-5555 (Borg)
with recursive supervision as (
select e.fname, e.lname, e.ssn, e.super_ssn from
employee e
where e.ssn = '888665555'
UNION
select e.fname, e.lname, e.ssn, e.super_ssn from
employee e join supervision s on s.ssn = e.super_ssn
) select * from supervision;
We define the table "supervision" recursively, and its attributes implicitly. That is, the atomic part of the recursive definition says that a record should look like (fname, lname, ssn, ssn). The attributes listed in the recursive part do in fact match; it is an error if they do not. (Because most database types are strings, it is an error that is easy to make.)
The next table lists pairs of employees (a,b), by ssn, where b is a "supervisor ancestor" of a:
with
recursive superancestor as (
select e.ssn, e.super_ssn from employee e where e.super_ssn
is not null
UNION
select e.ssn, s.super_ssn from employee e join
superancestor s
on e.super_ssn = s.ssn
) select * from superancestor;
Note how the recursion works: (a,b) is in the table if a is supervised by b (the atomic case), or a has supervisor s, and s has supervisor ancestor c.
Let's do that another way: this time the recursive relationship is that a has a supervisor ancestor s, and the supervisor of s is c:
with
recursive superancestor as (
select e.ssn, e.super_ssn from employee e where e.super_ssn
is not null
UNION
select e.ssn, s.super_ssn from superancestor e join
employee s
on e.super_ssn = s.ssn where s.super_ssn is not null
) select * from superancestor;
Is anyone their own supervisor-ancestor? First of all, would this lead to an infinite cycle in the CTE table? No: it would simply lead to a set of pairs (ai,ai+1), i<n where a0 is supervised by a1 is supervised by a2 is supervised by aN is supervised by a0.
Let's formulate the query, using our superancestor relationship above. The goal is to see if anyone is their own superancestor:
with
recursive superancestor as (
select e.ssn, e.super_ssn from employee e where e.super_ssn
is not null
UNION
select e.ssn, s.super_ssn from employee e join
superancestor s
on e.super_ssn = s.ssn
) select sa.ssn from superancestor sa where sa.ssn = sa.super_ssn;
We have to add someone!
update
employee set super_ssn = '111220177' where ssn = '111220162';
update employee set super_ssn = '111220154' where ssn =
'111220162'; -- undo
This should lead to a chain 177->165->162->177.
Sometimes we'd like to use aggregation functions but with more flexibility than provided by GROUP BY. Recall that GROUP BY runs the aggregation functions over each group. Let's warm up with some examples:
select sum(salary) from employee;
select dno, salary from employee;
select dno, sum(salary) from employee group by dno;
This one fails, because it has no group-by clause: select dno, salary, sum(salary) from employee;
But this works:
select dno, salary, sum(salary) over() from employee;
What is it doing? Applying the sum(salary) function to all salaries. This isn't necessarily what we're looking for. But how about this:
select dno, lname, salary, sum(salary) over(partition by dno) from employee;
Now, we have a row for each employee, but the last column is a sum of that employee's department's salaries. This cannot be done with group by (at least without an inner query), because if we group by dno, we can't have an output record for each employee.
The partition clause, inside the over(), determines what we're taking the sum of. We find the partition bucket that the current record belongs to (that is, we find the employee's dno), and take the sum over all salaries in that bucket.
There's also an order by clause, for things within the bucket.
select lname, salary, sum(salary) over (order
by lname) from employee;
select lname, salary, sum(salary) over (order by salary desc) from
employee;
In each case the sum(salary) is cumulative. The order-by clause determines the order of addition in the cumulative sum.
In the next two examples the rows we're summing are spelled out explicitly; in each case we are doing a cumulative sum. In the first case the sum is cumulative over all employees; in the second it is cumulative by department.
select lname, salary, sum(salary) over (order by lname rows between unbounded preceding and current row) from employee;
select lname, dno, salary, sum(salary) over (partition by dno order by lname rows between unbounded preceding and current row) from employee;
The role of sum() with an over() clause is quite different from its role as an aggregation function. There are a set of functions that can be used with over(), known as window functions. Many apply to the entire partition, or (in the presence of an order clause) from the beginning of the partition to the current record. Here's an example:
select lname, row_number() over(order by lname) from employee;
What happens if we omit the over()?How about finding the median? We can't reference row_number() directly in the where clause, but we can use a CTE:
with temp(lname, salary, rn) as (select lname,
salary, row_number() over (order by salary) from employee)
select * from temp where rn = (select count(*) from employee)/2;
select lname, salary, row_number() over (order by salary) from employee where rn = (select count(*) from employee)/2;
Some window functions apply to a "window frame", that is, a defined range within the ordered partition. We define this window frame with the between clause. Here's an example in which each employee is listed with the average salary of the employee, the previous employee and the following employee:
select lname, salary, avg(salary) over (order by salary rows between 1 preceding and 1 following) from employee;
The information_schema is more or less standardized; the Postgres system catalog (schema pg_catalog) is postgres-specific. We can see these schemas with the psql command \dnS. Though we'll refer to the objects in these schemas as tables, many of them are actually views. While reading from them is reasonable, updates to system tables can destroy your database.
These tables can shed light on
For some of these examples, the following psql option is useful:
\x auto
It forces the printing of overly wide tables in a more vertical format.
Disable with \x off.
Some references:
All the postgres commands, like \d and \d table, get their information
from information_schema and pg_catalog queries.
First we'll use the table information_schema.columns:
select column_name, table_schema, data_type
from information_schema.columns where table_name = 'employee';
select * from information_schema.columns where table_name = 'employee' and
column_name = 'ssn';
This gives the "normal" columns to table "employee", but does not
give the system columns. For that we'll use the pg_class
table. We can see the columns of this table with \d
pg_class.
select oid, relpages, reltuples FROM pg_class WHERE relname = 'employee';
This is the output (on my system. right now.)
oid | relpages |
reltuples
-------+----------+-----------
16386 | 1
| 9
The oid (not printed with "select *"!) is the so-called table oid: an internal identifier of the table. The other two columns are estimates of the number of pages and number of tuples.
Now let's try a query that is comparable to the above information_schema.columns query.
select a.attname, a.atttypid, t.typname from pg_class c join pg_attribute a on c.oid=a.attrelid join pg_type t on t.oid = a.atttypid where c.relname = 'employee';
Here's the output.
attname | atttypid |
typname
-----------+----------+---------
tableoid | 26 | oid
cmax
| 29 | cid
xmax
| 28 | xid
cmin
| 29 | cid
xmin
| 28 | xid
ctid
| 27 | tid
fname | 1043 |
varchar
minit | 1042 |
bpchar
lname | 1043 |
varchar
ssn |
1042 | bpchar
bdate | 1082 |
date
address | 1043 | varchar
sex |
1042 | bpchar
salary | 1700 | numeric
super_ssn | 1042 | bpchar
dno
| 23 | int4
Note the first six.
select datname from pg_database;
select datname, datdba, encoding from pg_database order
by datname;
How about the name and owner of databases? This works, if you have DBA privileges:
select d.datname, d.datdba, a.rolname, pg_encoding_to_char(d.encoding) from pg_database d join pg_authid a on d.datdba = a.oid order by d.datname;
But if you're not a privileged user, use pg_roles instead of pg_authid; the former does not have the rolpassword field (which can be used to hack passwords).
select d.datname, d.datdba, a.rolname, pg_encoding_to_char(d.encoding) from pg_database d join pg_roles a on d.datdba = a.oid order by d.datname;
Access to tables is created through specific "grants". Here we view the grants on a table:
select grantee, privilege_type from
information_schema.role_table_grants
where table_name='employee';
Here are the grantors and grantees:
select * from
information_schema.table_privileges where table_name = 'employee';
select * from information_schema.column_privileges where table_name =
'employee' and column_name = 'ssn';
This is where Postgres keeps its information on where-clause selectivity. Some information: postgresql.org/docs/9.6/static/row-estimation-examples.html.
The pg_stats table is where postgres stores the statistics it has gathered to assist in query planning. The histogram_bounds attribute contains a histogram describing the data distribution. If histogram_bounds is null, there is no histogram data available to the query planner! We'll start with the Dellstore database (named "dell" on my machine):
select tablename, attname from pg_stats where schemaname = 'public' and histogram_bounds is null;
tablename
|
attname
------------+----------------------
orderlines | quantity
orderlines | orderlineid
products | special
products |
price
<--- not integer
products | category
customers | gender
customers | income
customers | age
customers | password
customers | creditcardexpiration
customers | creditcardtype
customers | region
customers | country
customers | state
customers | address2
Now let's see which attributes do have histogram data:
select tablename, attname from pg_stats where schemaname = 'public' and histogram_bounds is not null;
tablename
| attname
------------+----------------
orderlines | orderid
orderlines | prod_id
orderlines | orderdate
inventory | prod_id
inventory | quan_in_stock
inventory | sales
products | prod_id
products | title
products | actor
products | common_prod_id
cust_hist | customerid
cust_hist | orderid
cust_hist | prod_id
customers | customerid
customers | firstname
customers | lastname
customers | address1
customers | city
customers | zip
customers | email
customers | phone
customers | creditcard
customers | username
orders | orderid
orders | orderdate
orders | customerid
orders | netamount
orders | tax
orders | totalamount
Let's look at a couple of these:
select histogram_bounds from pg_stats where
tablename = 'orders' and attname = 'totalamount';
select histogram_bounds from pg_stats where tablename = 'customers' and
attname = 'zip';
Now let's try some queries on these attributes, with differing levels of selectivity:
explain select * from customers where
customerid between 2000 and 2400; // index scan
explain select * from customers where customerid between 2000 and
15000; // still an index scan
explain select * from customers where customerid between 2000 and 20000;
The cutoff seems to be somewhere around 15893. [!]
What if we try this with zip:
explain select * from customers where zip between 60000 and 62000;
Selectivity doesn't matter as there's no index on zip. We could add it, though.
Now let's combine two where conditions:
explain select * from customers where zip between 60000 and 62000 and customerid between 2000 and 2400;
We start with an index scan on customerid.
explain select * from customers where firstname between 'P' and 'Q';
We get a sequential scan. There's no index on firstname.
explain select * from customers where zip between 60000 and 62000 and firstname between 'P' and 'Q';
Again, there is no useful index.
Next we run analyze select. This actually runs the
query.
explain analyze select * from customers where zip between 60000 and 62000 and firstname between 'P' and 'Q';
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------
Seq Scan on customers (cost=0.00..888.00 rows=9
width=268) (actual time=0.241..3.498 rows=13 loops=1)
Filter: ((zip >= 60000) AND (zip <= 62000) AND
((firstname)::text >= 'P'::text) AND ((firstname)::text <=
'Q'::text))
Rows Removed by Filter: 19987
Planning time: 0.158 ms
Execution time: 3.520 ms
Note that the estimator thinks there will be 9 rows, but the actual number
of rows is 13.
Here's another:
explain analyze select * from customers where zip between 60000 and 70000 and firstname between 'P' and 'T';
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------
Seq Scan on customers (cost=0.00..888.00 rows=174
width=268) (actual time=0.579..47.226 rows=185
loops=1)
Filter: ((zip >= 60000) AND (zip <= 70000) AND
((firstname)::text >= 'P'::text) AND ((firstname)::text <=
'T'::text))
Rows Removed by Filter: 19815
Planning time: 0.138 ms
Execution time: 47.303 ms
The estimator thinks 174 but really there are 185.