Like most consultants, I’m using more and more open source software. A big change for me was to use PostgreSQL following 20 years experience of MS SQL Server.
I’m no DBA, so user administration etc is of no great interest, I’m only interested in retrieving and loading data. What follows is part comparison, part ‘things I wish I’d known before I started’. I’ve skipped over some of the technical details, they’re available elsewhere from more knowledgeable people.
I hope it’s useful, please leave comments at the bottom. I’d love to be told I’m wrong about some of this, I’d also like to hear about any changes you’ve found.
Definitely top of the list of things I wish I’d known! PostgreSQL changes everything to lower case unless you quote the text. That means that if you create a table called
"ForgetAboutUsingCamelCase", it fails if you use
SELECT * FROM ForgetAboutUsingCamelCaseas the table name would be interpreted as
forgetaboutusingcamelcase. You’d need to use
SELECT * FROM "ForgetAboutUsingCamelCase". Instead, stick to using
forget_about_camel_caseand just live with lots of underscores everywhere.
- Bite the bullet, use Linux
We’re all happy in our Windows comfort zone but in the open source world, Linux is King. Some PostgreSQL addons are only available for Linux too. I even managed to find a performance setting that will instantly break an installation on Windows.
- Every table’s a heap!
I’m so used to the notion that a table is stored in the order of the clustered index (typically primary key) that this came as a shock. New rows are just added to the end of the heap. Updates are added at the end with the existing row just marked for later deletion. You can cluster a table on an index but the table order will only match for that initial clustering (you can cluster again later). There’s a process called Vacuuming that tidies up behind the scenes.
The really clever part is that each row is marked with an incrementing ID. That means a query can pick up a slightly older version of a row if it needs to. The advantage of this method is that there are fewer reasons to lock pages and no page splits as data is inserted or updated.
- Indexed Views vs Materialised Views
I always like the idea of Indexed Views in SQL Server. Just create a regular view, make sure it has a primary key, add an index to the primary key. You immediately have a cached copy of the data for the view. Better still, it’s updated automatically as data is added to the featured tables. Sounds great, but there’s no outer joins, unions or sub-queries. From experience, if my query doesn’t feature any of those, it doesn’t need a performance improvement, so they’re rarely used!
PostgreSQL has Materialised Views instead. These are a point in time snapshot of the query result. No automatic updates but no limitations in what you can use in the SQL. Same query definition, just add MATERIALIZED at the start and WITH DATA at the end. I’ve found that the easiest way to manage the updates is to write a simple stored procedure that’s run each night.
- Materialised Views 2 – ODBC
ODBC (not sure about JDBC) doesn’t show the materialised views by default. There are settings that can be changed, but do you really want the admin headache of doing that? I’ve been creating materialised views with a m_v_ prefix, then creating a standard view that queries the materialised views. Another advantage of that is that you can UNION multiple materialised views to save refresh time as you only refresh recent data.
- Views are ‘schema-bound’
Possibly the most irritating feature of PostgreSQL. If you create a view (view1) that references another view (view2), you can’t recreate view2. You can add columns to view2 but to make any significant changes you need to drop both views and recreate them. Similarly, you can’t make significant changes to any table referenced by those views without dropping everything.
- Table data types can’t be changed
Same reason as above. MS SQL does a good job of this as it does some clever stuff in the background and copies the data to a new table. In PostgreSQL, you end up just adding a new column with the correct type as recreating all those views is just too much bother.
- PgAdmin isn’t SQL Server Management Studio
Sorry, it just isn’t as good. There’s no option to switch connections for a query and it’s a bit clunky. If you have another query tool you like using, stick to that.
- Learn about Indexes
SQL Server makes you so lazy! Slow query, ask for index recommendation, do a ‘sanity check’, create index. There is a tool that does something similar in PosgreSQL but it’s very, very expensive for a small installation. Learn how indexes work and how to spot potential candidates. It will make you a better SQL Server consultant too.
- A new index type – BRIN
Clever little index, just stores the highest and lowest value for a column in each page. Great for indexing dates if your table is approximately in date order. Just bear in mind that you need a lot of data before the index is used.
- Where are all my comments?
Like all well behaved consultants, I write lots of comments in my view and table definitions. MS SQL saves your definition text including all your comments for posterity. PostgreSQL just creates the view and discards everything else. When you request the view definition, PostgreSQL creates the query from the view itself. All your comments have disappeared. You can add comments using SQL but that just adds a comment to the view itself, not in any specific place.
- What’s happened to VARCHAR?
In MS SQL, you create a column with type VARCHAR (or NVARCHAR) with a length. In PostgreSQL, the length is optional and (oddly) performance is supposed to be better if you don’t specify a length as there’s one less thing to check on insert. That’s fine except the ODBC then returns the data as the MS SQL equivalent of TEXT which can’t be manipulated by some products (Crystal Reports in this case).
- Why is my view returning TEXT instead of VARCHAR?
If you have a VARCHAR column and do any manipulation (concatenation, substring etc), the result is TEXT. You have to cast the result to VARCHAR. When you create a view, cast the result of each column to the correct data type. That includes any numeric columns. Saves a load of work later especially as you can’t change the data type in a table or view without dropping first.
- What are all these colons?
They’re short-hand for ‘CAST’. They’re a bit odd at first and are often unneeded but you’ll get used to it.
- Casts have to be explicit
In MS SQL, it’s quite accepted to do something like
SELECT * from table where datefield > '2022-01-01'MS SQL automatically casts the string to a date or datetime. In PostgreSQL, you have to do
SELECT * from table where datefield > 2022-01-01'::timestamp
- Text is case sensitive
This is a real pain in your WHERE clause, especially if the source data is a bit iffy. You can make a column not case sensitive (for sort order), but then you can’t filter on it! Best solution is to fix the source data. If that’s not possible, create an index on the upper case of the column, then filter using upper case. You could do something in a view, just watch the performance.
- No COLUMNSTORE Indexes
There is an old project that adds them but nothing’s been done with it for a while. There’s also a commercial add-on. Hopefully, they’ll be added natively in the future.