Postgresql updating millions of rows


Chris Skorlinski Microsoft SQL Server Escalation Services Yes, post mostly Replication or High Availability topics, but I found this interesting enough to share on my Repl Talk blog.

The most efficient solution is to use 2 queries (the first one to calculate the ID and the second one to retrieve the corresponding row): gab=# SELECT CASE WHEN id = 0 THEN 1 ELSE id END FROM (SELECT ROUND(RANDOM() * (SELECT MAX(id) FROM big_data)) as id) as r; id ---------- 45125146 (1 row) Time: 0.511 ms SELECT * FROM big_data WHERE id = 45125146; id | some_data ---------- ---------------------------------- 45125146 | 5589c2f8f711ce9c149b5d2e05b99afb (1 row) Time: 2.115 ms There is no way to keep Postgre SQL from scanning more than one row in a single query (neither Common Table Expressions or JOIN will solve this issue).

Using 2 queries is acceptable, however, this solution to the problem has a major flaw: if any row was created then deleted, it might calculate an ID that is no longer in the table.

The UPDATE was estimated to take DAYS not HOURS to complete We were called to explore how to make this update run faster.

We learned the database was restored from SQL Server 2000 to SQL Server 2008.

To clarify, when I say lots of rows, I’m talking about millions of rows from a delimited text file.