OPM – Open PostgreSQL Monitoring

mw44118 · on Sept 10, 2014

There's no install instructions in the .zip file, but there is a github wiki. But that has obsolete instructions. But there's an open issue, and the comments for that issue include fixes to that.

I stopped messing with it once I saw how they wanted me to install something into root's crontab.

My hypothesis: postgresql has none of these jazzy monitoring tools because it never crashes.

Sure, as traffic grows, slow queries need to be optimized, crappy N+1 ORMs need to be replaced, and boxes need more RAM and faster disks, but when was the last time you heard of anyone complain about postgresql or blame it for any downtime?

nieve · on Sept 11, 2014

/snark I've certainly had PostgreSQL crash and all it took was someone in a Verizon datacenter accidentally yanking out the cables connecting the server to its RAID arrays. It even crashed a second time when someone "tripped over a cable" and took out the arrays again and a third time when the redundant power supplies turned out not to be...

OTOH complete loss of storage resulted in sub-second data loss with autorecovery the first two times using a five year old Pg installation and minimal issues the third time. The master server had a three year uptime on the main server process when it went down. Overall I'm pretty comfortable with that track record.

majorsc2noob · on Sept 10, 2014

I want to be able to scale out by adding more machines. I want to be able to failover automatically to another data center when the first one goes down. I have yet to see a straight forward way to accomplish this with PG. Their wiki lists a bunch of tools related to this but they are either abandoned or does not cover this as far as I can tell. I can't understand all the positive I read about PG..

fdr · on Sept 10, 2014

I'd say the way failover/HA is done at Heroku is straightforward. Will (the designer) or I plan to write about it some day, pending laziness.

It took some time to figure out because it required breaking some orthodoxy, but I'm happy with the result.

The thresholds are documented at https://devcenter.heroku.com/articles/heroku-postgres-ha#fai....

The promotion is done by rebinding the URL of the database and restarting the app. This shares mechanism neatly with changing of passwords, which is one reason we decided it was worth throwing out network-transparent orthodoxy when it came to HA: the clients must be controlled anyway to deal with security considerations.

vidarh · on Sept 10, 2014

One approach I used when we migrated data centres to avoid having to manage timing etc. of IP address changes in our apps, was to use haproxy.

We configured slaves in the new data centre, set up a haproxy in the new data centre that used the old data centre databases as the main backend, and the new data centre databases as the backup, changed the apps to point to haproxy, shut down the masters and let haproxy shift all the traffic, promoted the slaves once we were certain they'd caught up. We had a period of seconds where the sites were effectively read-only, but that was it.

We're planning rolling out haproxy with keepalived or ucarp to mediate between a lot more of our backend services

majorsc2noob · on Sept 10, 2014

> I'd say the way failover/HA is done at Heroku is straightforward.

> It took some time to figure out because it required breaking some orthodoxy

If it's hard to figure out, it does not sounds too straight forward.

fdr · on Sept 10, 2014

Only because the idea of manipulating clients of the database to do HA is somewhat heretical.

blake85 · on Sept 10, 2014

Just wanted to put in a vote for a writeup. Mind sharing a few more details here in the meantime?

pilif · on Sept 10, 2014

> I want to be able to failover automatically to another data center when the first one goes down. I have yet to see a straight forward way to accomplish this with PG

From personal experience (also from others: https://github.com/blog/1261-github-availability-this-week), I don't think you want to failover a database automatically.

Of all the issues that could cause a machine to decide that a failover is needed, most of the root causes make a failover actually non desirable (a hardware failure for example (failover is good) is way less likely than non-reachability due to load (failover is disastrous) and unless you are very careful, an automated solution will act the same way in both cases.

Add to that the huge cost of failing back during which time there's no more slave to fail over to: Until 9.4 ist released, failing back requires you to file-system-level copy all the data back to the failed master to bring it back up as a slave.

After 9.4, re-synchronizing an old failed master to the new master will actually be possible in most cases (a mistaken failover is usually covered by these).

In case of an emergency, first make sure that a failover would actually help (if you're down because of high load and a misconfiguration of your system, failing over won't help, but will only make things worse), then fail over manually.

As I said, there are way fewer possible emergencies where failing over would help compared to many, many more where failing over would actually cause more damage.

This is valid for all non master-master database configurations I've had to deal with so far, but, again, it's even more pronounced with postgres because of the very time (and bandwith - which could mean "costly" when you cross the public internet) consuming failback (during which you have nowhere to fail over to again).

If you really, really want to do it, have a look at pgpool (http://www.pgpool.net/) which can automatically fail over to a slave and which also is able to read-load-balance between one or multiple slaves. It's quite the out-of-the-box solution.

sitharus · on Sept 10, 2014

Postgres has had hot standby replication since 9.1, along with streaming replication.

https://wiki.postgresql.org/wiki/Hot_Standby

As for scaling horizontally at the db level that can be achieved with foreign data wrappers calling out, but isn't built in.

While Postgres doesn't rival oracle and ms SQL server in feature checkboxes it's a very solid DBMS with many advanced SQL features, and it's free and open source. You can do a lot without hitting the scaling problems.

majorsc2noob · on Sept 10, 2014

To be honest, I'm pretty tired of the "You can do a lot without hitting the scaling problems.".

I hear it all the time. Of course, if you don't have a scaling problem you don't. But when you have, you need a solution to it.

AlisdairO · on Sept 10, 2014

Well, in the world of free software you have a choice - an easily-scaled database system that offers you few guarantees and is very hard to code safely against, or one that is harder to scale but easier to code safely against. That doesn't make PG bad - it's just a choice you make. If you want to pay, you can get a little bit closer to having both.

Of course you hear lines like you quoted all the time - that's because it's true. The overwhelming majority of applications don't have needs beyond a master-slave pair of DB systems - and trends are moving in favour of that direction every day as RAM gets cheaper. Stack Overflow runs on a single not-that-beefy master-slave DB system.

Typical web applications are insanely amenable to read caches, and that's where you do most of your scaling. If you're at the point where your scaling needs truly exceed a large single DB system, I'd hope you have some money to throw at the problem.

mw44118 · on Sept 10, 2014

Imagine my users table has a unique constraint on email address. Or my hotel_room_reservations table has to guarantee I never store overlapping reservations for the same room. In either case, on insert and on update, I need to check the new or updated row vs a bunch of other rows.

Those scenarios make horizontal scaling not so simple. Sure, get rid of those constraints, and you can insert as fast as your network can ship the bytes. But that's not a database anymore. That's a glorified flat file.

bjt · on Sept 10, 2014

I haven't tried it yet, but Postgres XL is pretty new, and promises this. http://www.postgres-xl.org/

majorsc2noob · on Sept 10, 2014

No it doesn't. "In terms of automatic failover, it is currently not part of the core project".

rjuju · on Sept 10, 2014

I agree that the documentation still need a lot of work. However, it doesn't say nor suggest to setup any crontab as root. As the only requirement is to have a stored procedure called regularly, you can use local postgres crontab for example, or any other place else. I will clarify this point on documentation as soon as possible.

mw44118 · on Sept 10, 2014

Thanks!

STRiDEX · on Sept 10, 2014

I'm not sure how these two projects are different, but they seem to be worked on by the same people/person.

http://dalibo.github.io/powa/

moondowner · on Sept 10, 2014

Just noticed that as well.

I wonder why two different projects for the same thing? Are there any differences between OPM and PoWA? Is OPM built on top of PoWA?

daamien · on Sept 10, 2014

Hi !

I'm involved in both projects. Here' s some insight :

OPM and POWA are not related. They were born at different times and developed by two different R&D teams at DALIBO, a French PostgreSQL company. The fact that they've been released publicly almost at the same time is merely a coincidence

OPM is designed to help a DBA to manage a very large number of servers. It provides answers to the questions : "Is there a problem somewhere on my PostgreSQL servers ? Which instances need to improvements ?"

POWA is more focused showing the "real-time" activity of one specific instance. It provides answers to questions like : "Which queries are slowing down my server right now ?"

If you're familiar with Oracle products, the difference between OPM and POWA is more-or-less the difference between Oracle AWR and Oracle Grid Control.

In a nutshell, if you only have one or two PostgreSQL servers in production, then POWA is easier to install and probably enough for you. If you want more stats and an overview of your servers, then OPM is a better choice.

Both projects are autonomous and moving fast. We're just pushing code on github without direct economic goals or marketing plans :-) In the long run we'll see which one finds its user base and how they evolve.

That being said, we also have other guys working on other similar PostgreSQL tools, in particular :

pgBadger ( http://dalibo.github.io/pgbadger/ ) is a log analyzer which provides rich reports of the database activity. Basically it answers to the question "Which queries should be optimized ?"

pgCluu ( http://pgcluu.darold.net/ ) is a audit/reporting tools, more focused on system and resource utilization. It answers to questions such as "What's the load on my CPUs ? Is my memory swapping ?"

pg_activity ( https://github.com/julmon/pg_activity/ ): is like linux "top" command but for PostgreSQL queries.

pg_stat_kcache ( https://github.com/dalibo/pg_stat_kcache ) will gather statistics about reads done by the filesystem layer

So basically regarding PostgreSQL monitoring, we're just moving in every possible directions at the same time... Some say it's a "waste of time", we prefer the term "Darwinian Approach" :P

moondowner · on Sept 10, 2014

Thanks for the clarification!

sarnowski · on Sept 10, 2014

Since we (Zalando) are heavy PostgreSQL users, we also had the need for monitoring and part of the result we came up with is http://zalando.github.io/PGObserver/ and https://github.com/zalando/pg_view

davidw · on Sept 10, 2014

Minor grammar nitpick: "Open PostgreSQL Monitoring is a free software" is incorrect English, as software is an uncountable: we don't say "a software", but a program, a system, or something like that. You could also say "is free software".

daamien · on Sept 10, 2014

fixed.

Cieplak · on Sept 10, 2014

htop like application for PostgreSQL server activity monitoring:

https://github.com/julmon/pg_activity/

rpedela · on Sept 10, 2014

I am unsure what some of the graphs measure and labels mean when viewing the demo. Tooltips, descriptions, etc would be helpful in my opinion. Overall it seems like a great start though.

rjuju · on Sept 10, 2014

For all PGSQL graphs, the signification of each graph and label is (only :/) documented in the check_pgactivity "--help" option, as it's the direct source of these graphs. Sadly, we didn't have time yet to write a proper documentation on the wiki page, or on the UI. We'll add a better documentation as soon as possible.

For the SYSTM graphs, they correspond to standard nagios system checks.

therealmarv · on Sept 10, 2014

Is there anything similar for Newrelic out there? Nice project btw.

tvon · on Sept 10, 2014

The "fork" link goes nowhere, FYI.

rjuju · on Sept 10, 2014

This has been fixed. Thanks for the report !