That Needle In The Haystack

There are times in technology when something goes wrong.  It happens.  It’s the cost of progress.

It can be at lot of fun finding what caused things to go wrong and then fixing it.

Follow this trip, for instance.  Hopefully it helps someone else with google hosted addresses which forward to an internal mailing list server running mailman.

 

The story starts with a third party application that included a MailMan mailing list installation, which, with its own custom hooks, it worked well.  The lists were not used much, perhaps annually, but they did the job.

The organization has moved all of its mail to Google, and, because of the custom nature of this application and its mailing lists, those lists remained on this server.  That was over a year ago.  It gets reported that mail to those lists is going in to a veritable black hole; completely disappearing never to be seen again.

Get out the shovel and start digging through the layers.  Verifying that outbound mail is working from the command line, from the mailman server (subscribe yourself from the web interface), and verify that inbound email is making it to the server.  All of that checks out, but you are left with a log entry in /var/log/mailman/vette which reads:

Apr 23 12:45:56 2013 (3254) Message discarded, msgid: <5176BAAE.5040004@my.domain.com>, handler: Approve

At least we know it is making it to Mailman before being dropped to the floor.  Google time!  Only to discover that your keywords “mailman” “vette”, “message discarded” and “handler: approve” come up with a lot of hits.  Weeding through them takes a bit of time, but no fruit is yielded.  Refining that search doesn’t change too much.

Take a break and accomplish a few other things, then after 10 minutes, come back.  Search again.  Read.  Research.  Close the over abundance of tabs that were left open.  Open a few more.  Refine that search one last time and the message you were looking for is there: http://mail.python.org/pipermail/mailman-users/2009-September/067226.html

Reading through it, it looks and sounds like what you are experiencing.  At there is actually an intelligent answer in that posting.  Ok.. Time to prove we have a match.

Back to Mailman.  Check out the postfix configuration.  Being a little rusty with postfix, and not being the person who built this, it’s a little daunting since there are references to /etc/mailman/aliases (wha?!).  Fiddle with the aliases there, only to find out that it is really using /etc/aliases and some custom version of newaliases.  Finally get that working such that any inbound email to the list will get dropped to the root user.

Fire off a couple of emails to the list and tail /var/log/maillog.  Scratch your head as you watch the message get pushed off to Mailman when you explicitly told it not to.  Dig, dig, and find that someone created a cronjob to rebuild /etc/aliases every 30 minutes.  Look at the clock, it’s 32 minutes after.  Fix the aliases file again, run newaliases, restart postfix, and fire off your test email again.

Note that you are now on test email #8, each of which includes the line “Please ignore and discard this message.  There is no need to respond.”

Watching the logs, you see the message come through and get saved.  Checking out /var/mail/root, you see the “X-BeenThere” header mentioned in the posting.  Viola.  That’s the problem!

Clean-up time. Revert /etc/aliases despite knowing that a cronjob will revert it for you anyhow and restart postfix. Back up

/usr/lib/mailman/Mailman/Handlers/Approve.py

And modify the last few lines to include a custom version of the “x-beenthere” check.  Move the .pyc and .pyo files to the same location you stored your backup.  Restart mailman.  Restart postfix just for kicks.  Run a test through, number 9, and you watch it come in, get processed, and a slew of emails go out.  Hooray.

Notify everyone!  Oh, and then you check everything back in to puppet to make sure you don’t have to remember this stuff, and, while you are doing so, start deleting all of the unsolicited responses to your test message.

Fedora 18 “Spherical Cow” – What happened here?

[EDIT: Further examination revealed that a majority of the problems experienced were related to the glibc + kernel included with Fedora 18 and the updated glibc+kernel for Fedora 17.  The result was the same in both situations: unstable VIA and IVTV drivers.]

 

Fedora 18 has the release name “Spherical Cow”.  It should probably be called “Spherical Dung”.

 

“Why?” you might ask.

 

Ok, here goes:

From an end user perspective, Fedora 18 is an effective desktop that continues to force certain elements to evolve (*cough* Gnome 3 *cough*), and while that is welcome news, it doesn’t quite cut it.

From a systems administration point of view, Fedora 18 makes some great strides in cleaning up the messy transition from SysV to systemd, eliminating some of the annoyances, and attempting to make the SA’s life easier at the same time it makes it harder (holy dependency hell.. try removing NetworkManager for one).  That’s ok, because that’s what the Fedora tree is all about: being on the leading edge (not bleeding.. or hemorrhaging like with Rawhide).

I heard all of the rumors about Fedora 18 being the buggiest release ever, but having had heard all of that before, the level of skepticism was high.  Unfortunately, after a week of battling with a test server, I have resigned to agree.

In my case, specifically, I was testing on a 2yr old system: dual core athlon X2 4000+, 4GB ram, 1 IDE drive, two SATA drives, MSI motherboard with a VIA chipset.  This system ran fine under Fedora 16, and I followed the upgrade path (anaconda to Fedora 17, then fedup (bad name, BTW.. who ever came up with that should be hit with a clue-by-four)).

I experienced problems right out of the gate.  Graphical (X) interface would not come up and, once attempted, it would not display anything on the screen, even if the init level were brought back down to 3.  Drives randomly reported DRTY errors and locked up the system, and, after all of that, even the NIC would report physical errors and go off-line.  This could happen right away or under high I/O load.  Either way, it was painful.

Thinking the upgrade was the wrong way to go, a fresh install was performed.  That helped.  Only helped to change the drive errors that were reported..  Oh, and although X was working again, now the system would lock up after anywhere from 10 minutes to 3 hours.

Thoughts of the motherboard having died came about, so trying all of this in another system did quiet things down a bit (different chipset, I might add), but the capture cards in the system now started reporting that they, too, were locked and inaccessible.  rolling back the kernel to even an F16 kernel did not help.

The above was a painful week.  Finally giving up and installing Fedora 17 quieted everything down and the system has been running stable ever since.

 

Edit: I’m now running Ferdora 19 and had to abandon the motherboard I was using. Supposedly, the VIA chipset on that board has horrible driver support in later kernels and was the ultimate source of the problem.

EMR and Healthcare – Ergonomics

Some time ago I was involved in a huge electronic medical records (EMR) roll-out that was project managed in such a way as to create evaluation teams with physician involvement.  This was a excellent way to evaluate all of the aspects of the EMR and get buy-in from the end customers who were going to be impacted the most by the change from paper to electronic records.

IT plays an important role, from understanding the individual physician’s workflow to knowing all of the pieces involved in the hardware.  The impact is huge.

Dubbed “ergonomics”, the placement of the computing equipment in the examination room, the posture of the physician, and the computing equipment in use .. they all have an impact, positive or negative, on the experience for the physician and the patient, and, in the end, how efficient and effective a physician can be at their job.  An inefficient physician means seeing less patients, which in turn means making less money for the practice.

This draft document was intended to be presented to each physician practice as part of their “go-live’ package, but certain elements got in the way.  I share this information here in the hopes that what was learned through an 8month long process could be useful for others.

 

Ergonomic White Paper – v5 DRAFT