Small modification for queue runners?

Discussion:

Michael Haardt

2004-11-30 11:41:08 UTC

Hello,

I just had an idea how to improve queue run performance on larger queues:
How about having each queue runner not spawn a single delivery attempt
at a time, but a fixed number?

Right now, you can configure the amount of queue runners in total, but
quite often I see them stepping on each others toes. A single queue
runner that keeps a fixed number of deliveries running would not attempt
to deliver a message that is being tried by another queue runner, just
to find that the message is locked.

Right now, a queue runner forks a child and listens on a pipe to it.
The new queue runner had to have an array of pipes to listen to, starting
a new child when one ends, unless it reached the end of the queue.
Keeping 20 children running, the queue would be traversed just 1/20th
compared to 20 queue runners with one child each.

Just an idea. Any comments?

Michael

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Philip Hazel

2004-12-01 09:48:23 UTC

Permalink

Post by Michael Haardt
How about having each queue runner not spawn a single delivery attempt
at a time, but a fixed number?

Thanks for the idea, but I am not convinced that it would improve
performance much, if at all. After all, how much resource does it take
to open a file (that is already open by another process), attempt to
lock it, find that it is already locked, and so move on? That is what
happens when a queue runner checks a message that is already being
processed.

Actually, what I have written is not the whole truth. Exim creates a
subprocess in which those tests take place. I suppose some resource
could be saved by doing the test before creating the subprocess

Post by Michael Haardt
Right now, you can configure the amount of queue runners in total, but
quite often I see them stepping on each others toes. A single queue
runner that keeps a fixed number of deliveries running would not attempt
to deliver a message that is being tried by another queue runner, just
to find that the message is locked.

I don't think having a queue runner deliver n messages at once makes any
difference - for each message you still have to test whether some other
Exim process is working on it. Meanwhile, you have added considerable
complication to the queue runner code.

Post by Michael Haardt
Right now, a queue runner forks a child and listens on a pipe to it.
The new queue runner had to have an array of pipes to listen to, starting
a new child when one ends, unless it reached the end of the queue.
Keeping 20 children running, the queue would be traversed just 1/20th
compared to 20 queue runners with one child each.

Circumstances have changed a lot since I started work in Exim nearly 10
years ago. The design was for an environment where over 95% of messages
are delivered right away, so queue runners are dealing with only the
problem messages, which are a small percentage. This is still true in
our environment, though the volume of email has increased enormously, so
5% covers a lot more actual messages.

The bottom line is that Exim does not perform particularly well in
environments where the queue regularly gets very large. It was never
designed for this; deliveries from the queue were always intended to be
"exceptions" rather than the norm.

--
Philip Hazel University of Cambridge Computing Service,
***@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book: http://www.uit.co.uk/exim-book
--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Michael Haardt

2004-12-01 10:38:46 UTC

Permalink

Post by Philip Hazel
Thanks for the idea, but I am not convinced that it would improve
performance much, if at all. After all, how much resource does it take
to open a file (that is already open by another process), attempt to
lock it, find that it is already locked, and so move on? That is what
happens when a queue runner checks a message that is already being
processed.

If many queue runners are active on a large queue, Exim appears to get
unfair by trying the same messages over and over, whereas others sit
on the queue untouched. A central queue runner that spawns multiple
deliveries would coordinate the delivery attempts, thus introducing
more fairness.

Post by Philip Hazel
Actually, what I have written is not the whole truth. Exim creates a >

subprocess in which those tests take place. I suppose some resource >
could be saved by doing the test before creating the subprocess

Queue runners would get way more efficient if they tried to obtain locks
before forking, but the unfairness issue is still there.

Post by Philip Hazel
I don't think having a queue runner deliver n messages at once makes any
difference - for each message you still have to test whether some other
Exim process is working on it. Meanwhile, you have added considerable >

complication to the queue runner code.

Yes, the lock can not be avoided that way, but hitting an already locked
message will get a rare exception, even with large queues.

Michael

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Philip Hazel

2004-12-01 14:36:48 UTC

Permalink

Post by Michael Haardt
If many queue runners are active on a large queue, Exim appears to get
unfair by trying the same messages over and over, whereas others sit
on the queue untouched.

Have you got evidence for that? How do you know the messages are
untouched? Each queue runner should process the queue[*] in a "random"
order. Every queue runner should look at every message on the queue
eventually. If you have evidence otherwise, it is evidence of a bug.

Post by Michael Haardt
Queue runners would get way more efficient if they tried to obtain locks
before forking, but the unfairness issue is still there.

I have wishlisted this idea. However, I think that in practice, in most
configurations, the number of clashes will be small, and it is only when
there is a clash that this will make any difference. I don't really
think it will get "way more efficient", certainly not in most common
cases. For instance, if you have 5 queue runners, there will most likely
be 5 clashes (or maybe a few more for new messages that are being
delivered), but if you are scanning a queue of 1,000 messages that won't
be noticed.

Michael Haardt

2004-12-01 15:43:04 UTC

Permalink

Post by Philip Hazel

Post by Michael Haardt
If many queue runners are active on a large queue, Exim appears to get
unfair by trying the same messages over and over, whereas others sit
on the queue untouched.

Soemtimes, I saw messages that were on the queue for hours. They came in,
weren't delivered instantly due to high load at the time, and it took
real long until the queue runners got to process them. That means I
need more queue runners, but increasing their number a lot gives me the
"Spool file already locked" problem a lot, which means queue runners
just consume CPU time without doing anything useful.

Perhaps the randomisation is not random enough, but a central queue
runner would avoid the need for one.

Post by Philip Hazel
I have wishlisted this idea. However, I think that in practice, in most
configurations, the number of clashes will be small, and it is only when
there is a clash that this will make any difference. I don't really
think it will get "way more efficient", certainly not in most common
cases. For instance, if you have 5 queue runners, there will most likely
be 5 clashes (or maybe a few more for new messages that are being
delivered), but if you are scanning a queue of 1,000 messages that won't
be noticed.

I have currently 300 queue runners working on queues between 20.000
and 100.000 messages. For a MTA not designed to do that, Exim works
fairly good, but to scale beyond, modifications are required.

Michael

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

John W. Baxter

2004-12-01 17:25:46 UTC

Permalink

Post by Michael Haardt
I have currently 300 queue runners working on queues between 20.000
and 100.000 messages. For a MTA not designed to do that, Exim works
fairly good, but to scale beyond, modifications are required.

Yikes.

We have a monitor running which alerts me when Exim's queue grows over a
configurable size (currently 500...we've had the monitor set to as low as 60
or 70).

--John

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Philip Hazel

2004-12-02 09:47:29 UTC

Permalink

Good Grief! I am quite amazed that Exim works at all on queues that
long. And 300 queue runners! Words fail me....

Post by Michael Haardt
From my perspective, 1000 messages is a big queue, and more than two or

three simultaneous queue runners is excessive.

Exim is just not designed to operate with queues of any great length.

Even a single queue runner operating on 20 000 messages or more is going
to run badly. For a start, it will take time and memory to create its
list of messages to process. With split_spool_directory set, it first
makes a list of subdirectories, and then it processes the subdirectories
one by one, but even then you will have 400/500 messages per
subdirectory. It will take a long time to work its way through 20 000
messages.

Why are your queues so long? If messages arrive and are not delivered
because of load, then perhaps you need more hardware or a faster
Internet connection? (I realize that cost starts to be a factor.)

I know that large ISPs that have to deal with large numbers of waiting
messages do it by using multiple servers in a two- (or more) stage
configuration. Messages come into the first-level server; if they are
not immediately delivered, fallback_hosts is used to shunt them off to
the second-level server. So the front-level hosts never have a queue of
any length, and can therefore operate efficiently on messages that can
be delivered without delay. In a three-stage system, messages that
haven't been delivered from the second-level hosts within, say, 6 hours,
are passed on to a third-level server. This is where the big queues
occur, but since the messages are already well-delayed, its performance
is not so crucial.

More Background (for anyone searching the archives)
---------------------------------------------------

Before I wrote Exim we ran Smail, but before that we ran an MTA that
used a central "queue manager" process to control all deliveries. This
was a nightmare. Because everything had to go through it, it was a
bottleneck. What was worse, however, was that it kept lists of messages
in main memory. These lists could get corrupted so that it could
"forget" that a message existed. Such messages apparently vanished, only
to reappear as if by magic when the queue manager was restarted (which
it was from time to time because it could also get stuck).

I far preferred Smail's approach, which I adopted for Exim. There is no
separate list of messages. The files on disk ARE the queue. They are
processed by independent, short-lived processes. If one such process
crashes or gets stuck, or whatever, it does not impact on the entire
email service. This seems a nice application of the KISS principle.

You could Do It Yourself
------------------------

There is nothing to stop you writing your own "Exim scheduling server"
if you want to. You can turn off Exim's starting of queue runners. Your
own server could read the spool directories to obtain a list of
messages, and if it wants to, look into the files to find the
recipients (the spool file format is documented). Your server can then
run as many subprocesses as it likes, and in each one it can run

exim -Mc <message-id>

or perhaps better

exim -q <message-id> <same-message-id>

to make it deliver in "queue run" mode. Personally, I would not be happy
with such a project because of the problems of bottlenecking and single
point of failure (and all the other problems of long running processes,
such as memory leaks).

Tony Finch

2004-12-02 11:10:15 UTC

Permalink

Post by Philip Hazel
You could Do It Yourself
Personally, I would not be happy with such a project because of the
problems of bottlenecking and single point of failure (and all the other
problems of long running processes, such as memory leaks).

Postfix seems to be able to use this model successfully. Perhaps someone
would be able to learn some lessons from it and avoid the pp pitfalls.
Though we wouldn't benefit from it -- our queues are only a few hundred on
each machine. They were about ten times bigger before we started using
callout verification and largely eliminated double bounces.

Tony.

--
f.a.n.finch <***@dotat.at> http://dotat.at/
MALIN HEBRIDES: NORTHEAST 4 OR 5 INCREASING 6. RAIN LATER. GOOD BECOMING
MODERATE.
--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Philip Hazel

2004-12-02 14:20:57 UTC

Permalink

Post by Tony Finch
Postfix seems to be able to use this model successfully.

The explanation is probably that Wietse is a much better programmer than
I am. I know that I don't understand the ramifications of the
coordination that would be necessary to write something like that.

Michael Haardt

2004-12-03 14:18:41 UTC

Permalink

Post by Philip Hazel

Post by Tony Finch
Postfix seems to be able to use this model successfully.

Now that I had more time to think about it, let me summarise my partial
understanding of the problem and show a simple solution:

Currently, queue runners start one delivery each. Multiple queue runners
run uncoordinated. In order to avoid locking collisions, each queue
runner randomises the list of messages before processing it.

Until now, everybody assumed that the randomisation would treat all
mails fair and with small queues and numbers of queue runners, everything
works that way, and efficiently so.

With larger queues and larger amounts of queue runners, the picture
changes: Some mails stay on the queue pretty long. Using even more
queue runners, things don't change a lot, but CPU usage increases badly.
This gets up to the point where I can't run more queue runners to get rid
of all messages. Small bursts of messages let load rise, thus queueing
even more, making everything worse. If you remove the hints databases
at this time, the system gets unusable, which tells me they really do
help a lot and hide the problem otherwise.

So why does this happen? This is the part I don't understand entirely yet.
I suspect that most CPU time is spent trying to deliver messages that
are either currently locked by another delivery or, what happens way
more often, that were just tried by another queue runner. Single queue
runs are very long, which means new messages are not recognised until a
new queue runner is allowed to start. In the mean time, old messages are
tried by all queue runners, each finding that their retry time has not yet
come, which wastes CPU time badly. At least that's part of the picture.
I don't see a square effect here, but something like that must be going
on, because running twice the number of queue runners let hell break
lose instead of slowing things down by half.

Thanks to Philip giving me the hint to look at -Mc, the solution
is a small script which takes the output of exim -bpra, extracts all
message-IDs and feeds lines of "exim -Mc <id>" to a program which starts
a shell for each line of input, keeping up to a configurable amount of
subshells running, if there is enough work.

The effect is dramatic: A queue of 244.000 mails has been cut down to
53.000 in two hours with 300 parallel deliveries going on. The 300 Exim
queue runners before could not perform that, probably because this
queue run treats all messages fair, not retrying one before any other
has been tried.

In case anybody wants to run similar experiments, I append the small
parallel shell program and the script.

Philip: Could you shed some light on why the queue runner needs the
pipe? How does the process tree look like when delivering multiple
messages down the same channel? Does one delivery fork and exec a new
delivery, passing it the channel? If so, why can't it just exec it?

Michael
----------------------------------------------------------------------
#include <sys/types.h>
#include <sys/wait.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
char ln[4096],*end;
int p,res=0,status=0,usage=0;

if (argc!=2) usage=1;
else
{
p=strtol(argv[1],&end,10);
if (*end || p<1) usage=1;
}
if (usage)
{
fprintf(stderr,"Usage: parsh parallelity\n");
exit(1);
}
while (fgets(ln,sizeof(ln),stdin))
{
if (p==0) { wait(&status); ++p; }
if (res==0 && status) res=status;
switch (vfork())
{
case 0:
{
execl("/bin/sh","sh","-c",ln,(const char*)0);
fprintf(stderr,"parsh: exec failed: %s\n",strerror(errno));
exit(2);
break;
}
case -1:
{
fprintf(stderr,"parsh: vfork failed: %s\n",strerror(errno));
exit(2);
break;
}
default: --p;
}
}
while (wait(&status)!=-1) if (res==0 && status) res=status;
return res;
}
----------------------------------------------------------------------
#!/bin/rc

CONCURRENCY=300

echo 'Exim queue runner started.'

exec >/dev/null >[2=1] </dev/null

fn sighup {}

@{
@{ while (true) { /usr/exim/bin/exim -bpra; sleep 60 } } | awk '{
id=$3
notmatched=0;
do { getline } while ($0!="")
{ print "/usr/exim/bin/exim -Mc " id; print }
}' | parsh $CONCURRENCY
} &

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Philip Hazel

2004-12-03 16:46:07 UTC

Permalink

Post by Michael Haardt
Currently, queue runners start one delivery each. Multiple queue runners
run uncoordinated. In order to avoid locking collisions, each queue
runner randomises the list of messages before processing it.

No. The reason for randomising is so that one message that takes forever
to deliver does not always hold up the queue run at the same point. I do
not see locking collisions as a big issue. As I said, they consist of
"open the file, try to get a lock, oops it's already locked, exit". That
really should not use very many resources.

Post by Michael Haardt
So why does this happen? This is the part I don't understand entirely yet.
I suspect that most CPU time is spent trying to deliver messages that
are either currently locked by another delivery or, what happens way
more often, that were just tried by another queue runner.

The second of those is much more likely than the first. I really can't
see that detecting a lock and skipping is going to delay you much. In
the second case, the queue runner will route the message, then consult
the hints, and only then discover that it isn't time yet. Depending on
how your routing works, that might be the bottleneck.

Another issue with queue runners is that they scan the directory
and build a list of message ids in main memory. But even that shouldn't
be a really big issue. In the light of your experiment, it seems not.
Looks like your test shows that it *is* the redundant trying of messages
that have just been tried that is your problem when you run so many
queue runners.

It would be helpful if it were posssible to profile a queue runner in
your environment, to see exactly where it is spending CPU time.

Post by Michael Haardt
Philip: Could you shed some light on why the queue runner needs the
pipe? How does the process tree look like when delivering multiple
messages down the same channel? Does one delivery fork and exec a new
delivery, passing it the channel?

Yes.

Post by Michael Haardt
If so, why can't it just exec it?

The reason it can't just exec it is that the original queue runner needs
to wait until the entire sequence of deliveries has happened. Otherwise
it would not be following the rule "one queue runner does one delivery
at a time"[*]. The original process that the queue runner creates may
finish long before the entire chain. The pipe is a convenient way of
detecting when all the forked processes have terminated.

[*]Actually, the rule is already broken if one message has deliveries to
more than one host, and there are other messages waiting for both of
them.

Peter Bowyer

2004-12-03 17:31:08 UTC

Permalink

In the second case, the queue runner will route the message, then
consult the hints, and only then discover that it isn't time yet.

I wonder if it'd be worth statting the file and skipping it if the
atime is too recent.

But it wouldn't know what 'too recent' means without routing the message and
applying the retry rules - unless you mean a global rule such as 'less than
5 minutes is always too soon'.

Peter

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Nigel Metheringham

2004-12-06 09:33:46 UTC

Permalink

Post by Peter Bowyer

I wonder if it'd be worth statting the file and skipping it if the
atime is too recent.

But it wouldn't know what 'too recent' means without routing the message and
applying the retry rules - unless you mean a global rule such as 'less than
5 minutes is always too soon'.

queue_only mode would be a rather major wrinkle in this.

As for portability and people turning off atime updates (yes, thats me),
at least that would mean that atime was set to an earlier timestamp than
it should be, so you would then fall through into current behaviour -
although it has cost you an extra stat() (unless theres an existing one
that can be used) for no benefit on those systems.

Nigel.

--
[ Nigel Metheringham ***@InTechnology.co.uk ]
[ - Comments in this message are my own and not ITO opinion/policy - ]
--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Tony Finch

2004-12-03 17:26:00 UTC

Permalink

In the second case, the queue runner will route the message, then
consult the hints, and only then discover that it isn't time yet.

I wonder if it'd be worth statting the file and skipping it if the atime
is too recent.

Tony.

John W. Baxter

2004-12-03 19:12:03 UTC

Permalink

Post by Tony Finch

In the second case, the queue runner will route the message, then
consult the hints, and only then discover that it isn't time yet.

I wonder if it'd be worth statting the file and skipping it if the atime
is too recent.

It would be necessary to know whether atime recording has been turned off
(the noatime option in mount (as seen in the old Linux release whose man
page I happened to read) and the moral equivalent in the fstab). Possibly
one knows that because the interval seems very large.

This "feels like" another great opportunity for per OS and per OS version
code to creep into Exim (and in the case of Linux, possibly
per-distribution).

---john

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Philip Hazel

2004-12-06 10:34:01 UTC

Permalink

Post by John W. Baxter
This "feels like" another great opportunity for per OS and per OS version
code to creep into Exim (and in the case of Linux, possibly
per-distribution).

Given Nigel's subsequent post, this could probably only be done as an
option, which doesn't seem to good an idea.

Brooding over the original discussion over the weekend, I can see that,
for situations where there is a queue of any size, a queue runner that
makes ONE list of messages and then tries to deliver several at once
would waste less time looking at messages unncessarily that several
queue runners. This is borne out by Michael's experiments with his own
"external queue runner". Exim really does work best when one queue
runner can finish its work before the next one starts, because queue
runners were never intended to be a primary delivery mechanism.

A more radical idea would be to invent yet another hint. When a
message fails to be delivered and is left on the queue, the hints for
various delivery actions (hosts etc) are updated together at the end of
the delivery process. It should be straightforward to remember the
earliest retry time for any of them. So one could make a hint record,
keyed by message id, that contained this time. A queue runner could
consult this file before forking and trying to deliver a message.

The problem with this is the bottleneck of the hints database. The queue
runner would have to keep opening and closing it so that delivery
processes that were finishing could update it. This leads us on into the
territory of database concurrent usage. I believe that BDB4 has
mechanisms for this, but it would be a radical departure to insist that
all Exim users use BDB4.

My current position is that this is an area that might in the future be
played with, but it isn't something for the short-term. And ideally,
someone who has these huge queues needs to do experiments - it isn't
something I can simulate.

Michael Haardt

2004-12-06 11:55:20 UTC

Permalink

Post by Philip Hazel
A more radical idea would be to invent yet another hint. When a
message fails to be delivered and is left on the queue, the hints for
various delivery actions (hosts etc) are updated together at the end of
the delivery process. It should be straightforward to remember the
earliest retry time for any of them. So one could make a hint record,
keyed by message id, that contained this time. A queue runner could
consult this file before forking and trying to deliver a message.

That optimisation does not solve my problem of too few simultaneous
deliveries, but it may reduce the time for a queue run in total by
a great amount. I thought about it as well, but abandoned the idea so
far, because it effectively introduces messages based retry time. If you
reduce the retry time for a host or domain, existing messages will not be
delivered, because it does not change the existing earliest retry times.

Post by Philip Hazel
My current position is that this is an area that might in the future be
played with, but it isn't something for the short-term. And ideally,
someone who has these huge queues needs to do experiments - it isn't
something I can simulate.

How about saving that for the next time I need to squeeze out more
performance? :-)

Michael

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Philip Hazel

2004-12-06 14:10:49 UTC

Permalink

Post by Michael Haardt
As I see it, the queue runner parent keeps a pipe open, because its
child terminates without waiting for grandchildren it spawns before right
before terminating. If it would not fork grandchildren, but exec them,
the parent would still wait for the same process, thus not needing a pipe.
But I am probably missing something in this picture.

You are! :-) The spawning is not necessarily "right before terminating".

Suppose an email is addressed to two recipients, A and B, on different
mail servers. When Exim has delivered to A, it notices that there is
another email that previously could not get through to A. So it forks a
new process and hands over the connection to it. Now it goes on and
delivers to B. Waiting for the B delivery to happen is a bad idea,
because you are holding the SMTP connection to A open. And anyway, there
may also be C and D and E...

A more "sophisticated" design would be able to deliver the other waiting
messages from the existing process, but Exim is not sophisticated in
this way.

Post by Michael Haardt
That optimisation does not solve my problem of too few simultaneous
deliveries, but it may reduce the time for a queue run in total by
a great amount. I thought about it as well, but abandoned the idea so
far, because it effectively introduces messages based retry time.

Sort of partially, I suppose. If the host comes back up again, the
message won't be looked at as early as it otherwise would (but you might
hope that it would get delivered down an existing channel). It is
certainly an added complication that would be hard to describe.

Post by Michael Haardt
If you reduce the retry time for a host or domain, existing messages
will not be delivered, because it does not change the existing
earliest retry times.

But that is true already. If you reduce the retry time for a host, it
does not affect the existing hints data, which includes the next time to
try that host.

Philip

--
Philip Hazel University of Cambridge Computing Service,
***@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Pete Carah

2004-12-05 06:54:37 UTC

Permalink

One other thing to consider here is that unix takes *very* long to open a
file in a directory with 500k files (the spool of 244k messages mentioned).
I've seen a full minute to create a file in such a directory (fairly old
machine but not *that* old). (create is slower than *any* other open.)
If the spool is going to get that big one should hash the names into
subdirectories... And if all queue runners are doing this at the same
time, readdir() will get pretty bogged down.

One other approach to speeding this up would be a db file (hash or btree)
parallel to the spool directory (or indexing into it). That should speed up
the randomisation of the queue runners' sending order a lot over a raw
directory if the queue gets very big. OTOH it would be lots of extra
overhead in any of my cases... (my queue never gets over 100-200).

Freebsd has the dirhash option but I don't know if the kernel memory it
uses for that is persistent when the dir gets *that* big. I don't know
what other unices (or similar) do here. Mac HFS would be OK (btree dirs;
worse for sequential reads but quick for opens) but I don't know if OS-X
uses those or not.

-- Pete

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Adrian Phillips

2004-12-05 11:12:04 UTC

Permalink

Pete> One other thing to consider here is that unix takes *very*
Pete> long to open a file in a directory with 500k files (the
Pete> spool of 244k messages mentioned). I've seen a full minute

Depends on the filesystem - reiserfs is hash based and takes no time
at all to open or create a file in a directory with 500k
files. Scanning the whole directory takes a few seconds though.

Sincerely,

Adrian Phillips

--
Who really wrote the works of William Shakespeare ?
http://www.pbs.org/wgbh/pages/frontline/shakespeare/
--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Greg A. Woods

2004-12-05 21:22:04 UTC

Permalink

[ On Saturday, December 4, 2004 at 22:54:37 (-0800), Pete Carah wrote: ]

Subject: Re: [exim] Small modification for queue runners?
One other thing to consider here is that unix takes *very* long to open a
file in a directory with 500k files (the spool of 244k messages mentioned).

As you almost hint that depends on the filesystem, the implementation of
the filesystem, and perhaps also the regularity with which the directory
is used.

One other approach to speeding this up would be a db file (hash or btree)
parallel to the spool directory (or indexing into it).

No, I don't think so -- at least not at user level.

Simple directory level hashing at the user level would likely suffice if
this were a really big problem for enough people (see the URL below).
It's almost trivial to implement in this kind of application too, and
it's a very widely used solution (e.g. Cyrus IMAP is one e-mail related
example where this technique is used).

Freebsd has the dirhash option but I don't know if the kernel memory it
uses for that is persistent when the dir gets *that* big.

With most modern FFS implemenations there's lots of caching of vnodes
and metadata, and with FFS "soft dependencies" there's also much of the
benefit of "async" and "noatime" mounts without their dangers. These
combine to make the affects of accessing directories with many files
less painful than it once was

FreeBSD's "dirhash" is indeed a potential benefit, though it needs
careful tuning and control.

I don't know
what other unices (or similar) do here. Mac HFS would be OK (btree dirs;
worse for sequential reads but quick for opens) but I don't know if OS-X
uses those or not.

The design goals for SGI's XFS included solving this problem of handling
many files in one directory and their papers and reports suggest they
met this goal quite well (they use B-tree indexing within the directory
file).

FFSv2 apparently handles many files per directory even better than the
old 4.4BSD FFS with softdep. ReiserFS and one called GFS apparently do
well too.

The scripts included in this message might be useful to anyone who wants
to investigate how well any given system behaves:

http://mail-index.netbsd.org/tech-kern/2000/12/19/0016.html

--
Greg A. Woods

+1 416 218-0098 VE3TCP RoboHack <***@robohack.ca>
Planix, Inc. <***@planix.com> Secrets of the Weird <***@weird.com>
--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Michael Haardt

2004-12-06 11:48:47 UTC

Permalink

Post by Philip Hazel
Looks like your test shows that it *is* the redundant trying of messages
that have just been tried that is your problem when you run so many
queue runners.

Yes. I badly need simultaneous deliveries to make good use of the system,
so I used multiple queue runners. If I have 20000 mails on the queue, 100
of them being deliverable right now, although slow, one queue runner can
not deliver them before more piles up, but I have a ratio of 19900 failed
deliveries vs. 100 successful ones. That's ok. Using 10 queue runners,
I have a ratio of 199900 failed deliveries cs. 100 successful ones.

Using 10 spawned deliveries from a central queue runner, I can deliver
10 times as much, yet I have the ratio of a single Exim queue runner.

Post by Philip Hazel

Yes.

Post by Michael Haardt
If so, why can't it just exec it?

The reason it can't just exec it is that the original queue runner needs
to wait until the entire sequence of deliveries has happened. Otherwise
it would not be following the rule "one queue runner does one delivery
at a time"[*]. The original process that the queue runner creates may
finish long before the entire chain. The pipe is a convenient way of
detecting when all the forked processes have terminated.
[*]Actually, the rule is already broken if one message has deliveries to
more than one host, and there are other messages waiting for both of
them.

As I see it, the queue runner parent keeps a pipe open, because its
child terminates without waiting for grandchildren it spawns before right
before terminating. If it would not fork grandchildren, but exec them,
the parent would still wait for the same process, thus not needing a pipe.

But I am probably missing something in this picture.

Michael

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Bill Hacker

2004-12-06 02:57:35 UTC

Permalink

Stumped here:

- I have two PGSQL-driven Exim 4.43+28 installs, near-as-dammit
identical (both are still in testing mode) as to build, environment,
users in the DB, directory structure, perms, etc. They differ primarily
in domain names & IP numbers.

- Both were built "WITH_PGSQL", found the relevant dependencies (lbpq)
and reflect that in the lookup capability list as output from the 'Exim
-bV' command.

- ~/exim/configure is identical, save for IP & domain name.

- The one built a few days ago works fine. The one built today throws
an error I cannot find on Google, in docs, or in the 32,000+ message
archive I conveniently keep of this very mailing list:

- The error message is:

Exim configuration error in line 14 of /usr/local/etc/exim/configure:
main option "pgsql_server" unknown.

- The line cited is one or the other of (never both at the same time) of:

hide pgsql_servers = (/tmp/.s.PGSQL.5432)/<dbname>/<dbuser>/<dbuserpwd>

OR

hide pgsql_servers = localhost/<dbname>/<dbuser>/<dbuserpwd>

PGSQL is responding to ZPsycopgDA on *both* the IP port and the Unix
socket, so it doesn't appear to be a DB issue.

Syntactically, it sounds as if Exim is telling me it was not compiled to
understand the "psql_server" call - yet it was.

exim -bV so confirms.

Plenty of 'evidence' available if no one has a QNDA.

Stumped here...................................

Bill Hacker

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Stephen Gran

2004-12-06 03:12:12 UTC

Permalink

First - please start a new thread rather than replying to an old message
- I am easily confused.

Post by Bill Hacker
main option "pgsql_server" unknown.

^^^^^^^^^^^^
Note the singular

Post by Bill Hacker
hide pgsql_servers = (/tmp/.s.PGSQL.5432)/<dbname>/<dbuser>/<dbuserpwd>
OR
hide pgsql_servers = localhost/<dbname>/<dbuser>/<dbuserpwd>

^^^^^^^^^^^^^
Note the plural

Typo?

--
--------------------------------------------------------------------------
| Stephen Gran | Violence is a sword that has no handle |
| ***@lobefin.net | -- you have to hold the blade. |
| http://www.lobefin.net/~steve | |
--------------------------------------------------------------------------
--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Bill Hacker

2004-12-06 11:26:20 UTC

Permalink

No longer.

"BDOS Error on B:" syndrome. An irrelevant error message, see below.

Post by Bill Hacker
- I have two PGSQL-driven Exim 4.43+28 installs, near-as-dammit
identical (both are still in testing mode) as to build, environment,
users in the DB, directory structure, perms, etc. They differ primarily
in domain names & IP numbers.
- Both were built "WITH_PGSQL", found the relevant dependencies (lbpq)
and reflect that in the lookup capability list as output from the 'Exim
-bV' command.
- ~/exim/configure is identical, save for IP & domain name.
- The one built a few days ago works fine. The one built today throws
an error I cannot find on Google, in docs, or in the 32,000+ message
main option "pgsql_server" unknown.
hide pgsql_servers = (/tmp/.s.PGSQL.5432)/<dbname>/<dbuser>/<dbuserpwd>
OR
hide pgsql_servers = localhost/<dbname>/<dbuser>/<dbuserpwd>
PGSQL is responding to ZPsycopgDA on *both* the IP port and the Unix
socket, so it doesn't appear to be a DB issue.

And it was not - I could 'see' the open connection from Exim to PG.

Post by Bill Hacker
Syntactically, it sounds as if Exim is telling me it was not compiled to
understand the "psql_server" call - yet it was.
exim -bV so confirms.
Plenty of 'evidence' available if no one has a QNDA.
Stumped here...................................
Bill Hacker

Quick 'N Dirty Answer it was not, but for future reference:
'psql_servers' were only 'unknown' in a special case.

- The 'daemonized' exim instance was running fine, and able to make its
DB lookups, as the logs showed once I pointed some actual inbound
traffic onto the box.

The problem came about 'coz (lazy me!) I had been doing inital testing
from an 'on box' shell account su'ed to root.

mail -s Test(n) <valid on-box OR off-box address><Newline>
Test(n)<Newline>
.
<full-stop><Newline> or <Ctrl-D>

- ail/Sendmail style, invoked a separate *non daemon* instance of exim.
But not with the UID:GID.

- While this one was able to read the /usr/local/etc/exim/configure
file, and permitted to ignore the 'never_user=root', it was NOT able to
hide its identity when attempting the connection to postgres - which
refused it, as pg won't speak to root.

- su'ing to pgsql reversed the situation - now the DB call would have
been acceptable, but user:group pgsql:pgsql had no right to read the
exim configure file, (inciting a quite explicit error message) - so
events never reached that stage.

- dropping to an ordinary 'wheel' user, with rights to neither the
configure file, nor the DB socket confirmed it.

Good news, I suppose, is that the security model seems to be QED... <g>

Bill Hacker

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Michael Haardt

2004-12-02 10:18:39 UTC

Permalink

Post by Philip Hazel
Why are your queues so long? If messages arrive and are not delivered
because of load, then perhaps you need more hardware or a faster
Internet connection? (I realize that cost starts to be a factor.)

Once in a while, messages do not get delivered instantly due to load,
but usually load is fine. An example are extraordinary big newsletters,
because they tend to generate a flood of over quota bounces. But as
I said, that's not my real problem.

My queues are so long for various reasons:

o I do keep undelivered mail for 6 days until giving up
o I am backup MX for a bunch smaller sites
o The total amount of processed messages is much, much higher.
Like your site, most mail will be delivered instantly. ;-)

I should mention that I run three such nodes in my cluster.

Post by Philip Hazel
[Many large ISPs run a multi-level queue]

I was thinking about employing that system as well, but as it is, the
only real problem is CPU usage from queue runners that fork processes
that don't do anything useful. That's not going to change with multiple
queues, which just cure the load-induced problem of new messages not
being looked at soon.

300 queue runners usually use about one CPU of a dual Athlon 2600+
system. Occasionally there is much less used, occasionally both are
used entirely. Things look much worse with 500 or 600, though. On
average, 360 I/O transactions/s are done, with peaks at a little over
800. I guess the system could perform even 1000, if everything else
is perfectly balanced.

See? Exim is real great software, works far beyond what you thought
it could do and there is potential to move even further. :-)

Post by Philip Hazel
to make it deliver in "queue run" mode. Personally, I would not be happy
with such a project because of the problems of bottlenecking and single
point of failure (and all the other problems of long running processes,
such as memory leaks).

The current model is one extreme: A hoard of uncoordinated queue runners,
each spawning one delivery at a time. A central queue runner is the other.
I suggest something in between: Give a queue runner the option to spawn
more than one delivery at a time. There could still be multiple queue
runners, e.g. one per directory for split spools. That way you make use
of simultaneous IO capacity, as resulting from RAIDs, when traversing the
queue.

The historical reason of a broken central queue manager is of course a
good reason never wanting to see that again. Qmail, on the other side,
shows a very well working, very stable central queue manager that does
work on files. I just don't like it otherwise for a bunch of reasons.

Michael

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Philip Hazel

2004-12-02 14:18:30 UTC

Permalink

Post by Michael Haardt
See? Exim is real great software, works far beyond what you thought
it could do and there is potential to move even further. :-)

What can I say? I'm amazed.

Post by Michael Haardt
The current model is one extreme: A hoard of uncoordinated queue runners,
each spawning one delivery at a time. A central queue runner is the other.
I suggest something in between: Give a queue runner the option to spawn
more than one delivery at a time. There could still be multiple queue
runners, e.g. one per directory for split spools. That way you make use
of simultaneous IO capacity, as resulting from RAIDs, when traversing the
queue.

Well, I'm still not convinced that you will gain very much by running
one queue runner that does two deliveries at once compared with two
queue runners that do one delivery at once. But I'm always ready to be
proved wrong... However, there is no chance of my implementing anything
like that in the near future.

Post by Michael Haardt
The historical reason of a broken central queue manager is of course a
good reason never wanting to see that again. Qmail, on the other side,
shows a very well working, very stable central queue manager that does
work on files.

Oh, I'm not saying it can't be done. I'm just saying that I wouldn't
want to do it! I go for the easy approach, where the consequences of
my mistakes are less serious. :-)

Greg A. Woods

2004-12-02 19:55:16 UTC

Permalink

[ On Thursday, December 2, 2004 at 11:18:39 (+0100), Michael Haardt wrote: ]

Subject: Re: [exim] Small modification for queue runners?
o I am backup MX for a bunch smaller sites

If they contribute significantly to your queue load then you should
consider doing their work on a separate host or hosts.

Further if they consistenly always contribute a lot of undeliverable
messages to your queues because their primary MX hosts are not regularly
available then you should _strongly_ suggest to those folks that they
give up on trying to operate their own primary MX and instead simply
fetch their mail from your server(s) by IMAP or POP. SMTP is really
only intended to be used on a fully, 7x24, connected network and the
store-and-forward design is simply a reliability and integrity feature
intended only to handle _rare_ exceptions to normal connectivity and
operation.

Even though SMTP is by its design a store-and-forward protocol, SMTP
queues are not very good for "long-term" storage of some messages
especially when the vast majority of other messages will never sit in
the queue for more than a minute or so.

Tony Finch

2004-12-02 20:03:39 UTC

Permalink

Post by Greg A. Woods
[ On Thursday, December 2, 2004 at 11:18:39 (+0100), Michael Haardt wrote: ]

Subject: Re: [exim] Small modification for queue runners?
o I am backup MX for a bunch smaller sites

And if they are usually up you should do call-forward verification.
This is a BIG bonus for reducing queue sizes.

Tony.

Michael Haardt

2004-12-06 14:49:29 UTC

Permalink

Post by Philip Hazel
Suppose an email is addressed to two recipients, A and B, on different
mail servers. When Exim has delivered to A, it notices that there is
another email that previously could not get through to A. So it forks a
new process and hands over the connection to it. Now it goes on and
delivers to B. Waiting for the B delivery to happen is a bad idea,
because you are holding the SMTP connection to A open. And anyway, there
may also be C and D and E...

Now I got it. How about putting that somewhere in the code as comment? :)

If a message has many recipients, it may start a bunch of deliveries.
Typical newsletters without VERP thus avoid the otherwise controlled
parallelism after a failure like too high local load, possibly resulting
in a new load peak.

Would anybody object if the choice were only a) to delay delivery of
B until all messages were transmitted to the host for A or b) never to
send messages down the same channel?

Neither would need a pipe, thus making multiple deliveries from one
queue runner very easy.

That sounds like I wanted to avoid using a semaphore for the number of
concurrent deliveries. Given the portability of semaphores and shared
memory in the existing Unix world, I do, because Exim currently runs on
a bunch of odd systems.

Post by Philip Hazel

Post by Michael Haardt
If you reduce the retry time for a host or domain, existing messages
will not be delivered, because it does not change the existing
earliest retry times.

But that is true already. If you reduce the retry time for a host, it
does not affect the existing hints data, which includes the next time to
try that host.

Yes, but it's just one host, so using fixdb to change its record causes
delivery of all messages.

Michael

--
## List details at http://www.exim.org/mailman/listinfo/exim-users Exim details at http://www.exim.org/ ##

Philip Hazel

2004-12-07 09:55:40 UTC

Permalink

Post by Michael Haardt
If a message has many recipients, it may start a bunch of deliveries.
Typical newsletters without VERP thus avoid the otherwise controlled
parallelism after a failure like too high local load, possibly resulting
in a new load peak.

Indeed, but remember my original design parameters: the assumption is
that most deliveries will happen first time, NOT via a queue runner. I
hadn't thought about high load issues at that time (the various load
controls were all added to Exim later).

Post by Michael Haardt
Would anybody object if the choice were only a) to delay delivery of
B until all messages were transmitted to the host for A or b) never to
send messages down the same channel?

Not quite so simple. If you have remote_max_parallel set greater than
one, delivery to A and B may be happening simultaneously, in different
processes that cannot communicate with each other. So the only way to do
(a) would be to set remote_max_parallel=1, which is probably not a good
idea. And doing (b) removes an important optimization that sometimes is
very helpful. Making it an option is no problem; making it the default
isn't right.

Post by Michael Haardt
Neither would need a pipe, thus making multiple deliveries from one
queue runner very easy.

It isn't particularly hard managing an array of pipes. That is, in fact,
what the delivery process already does in order to implement
remote_max_parallel. Each delivery subprocess passes back information
about what happened using a pipe.

Post by Michael Haardt
Yes, but it's just one host, so using fixdb to change its record causes
delivery of all messages.

Ah! Somebody that is brave enough to use fixdb... congratulations! I
wasn't thinking of that.