Discussion:
Messages bouncing too soon
Mark Rigby-Jones
2005-04-13 13:53:19 UTC
Permalink
Hi all,

I'm getting a strange issue where exim seems to be bouncing messages much
sooner then the final retry rule. The scenario in question is with messages
coming into a server which are then sent on to the customer's SMTP server.

Earlier in the week one customer had to take their server down and, as
expected, mail queued up here with lots of "retry time not reached for any
host" messages in the log. However, after a day or so, messages started to
bounce with the message "all hosts have been failing for a long time and
were last tried after this message arrived".

The retry rule for the messages (as checked with exim -brt) is as follows,
giving a final retry timeout of two weeks.
cdb*@;/etc/mail/retry/server.cdb * F,2h,15m; F,16h,1h; F,2w,8h;

The entry for the customer's server in the retry.db file did correctly show
the start time of the failure as being a couple of days ago.

I'm at a loss to explain why these messages are being bounced. Any hints as
to what else to look at would be greatly appreciated.

mrj
--
Mark Rigby-Jones, Operations Manager @ Community Internet plc
Windsor House, 12 High Street, Kidlington, Oxford OX5 2PJ, UK
Tel: +44-1865-856000 (Direct: +44-1865-856009) Fax: +44-1865-856001
***@community.net.uk <*> http://www.community.net.uk/~mrj
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Philip Hazel
2005-04-14 19:08:32 UTC
Permalink
Post by Mark Rigby-Jones
Earlier in the week one customer had to take their server down and, as
expected, mail queued up here with lots of "retry time not reached for any
host" messages in the log. However, after a day or so, messages started to
bounce with the message "all hosts have been failing for a long time and were
last tried after this message arrived".
Please RTFM, in particular section 32.8 of the 4.50 manual. Exim does
host-based retrying, not message-based retrying.
--
Philip Hazel University of Cambridge Computing Service,
***@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book: http://www.uit.co.uk/exim-book
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Mark Rigby-Jones
2005-04-15 10:21:23 UTC
Permalink
Post by Philip Hazel
Post by Mark Rigby-Jones
Earlier in the week one customer had to take their server down and, as
expected, mail queued up here with lots of "retry time not reached for any
host" messages in the log. However, after a day or so, messages started to
bounce with the message "all hosts have been failing for a long time and were
last tried after this message arrived".
Please RTFM, in particular section 32.8 of the 4.50 manual. Exim does
host-based retrying, not message-based retrying.
*nods* I am aware of that, my issue is that the host had been failed for
less than two days (I verified this in the retry db file, unfortunately the
customer got their server back up before I could snapshot it), whilst the
retry rule had two weeks until the final cutoff.

I would expect messages to be bounced quickly once the remote host had been
down for two weeks or more, but I wouldn't expect any messages to be
bounced during the first two weeks - or am I misunderstanding something?

Thanks,
mrj
--
Mark Rigby-Jones, Operations Manager @ Community Internet plc
Windsor House, 12 High Street, Kidlington, Oxford OX5 2PJ, UK
Tel: +44-1865-856000 (Direct: +44-1865-856009) Fax: +44-1865-856001
***@community.net.uk <*> http://www.community.net.uk/~mrj
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Philip Hazel
2005-04-15 10:51:01 UTC
Permalink
Post by Mark Rigby-Jones
I would expect messages to be bounced quickly once the remote host had been
down for two weeks or more, but I wouldn't expect any messages to be bounced
during the first two weeks - or am I misunderstanding something?
The timeout depends on your retry settings. I am currently in Maputo, on
the end of a very slow network connection, and I cannot devote much time
to email. If you want to do further checking, the thing to do would be
to send a test message to one of the failing domains, with debugging
enabled, and using the -N command line option so that it doesn't
actually get delivered. The debugging output should give some clue as to
why it is bouncing. (I presume you are using release 4.50; if not, scan
the ChangeLogs to see if there's anything relevant that has been
changed.)
--
Philip Hazel University of Cambridge Computing Service,
***@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book: http://www.uit.co.uk/exim-book
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Miros/law Baran
2005-04-15 13:28:29 UTC
Permalink
Post by Philip Hazel
Post by Mark Rigby-Jones
I would expect messages to be bounced quickly once the remote host had been
down for two weeks or more, but I wouldn't expect any messages to be bounced
during the first two weeks - or am I misunderstanding something?
The timeout depends on your retry settings. I am currently in Maputo, on
the end of a very slow network connection, and I cannot devote much time
to email. If you want to do further checking, the thing to do would be
to send a test message to one of the failing domains, with debugging
enabled, and using the -N command line option so that it doesn't
actually get delivered. The debugging output should give some clue as to
why it is bouncing. (I presume you are using release 4.50; if not, scan
the ChangeLogs to see if there's anything relevant that has been
changed.)
I've seen similar weird behaviour too. I'm not sure if it has something
in common, but the 'retry timeout exceeded' message did ring a bell.

The scenario: An e-mail was sent to ***@domain.tld, which has two
MXes, alternator.domain.tld and rotanretla.domain.tld; both have
Postfix with sender callouts enabled.

Because both the MXes don't have any ident daemon running and in my
setup there was a rfc1413_query_timeout set to 10s, the situation was as
follows:

myhost.tld -> alternator [conn. 1] EHLO (and stuff)
alternator -> myhost.tld [conn. 2] EHLO (callout), 10s for banner,
probable timeout
alternator -> myhost.tld [conn. 1] 450
myhost.tld -> rotanretla [conn. 3] EHLO (and stuff)
rotanretla -> myhost.tld [conn. 4] EHLO (callout), 10s for banner,
probable timeout
rotanretla -> myhost.tld [conn. 3] 450

So far, so good. Then, something unexpected: the exim daemon creates
immediate bounce for that mail:

--8<--
***@domain.tld
SMTP error from remote mailer after RCPT TO:<***@domain.tld>:
host rotanretla.domain.tld [xxx.xxx.xxx.xxx]: 450 <***@myhost.tld>:
Sender address rejected: unverified address: Address verification in progress:
[this is the error message from the remote server]
retry timeout exceeded
[and this is what the exim has to say in that situation]
--8<--

The behaviour was triggered by two immediate 450s from both the MXes on
the receiving side. Retry rules are pretty standard:

* * F,2h,15m; G,16h,1h,1.5; F,4d,6h

No additional retry controls are set. The exim version is 4.50 (the
Debian package 4.50-5), with a simple, one-file configuration /I don't
use the default Debian way of the Exim configuration/.

Kind regards
Jubal
--
[ Miros/law L Baran, baran-at-knm-org-pl, neg IQ, cert AI ] [ 0101010 is ]
[ BOF2510053411, makabra.knm.org.pl/~baran/, alchemy pany ] [ The Answer ]

Things are more like they used to be than they are now.
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Philip Hazel
2005-04-26 11:12:19 UTC
Permalink
Post by Mark Rigby-Jones
I would expect messages to be bounced quickly once the remote host had been
down for two weeks or more, but I wouldn't expect any messages to be bounced
during the first two weeks - or am I misunderstanding something?
You are correct. Exim should only bounce messages after a host has been
down for the full retry time. This is a tricky area of the code which
has had problems in the past. Now that your problem has gone away, I'm
not sure exactly how I can proceed to try to figure out what happened.
Eyeballing the code doesn't always show up anything, but I'll take a
look, just in case.
--
Philip Hazel University of Cambridge Computing Service,
***@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book: http://www.uit.co.uk/exim-book
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Mark Rigby-Jones
2005-04-27 10:51:16 UTC
Permalink
Post by Philip Hazel
You are correct. Exim should only bounce messages after a host has been
down for the full retry time. This is a tricky area of the code which has
had problems in the past. Now that your problem has gone away, I'm not
sure exactly how I can proceed to try to figure out what happened.
Eyeballing the code doesn't always show up anything, but I'll take a
look, just in case.
That particular problem has indeed gone away as we have no customers with
broken mail servers at the moment. I am seeing something very similar with
outgoing email (this came to light yesterday after Yahoo! decided to
"de-prioritize" our mail servers). I can see that domains are getting
marked as past their final cutoff in the retry DB file:

[***@f2.mail exim/log]% exinext <domain>.com
Transport: <ipaddr> [<ipaddr>] error 110: Connection timed out
first failed: 26-Apr-2005 01:18:00
last tried: 27-Apr-2005 00:10:41
next try at: 27-Apr-2005 10:50:41
past final cutoff time

[***@f2.mail exim/log]% exim -brt <domain>.com
Retry rule: * * F,2h,15m; F,16h,1h; F,4d,8h;

I'm not sure how to get more useful debug information, as I presume the
issue is with the delivery attempt which sets the 'past final cutoff time'
flag (as testing a domain which is already failed simply shows it being
rejected because that flag is set).

mrj
--
Mark Rigby-Jones, Operations Manager @ Community Internet plc
Windsor House, 12 High Street, Kidlington, Oxford OX5 2PJ, UK
Tel: +44-1865-856000 (Direct: +44-1865-856009) Fax: +44-1865-856001
***@community.net.uk <*> http://www.community.net.uk/~mrj
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Philip Hazel
2005-04-27 13:40:36 UTC
Permalink
I can see that domains are getting marked
Thanks for that information. I've been eyeballing the code, and not
seeing anything I can fix, but knowing that it is setting the cutoff
flag incorrectly narrows down the possibilities somewhat. I'll take a
closer look at that bit of code.
--
Philip Hazel University of Cambridge Computing Service,
***@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book: http://www.uit.co.uk/exim-book
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Philip Hazel
2005-04-27 14:00:19 UTC
Permalink
Post by Mark Rigby-Jones
Transport: <ipaddr> [<ipaddr>] error 110: Connection timed out
first failed: 26-Apr-2005 01:18:00
last tried: 27-Apr-2005 00:10:41
next try at: 27-Apr-2005 10:50:41
past final cutoff time
Retry rule: * * F,2h,15m; F,16h,1h; F,4d,8h;
That data does not make sense. The host appears to have been down for
around 23 hours. So the retrying should happened every 8 hours. However,
it seems to have calculated the next retry with an interval of 10 hours
and 40 minutes.

This may be a silly question, but you aren't sharing the hints data
between more than one host, are you?

Philip
--
Philip Hazel University of Cambridge Computing Service,
***@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Mark Rigby-Jones
2005-04-29 10:00:53 UTC
Permalink
Post by Philip Hazel
That data does not make sense. The host appears to have been down for
around 23 hours. So the retrying should happened every 8 hours. However,
it seems to have calculated the next retry with an interval of 10 hours
and 40 minutes.
*nods* I saw that, but wasn't entirely sure how it was calculated. For
reference, the complete retry ruleset:

# Domain Error Retries
# ------ ----- -------
* * senders=: G,4h,5m,2
cdb*@;/etc/mail/retry/dialup.cdb * F,28d,7d
cdb*@;/etc/mail/retry/server.cdb * F,2h,15m; F,16h,1h; F,14d,8h
cdb*@;/etc/mail/retry/local.cdb * F,2h,15m; F,16h,1h; F,14d,8h
* * F,2h,15m; F,16h,1h; F,4d,8h
Post by Philip Hazel
This may be a silly question, but you aren't sharing the hints data
between more than one host, are you?
Nope, it's just on the one host.

mrj
--
Mark Rigby-Jones, Operations Manager @ Community Internet plc
Windsor House, 12 High Street, Kidlington, Oxford OX5 2PJ, UK
Tel: +44-1865-856000 (Direct: +44-1865-856009) Fax: +44-1865-856001
***@community.net.uk <*> http://www.community.net.uk/~mrj
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Philip Hazel
2005-04-29 15:08:54 UTC
Permalink
Post by Mark Rigby-Jones
Post by Philip Hazel
That data does not make sense. The host appears to have been down for
around 23 hours. So the retrying should happened every 8 hours. However,
it seems to have calculated the next retry with an interval of 10 hours
and 40 minutes.
*nods* I saw that, but wasn't entirely sure how it was calculated. For
# Domain Error Retries
# ------ ----- -------
* * senders=: G,4h,5m,2
* * F,2h,15m; F,16h,1h; F,4d,8h
Post by Philip Hazel
This may be a silly question, but you aren't sharing the hints data
between more than one host, are you?
Nope, it's just on the one host.
Given that set of retry rules, I am completely baffled... wait ... I
notice that 10 hours and 40 minutes is 640 minutes. That amount of retry
time can be the result of 5 minutes multiplied by 2 several times, which
is the algorithm in your first retry rule. Did the message have an empty
sender? But that rule should have timed out after 4 hours...
--
Philip Hazel University of Cambridge Computing Service,
***@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book: http://www.uit.co.uk/exim-book
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Philip Hazel
2005-04-29 15:11:37 UTC
Permalink
Post by Philip Hazel
Given that set of retry rules, I am completely baffled... wait ... I
notice that 10 hours and 40 minutes is 640 minutes. That amount of retry
time can be the result of 5 minutes multiplied by 2 several times, which
is the algorithm in your first retry rule. Did the message have an empty
sender? But that rule should have timed out after 4 hours...
.... and of course, it had! (Silly me.)
--
Philip Hazel University of Cambridge Computing Service,
***@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book: http://www.uit.co.uk/exim-book
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Mark Rigby-Jones
2005-04-29 16:56:48 UTC
Permalink
Post by Philip Hazel
Post by Mark Rigby-Jones
# Domain Error Retries
# ------ ----- -------
* * senders=: G,4h,5m,2
* * F,2h,15m; F,16h,1h; F,4d,8h
Given that set of retry rules, I am completely baffled... wait ... I
notice that 10 hours and 40 minutes is 640 minutes. That amount of retry
time can be the result of 5 minutes multiplied by 2 several times, which
is the algorithm in your first retry rule. Did the message have an empty
sender? But that rule should have timed out after 4 hours...
The particular message I was looking at didn't have an empty sender, but by
that point, the host was already marked as past its final cutoff time in
the retry hints file.

Could, then, a message with an empty sender failing after 4 hours cause the
host to get marked as 'past final cutoff' in the hints db even though other
messages have a final cutoff time of four days? That would certainly
explain pretty much everything I'm seeing, and I have a sneaking suspicion
that the first time I saw this behaviour (on a different server) was not
long after I intorduced the "senders=:" retry rule.

Hmmm. I think I'll comment out that rule for the time being and see if it
makes any difference.

mrj
--
Mark Rigby-Jones, Operations Manager @ Community Internet plc
Windsor House, 12 High Street, Kidlington, Oxford OX5 2PJ, UK
Tel: +44-1865-856000 (Direct: +44-1865-856009) Fax: +44-1865-856001
***@community.net.uk <*> http://www.community.net.uk/~mrj
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Philip Hazel
2005-05-03 08:52:30 UTC
Permalink
Post by Mark Rigby-Jones
Post by Mark Rigby-Jones
* * senders=: G,4h,5m,2
<snip>
Post by Mark Rigby-Jones
Could, then, a message with an empty sender failing after 4 hours cause the
host to get marked as 'past final cutoff' in the hints db even though other
messages have a final cutoff time of four days?
Aarrgghh!! Yes, of course. My goodness, I didn't think through the
implications of adding the "senders" facility.

For host errors, Exim does *host-based* retries, not message-based
retries. So yes, if the problem is with the host, it is going to
misbehave in exactly the way you describe.

The addition of the "senders" feature was done with thought of errors
given to MAIL FROM:<> in mind, to get rid of bounces to sites that are
rejecting null senders quickly. I did not think of what might happen for
other errors.

I will update the documentation in due course to point out this issue,
and suggest that "senders" is used only in conjuction with a test for
specific errors that are not host based. At present, there is the
ability to check for RCPT TO errors, but not for MAIL FROM.

What a mess. Sorry.
--
Philip Hazel University of Cambridge Computing Service,
***@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book: http://www.uit.co.uk/exim-book
--
## List details at http://www.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/
Loading...