Discussion:
SMTP command timeout on connection - how to troubleshoot
Scott Neader
2012-02-17 21:11:25 UTC
Permalink
One particular remote mail server seems to be having problems delivering
mail to one of my local servers running Exim. The server reports that it's
running "EdgeWave mag2700" if this matters.

I have in my Exim config:

smtp_receive_timeout = 120s
log_selector = +subject +arguments +received_recipients

Here is a log snippet, redacted:

2012-02-17 14:22:52 H=sbox.example.net [1.2.3.4] Warning: Sender rate 1.3 /
1h
2012-02-17 14:22:53 1RyUKT-0003th-B6 <= ***@example.net H=
sbox.example.net [ 1.2.3.4 ] P=esmtp S=43627
id=00e901ccedb1$f272dcf0$d75896d0$@example.net T="test 1A" for
***@myuser.com
2012-02-17 14:24:53 SMTP command timeout on connection from sbox.example.net [
1.2.3.4 ]
2012-02-17 14:24:53 H=sbox.example.net [ 1.2.3.4 ] Warning: "Connection
Ratelimit - sbox.example.net [ 1.2.3.4 ] because of notquit:
command-timeout (1.3/1h max:1.2)"

Note, that even though it says "connection ratelimit" I have this server
set to disregard ratelimits, for troubleshooting purposes, so we are
continuing to accept connections from them, with no rate limit.

You can see that the remote server connects, provides the email headers,
then seems to do nothing for 120 seconds, at which time we disconnect them.

What are my options for troubleshooting? Of course the remote mail server
admin feels that it's OUR problem. It's hard to refute that, when I have
2977 other timeout messages in my log (this 1.2.3.4 server accounts for 146
of them). Of course, probably a lot of those are spammers.

- Scott
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Jeremy Harris
2012-02-17 22:05:28 UTC
Permalink
Post by Scott Neader
EdgeWave mag2700
You don't say what your system is running though.
Can you run wireshark, and grab a failing smtp conversation?
--
Jeremy
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Scott Neader
2012-02-17 22:12:57 UTC
Permalink
I am running Exim 4.69.

I will see if I can get a cap from my side. Not sure if the cap from the
remote side is useful, but I have some of those.

- Scott
Post by Jeremy Harris
Post by Scott Neader
EdgeWave mag2700
You don't say what your system is running though.
Can you run wireshark, and grab a failing smtp conversation?
--
Jeremy
--
## List details at https://lists.exim.org/**mailman/listinfo/exim-users<https://lists.exim.org/mailman/listinfo/exim-users>
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Scott Neader
2012-02-17 22:08:02 UTC
Permalink
I was able to get the remote mail server admin to send me a packet capture
in .pcap format (if anyone wants to see it, I'd be glad to share, nothing
confidential in the cap).

What I see is that our Exim server sends the "250 OK id=xxxxxxx" message
just fine, and within a few ms, their server sends an ACK packet.

Here's the funny part... 120 seconds later, my Exim server sends a "421
my.servername.net: SMTP comamnd timeout - closing connection" packet, and
their server sends the ACK for that also, then my server sends the "FIN,
ACK", and their server sends the ACK, and the connection is closed.

Any ideas what is going on?

- Scott
Post by Scott Neader
One particular remote mail server seems to be having problems delivering
mail to one of my local servers running Exim. The server reports that it's
running "EdgeWave mag2700" if this matters.
smtp_receive_timeout = 120s
log_selector = +subject +arguments +received_recipients
2012-02-17 14:22:52 H=sbox.example.net [1.2.3.4] Warning: Sender rate 1.3
/ 1h
sbox.example.net [ 1.2.3.4 ] P=esmtp S=43627
2012-02-17 14:24:53 SMTP command timeout on connection from
sbox.example.net [ 1.2.3.4 ]
2012-02-17 14:24:53 H=sbox.example.net [ 1.2.3.4 ] Warning: "Connection
command-timeout (1.3/1h max:1.2)"
Note, that even though it says "connection ratelimit" I have this server
set to disregard ratelimits, for troubleshooting purposes, so we are
continuing to accept connections from them, with no rate limit.
You can see that the remote server connects, provides the email headers,
then seems to do nothing for 120 seconds, at which time we disconnect them.
What are my options for troubleshooting? Of course the remote mail server
admin feels that it's OUR problem. It's hard to refute that, when I have
2977 other timeout messages in my log (this 1.2.3.4 server accounts for 146
of them). Of course, probably a lot of those are spammers.
- Scott
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Graeme Fowler
2012-02-17 22:20:49 UTC
Permalink
Make sure there are no Cisco PIX or ASA devices with "smtp fixup" or "inspect smtp" switched on between you and the remote site.

Graeme
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
W B Hacker
2012-02-18 10:51:57 UTC
Permalink
Post by Scott Neader
I was able to get the remote mail server admin to send me a packet capture
in .pcap format (if anyone wants to see it, I'd be glad to share, nothing
confidential in the cap).
What I see is that our Exim server sends the "250 OK id=xxxxxxx" message
just fine, and within a few ms, their server sends an ACK packet.
Here's the funny part... 120 seconds later, my Exim server sends a "421
my.servername.net: SMTP comamnd timeout - closing connection" packet, and
their server sends the ACK for that also, then my server sends the "FIN,
ACK", and their server sends the ACK, and the connection is closed.
Any ideas what is going on?
- Scott
Set your timeout to 4 minutes instead of two minutes and see what, if
anything, changes.

My 'SWAG' is that, '250 OK' having been conveyed, a 'QUIT' will
transpire *before* your timeout arrives, all-hands will stand-down
normally, and the time-on-teat won't be enough longer to matter.

IOW you won't actually reach the 4 minutes, but at least will not have
confused the issue with an extraneous time-out message.

Bill
--
韓家標
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use
Scott Neader
2012-02-21 05:54:54 UTC
Permalink
Post by W B Hacker
Set your timeout to 4 minutes instead of two minutes and see what, if
anything, changes.
I changed the timeout to 240 seconds, with no change... it just hangs for
240 seconds after I send the 250 OK, and then I disconnect due to timeout.

Make sure there are no Cisco PIX or ASA devices with "smtp fixup" or
Post by W B Hacker
"inspect smtp" switched on between you and the remote site.
I'm told they do not have a PIX or ASA, but they do have a series 7200
router... it is capable of "smtp fixup", so I am asking them to ask their
router folks if it is enabled.

As another data point... I remembered having this problem with another ISP
a few months back. I just telnet'd to port 25 on their server and guess
Post by W B Hacker
ESMTP EdgeWave mag4000
Yep... so here we have two of these "EdgeWave" mail servers that I can't
get mail from (but I can send them mail fine).

Anyone out there interested in looking at the package capture .pcap file I
have?

I will try to look through my logs for "SMTP command timeout" and try to
sift out the obvious spam zombie PCs and look for real mail servers, then
try to see if there is are more EdgeWave issues out there. Maybe it's
something?

- Scott
Post by W B Hacker
Post by Scott Neader
I was able to get the remote mail server admin to send me a packet capture
in .pcap format (if anyone wants to see it, I'd be glad to share, nothing
confidential in the cap).
What I see is that our Exim server sends the "250 OK id=xxxxxxx" message
just fine, and within a few ms, their server sends an ACK packet.
Here's the funny part... 120 seconds later, my Exim server sends a "421
my.servername.net: SMTP comamnd timeout - closing connection" packet, and
their server sends the ACK for that also, then my server sends the "FIN,
ACK", and their server sends the ACK, and the connection is closed.
Any ideas what is going on?
- Scott
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Scott Neader
2012-02-21 06:00:21 UTC
Permalink
FYI, I'm seeing a number of timeouts from a mail provider called "
redcondor.net" and sure enough, when I telnet to port 25:

220 smtp450.redcondor.net ESMTP EdgeWave mag4000e


LOL... http://redcondor.com/ -- owned "by EdgeWave".

I guess the question is... is it something EdgeWave is doing, or is it
Exim?

We sure don't get along, that is for sure.

- Scott
Post by W B Hacker
Set your timeout to 4 minutes instead of two minutes and see what, if
Post by W B Hacker
anything, changes.
I changed the timeout to 240 seconds, with no change... it just hangs for
240 seconds after I send the 250 OK, and then I disconnect due to timeout.
Make sure there are no Cisco PIX or ASA devices with "smtp fixup" or
Post by W B Hacker
"inspect smtp" switched on between you and the remote site.
I'm told they do not have a PIX or ASA, but they do have a series 7200
router... it is capable of "smtp fixup", so I am asking them to ask their
router folks if it is enabled.
As another data point... I remembered having this problem with another ISP
a few months back. I just telnet'd to port 25 on their server and guess
Post by W B Hacker
ESMTP EdgeWave mag4000
Yep... so here we have two of these "EdgeWave" mail servers that I can't
get mail from (but I can send them mail fine).
Anyone out there interested in looking at the package capture .pcap file I
have?
I will try to look through my logs for "SMTP command timeout" and try to
sift out the obvious spam zombie PCs and look for real mail servers, then
try to see if there is are more EdgeWave issues out there. Maybe it's
something?
- Scott
Post by W B Hacker
Post by Scott Neader
I was able to get the remote mail server admin to send me a packet capture
in .pcap format (if anyone wants to see it, I'd be glad to share, nothing
confidential in the cap).
What I see is that our Exim server sends the "250 OK id=xxxxxxx" message
just fine, and within a few ms, their server sends an ACK packet.
Here's the funny part... 120 seconds later, my Exim server sends a "421
my.servername.net: SMTP comamnd timeout - closing connection" packet, and
their server sends the ACK for that also, then my server sends the "FIN,
ACK", and their server sends the ACK, and the connection is closed.
Any ideas what is going on?
- Scott
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Scott Neader
2012-02-21 07:05:48 UTC
Permalink
I opened a ticket with EdgeWave, to see if they were aware of any
particular problems talking to Exim. Their response was this:

Per RFC 2821, If your mail server is issuing a 250 response at the end of
the smtp session then that means that the message was successfully
delivered. (as far as the sender is concerned).
I am not aware of any general issues with sending mail to exim mail
servers.
I do agree they have a point... when Exim sends the 250 OK... shouldn't
Exim then deposit the message into the local mailbox or whatever needs to
happen at this point? Doesn't it seem wrong for Exim to send a 250 OK, but
not have actually accepted the message?

- Scott
FYI, I'm seeing a number of timeouts from a mail provider called "
220 smtp450.redcondor.net ESMTP EdgeWave mag4000e
LOL... http://redcondor.com/ -- owned "by EdgeWave".
I guess the question is... is it something EdgeWave is doing, or is it
Exim?
We sure don't get along, that is for sure.
- Scott
Post by W B Hacker
Set your timeout to 4 minutes instead of two minutes and see what, if
Post by W B Hacker
anything, changes.
I changed the timeout to 240 seconds, with no change... it just hangs for
240 seconds after I send the 250 OK, and then I disconnect due to timeout.
Make sure there are no Cisco PIX or ASA devices with "smtp fixup" or
Post by W B Hacker
"inspect smtp" switched on between you and the remote site.
I'm told they do not have a PIX or ASA, but they do have a series 7200
router... it is capable of "smtp fixup", so I am asking them to ask their
router folks if it is enabled.
As another data point... I remembered having this problem with another
ISP a few months back. I just telnet'd to port 25 on their server and
Post by W B Hacker
ESMTP EdgeWave mag4000
Yep... so here we have two of these "EdgeWave" mail servers that I can't
get mail from (but I can send them mail fine).
Anyone out there interested in looking at the package capture .pcap file
I have?
I will try to look through my logs for "SMTP command timeout" and try to
sift out the obvious spam zombie PCs and look for real mail servers, then
try to see if there is are more EdgeWave issues out there. Maybe it's
something?
- Scott
Post by W B Hacker
Post by Scott Neader
I was able to get the remote mail server admin to send me a packet capture
in .pcap format (if anyone wants to see it, I'd be glad to share, nothing
confidential in the cap).
What I see is that our Exim server sends the "250 OK id=xxxxxxx" message
just fine, and within a few ms, their server sends an ACK packet.
Here's the funny part... 120 seconds later, my Exim server sends a "421
my.servername.net: SMTP comamnd timeout - closing connection" packet, and
their server sends the ACK for that also, then my server sends the "FIN,
ACK", and their server sends the ACK, and the connection is closed.
Any ideas what is going on?
- Scott
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Heiko Schlittermann
2012-02-21 08:42:14 UTC
Permalink
Hi,
Post by Scott Neader
I opened a ticket with EdgeWave, to see if they were aware of any
Per RFC 2821, If your mail server is issuing a 250 response at the end of
the smtp session then that means that the message was successfully
delivered. (as far as the sender is concerned).
I am not aware of any general issues with sending mail to exim mail
servers.
I do agree they have a point... when Exim sends the 250 OK... shouldn't
Exim then deposit the message into the local mailbox or whatever needs to
happen at this point? Doesn't it seem wrong for Exim to send a 250 OK, but
not have actually accepted the message?
If I get it well, it happens about the following:

Remote Server Your Server

[ TCP SYN/SYN-ACK/ACK ]
220 





DATA -->
<-- 354 Enter Message


 -->
. -->
<-- 250 OK id=1Rzl72-00089K-6J

[several minutes]
<-- 421 
 SMTP command timeout

[ TCP FIN/FIN-ACK/ACK ]


I just tried this with one of our servers, using netcat as client
and just didn't send the QUIT. It did exactly what I expected,
Exim delivers the mail as soon as the 250 OK was sent. Then, minutes
later there is a notice in the mainlog about the command timeout.

From my POV your problem is not related to this unhappy end of the
connection. (But I'd say it's unpolite behaviour of (by?) the other side
)
--
Heiko :: dresden : linux : SCHLITTERMANN.de
GPG Key 48D0359B : 3061 CFBF 2D88 F034 E8D2 7E92 EE4E AC98 48D0 359B
Graeme Fowler
2012-02-21 09:51:51 UTC
Permalink
Post by Heiko Schlittermann
From my POV your problem is not related to this unhappy end of the
connection. (But I'd say it's unpolite behaviour of (by?) the other side…)
We saw this very recently after a site firewall upgrade. Almost
identical behaviour except the final "\n.\n" was never received so the
message was deferred. We logged a 421 command timeout.

The Cisco device was on our premises, not the remote, and removing the
erroneous "inspect SMTP" made it all go away.

Graeme
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - h
David Woodhouse
2012-02-22 10:42:52 UTC
Permalink
Post by Scott Neader
I was able to get the remote mail server admin to send me a packet capture
in .pcap format (if anyone wants to see it, I'd be glad to share, nothing
confidential in the cap).
What I see is that our Exim server sends the "250 OK id=xxxxxxx" message
just fine, and within a few ms, their server sends an ACK packet.
Hm, if the message they see is really 'OK id=xxxxxxx' with a real
Exim-like queue ID (why the hell do you feel the need to obfuscate a
local queue ID, anyway?) then it's unlikely to have been generated
anywhere but your server.

If you search your logs for that specific ID, what do you see?

If you do a capture on *your* end, does it match what they see at their
end?
--
dwmw2
Scott Neader
2012-02-22 15:56:44 UTC
Permalink
Hi David. First, I wasn't obfuscating the ID, I was just saying "we send
our 250, they ack". Didn't think the actual log messsage and ID would be
important enough to paste into the email...

Anyway... you are right... these messages ARE getting delivered. I was
looking at log messages based on IP, and only seeing the connection, data
and delivery messages, but I did not look at how Exim dealt with the
message... when looking by ID as you have suggested, it shows the messages
are being delivered into the local mailbox.

The mystery still stands as to why I am seeing all these SMTP command
timeouts from just these "EdgeWave" mail servers. If the EdgeWave server
has received our "250 OK" message, and their packet capture shows they have
received it, and they have sent an ACK, then why don't they DISCONNECT?

I have started a ticket with EdgeWave, to see if they have any interest in
figuring this out.

Regarding a packet capture on my side, I have to admit, I have never done
it on command-line Linux before (done many on Windoze via
Ethereal/WireShark), so I will have to research that.

Thanks for the input -- much appreciated!!

- Scott
Post by Scott Neader
Post by Scott Neader
I was able to get the remote mail server admin to send me a packet
capture
Post by Scott Neader
in .pcap format (if anyone wants to see it, I'd be glad to share, nothing
confidential in the cap).
What I see is that our Exim server sends the "250 OK id=xxxxxxx" message
just fine, and within a few ms, their server sends an ACK packet.
Hm, if the message they see is really 'OK id=xxxxxxx' with a real
Exim-like queue ID (why the hell do you feel the need to obfuscate a
local queue ID, anyway?) then it's unlikely to have been generated
anywhere but your server.
If you search your logs for that specific ID, what do you see?
If you do a capture on *your* end, does it match what they see at their
end?
--
dwmw2
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
David Woodhouse
2012-02-22 16:27:57 UTC
Permalink
Post by Scott Neader
The mystery still stands as to why I am seeing all these SMTP command
timeouts from just these "EdgeWave" mail servers. If the EdgeWave server
has received our "250 OK" message, and their packet capture shows they have
received it, and they have sent an ACK, then why don't they DISCONNECT?
I don't know why they don't send a QUIT and then disconnect. Perhaps
they keep the connection open in case they want to use it to deliver
another mail in the future? Stranger things have happened...
--
dwmw2
Scott Neader
2012-02-22 16:36:27 UTC
Permalink
Are you willing to look at the cap file from their side, to see if they are
doing things right? I'd like to be able to tell them... look, RFC XXX says
after we send the 250 OK, you should send a QUIT but your cap shows you are
not..." (or whatever) -- but I'm just not knowledgeable enough.

- Scott
Post by David Woodhouse
Post by Scott Neader
The mystery still stands as to why I am seeing all these SMTP command
timeouts from just these "EdgeWave" mail servers. If the EdgeWave server
has received our "250 OK" message, and their packet capture shows they
have
Post by Scott Neader
received it, and they have sent an ACK, then why don't they DISCONNECT?
I don't know why they don't send a QUIT and then disconnect. Perhaps
they keep the connection open in case they want to use it to deliver
another mail in the future? Stranger things have happened...
--
dwmw2
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
David Woodhouse
2012-02-23 07:49:53 UTC
Permalink
Post by Scott Neader
Are you willing to look at the cap file from their side, to see if they are
doing things right? I'd like to be able to tell them... look, RFC XXX says
after we send the 250 OK, you should send a QUIT but your cap shows you are
not..." (or whatever) -- but I'm just not knowledgeable enough.
By all means, send it my way. Note that the only "problem" this causes
is an extra line in your log and a small amount of memory used while
Exim is waiting to die, right?
--
dwmw2
Scott Neader
2012-02-23 17:24:26 UTC
Permalink
Thanks, David, I'll send it to you direct.

My concern on the timeouts is:

1) I have seen in the past that all of my Exim sockets can be consumed by
misbehaving mail servers (or spam zombies) and thus we defer mail. I'm
open to discussion on this, if I'm doing something wrong, or
misunderstanding.

2) The far-end customer (using EdgeWave) is reporting SOME fatal errors.
Most messages are getting through, but the reason I found the problem is
after being contacted by their ISP asking why we aren't accepting some of
their mail.

3) We have rate limits set up for misbehaving mail servers, and these
timeouts are counted toward the rate limit. I will need to research to
find out how to stop counting timeouts toward rate limits, if I am to start
ignoring these timeouts as non-issues.

4) It seems most servers with this timeout problem are either EdgeWave mail
servers, or spam zombie home computers. I'm hesitant to ignore these
timeouts, but if the Exim community feels that I should, then I will.

Thanks!!

- Scott
Post by David Woodhouse
Post by Scott Neader
Are you willing to look at the cap file from their side, to see if they
are
Post by Scott Neader
doing things right? I'd like to be able to tell them... look, RFC XXX
says
Post by Scott Neader
after we send the 250 OK, you should send a QUIT but your cap shows you
are
Post by Scott Neader
not..." (or whatever) -- but I'm just not knowledgeable enough.
By all means, send it my way. Note that the only "problem" this causes
is an extra line in your log and a small amount of memory used while
Exim is waiting to die, right?
--
dwmw2
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Todd Lyons
2012-02-23 19:27:47 UTC
Permalink
By any chance do you have a firewall (Cisco ASA for example) that you
block all or most ICMP?

A few years ago, I experienced issues with a few particular remote
sites and their erratice mail delivery to us. We had blocked most
ICMP types at the firewall for PCI compliance. We relaxed the rule
and blocked just a few specific ICMP types (the time query ones) and
all of a sudden those issues went away. It must have been breaking
path mtu discovery.

...Todd
Post by Scott Neader
Thanks, David, I'll send it to you direct.
1) I have seen in the past that all of my Exim sockets can be consumed by
misbehaving mail servers (or spam zombies) and thus we defer mail.  I'm
open to discussion on this, if I'm doing something wrong, or
misunderstanding.
2) The far-end customer (using EdgeWave) is reporting SOME fatal errors.
 Most messages are getting through, but the reason I found the problem is
after being contacted by their ISP asking why we aren't accepting some of
their mail.
3) We have rate limits set up for misbehaving mail servers, and these
timeouts are counted toward the rate limit.  I will need to research to
find out how to stop counting timeouts toward rate limits, if I am to start
ignoring these timeouts as non-issues.
4) It seems most servers with this timeout problem are either EdgeWave mail
servers, or spam zombie home computers.  I'm hesitant to ignore these
timeouts, but if the Exim community feels that I should, then I will.
Thanks!!
- Scott
Post by David Woodhouse
Post by Scott Neader
Are you willing to look at the cap file from their side, to see if they
are
Post by Scott Neader
doing things right?  I'd like to be able to tell them... look, RFC XXX
says
Post by Scott Neader
after we send the 250 OK, you should send a QUIT but your cap shows you
are
Post by Scott Neader
not..." (or whatever) -- but I'm just not knowledgeable enough.
By all means, send it my way. Note that the only "problem" this causes
is an extra line in your log and a small amount of memory used while
Exim is waiting to die, right?
--
dwmw2
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
--
SOPA: Any attempt to [use legal means to] reverse technological
advances is doomed.  --Leo Leporte
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Scott Neader
2012-02-24 04:51:01 UTC
Permalink
Post by Todd Lyons
By any chance do you have a firewall (Cisco ASA for example) that you
block all or most ICMP?
My Exim server does not. However, the far end EdgeWave server
(66.43.215.27) does have a Cisco 7201 in front of it, and the server is not
ping-able.
Post by Todd Lyons
A few years ago, I experienced issues with a few particular remote
sites and their erratice mail delivery to us. We had blocked most
ICMP types at the firewall for PCI compliance. We relaxed the rule
and blocked just a few specific ICMP types (the time query ones) and
all of a sudden those issues went away. It must have been breaking
path mtu discovery.
Thanks for that... that is the second suggestion that it could be the
customer's firewall/router causing these problems. I am relaying to them.

- Scott
Post by Todd Lyons
Post by Scott Neader
Thanks, David, I'll send it to you direct.
1) I have seen in the past that all of my Exim sockets can be consumed by
misbehaving mail servers (or spam zombies) and thus we defer mail. I'm
open to discussion on this, if I'm doing something wrong, or
misunderstanding.
2) The far-end customer (using EdgeWave) is reporting SOME fatal errors.
Most messages are getting through, but the reason I found the problem is
after being contacted by their ISP asking why we aren't accepting some of
their mail.
3) We have rate limits set up for misbehaving mail servers, and these
timeouts are counted toward the rate limit. I will need to research to
find out how to stop counting timeouts toward rate limits, if I am to
start
Post by Scott Neader
ignoring these timeouts as non-issues.
4) It seems most servers with this timeout problem are either EdgeWave
mail
Post by Scott Neader
servers, or spam zombie home computers. I'm hesitant to ignore these
timeouts, but if the Exim community feels that I should, then I will.
Thanks!!
- Scott
Post by David Woodhouse
Post by Scott Neader
Are you willing to look at the cap file from their side, to see if
they
Post by Scott Neader
Post by David Woodhouse
are
Post by Scott Neader
doing things right? I'd like to be able to tell them... look, RFC XXX
says
Post by Scott Neader
after we send the 250 OK, you should send a QUIT but your cap shows
you
Post by Scott Neader
Post by David Woodhouse
are
Post by Scott Neader
not..." (or whatever) -- but I'm just not knowledgeable enough.
By all means, send it my way. Note that the only "problem" this causes
is an extra line in your log and a small amount of memory used while
Exim is waiting to die, right?
--
dwmw2
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
--
SOPA: Any attempt to [use legal means to] reverse technological
advances is doomed. --Leo Leporte
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Phil Pennock
2012-02-22 19:51:30 UTC
Permalink
Post by Scott Neader
Regarding a packet capture on my side, I have to admit, I have never done
it on command-line Linux before (done many on Windoze via
Ethereal/WireShark), so I will have to research that.
tcpdump -w foo-$(date +%s).cap -s 1500 -i ethDEVICE port 25

ethDEVICE on Linux systems used to typically be eth0.

The format of the capture dump is portable and you can then open it in
WireShark on another OS. Hit Ctrl-C when you want to stop the dumping.
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Scott Neader
2012-02-22 20:19:58 UTC
Permalink
Very useful, Phil -- thanks very much !!!

- Scott
Post by Phil Pennock
Post by Scott Neader
Regarding a packet capture on my side, I have to admit, I have never done
it on command-line Linux before (done many on Windoze via
Ethereal/WireShark), so I will have to research that.
tcpdump -w foo-$(date +%s).cap -s 1500 -i ethDEVICE port 25
ethDEVICE on Linux systems used to typically be eth0.
The format of the capture dump is portable and you can then open it in
WireShark on another OS. Hit Ctrl-C when you want to stop the dumping.
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Mike Kennedy
2012-02-22 16:45:56 UTC
Permalink
I had a similar experience with a JVM that maintained an SMTP connection
pool, the connections were being held open until it went to use them and
found they had timed out. In my case, the people administering the JVM were
cooperative and set the pool to expire connections rather than keep them
around until they timed out on the server side. I don't know EdgeWave from
Adam, though, and my mail servers don't communicate directly with any
similar DLP product, so this is little better than a WAG in your case.
Post by Scott Neader
Hi David. First, I wasn't obfuscating the ID, I was just saying "we send
our 250, they ack". Didn't think the actual log messsage and ID would be
important enough to paste into the email...
Anyway... you are right... these messages ARE getting delivered. I was
looking at log messages based on IP, and only seeing the connection, data
and delivery messages, but I did not look at how Exim dealt with the
message... when looking by ID as you have suggested, it shows the messages
are being delivered into the local mailbox.
The mystery still stands as to why I am seeing all these SMTP command
timeouts from just these "EdgeWave" mail servers. If the EdgeWave server
has received our "250 OK" message, and their packet capture shows they have
received it, and they have sent an ACK, then why don't they DISCONNECT?
I have started a ticket with EdgeWave, to see if they have any interest in
figuring this out.
Regarding a packet capture on my side, I have to admit, I have never done
it on command-line Linux before (done many on Windoze via
Ethereal/WireShark), so I will have to research that.
Thanks for the input -- much appreciated!!
- Scott
Post by Scott Neader
Post by Scott Neader
I was able to get the remote mail server admin to send me a packet
capture
Post by Scott Neader
in .pcap format (if anyone wants to see it, I'd be glad to share,
nothing
Post by Scott Neader
Post by Scott Neader
confidential in the cap).
What I see is that our Exim server sends the "250 OK id=xxxxxxx"
message
Post by Scott Neader
Post by Scott Neader
just fine, and within a few ms, their server sends an ACK packet.
Hm, if the message they see is really 'OK id=xxxxxxx' with a real
Exim-like queue ID (why the hell do you feel the need to obfuscate a
local queue ID, anyway?) then it's unlikely to have been generated
anywhere but your server.
If you search your logs for that specific ID, what do you see?
If you do a capture on *your* end, does it match what they see at their
end?
--
dwmw2
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
--
## List details at https://lists.exim.org/mailman/listinfo/exim-users
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/
Loading...