-
Notifications
You must be signed in to change notification settings - Fork 729
Improve SIP packet detection using heuristic parsing #2024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## dev #2024 +/- ##
==========================================
- Coverage 83.45% 83.45% -0.01%
==========================================
Files 312 312
Lines 54632 54733 +101
Branches 11816 11585 -231
==========================================
+ Hits 45594 45677 +83
- Misses 7811 7818 +7
- Partials 1227 1238 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@sorooshm78 I just saw we already have most of the logic that you implemented: https://github.com/seladb/PcapPlusPlus/blob/master/Packet%2B%2B/src/UdpLayer.cpp#L113-L120 The 2 things you implemented on top of what we have is parsing the request version and the URI, but are they actually required? 🤔 |
|
@seladb Upon reviewing the code in Wireshark, I concluded that further investigation is needed. The logic in Wireshark is essentially designed to search the first line and split it into three tokens. Since the format for request and response is defined in the RFC, this determination was made based on that. Specifically, the RFC defines SIP requests and responses in the following way:
However, based on your comment, I noticed that part of the logic in your parsing already includes this check, so I decided to build on that. For SIP responses, everything was straightforward and clear because of the fixed length of the version and the three-digit status code, which made detection easy. I could use the However, for Therefore, I added two static methods to the
This was necessary to ensure that the parsing of SIP requests was handled correctly without violating the existing design logic. Please let me know if you think there's a better way to approach this or if there are any concerns with my changes. |
|
@sorooshm78 I think I understand it better now: you'd like to be able to parse SIP messages that are not on port 5060. Since the parsing logic is part of the fast path and can potentially run on every captured packet, it should be as efficient as possible, which is why we need to be careful when parsing strings. That's the reason we check the port first and then try to parse the first line. To be honest, I don't see a good way to run this parsing logic on every packet in an efficient manner that would not hurt the performance. Do you know if there are standard SIP ports other than 5060? If yes, maybe we can add them to the list and be able to parse a larger variaty of SIP packets... 🤔 |
@seladb A solution I have considered plausible is a 2 pass system. Basically the ports work only as a hint to which protocol type to try to parse as first. This would allow the library to match packets from "known" protocol ports fast, while also allowing parsing capability of packets coming from "unknown" ports. Below is a flow chart on how it can work. flowchart TD
start[Inbound Packet]-->portCheck{Check Ports}
portCheck -->|Port Matches| discectHintCheck{Run dissector for hinted port}
discectHintCheck --> |No match| runAllDissectors
portCheck -->|No match| runAllDissectors{Run all dissectors}
discectHintCheck -->|Data matches| packetParsed[Parsed Packet]
runAllDissectors -->|Data matches| packetParsed
runAllDissectors--> |No match| unknownPacket[Unknown Packet]
Basically if the inbound packet is HTTP and comes on port 80, it will be fast tracked to the HTTP parser by the port 80 -> HTTP hint. If it came on port 41518, it would fail all port hints and go to the second stage where all the possible protocols are tested for signature match, eventually getting to the HTTP parser and matching. Such architecture should keep performance on the fast path, while allowing protocols on unknown ports to be parsed. The only trade-off is that completely unparsable packets need to go through more checks until they are discarded. |
|
@seladb
|
@sorooshm78 I think this aligns with @Dimi1010 's suggestion above ☝️ The main issue I see with this approach is that all of the non-classified protocols (meaning protocols PcapPlusPlus doesn't yet parse) will fall into this bucket and will be checked for SIP matching (and in the future, maybe more protocols). Maybe we can combine this with option (2) - we can have add 2 static // Parse with checking the port
static SipLayer* parseSipLayer(uint8_t* data, size_t dataLen, Layer* prevLayer, Packet* packet, uint16_t srcPort, uint16_t dstPort);
// Parse without checking the port
static SipLayer* parseSipLayer(uint8_t* data, size_t dataLen, Layer* prevLayer, Packet* packet);The first method will do roughly what the current parsing logic does: check the ports first, then check if it's a request or a response. It will be called instead of the current SIP parsing. The second method will do what option (2) suggests: check the first 3 characters: if it looks like a SIP response - continue checking the version, status code, etc. If it looks like a request - continue checking the SIP method. This method will be called last - after all the port checks. Please let me know what you think. |
|
The approach you suggested works very well for SIP, and I agree it’s a good fit there. However, what I’m proposing (and I believe this is also what @Dimi1010 intends) is a more general change to how PcapPlusPlus detects protocols. Right now detection relies heavily on ports, but changing default ports is not rare at all – it’s actually quite common in real deployments. This is not specific to SIP; any supported protocol might run on a non-default port. In such cases, PcapPlusPlus will simply fail to identify the packet if we only look at ports. I understand that PcapPlusPlus doesn’t support every protocol, but even for the protocols it does support, port-only detection means we don’t get full coverage. As soon as someone changes the default port, those packets are treated as “unknown” by the library. So the trade-off is:
What I’m suggesting is:
Any new protocol added in the future can follow the same pattern. Instead of removing heuristic functions, we can focus on keeping them as lightweight and fast as possible. For example, in my initial SIP heuristic I was taking the first line, splitting it into three tokens, and then parsing each one, which is more expensive. A faster approach is to just look at the first 3 characters of the payload:
This keeps the heuristic very cheap, but still doesn’t rely on the port. |
|
I was indeed talking in the general case. Not just for SIP, but as a general solution to the current limitation of port based detection, which is brittle. Yes, it will mean that the packets that don't fit the fast-path "known ports" heuristic will take slower to parse as they have to run through all dissectors, but I deem that acceptable for the improved functionality. Otherwise, they won't be parsed at all. As for the heuristic dissector functions. I think it's fine if they have some more expensive checks as long as they are later in the order. IMO, the first checks in a detector should be cheap ones that can reject as many true negatives as possible. I'm not super familiar with SIP, but the proposed solution seems fine? If the first characters are not "SIP," that can reject the majority of the true negatives without further inspection? |
|
Another idea that came to my mind is to make the use of heuristics configurable. That way, users who care more about processing speed can disable them and keep the current behavior, while users who prefer higher accuracy and fully parsed results can enable them. |
|
@sorooshm78 @Dimi1010 my suggestion is very similar to what both of you suggest: // Parse with checking the port
static SipLayer* parseSipLayer(uint8_t* data, size_t dataLen, Layer* prevLayer, Packet* packet, uint16_t srcPort, uint16_t dstPort);
// Parse without checking the port
static SipLayer* parseSipLayer(uint8_t* data, size_t dataLen, Layer* prevLayer, Packet* packet);The first overload is the "parse SIP by port" which will be called in the same place it's called today. The second overload is the one who does "heuristic parsing" and will be called at the end, after the port search is exhausted. I agree we can think of a more generic approach which will have heuristics to parse more protocols, but for now maybe we can start with SIP only and see how it goes.
@sorooshm78 I agree that could be a good heuristic that can minimize the performance impact. |
|
@seladb |
@seladb Just to clarify, do you mean that the first one does only the port check or port check + heuristic for parse test? Because I think only the heuristic parse check overload is needed (without the port params). The dispatch by port hint should be handled inside the Note: IMO, since the SipLayer constructors accept pure data buffers during parse, the actual heuristic function should be implemented as a standalone function, instead of inside This is a quick draft on how I can imagine the 2 stage parser to work. enum class SipPacketDissectResult
{
Rejected = 0,
Request = 1,
Response = 2,
};
SipPacketDissectResult dissectSipHeuristic(const uint8_t* data, size_t dataLen)
{
// Do heuristic checks here.
// Return SipPacketDissectResult::Rejected on reject.
// Return SipPacketDissectResult::Request or SipPacketDissectResult::Response if it matches a type.
}
UdpLayer::parseNextLayer()
{
// Guards against insufficient data length here.
uint16_t portSrc, portDst;
// Unpack ports
bool skipPortCheck = false;
for (int i = 0; i < 2; i++)
{
/* Other if (skipPortCheck || TLayer::isTPort(portSrc) || TLayer::isTPort(portDst)) */
// SIP check
if (skipPortCheck || SipLayer::isSipPort(portSrc) || SipLayer::isSipPort(portDst))
{
auto dissectResult = SipLayer::dissectSipHeuristic(udpData, udpDataLen));
// Possibly add initial "if(dissectResult == SipPacketDissectResult::Rejected) {} else"
// to short-circuit the other ifs but probably won't be necessary. Benchmark first.
if (dissectResult == SipPacketDissectResult::Request)
{
constructNextLayer<SipRequestLayer>(udpData, udpDataLen, m_Packet);
break; // We constructed the next layer. Exit the for-loop.
}
else if (dissectResult == SipPacketDissectResult::Response)
{
constructNextLayer<SipResponseLayer>(udpData, udpDataLen, m_Packet);
break; // We constructed the next layer. Exit the for-loop.
}
}
/* Other if (skipPortCheck || TLayer::isTPort(portSrc) || TLayer::isTPort(portDst)) */
// ----- After all if parse checks ------
// After the first iteration is done. Set skipPortCheck to true to ignore the ports on the second run.
skipPortCheck = true;
}
// Final check if no layer could be identified.
if(!hasNextLayer())
{
// Construct a generic payload layer.
constructNextLayer<PayloadLayer>(udpData, udpDataLen, m_Packet);
}
}
bool SipLayer::isSipPort(uint16_t port) { /*... */}
// Maybe not needed?
static SipLayer* SipLayer::parseSipLayer(uint8_t* data, size_t dataLen, Layer* prevLayer, Packet* packet)
{
// Does heuristic checks here.
// Return `nullptr` on reject.
// Return SipLayer object on success.
}PS: This architecture would also allow multiple protocol types to share a "port hint" and still work, although the more overlap there is the less benefits there are to actually hinting. |
|
@Dimi1010 I think that parsing a SIP layer if a port is known is quite different vs. any packet: if you know it's a SIP port you can expect a request or response and parse them accordingly. For example: if the dst port is 5060 it can't be a SIP request and vise versa. However, for any packet, you should first decide if it's a request or a response (probably by examining the first 3 characters), and then continue parsing. That's why I suggested 2 parsing methods vs. just one.
Instead of
I think we should find a better approach than the 2 iteration for-loop, because packets that don't match any layer will be parsed twice as slow, as they will throw the whole check again... |
9ff7190 to
4d846ad
Compare
4d846ad to
b34c27b
Compare
Dimi1010
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
seladb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sorooshm78 please find last few comments, this PR is almost ready to be merged 👍
seladb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is ready to be merged.
@Dimi1010 please let us know if you have any additional comments.
Dimi1010
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple questions, but otherwise LGTM.
|
@sorooshm78 PR has been merged. Thank you for working on this. 🙂 |
|
Thank you so much, @sorooshm78, for working on this. I really appreciate the effort, and I apologize for the long review cycle 🙏 ❤️ |
|
@seladb Thanks a lot! No worries at all — I really appreciate the review and the feedback, and I’d be happy to contribute more to this project in the future. |
This PR improves SIP packet detection in PcapPlusPlus by introducing heuristic parsing based on Wireshark’s SIP dissector. SIP messages are now detected from the UDP payload itself, not only when using port 5060.
static bool dissectSipHeuristic(const uint8_t* data, size_t dataLen)to detect SIP requests/responses from payload content (Wireshark-style logic)UdpLayerso SIP packets on non-standard ports are correctly classifiedRelated issue: #2022
Note: I’m not yet fully familiar with PcapPlusPlus’ internal structure, so if there are better places, names, or patterns for this logic, I’m happy to adjust the PR based on your feedback