Skip to content
This repository has been archived by the owner on Sep 7, 2020. It is now read-only.

Feature/collect link metrics #831

Merged
merged 14 commits into from
Mar 19, 2020
Merged

Feature/collect link metrics #831

merged 14 commits into from
Mar 19, 2020

Conversation

mariomaz
Copy link
Collaborator

@mariomaz mariomaz commented Feb 14, 2020

See "[TASK] Link metric collection for Ethernet links #390" for a detailed description of the task.

Currently we only support collecting link metrics for Ethernet interfaces and it is done partially, because we need topology information to know which are the neighbors of the node. This is not available yet.

Collecting metrics for WiFi interfaces requires adding support for standard commands in DWPAL. See "Send standard NL80211 commands using DWPAL #782" for more information.

For tests_flows.sh to pass, "Fix several problems with the Device Information TLV #825" should be merged first.

I preferred to create a draft-PR so you can verify the design of the solution and give early feedback. Also to be particularly reviewed is the way MAC address of requested specific neighbor is compared against MAC address of the controller.

Copy link
Collaborator

@arnout arnout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed up to "split link metric query TLV into two TLVs".

agent/src/beerocks/slave/link_metrics/link_metrics.h Outdated Show resolved Hide resolved
agent/src/beerocks/slave/link_metrics/link_metrics.h Outdated Show resolved Hide resolved
* @return True on success and false otherwise.
*/
virtual bool get_link_metrics(const std::string &local_interface_name,
const sMacAddr &neighbor_interface_address,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in practice it will never be possible to distinguish between different neighbours that are connected on the same interface. This is a serious shortcoming of the 1905.1 spec, but not much we can do about it... @mariomaz do you think you can ask your colleagues who are more familiar with 1905.1 how you are supposed to deal with interfaces where there are multiple neighbours, or none?

Anyway, my point is: I think in practice we will only use the local_interface_name.

(nitpick)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you.
I've checked our 1905.1 implementation at this point and found this TODO which also confirms your opinion:

            //   TODO: In Linux there is no easy way to obtain this
            //   information. We will just report the amount of *total* packets
            //   transmitted from our local interface, no matter the
            //   destination.
            //   This information will only be correct when the local interface
            //   is connected to one single remote interface... however we
            //   better report this than nothing at all.

However, be aware that this is an interface which will be implemented by different collectors, not only the 802.3 collector. In the 802.11 collector it is possible to distinguish between different neighbors and thus we do need the neighbor_interface_address parameter.

Comment on lines 135 to 172
req.hdr.nlmsg_len = NLMSG_LENGTH(sizeof(struct rtgenmsg));
req.hdr.nlmsg_type = RTM_GETLINK;
req.hdr.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
req.hdr.nlmsg_seq = 1;
req.hdr.nlmsg_pid = 0;
req.gen.rtgen_family = AF_PACKET; /* no preferred AF, we will get *all* interfaces */

io.iov_base = &req;
io.iov_len = req.hdr.nlmsg_len;
rtnl_msg.msg_iov = &io;
rtnl_msg.msg_iovlen = 1;
rtnl_msg.msg_name = &kernel;
rtnl_msg.msg_namelen = sizeof(kernel);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to open-code all this, rather than using libnl3 genl? In fact, we could move base_wlan_hal_nl80211::send_nl80211_msg to network_utils and use that function. The nl80211 part of it is only hardcoded in the genl_ctrl_resolve call.

Anyway, something that can be done later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you.

This is something that could probably be done using changes in "Feature/add support for nl80211 standard commands to BWL #900". However, be aware that such changes have been made into BWL and not in network_utils.

Anyway, I agree with you that this is a refactoring that could be done later (I think it is not going to be so easy and quick as might seem at first glance).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not indeed move base_wlan_hal_nl80211::send_nl80211_msg to BCL? just that function, shouldn't be too hard even as a followup commit in this PR. I really prefer that we don't duplicate code when we really don't need to. Still, as you say, it can be done later, so (nitpoick)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method base_wlan_hal_nl80211::send_nl80211_msg() is going to disappear after we had approved the design and merged "Feature/add support for nl80211 standard commands to BWL #900" and done a refactoring of the NL80211 version of the BWL.

The work to be done is then to implement the get_link_metrics() method using the new classes defined in #900.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of the new classes, the question remains why you would do all the parsing manually rather than use one of the rtnl_link_get_ functions to get all info in a nice struct.

/**
* This message contains the stats for the interface we are interested in
*/
if (0 == strcmp(local_interface_name.c_str(), (char *)RTA_DATA(attribute))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is a way to specify this already in the GET message, so instead of getting all interfaces, you get just one.

You would need to convert it to ifindex first though, IIRC.

(nitpick).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a nitpick comment, the change could probably be done as part of the previous refactoring.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strcmp is baned C function if I'm not mistaken.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it appears I was mistaken and apparently the list of banned C functions by Intel SDL is listed in github:
https://github.com/intel/safestringlib/wiki/SDL-List-of-Banned-Functions
We should add this to our CONTRIBUTING.md, perhaps @vitalii-komisarenko @abelog or @VladyslavTupikin can do that as good quick first PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although strcmp is not in the list of banned C functions, I've replaced it with std::strncmp() C++ function which is safer.

}
}

link_metrics.transmitter.packet_errors = stats->tx_errors;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we have to do this right...

All the stats fields are uint32_t and wrap. tx_errors probably won't wrap, but the number of rx/tx packets will wrap if the device stays up for a couple of days.

Therefore, we have to keep the previously measured value and subtract it (wrapping subtraction, I'm not entirely sure what the standard C++ way of doing that is).

While we're at it, we should probably also keep the timestamp. In 1905.1 the time of the measurement isn't specified (which renders it completely useless), but in R2 there is a timestamp included. And to make the metrics somewhat useful even in R1, we should probably divide by the sampling time so the numbers are at least comparable with each other.

So, I think it would be better to keep an object for the link metrics for each interface.

That also means you can keep the netlink fd open. Minor optimisation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I cannot see the timestamp field you mentioned. Maybe I'm not looking at the right document. Can you please clarify which are those R1 and R2 docs?
I agree with you that if we cannot include the time measurement window along with the counter, then it is useless to provide the number of packets/errors during such measurement period.

I think however that, even if we could provide timestamp or elapsed time since last measure, we have no way to know if the number of packets we read with the netlink socket have wrapped (and how many times) since last call to get link metrics. Although such wrapping is quite improbable to go unnoticed, given the frequency with which calls to get link metrics will be probably issued, we cannot guarantee that wrapping won't happen. Keeping timestamps is useless if they are not provided by the kernel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I cannot see the timestamp field you mentioned.

I should have written "we should keep a timestamp".

we have no way to know if the number of packets we read with the netlink socket have wrapped (and how many times) since last call to get link metrics.

Stupid me again, I thought these were 64-bit numbers, but they're only 64-bit if you get them through netlink, not if you get them from IFLA_STATS64. Which apparently requires special handling because they're unaligned...

Which brings me to the question: why don't we use rtnl for this? E.g. https://gist.github.com/beejeebus/cc4fb07472cf5a0afd41 rtnl_link takes care of the 64-bit handling and fallback to 32-bit if the 64-bit stats are not available.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear: since the code is there now anyway, I would merge it as is. But when later we refactor with #900, I think we should also look into using rtnl_link instead of doing all the parsing manually. However, before going into that, we should check if the libnl-tiny used in openwrt also has it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read in https://wireless.wiki.kernel.org/en/developers/documentation/nl80211

libnl tiny
OpenWrt folks created a tiny version of libnl based on a git snapshot, which only contains genl, not rtnetlink or any of the netfilter stuff, and compiles down to less than 30k in binary size. You can find it here:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=tree;f=package/libs/libnl-tiny;hb=HEAD

@ghost ghost linked an issue Feb 19, 2020 that may be closed by this pull request
3 tasks
@ghost ghost requested review from tomereli and morantr February 19, 2020 09:18
@mariomaz
Copy link
Collaborator Author

mariomaz commented Mar 2, 2020

I've also replaced strcmp() banned function with std::strncmp()

Copy link
Collaborator

@tomereli tomereli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review only for now, so just comments

agent/src/beerocks/slave/CMakeLists.txt Show resolved Hide resolved
/**
* This message contains the stats for the interface we are interested in
*/
if (0 == strcmp(local_interface_name.c_str(), (char *)RTA_DATA(attribute))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strcmp is baned C function if I'm not mistaken.

/**
* This message contains the stats for the interface we are interested in
*/
if (0 == strcmp(local_interface_name.c_str(), (char *)RTA_DATA(attribute))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it appears I was mistaken and apparently the list of banned C functions by Intel SDL is listed in github:
https://github.com/intel/safestringlib/wiki/SDL-List-of-Banned-Functions
We should add this to our CONTRIBUTING.md, perhaps @vitalii-komisarenko @abelog or @VladyslavTupikin can do that as good quick first PR.

Comment on lines 135 to 172
req.hdr.nlmsg_len = NLMSG_LENGTH(sizeof(struct rtgenmsg));
req.hdr.nlmsg_type = RTM_GETLINK;
req.hdr.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
req.hdr.nlmsg_seq = 1;
req.hdr.nlmsg_pid = 0;
req.gen.rtgen_family = AF_PACKET; /* no preferred AF, we will get *all* interfaces */

io.iov_base = &req;
io.iov_len = req.hdr.nlmsg_len;
rtnl_msg.msg_iov = &io;
rtnl_msg.msg_iovlen = 1;
rtnl_msg.msg_name = &kernel;
rtnl_msg.msg_namelen = sizeof(kernel);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not indeed move base_wlan_hal_nl80211::send_nl80211_msg to BCL? just that function, shouldn't be too hard even as a followup commit in this PR. I really prefer that we don't duplicate code when we really don't need to. Still, as you say, it can be done later, so (nitpoick)

@mariomaz mariomaz force-pushed the feature/collect-link-metrics branch from e2e175f to f42ee33 Compare March 5, 2020 08:53
@mariomaz mariomaz requested review from arnout and tomereli March 5, 2020 11:09
@mariomaz mariomaz force-pushed the feature/collect-link-metrics branch from 60f68a6 to 7485d32 Compare March 6, 2020 09:29
@mariomaz mariomaz marked this pull request as ready for review March 6, 2020 10:07
@mariomaz mariomaz requested a review from rmelotte as a code owner March 6, 2020 10:07
Copy link
Collaborator

@arnout arnout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for doing the test in python instead of shell, all my comments are nitpicky.

}
}

link_metrics.transmitter.packet_errors = stats->tx_errors;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I cannot see the timestamp field you mentioned.

I should have written "we should keep a timestamp".

we have no way to know if the number of packets we read with the netlink socket have wrapped (and how many times) since last call to get link metrics.

Stupid me again, I thought these were 64-bit numbers, but they're only 64-bit if you get them through netlink, not if you get them from IFLA_STATS64. Which apparently requires special handling because they're unaligned...

Which brings me to the question: why don't we use rtnl for this? E.g. https://gist.github.com/beejeebus/cc4fb07472cf5a0afd41 rtnl_link takes care of the 64-bit handling and fallback to 32-bit if the 64-bit stats are not available.

Comment on lines 135 to 172
req.hdr.nlmsg_len = NLMSG_LENGTH(sizeof(struct rtgenmsg));
req.hdr.nlmsg_type = RTM_GETLINK;
req.hdr.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
req.hdr.nlmsg_seq = 1;
req.hdr.nlmsg_pid = 0;
req.gen.rtgen_family = AF_PACKET; /* no preferred AF, we will get *all* interfaces */

io.iov_base = &req;
io.iov_len = req.hdr.nlmsg_len;
rtnl_msg.msg_iov = &io;
rtnl_msg.msg_iovlen = 1;
rtnl_msg.msg_name = &kernel;
rtnl_msg.msg_namelen = sizeof(kernel);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of the new classes, the question remains why you would do all the parsing manually rather than use one of the rtnl_link_get_ functions to get all info in a nice struct.

framework/tlvf/src/src/CmduMessageRx.cpp Outdated Show resolved Hide resolved
Comment on lines +100 to +111
* The IEEE 1905.1 standard says about the Link Metric Query TLV and the neighbor type
* octet that "If the value is 0, then the EUI48 field is not present; if the value is 1,
* then the EUI-48 field shall be present."
*
* However, optional fields are not currently supported by TLVF.
*
* As a workaround, instead of defining a tlvLinkMetricQuery TLV with an optional field,
* we have defined two different TLVs, one with the optional field and the other one
* without it. Application must then check the length of received TLV to know if optional
* field (MAC address of neighbor device) is present or not and then create an instance of
* either tlvLinkMetricQuery or tlvLinkMetricQueryAllNeighbors respectively.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I said it before, but this is really an excellent explanation.

Comment on lines +2437 to +2574
/**
* Set alias flag to true if link metrics for a specific neighbor have been requested
*/
bool specific_neighbor =
ieee1905_1::eLinkMetricNeighborType::SPECIFIC_NEIGHBOR == neighbor_type;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nitpick) I would set this directly in the condition above instead of saving neighbor_type.

tests/test_flows.sh Outdated Show resolved Hide resolved
@mariomaz mariomaz force-pushed the feature/collect-link-metrics branch from 7485d32 to 20eee01 Compare March 16, 2020 17:01
Copy link
Collaborator

@tomereli tomereli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only nitpicks - this is an amazing work @mariomaz - well done!

@@ -0,0 +1,279 @@
/* SPDX-License-Identifier: BSD-2-Clause-Patent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me some time to understand that this is only for ethernet links. I guess I should know by heart that 802.3 is Ethernet, but consider changing the file name to eth_link_metrics_collector.cpp (nitpick)

* If the media type of the link is IEEE 802.11, then this value is the estimated RSSI in dB at
* the receive side of the link expressed in dB; otherwise, it is set to 0xFF.
*/
uint8_t rssi = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why not set it to UINT8_MAX by default instead of 0? Then you don't need to set it in the ethernet link collector implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is going to be used in the WiFi link collector implementation too. UINT8_MAX is not a default value, but the value to use for ethernet links (according to standard)

/**
* Get link metrics information
*/
if (collector &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to split to 2 if conditions (nitpick)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some static code analyzers would complain in such case.

* tlvLinkMetricQueryAllNeighbors without it. Application must check which of both TLVs has
* been received inside the message.
*/
std::shared_ptr<ieee1905_1::tlvLinkMetricQueryAllNeighbors> tlvLinkMetricQueryAllNeighbors;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can handle each option in separate functions? (nitpick)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the code is common to both options. Separated functions would contain lots of duplicated code so I prefer to leave it as it is now.

@arnout arnout force-pushed the feature/collect-link-metrics branch from c64e2d1 to 75f286f Compare March 18, 2020 18:50
Copy link
Collaborator Author

@mariomaz mariomaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I pressed the wrong button

@mariomaz mariomaz added the don't merge This PR is not ready for merge or review label Mar 19, 2020
@mariomaz mariomaz force-pushed the feature/collect-link-metrics branch from 75f286f to 636e1cf Compare March 19, 2020 18:26
@mariomaz mariomaz added ready for merge PR is ready to be merged (automatically) and removed don't merge This PR is not ready for merge or review labels Mar 19, 2020
Define a structure to hold collected link metrics. This structure will
later be used to fill the TLVs of the Link Metric response message.

Use uint32_t instead of uint16_t for max throughput capacity and bit
rate fields to allow for bigger values.
Spec document uses uint16_t so a saturation to the maximum value is
required when copying values to the TLV.

Link metrics have to be collected differently for wired and wireless
links. Define an interface class for a link metrics collector. Derived
implementation classes will override pure virtual methods.

Create an implementation class for Ethernet links. This implementation
reads network interface stats using a Netlink socket and the
NETLINK_ROUTE protocol.

TODO in future tasks:
- Implement link metrics collector for WiFi links in ([TASK] Link
metric collection for WiFi links #792)
- Move all link metrics related classes to BPL ([TASK] Move link
metric related classes to BPL #910)
- Implement Ethernet link metrics collector using and extending Netlink
classes defined in "Feature/add support for nl80211 standard commands
to BWL #900"

Signed-off-by: Mario Maz <[email protected]>
Signed-off-by: Mario Maz <[email protected]>
Media type group is the MSB of the media type field. It is not included
in any message as is. It is used internally to decide which mechanism
has to be used to compute the media type.

Signed-off-by: Mario Maz <[email protected]>
The Link Metric Query TLV contains an optional field. However, TLVF
does not currently support optional fields.

As a workaround we create two different TLVs, one that includes the
optional field and another one which does not. When parsing received
message, the application will check the length of included TLV to know
which one is it.

Signed-off-by: Mario Maz <[email protected]>
Method getNextTlvLength() returns an uit16_t with the length of next
TLV in message. However, the method is not considering the endianness
and returns the 2 bytes swapped.

Fix getNextTlvLength to consider endianness swapping bytes of returned
length value.

Swapping is only correct in the context of CmduMessageRx::parse().
Remove getNextTlvData and getNextTlvType(eTlvType) from CmduMessage.
Move getNextTlvType() and getNextTlvLength down into CmduMessageRx and
make them private.

Signed-off-by: Mario Maz <[email protected]>
Check the length of received TLV to know if optional field (MAC address
of neighbor device) is present or not and then create an instance of
either tlvLinkMetricQuerySpecificNeighbor or tlvLinkMetricQuery
respectively.

Signed-off-by: Mario Maz <[email protected]>
After having created different TLVs to cope with optional field in
tlvLinkMetricQuery and moved some inner enums out of their classes,
update test that was was using such TLV.

Signed-off-by: Mario Maz <[email protected]>
Add handler for LINK_METRIC_QUERY_MESSAGE message. On reception, obtain
requested link metrics, build LINK_METRIC_RESPONSE_MESSAGE and send it.

Currently only wired interfaces are supported. Obtaining link metrics
for wireless interfaces will be done in a future task.

Topology database is required to know which neighbor nodes are
connected to each interface. Since we do not have such database now, a
temporary solution has been implemented: a function that returns a list
with a single (neighbor, interface) pair for the controller. In the
future, once we have the topology database available, this function
will completed to return all links between interfaces and neighbors.

Signed-off-by: Mario Maz <[email protected]>
Modify the strings written to log so test_flows can later check
which messages and TLVs have been received.

Fix some typos in log messages.

Signed-off-by: Mario Maz <[email protected]>
@mariomaz mariomaz force-pushed the feature/collect-link-metrics branch from 636e1cf to 3133c23 Compare March 19, 2020 18:30
Remove agent's emulated response because now the Link Metric query
message is processed by agent. Also, as the message handler is now in
the backhaul_manager_thread, the log file to look for messages has to
be changed.

Signed-off-by: Mario Maz <[email protected]>
Update test_flows.py to create test for LINK_METRIC_QUERY_MESSAGE
reception in agent.

Check also that controller receives LINK_METRIC_RESPONSE_MESSAGE.

Signed-off-by: Mario Maz <[email protected]>
Link Metric Query message is handled now in backhaul manager thread
so remove its (empty) handler from son slave thread.

Signed-off-by: Mario Maz <[email protected]>
Now that we have a get_media_type() function, use it wherever it was
previously being computed a MediaType field.

Signed-off-by: Mario Maz <[email protected]>
@mariomaz mariomaz force-pushed the feature/collect-link-metrics branch from 3133c23 to 8919a56 Compare March 19, 2020 18:32
@mergify mergify bot merged commit 0011ffe into master Mar 19, 2020
@mergify mergify bot deleted the feature/collect-link-metrics branch March 19, 2020 18:45
@mariomaz mariomaz removed the ready for merge PR is ready to be merged (automatically) label Apr 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TASK] Link metric collection for Ethernet links
3 participants