-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts on multiprocessing (and networking) #131
Comments
I did some quick testing and the ROS installation through RoboStack went great. Took <5 min and there were no issues, rviz2 worked and the "topic" examples in ros2 examples also worked. Not all ROS packages are currently supported in RoboStack's conda packages, notably moveit and zed_ros_wraper are missing, but the realsense packages are available. In total 613/1441 packages are supported, I assume they only count packages listed on the ros index. Given that the installation process seems to be very smooth, the most important remaining issue is performance. The DDS that is provided/default is Fast-DDS. However, it seems to be using UDP for communication (seen in Fast-DDS monitor), even for two processes running on the same computer. This is also probably the reason why I can't publish more than about 1M points smoothly at 10 Hz. Which is about 160 MB/s (each point is 16 bytes in the example). For the full-resolution Zed2i point cloud at 15 fps, we need about 500 MB/s, so it's still quite far off. Luckily Fast-DDS supports shared memory transport. I hope it's not too difficult to enable that for ROS2. Here are two sources I'm looking into: |
Enabling shared memory seems fairly simple, I first created this XML file: <?xml version="1.0" encoding="UTF-8" ?>
<profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<!-- Default publisher profile -->
<publisher profile_name="default publisher profile" is_default_profile="true">
<qos>
<data_sharing>
<kind>AUTOMATIC</kind>
</data_sharing>
</qos>
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>
<!-- Default subscription profile -->
<subscriber profile_name="default subscription profile" is_default_profile="true">
<qos>
<data_sharing>
<kind>AUTOMATIC</kind>
</data_sharing>
</qos>
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</subscriber>
</profiles> And then set these environment variables: export RMW_FASTRTPS_USE_QOS_FROM_XML=1
export FASTRTPS_DEFAULT_PROFILES_FILE=/home/idlab185/ros2_examples/rclpy/topics/pointcloud_publisher/examples_rclpy_pointcloud_publisher/fastdds_profile.xml However this did not seem to use shared memory. So I tried forcing it by changing
The ROS message my node is trying to publish is: from sensor_msgs.msg import PointCloud2 It seems like ROS is thus publishing that as an unbounded Fast-DDS data type. I hope that can be configured, or that we can define custom ROS messages that are bounded. Related:
|
The docs in
So as far as I know it's a known ROS2 limitation that most of the messages (those with unbounded types) in It's a bit of a pity that we can't make use of the standard interfaces (for now and if we need high performance), and I hope we can still visualize our customized (bounded) point cloud messages in rviz2 etc. |
After reading this comment I'm afraid that making the types bounded is not sufficient, as ROS still also uses What this means in practice is that we will need custom messages for each camera resolution we want to be able to pass over shared memory with ROS2, e.g:
|
huh, I didn't know that. A little annoying we would need to create different messages for each type, but I can live with it. What is the throughput with shared memory that you can get with FastDDS? And for the network communication, did you tune the configuration (best effort,max throughput?) |
Shared memory is RAM memory, so theoretically we could get up to ~40 GB/s on Gorilla. I assume the DDS implementations are quite optimized and won't add too much overhead for large arrays (e.g. images). However we still have to test this in practice and see it we can get this configured. For the network communication I didn't change any of the default ROS2 or FastDDS settings. A very rough estimate for the throughput I got with UDP was 160 MB/s. For CycloneDDS there is some tuning adivce in this ROS2 How-to guide. For FastDDS (formerly Fast RTPS) I haven't found instructions. Another thing to check is whether we need to explicitly configure the usage of the "loopback interface" when transferring data between processes on the same host. Maybe this is enabled when we set the |
Concretely what I'm proposing:
We can start from this ros2 example: pointcloud_publisher.py I would create a |
Thanks! Let's evaluate this on (1) complexity for end-user and developers and (2) throughput performance. |
I may have an easy to install and use alternative for RoboStack (which, unlike Victor-Louis' experience above, I got very annoyed with during the installation process). I was planning to benchmark a couple of libraries & frameworks for IPC, but after 0MQ worked pretty much out of the box, I will stop here for now and continue to investigate this instead. 0MQ is pip installable ( The code (below) supports publishing RGB images, depth images, and colored point clouds from one process and subscribing from another. Since the publishing process can be launched as a child process, the code is as easy to use as Victor-Louis' current solution in airo-camera-toolkit. So far, it looks like I can achieve a throughput of about 600MBps. There's no need to manage shared memory ourselves, since 0MQ does it all for free. Though we do still need to handle serialization ourselves. As long as you just send NumPy arrays, it's easy to do (just use The code itself is not very complex either: it's only about 100 lines for published and subscriber. See https://gist.github.com/m-decoster/2eea84ad5fb4d364724af54aca70a1d4 To be continued Update: without depth images I get a throughput of 1261MB/s which is sufficient to send over point clouds and RGB images at 15 FPS! Possibly this line is the culprit causing |
thanks for digging into this @m-decoster! I'm dumping a few links that were on my ' to read' list on this topic:
Looking forward to your findings! |
Ah I think @Victorlouisdg had also identified this line as a huge performance hit. |
Be sure to check out my last comment here about the Then for the future, I agree we should look to outsource our multiprocess communication. The speeds of ZeroMQ seem promising, and it seems like you can define data shape/size at runtime (as opposed to compile-time for ros2). So it's definitely worth considering. However, I'm honestly still a fan of exploring the ros2 option first, because it is more standard in the robotics community. As a lab, I think we could save a lot of time if we embraced ros (e.g. also for schunk drivers and navigation), instead of avoiding it at all costs. |
As discussed on Friday, we will, for now use airo-ipc and stress test it in upcoming experiments and demos. I propose closing this issue for now, as we can always re-open when there is new information w.r.t. this. For completeness's sake, here are two more libraries that are interesting to keep an eye on:
|
Thoughts on multiprocessing (and networking)
I'm creating this issue to collect some thoughts on multiprocessing in general, our options and their pros and cons. Before I commit too much more time to our multiprocessing code, I want to be sure we’re implementing the right solution.
I’ve split this post into several chapters:
The problem with using a single process for everything
For some use cases, e.g. retrieving images & point clouds at full resolution and fps (possibly from multiple cameras), servoing at high frequency, recording videos, logging and saving data, it’s very difficult to keep everything running smoothly in a single (Python) process.
Concretely: it's very hard to record videos of your experiments with the same camera that you are using for making robot control decisions, which is a shame.
The
airo-mono
philosophyAiro-mono has the (implicit) motto: "Keep simple things simple", which has made it a great tool for research and prototyping. In practice, this means keeping everything pip installable (except maybe camera SDKs) and providing the functionality as Python functions, or Python classes with intuitive and standardized interfaces (and a few CLI tools and simple OpenCV “apps”).
The (ideal) standard airo-mono-based project getting-started workflow would be like this:
I think we all agree this has been a great success, and is not something we want to compromise on. So that is important to keep in mind when considering the multiprocess options.
Options for multiprocessing
Multiprocessing (or process-based parallelism) has been around for a long time, and a central topic is inter-process communication. I believe our main options are:
In the following subsections, I will explain each of these briefly and the pros/cons I believe they have.
Shared memory
Almost all operating systems support the concept of shared memory. Shared memory is simply a part of main memory (RAM) where multiple processes can write/read to (normally, a process has its private part of main memory). Reading and writing to shared memory can thus be very fast.
Python has a built-in package
multiprocessing
that makes it easy to create blocks of shared memory. You just provide a name and an amount of bytes. Additionally, it integrates with numpy pretty well (however you have to communicate the shape of numpy arrays to receivers yourself over shared memory, which is a bit clunky but it works). This is what I’ve currently used in theMultiprocessRGBPublisher
classes.Pros:
Cons:
The first three cons come down to we have to manage the shared memory ourselves (and maintain that code).
Cyclone DDS
DDS is short for Data Distribution Service, and it is a form of inter-process communication. DDS generally also supports passing data between computers connected via a network. For this reason, they often pass data with the IP. CycloneDDS uses UDP by default but can also be configured to use TCP. However, when operating in this mode, throughput is much lower than over shared memory (and likely can’t handle our full camera data streams). To fix this, CycloneDDS also has support for using shared memory, but I’m unsure how easy this is to install and configure.
CycloneDDS Python support seems pretty nice. Defining messages seems to be not much harder than defining a dataclass, see the Github readme for an example (Chatter).
Questions:
Python / C++ interoperability
Cons:
To be honest, I find that CycloneDDS fall into a somewhat undesirable gray zone between doing it ourselves and using ROS2. CycloneDDS is one of the middleware options for ROS2, so maybe if we go this route, we should just bite the bullet use ROS2. It seems silly to me to define custom CycloneDDS messages instead of using many of the existing ROS2 message types.
ROS2
The “old” ROS was in some sense similar to a custom DDS. However, with ROS2 they chose not to implement the communication middleware themselves anymore, but instead rely on several different DDS options. So basically ROS2 has message types that are not specific to any DDS, and it converts these to the message types of the specific DDSs.
There are several reasons we have currently opted out of the ROS2 ecosystem.
sudo apt upgrade
can break everyone’s projects). Dependency management for ROS packages in general can also be difficult.Problem 1 might be solvable, e.g. by running ROS2 in a Docker container. The caveat is that performance will likely not be great. We would probably need to configure the ROS2 DDS to use shared memory, and then also mount the host's shared memory into the Docker container, but that seems doable.
Correction: I’ve just realized that running ROS2 in a Docker container does not solve our problem, as it would require moving our airo-mono Python scripts into the container as well. A better solution might be to explore Robostack, which is a young project that allows installing ROS into conda environments.
Problem 2 is mostly a “dev problem”. If problem 1 can be solved, airo-mono users don’t even need to be aware that ROS2 is being used. For example, an airo-mono user could create a
Zed2i(multiprocess=True)
, which could behind the scenes start a docker container, and run a publisher/receiver that uses the zed_ros_wrapper. Additionally, this could be completely opt-in, e.g. we could raise a RunTimeError if a user enables multiprocess without having Docker installed.Pros:
Cons:
Conclusion and Action Points
In conclusion, I believe long-term the best solution would be to revisit ROS2, especially if we get it working within conda through RoboStack (paper). However, for the time being, our
multiprocessing
-based code works well for me and allows me to record videos of my data collection, which is my primary use case for wanting multiprocessing.Action Points:
The text was updated successfully, but these errors were encountered: