Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ros2 bag reindex #539

Closed
wants to merge 21 commits into from
Closed

Conversation

jhdcs
Copy link
Contributor

@jhdcs jhdcs commented Oct 15, 2020

This pull request addresses issue #395.

A new verb, reindex has been added, which attempts to reconstruct a metadata.yaml file from the sqlite3 databases found within the bag.

Functionality allowing different forms of reindex for bag types has been added as well.

Readme has been updated to indicate new verb, as well as how to use it.

Unit tests for reindex have been added to rosbag2_tests.

@jhdcs jhdcs requested a review from a team as a code owner October 15, 2020 19:00
@jhdcs jhdcs requested review from emersonknapp and Karsten1987 and removed request for a team October 15, 2020 19:00
@jhdcs jhdcs force-pushed the bug/missing_metadata branch from f568738 to eaabdc0 Compare October 15, 2020 19:27
@Karsten1987 Karsten1987 changed the title Bug/missing metadata Introduce ros2 bag reindex Oct 19, 2020
Copy link
Collaborator

@Karsten1987 Karsten1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I hope you don't mind that I changed your title. That way it's a bit more obvious what this PR is doing.

Thanks for the PR, a couple of remarks though:

  • The reindexer has to be more generic. I found a couple of places where .db3 and sqlite3 is hardcoded. Does it make more sense to have that reindex functionality live within the sqlite3 plugin then?
  • As mentioned inline, while I appreciate that you follow the style, I don't think we have to introduce interfaces for the reindexer. I don't think we need that flexibility on that level.
  • Boost is currently kind of a show-stopper. I think we have to come up with a rcpputils solution which can iterate over files wo/ boost.

rosbag2_cpp/CMakeLists.txt Outdated Show resolved Hide resolved
rosbag2_cpp/include/rosbag2_cpp/reindexer.hpp Show resolved Hide resolved
storage_.reset();
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

metadata_ = rosbag2_storage::BagMetadata{};

// This reindexer will only work on SQLite files, so this can't change
metadata_.storage_identifier = "sqlite3";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that has to be more generic. What happens if I try to reindex a ros1 bagfile? Or some custom storage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought sequential_reindexer was supposed to be the reindexer that worked on the sqlite files. Not sure why. I guess I'm derpy like that.

@jhdcs
Copy link
Contributor Author

jhdcs commented Oct 20, 2020

Made a few comments on your suggestions, thank you for those! And no problem on the name change - I'm not spectacular with titles :)

I did have one question that's been nagging me, though: Since it's theoretically possible to extract all the data that a metadata.yaml file provides, why is the file needed? Is there another format that can't store all the information needed to read it within itself?

@jhdcs jhdcs marked this pull request as draft October 20, 2020 14:47
@Karsten1987
Copy link
Collaborator

Karsten1987 commented Oct 20, 2020

I did have one question that's been nagging me, though: Since it's theoretically possible to extract all the data that a metadata.yaml file provides, why is the file needed?

The idea here is that the actual data payload can be huge compared to the metadata/index and distributed over multiple files. With a dedicated metadata file, we have all that data at hand without having to iterate over the payload.
For example, imaging having multiple GBs of data stored on a network drive. In order to get the metadata, you'll only have to download a tiny yaml file without ever having to open a huge file over your network.

@jhdcs
Copy link
Contributor Author

jhdcs commented Oct 20, 2020

Ok. Well, the good news is I've gotten the Linux/Unix reindexer working again. The bad news is I don't really have a way of testing the Windows version right now, since ROS 2 refuses to build itself on my Windows VM...

Anyhoo, now trying to figure out the best way to transplant the reindexer into the sqlite3 plugin. If I'm reading this correctly, the sqlite3 plugin is a StorageFactoryImpl, correct? I thought that was a bit lower-level than what the reindexer does... Is there another spot that I'm missing?

Otherwise, I'll try to wire up a new reindex function in the StorageFactoryImpl, and propagate the changes along. Fingers crossed...

@Karsten1987
Copy link
Collaborator

Karsten1987 commented Oct 20, 2020

if you want to make your life a bit easier, I guess you could directly plant that reindex logic in a python script within ros2bag. There's a python API for sqlite3 and all it needed would be a bit of logic to fail gracefully when that script is not available for other storage plugins than sqlite3.

The reason why I could see that as a feasible solution is that I don't think we need proper API for such a logic - I can't see a use-case other than just running a script to repair a broken record.

@jhdcs
Copy link
Contributor Author

jhdcs commented Oct 20, 2020

That... Actually might fix a LOT of problems - We wouldn't need to worry about two code paths for finding out the files within a folder in either Windows or Linux, for example. And I've already updated the get_metadata function to return QOS profiles, so as long as I can call it from Python, we're gold.

On a slightly different subject; as you've probably noticed, I realized I was using the wrong license blurb at the top of the reindex files I've added. However, the copyright lint doesn't seem to like what my employers wanted. Is there something I can do to fix this?

EDIT: Just realized you meant I can query SQLite from python directly. Even better! Should I undo the changes to the get_metadata function, then?

@Karsten1987
Copy link
Collaborator

On a slightly different subject; as you've probably noticed, I realized I was using the wrong license blurb at the top of the reindex files I've added. However, the copyright lint doesn't seem to like what my employers wanted. Is there something I can do to fix this?

I would recommend you'd use the regular license text and change the copyright to reflect your company name. That should pass the linters - as you can see Open Robotics, Amazon and Bosch have contributions as such. For all other legal hints, I would put them in a separate doc block below the copyright header. If that's okay with your employer.

@Karsten1987
Copy link
Collaborator

@jhdcs you might want to have a look at ros2/rcutils#249. It's still in a draft but might be the PR you're looking for the iterate through a directory.

@jhdcs
Copy link
Contributor Author

jhdcs commented Oct 30, 2020

Got the python version probably working, albeit unlinted. I need to write unit tests for it, which means I need to find some test databases. Should I just copy them over from the other test packages?

Also, I'm not sure how to get this to work with compression yet...

@jhdcs jhdcs marked this pull request as ready for review November 4, 2020 16:26
@jhdcs
Copy link
Contributor Author

jhdcs commented Nov 4, 2020

I've gotten the reindex operation working for uncompressed and file-compressed SQLite bags, now. Should I squash the commits? There appears to be a commit somewhere in the rebase that's confusing DCO.

@jhdcs jhdcs force-pushed the bug/missing_metadata branch from d7cb035 to 99509f8 Compare November 4, 2020 20:47
@jhdcs
Copy link
Contributor Author

jhdcs commented Nov 4, 2020

Figured out why the test is failing - in order to support the zstd compression scheme, a python package had to be installed via pip. The particular package is called "zstandard".

If we don't want to add new python packages to the workspace, I can comment out the areas of the code that reference it. However, reindexing operations will not be able to support zstd compressed bags.

@jhdcs
Copy link
Contributor Author

jhdcs commented Nov 5, 2020

Reindexing for message-compressed files has been implemented. This code is ready for review.

However, we will need to figure out how we want to handle the inclusion of the zstandard package.

@jhdcs jhdcs requested a review from Karsten1987 November 18, 2020 19:24
@jhdcs
Copy link
Contributor Author

jhdcs commented Nov 20, 2020

May I please have another review on this? I'm not sure why the Rpr__rosbag2__ubuntu_focal_amd64 check is failing.

@jhdcs
Copy link
Contributor Author

jhdcs commented Dec 1, 2020

Okay, got some time to look into it. The Rpr__rosbag2__ubuntu_focal_amd64 check is failing because the zstandard python package is not included. Which means that the file can't be opened/recompressed, and therefore the test can't find the new metadata.yaml file, and the whole thing falls down.

Do you have any recommendations for solving this? Without that zstandard package, we're locked out of working with compressed files...

Copy link
Collaborator

@emersonknapp emersonknapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - sorry I have not been involved on this PR to date!

I don't think we want to install the python-based zstandard package - because it may not match up with the version being used by the C++ code. Another reason is that we may support other compression implementations, and we don't want equivalent python bindings for every one of them. The real solution around this is probably to add an endpoint to the rosbag2_py API to decompress bagfiles, having it choose the correct compression implementation under the hood to do so.

I think as a first step with this feature it would be acceptable to support reindexing uncompressed bags and print out an error message when trying to work on a compressed bag - we could have a followup ticket for Enable reindex on compressed bagfiles

If I am reading correctly, this doesn't need to actually worry about compressed messages, right? Since the type data is available in the sqlite table uncompressed.

rosidl_typesupport_cpp
rosidl_typesupport_introspection_cpp
PUBLIC
ament_index_cpp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't add extra indent, removes unrelated whitespace changes

)


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove unrelated whitespace change

@@ -10,7 +10,10 @@

<buildtool_depend>ament_cmake</buildtool_depend>

<build_depend>libboost-filesystem-dev</build_depend>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be left from an old version of master? You'll probably want to revert all the changes from rosbag2_cpp, rosbag2_storage_default_plugins, rosbag2_tests, and rosbag2_transport - they don't seem related to your PR.

Additionally, I've noticed you checked in a __pycache__ directory in rosbag2_py - that should be removed as well.

parser.add_argument(
'-s', '--storage-id', default='sqlite3',
help="storage identifier to be used, defaults to 'sqlite3'")
parser.add_argument(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both compression-format and compression-mode choices shouldn't be duplicated. Perhaps they should be factored out of record.py and this file into a common place - or you could at least pull the values out of record.py's add_argument calls into a global level variable for that module, and from ros2bag.verb.record import COMPRESSION_MODES - something like that. The duplication will inevitable get desynchronized in the future

compression_fmt: Literal['', 'zstd'],
compression_mode: Literal['', 'none', 'file', 'message']) -> DBMetadata:
# Handle compression
if compression_fmt != '':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if compression_fmt != '':
if compression_fmt:

handles both None and ''. I would recommend using None rather than empty-string for optional values.

Also - this block would be much more readable as

if compression_fmt in ['none', 'message']:
  # open normally
elif compression_fmt == 'file':
  # open compressed
else:
  raise ValueError('unknown compression mode')

The "by process of elimintation !=" pattern is a lot less readable, and includes some duplicated lines.


def reindex(uri: str,
storage_id: str,
compression_fmt: Literal['', 'zstd'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is really a Literal value - it'll require shotgun surgery if a new compression format is added. You should probably take an Optional[str] (for both fmt and mode) - the underlying logic can decide whether or not it can handle this value - overkill and brittle to handle it in the type system.

return super(RosbagYamlDumper, self).increase_indent(flow, False)


class MetadataWriter:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering we already have a MetadataWriter in C++, this duplicated code will be prone to rot if the metadata format changes at all - this should be a call into rosbag2_storage MetadataIO for the logic in this file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I've seen ROS code in the past calling C++ libraries from within Python code. Is there precedent for this, or is this something that will change going forward?

@jhdcs
Copy link
Contributor Author

jhdcs commented Dec 8, 2020

If you are concerned about duplicating functions that are within C++, should I just go back to the C++ version of the code I had originally pushed? Perhaps with a little more logic on the Python side to grab the relative file paths, and thus avoid having to use Boost?

Or should I just try to open up the C++ API to accept calls to both get the metadata, and write it out?

@emersonknapp
Copy link
Collaborator

I'm concerned with code duplication in general. I think when Karsten suggested a Python script, I don't think he intended to suggest that we rewrite existing logic in Python - just that a Python top level script would be easier to manage than fully-C++ logic.

We definitely do not want to introduce a boost dependency, we intentionally removed it from the ros2 core and rosbag2 in most places. For path manipulation, the C++ stdlib handles a lot of that now, and there are a good number of shims for any missing functions in https://github.com/ros2/rcpputils

I personally think that expanding the rosbag2_py API to expose metadata operations would be a desirable feature in and of itself, and could be a standalone PR that would enable this feature

Copy link

@danthony06 danthony06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that overall the changes look good, and it would be nice to get this functionality merged into future releases.

@jhdcs
Copy link
Contributor Author

jhdcs commented Jan 4, 2021

I'm concerned with code duplication in general. I think when Karsten suggested a Python script, I don't think he intended to suggest that we rewrite existing logic in Python - just that a Python top level script would be easier to manage than fully-C++ logic.

We definitely do not want to introduce a boost dependency, we intentionally removed it from the ros2 core and rosbag2 in most places. For path manipulation, the C++ stdlib handles a lot of that now, and there are a good number of shims for any missing functions in https://github.com/ros2/rcpputils

I personally think that expanding the rosbag2_py API to expose metadata operations would be a desirable feature in and of itself, and could be a standalone PR that would enable this feature

I could revert all of the python changes and go back to the C++ version. However, that still leaves us with the problem that I can't query what files are in a directory. rccputils doesn't have that functionality in its filesystem_helper.hpp file, and std::filesystem doesn't exist in C++14. It does in C++17, but I'm not sure if or when ros2 will be switching to that version.

The only other possibility I can think of that doesn't add a new dependency is to figure out how to have Python send a list of strings to C++, and have it converted into a std::vector. That way Python can query what files are in the directory, and give it to C++ for further processing. But I have no idea how to do that...

@emersonknapp
Copy link
Collaborator

Yes - I am currently in the process of refactoring the compression code pretty seriously. Like I said in an earlier comment on this review, though, I think it's perfectly acceptable to partially implement this feature (not supporting compressed bags for a first pass).

As a note - I am finding this PR difficult to review due to the sheer size of it - and would prefer smaller PRs that implement individual atomic functionalities.

For example - if you are introducing a SequentialReindexer class that can do the reindexing functionality - then a PR introducing just that class, with tests and a usable C++ API would be a great PR that I'll be much more likely to be able to review effectively. I'm looking for PRs that are self-contained, demonstrably correct, and do one usable thing well, even if it doesn't do everything you would want it to do. That's something that we could really make meaningful progress on.

@jhdcs jhdcs force-pushed the bug/missing_metadata branch from 25a6814 to d499aeb Compare January 15, 2021 12:27
@ros-discourse
Copy link

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/tooling-wg-breakout-meeting-2021-01-22-rosbag2-reindexing/18489/3

Jacob Hassold added 18 commits January 18, 2021 08:54
Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
@jhdcs jhdcs force-pushed the bug/missing_metadata branch from 5e5798a to a2d6321 Compare January 18, 2021 13:55
Jacob Hassold added 3 commits January 18, 2021 09:11
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Distro A, OPSEC #4584

Signed-off-by: Jacob Hassold <[email protected]>
Copy link
Collaborator

@Karsten1987 Karsten1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, thanks a lot for keeping this PR alive, I know it's not trivial. A few things here:

  • I would try to keep the change to a minimal set. That is, if possible, I would prefer if we'd started without compression and end-to-end integration to fix the internal architecture of reindexing. Similarly, I was wondering what's the purpose of introducing an interface for the indexer. Don't get me wrong here, there might be a legit reason for it, I just want to rule out that we don't introduce unnecessary complexity. That is, I believe a reindexing of the database will always yield the same result, I don't really see what's a sequential reindexing doing differently from a different method of reindexing.
  • We've currently introduced a set of functions within rcutils (linked inline) which replaces much of the custom logic to iterate through files in a directory. I strongly advertise this.
  • I feel there's a lot of code around file handling, rosbag versioning and such which could be reduced or reorganized to be reused across the rosbag2 functionality. I believe there's code within the sequential_indexer wich can be reused within the sequential reader and writer.
  • I don't really like that rosbag2_transport is involved in this. I know it's currently our only entry-point from ros2bag to rosbag2_cpp, but would it make sense to rather go through rosbag2_py instead? I don't think it's a hard requirement here for me, but I think we should generally rethink the architecture on how to deal with python and cpp, now that we have rosbag2_py

Comment on lines +144 to +200
#ifdef WIN32
{
// Code placed in scope so variables can't accidentally be used elsewhere
WIN32_FIND_DATA FindFileData;
HANDLE hFind;

std::string search_dir = base_folder.string() + "\\*";
hFind = FindFirstFile(search_dir, &FindFileData);
// If an error occurs, we want to abort
if (hFind == INVALID_HANDLE_VALUE) {
DWORD dwError = GetLastError();
std::error_code ec(dwError, std::system_category());
throw(std::system_error(ec));
}

// Loop through the directory, collecting file names as we go
do {
if (ffd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) {
// I guess it's a directory in the bag file?
// Still, not interested in it.
continue;
} else {
auto temp_path = rcpputils::fs::path(ffd.cFileName);

// We are ONLY interested in database files
if (temp_path.extension().string() != ".db3") {
continue;
}
output.emplace_back(temp_path);
}
} while (FindNextFile(hFind, &ffd) != 0);

FindClose(hFind);
}
#else
{
// Code placed in scope so variables can't accidentally be used elsewhere
auto dirp = opendir(base_folder.string().c_str());
// If an error occurs, we want to abort
if (dirp == NULL) {
throw std::system_error(errno, std::generic_category());
}
dirent * dp;
while ((dp = readdir(dirp)) != NULL) {
auto non_const_folder = rcpputils::fs::path(base_folder);
auto temp_path = non_const_folder /= rcpputils::fs::path(dp->d_name);

// We are ONLY interested in database files
if (temp_path.extension().string() != ".db3") {
continue;
}

output.emplace_back(temp_path);
}
closedir(dirp);
}
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've recently added code which supersedes this: ros2/rcutils#323

// We're on a UNIX system. Import their filesystem stuff instead
#include <dirent.h>
#endif

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

{
reindexer_impl->reindex(storage_options);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@@ -11,6 +11,7 @@
<buildtool_depend>ament_cmake</buildtool_depend>

<depend>ament_index_cpp</depend>
<depend>libboost-filesystem</depend>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<depend>libboost-filesystem</depend>

const std::vector<rcpputils::fs::path> & files,
const rosbag2_storage::StorageOptions & storage_options);

// Compairson function for std::sort with our filepath convention
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Compairson function for std::sort with our filepath convention
// Comparison function for std::sort with our filepath convention

std::make_unique<rosbag2_storage::MetadataIo>());

virtual ~SequentialReindexer();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

{
namespace reindexers
{
namespace details
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the following two functions should be accessible independently from the reindexer. Aren't we having similar code in the reader already? I think we're introducing code duplication here.
Especially, given that version differentiation, I believe it's important to keep code as such to a minimum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a spot that you think this would fit better in?

Also, looking back on the code, not sure why I have the version parameter there. Huh. That should probably come out.

@@ -430,19 +430,19 @@ rosbag2_storage::BagMetadata SqliteStorage::get_metadata()

auto statement = database_->prepare_statement(
"SELECT name, type, serialization_format, COUNT(messages.id), MIN(messages.timestamp), "
"MAX(messages.timestamp) "
"MAX(messages.timestamp), offered_qos_profiles "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we needing this change here for the reindexing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the original metadata file includes the QOS profiles of the topics, so I have to query them from the database. I did some searching around, and I don't think this change affects anything else in the project, though if you have an alternate way to query the QOS profiles from the database, that might be preferable.

@jhdcs jhdcs mentioned this pull request Feb 8, 2021
@jhdcs
Copy link
Contributor Author

jhdcs commented Feb 8, 2021

Closing in favor of #641

@jhdcs jhdcs closed this Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants