Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow spaces in octal attributes #74358

Closed

Conversation

am11
Copy link
Member

@am11 am11 commented Aug 22, 2022

Contributes to #74316

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Aug 22, 2022
@ghost
Copy link

ghost commented Aug 22, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Contributes to #74316

Author: am11
Assignees: -
Labels:

area-System.IO

Milestone: -

Copy link
Member

@carlossanlop carlossanlop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, @am11. Left a few notes/questions.

@@ -208,6 +208,8 @@ internal static TarEntryType GetCorrectTypeFlagForFormat(TarEntryFormat format,
T value = T.Zero;
foreach (byte b in buffer)
{
if (b == (byte)' ') continue; // skip space
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes:

  • You could avoid the cast by using 0x20 (or 32 in decimal). The comment can stay since it clarifies it's a space.
  • We should add a note explaining why space can be allowed here (some exotic archives might contain entries where the octal value contains unexpected spaces).
  • In the tar archives that failed with this error, did you notice that the space was in the middle of the octal number? If not, maybe spaces are only fine to be detected in the edges of the span. I'm conflicted because this could mean complicating the method. What do you think?
  • Which octal field threw this exception? I think we should only limit allowing spaces to the fields that failed. I do not like the idea of allowing spaces in all octal fields. But will await for your confirmation.

Copy link
Member Author

@am11 am11 Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(byte)' ' cast incurs no cost. In fact, it is the preferred style used extensively in this repo for readability (see e.g. two lines below in the same file).

In main branch, we are already ignoring trailing nulls and spaces in all octal fields. while other implementations ignore both the leading and trailing spaces. I have not found any implementation which is allowing spaces in the middle of the value, so I agree that we should do something about it.

Given:

null null <space> null null 44 null 61 null null
null null <space> a b c null null 44 null 61 null null

lib-archive octal parser parses 44 (octal 36) for both, while busybox tar parses first one as 44 and second one as 0. Note that it applies to all header attributes in both implementations.

I am leaning to follow lib-archive's behavior rather than busybox's, what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carlossanlop, I went ahead and made the change.

what do you think?

From other comments, I assume your reply is you don't like libarchive's behavior, neither busybox's nor GNU tar's. Could you point me to the spec which suggests that we should only trim the space and null from the end of octal values, but not the beginning? i.e. why is this acceptable in the main branch

buffer = TrimEndingNullsAndSpaces(buffer);

but not when these bytes appear at the beginning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those tar implementations are way too flexible, they are basically ignoring what the spec describes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -20,19 +20,6 @@ public void MalformedArchive_TooSmall()
Assert.Throws<EndOfStreamException>(() => reader.GetNextEntry());
}

[Fact]
public void MalformedArchive_HeaderSize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these tests removed?

Copy link
Member

@carlossanlop carlossanlop Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but I am concerned that we now allow a whole header to contain invalid data. I do not think we want this level of flexiblity.

The test that is being removed would've thrown at some point due to having 512 consecutive 0x1 bytes. Let's analyze each one of the fields in the header:

From V7:

  • name: This can be assumed to be valid since we store it as string and don't validate the value until we extract.
  • mode, uid, guid, size, mtime, checksum: they all expect octal numbers. With your change, they would not throw if there are non-octal chars.
  • linkname: Same as name.

From Ustar and Gnu:

  • typeflag: We should be ok with any character here, even if it's not among the ones defined in our enum (this should throw right now, but it's a bug unrelated to this change, we should allow any ASCII character Verified, we allow values outside of the enum).
  • magic: this should throw. We only have three possible values: either it's all nulls, or it's the Ustar magic, or it's the Gnu magic.
  • version: Similar to magic.
  • uname/gname: would not throw, we store this as a string.
  • devmajor/devminor: Octal, this would pass with your change.
  • prefix: Same as name and linkname.

Then since the size had some data, but we didn't find any characters, we would set size as 0 with your change. So no data.

if (ds != null && ds.Length > 0)
{
using var memoryStream = new MemoryStream();
ds.CopyTo(memoryStream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is being done with the memory stream. What is this for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look into the implementation of the stream that Tar returns; I read it into a dummy stream to ensure it could successfully read to the end. It seems a reasonable small check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah alright, that's fine. Although we already have tests that verify both seekable and unseekable data streams can be read.

Comment on lines 146 to 147
if (entry.EntryType == TarEntryType.Directory)
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory entry check is unnecessary if the entry.DataStream is already being checked for null a few lines below.

Comment on lines 128 to 131
[Theory]
[InlineData("node-tar")]
[InlineData("tar-rs")]
public void Read_TestArchives_Successfully(string subset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be improved so that we can investigate failures much more easily.

Instead of iterating through the files inside the test method, you can pass the file names as a MemberData. Plus they can be reused.

        public static IEnumerable<object[]> GetNodeTarFiles()
        {
            string nodeTarPath = Path.Join(Directory.GetCurrentDirectory(), "tar", "note-tar");
            foreach (string file in Directory.EnumerateFiles(nodeTarPath, "*.tar", SearchOption.AllDirectories))
            {
                yield return new object[] { file };
            }
        }

        public static IEnumerable<object[]> GetTarRsFiles()
        {
            string nodeTarPath = Path.Join(Directory.GetCurrentDirectory(), "tar", "tar-rs");
            foreach (string file in Directory.EnumerateFiles(nodeTarPath, "*.tar", SearchOption.AllDirectories))
            {
                yield return new object[] { file };
            }
        }

     [Theory]
        [MemberData(nameof("GetNodeTarFiles"))]
        [MemberData(nameof("GetTarRsFiles"))]
        public void Read_TestArchives_Successfully(string subset)

Then you can remove the forloop inside the test method.

Comment on lines -212 to +216
if (digit >= 8)
{
ThrowInvalidNumber();
}
if (b < (byte)'0' || b > (byte)'7') break;

value = checked((value * octalFactor) + T.CreateTruncating(digit));
value = checked((value * octalFactor) + T.CreateTruncating((uint)(b - '0')));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which assets have octal fields with non-octal characters on either end? I'm concerned that we are basically allowing garbage and not respecting the tar spec anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea let's not anywhere intentionally be more lenient than we discover existing tools are.

Copy link
Member Author

@am11 am11 Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danmoseley, it is exactly as lenient as libarchive's octal parsing, it allows any non-octal character at the beginning and end.

GNU tar and busybox implementations allow both leading and trailing whitespaces and nulls.

.NET's implementation is the strictest one, we are only ignoring null and spaces when they appear at the end, and we do not allow them at the beginning.

I haven't find formal specification calling out for and against these behaviors, so it is up to us to decide which existing implementation to follow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But ignoring nulls and spaces is not the same as ignoring any character that is either smaller than '0' or larger than '7'.

Copy link
Member Author

@am11 am11 Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libarchive ignores any non-octal character. :)
https://github.com/libarchive/libarchive/blob/9e5279615033fa073d0bbae83b366359eab3b62f/contrib/untar.c#L45

if this is too lenient, I can confine it to only allow nulls and spaces.

ex = Record.Exception(() => entry.Length);
Assert.Null(ex);
Assert.NotNull(entry.Name);
Assert.True(Enum.IsDefined(entry.EntryType));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail if an entry of one of the node-tar and tar-rs asserts has an entry type that is not in our TarEntryType enum. I checked, and we pass the value directly without validating it is among the predefined enums:

typeFlag: (TarEntryType)buffer[FieldLocations.TypeFlag])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those files have known entry types, so this assert is valid for the current set of assets. If in the future, we ever updated assets in those directories from upstream into dotnet/runtime-assets, and those new files have some unknown entry types (very unlikely), we can adjust the code accordingly.

@am11 am11 force-pushed the feature/system.formats.tar/hardlinks-support branch from 842dc3f to 96a88cc Compare August 23, 2022 00:59
@carlossanlop
Copy link
Member

Of those assets which fail when their octal fields have unexpected characters, do we know what tool was used to create them?

@carlossanlop
Copy link
Member

Closing this in favor of #74412 .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Formats.Tar community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants