Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][CI] a crossbow job with MinRelSize enabled #31132

Closed
asfimport opened this issue Feb 14, 2022 · 36 comments
Closed

[C++][CI] a crossbow job with MinRelSize enabled #31132

asfimport opened this issue Feb 14, 2022 · 36 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Feb 14, 2022

Reporter: Jonathan Keane / @jonkeane
Assignee: Kouhei Sutou / @kou

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-15678. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
The pull request linked has the starts of this — but there's still an unidentified segfault in one of the tests

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Ok, so I was finally able to track this down. Fortunately (unfortunately?) it is not really a compiler bug (or maybe it is, I'm not sure). At the very least, I think we can avoid it.

level_comparison.cc is compiled with -msse4.2.

level_comparison_avx2.cc is compiled with -mavx2

This is expected and the functions they generate are housed in separate namespaces so they don't get confused. However, both functions rely on the function arrow::internal::FirstTimeBitmapWriter::AppendWord. The function is not templated but it is defined in the header file (and is not marked inline). I'm not really sure how we aren't getting a duplicate symbol error but some reading suggests it is implicitly inlined at link time.

In the object file (libparquet.a), there are two identical symbols named __ZN5arrow8internal21FirstTimeBitmapWriter10AppendWordEyx. One of them has SHLX and one of them has SHL. This disassembly of the SHLX version matches exactly the disassembly in the stack trace that @jonkeane posted in the PR. The two calling functions are (parquet::internal::standard::DefLevelsBatchToBitmap and parquet::internal::bmi2::DefLevelsBatchToBitmap.

So I think, the -O3 version is inlining the functions. The -Os version is not (-Os seems to discourage inlining in general). The linker is then faced with two identical symbols and just picks one (again, trying to optimize for size). It just so happens the version it picked was the one with SHLX.

So, as a test, we can try splitting the implementation part of bitmap_writer.h into bitmap_writer.cc (at least for FirstTimeBitmapWriter). The .cc file should then only be compiled once (with sse4.2). However, it's very possible we are just hitting the tip of the iceberg here, as any header file linked in by these avx2 compiled versions could be a ticking time bomb.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Good catch. This is exactly the same problem we ran into before with kernels:

// SumArray must be parameterized with the SIMD level since it's called both from
// translation units with and without vectorization. Normally it gets inlined but
// if not, without the parameter, we'll have multiple definitions of the same
// symbol and we'll get unexpected results.
(ARROW-13382). I wonder if we should reconsider the plan of vectorizing kernels by rebuilding the same source multiple times given this potential pitfall.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
We can perhaps add a static check that reports symbols outside the appropriate namespace. We might need some configurable suppression. For example, level_conversion_bmi2.cc.o would report:


0000000000000000 W arrow::util::ArrowLogBase& arrow::util::ArrowLogBase::operator<< <char [51]>(char const (&) [51])
0000000000000000 W std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > arrow::util::StringBuilder<char const (&) [33]>(char const (&) [33])
0000000000000000 W void arrow::util::StringBuilderRecursive<char const (&) [33]>(std::ostream&, char const (&) [33])
0000000000000000 W arrow::util::detail::StringStreamWrapper::stream()
0000000000000000 W arrow::util::Voidify::operator&(arrow::util::ArrowLogBase&)
0000000000000000 W arrow::util::Voidify::Voidify()
0000000000000000 W arrow::bit_util::BytesForBits(long)
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::AppendWord(unsigned long, long)
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::Finish()
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::FirstTimeBitmapWriter(unsigned char*, long, long)
0000000000000000 W parquet::ParquetException::ParquetException<char const (&) [33]>(char const (&) [33])
0000000000000000 W parquet::ParquetException::~ParquetException()
0000000000000000 W parquet::ParquetException::~ParquetException()
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::position() const
0000000000000000 W parquet::ParquetException::what() const
0000000000000000 W std::exception::exception()
0000000000000000 W char const (&std::forward<char const (&) [33]>(std::remove_reference<char const (&) [33]>::type&)) [33]

level_comparison_avx.cc.o looks to be in better shape:


0000000000000000 W short const& std::max<short>(short const&, short const&)
0000000000000000 W short const& std::min<short>(short const&, short const&)

But yes, if we have a better solution for this problem it might be safer.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Wow, thanks for the diagnosis @westonpace.
So, it turns out that our method for compiling multiple versions of code is violating the one-definition-rule for any inline function or method used in the caller code.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
@bkietz You may have some idea about how to fix this cleanly and reliably.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
So, currently we are doing something such as:

clang -c something_avx2.cc -mavx2 

An alternative would be not to pass the optimization flag on the command line but enable it selectively inside the source code, e.g.:

clang -c something_avx2.cc -DARROW_SPECIALIZED_SIMD_TARGET=avx2
namespace parquet {
namespace internal {
namespace PARQUET_IMPL_NAMESPACE {

#ifdef ARROW_SPECIALIZED_SIMD_TARGET

#define STRINGIFY_EXPANDED(a) ARROW_STRINGIFY(a)
#pragma clang attribute push (__attribute__((target( STRINGIFY_EXPANDED(ARROW_SPECIALIZED_SIMD_TARGET)) )), apply_to=function)

#endif

...

#ifdef ARROW_SPECIALIZED_SIMD_TARGET
#pragma clang attribute pop
#endif

}  // namespace PARQUET_IMPL_NAMESPACE
}  // namespace internal
}  // namespace parquet

This way we would avoid enabling the particular instruction set on code inlined from other headers. Of course, perhaps that's not actually desirable...

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
In any case, this is probably too involved a change for 8.0.0, so the 8.0.0 fix would simply to disable SIMD optimizations for Homebrew?

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
That seems like an good solution to me. I had no idea it was possible.

If we want the other headers to be included then we already have a bit of a solution demonstrated in level_conversion_inc.h. In the common header file you require some kind of target namespace to be defined.


namespace parquet {
namespace internal {
#ifndef PARQUET_IMPL_NAMESPACE
#error "PARQUET_IMPL_NAMESPACE must be defined"
#endif
namespace PARQUET_IMPL_NAMESPACE {
...
}  // namespace PARQUET_IMPL_NAMESPACE
}  // namespace internal
}  // namespace parquet

However, anything that includes one of these "common headers" must define that namespace...


#define PARQUET_IMPL_NAMESPACE standard
#include "parquet/level_conversion_inc.h"
#undef PARQUET_IMPL_NAMESPACE

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
@kou Do you think you might be able to take a look at this?

The comment at #12928 (comment) has a good explanation of what's going on and following that there are a few possible fixes (though none of them were fully implemented or decided

@asfimport
Copy link
Collaborator Author

Kouhei Sutou / @kou:
How about using template to distinct implementation for each architecture?


diff --git a/cpp/src/arrow/compute/kernels/codegen_internal.h b/cpp/src/arrow/compute/kernels/codegen_internal.h
index fa50427bc3..a4bd0eb586 100644
--- a/cpp/src/arrow/compute/kernels/codegen_internal.h
+++ b/cpp/src/arrow/compute/kernels/codegen_internal.h
@@ -710,8 +710,8 @@ struct ScalarUnaryNotNullStateful {
                        Datum* out) {
       Status st = Status::OK();
       ArrayData* out_arr = out->mutable_array();
-      FirstTimeBitmapWriter out_writer(out_arr->buffers[1]->mutable_data(),
-                                       out_arr->offset, out_arr->length);
+      FirstTimeBitmapWriter<> out_writer(out_arr->buffers[1]->mutable_data(),
+                                         out_arr->offset, out_arr->length);
       VisitArrayValuesInline<Arg0Type>(
           arg0,
           [&](Arg0Value v) {
diff --git a/cpp/src/arrow/compute/kernels/row_encoder.cc b/cpp/src/arrow/compute/kernels/row_encoder.cc
index 10a1f4cda5..26316ec315 100644
--- a/cpp/src/arrow/compute/kernels/row_encoder.cc
+++ b/cpp/src/arrow/compute/kernels/row_encoder.cc
@@ -42,7 +42,7 @@ Status KeyEncoder::DecodeNulls(MemoryPool* pool, int32_t length, uint8_t** encod
     ARROW_ASSIGN_OR_RAISE(*null_bitmap, AllocateBitmap(length, pool));
     uint8_t* validity = (*null_bitmap)->mutable_data();
 
-    FirstTimeBitmapWriter writer(validity, 0, length);
+    FirstTimeBitmapWriter<> writer(validity, 0, length);
     for (int32_t i = 0; i < length; ++i) {
       if (encoded_bytes[i][0] == kValidByte) {
         writer.Set();
diff --git a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
index 7d8d2edc4b..433df0f1b7 100644
--- a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
@@ -353,8 +353,8 @@ struct IsInVisitor {
     const auto& state = checked_cast<const SetLookupState<Type>&>(*ctx->state());
     ArrayData* output = out->mutable_array();
 
-    FirstTimeBitmapWriter writer(output->buffers[1]->mutable_data(), output->offset,
-                                 output->length);
+    FirstTimeBitmapWriter<> writer(output->buffers[1]->mutable_data(), output->offset,
+                                   output->length);
 
     VisitArrayDataInline<Type>(
         this->data,
diff --git a/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc b/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
index 611601cab8..da7de1c277 100644
--- a/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
@@ -1456,7 +1456,7 @@ struct MatchSubstringImpl {
         [&matcher](const void* raw_offsets, const uint8_t* data, int64_t length,
                    int64_t output_offset, uint8_t* output) {
           const offset_type* offsets = reinterpret_cast<const offset_type*>(raw_offsets);
-          FirstTimeBitmapWriter bitmap_writer(output, output_offset, length);
+          FirstTimeBitmapWriter<> bitmap_writer(output, output_offset, length);
           for (int64_t i = 0; i < length; ++i) {
             const char* current_data = reinterpret_cast<const char*>(data + offsets[i]);
             int64_t current_length = offsets[i + 1] - offsets[i];
diff --git a/cpp/src/arrow/util/bit_util_benchmark.cc b/cpp/src/arrow/util/bit_util_benchmark.cc
index 258fd27785..66a81b4e04 100644
--- a/cpp/src/arrow/util/bit_util_benchmark.cc
+++ b/cpp/src/arrow/util/bit_util_benchmark.cc
@@ -386,7 +386,7 @@ static void BitmapWriter(benchmark::State& state) {
 }
 
 static void FirstTimeBitmapWriter(benchmark::State& state) {
-  BenchmarkBitmapWriter<internal::FirstTimeBitmapWriter>(state, state.range(0));
+  BenchmarkBitmapWriter<internal::FirstTimeBitmapWriter<>>(state, state.range(0));
 }
 
 struct GenerateBitsFunctor {
diff --git a/cpp/src/arrow/util/bit_util_test.cc b/cpp/src/arrow/util/bit_util_test.cc
index 6c2aff4fbe..9b9f19feb1 100644
--- a/cpp/src/arrow/util/bit_util_test.cc
+++ b/cpp/src/arrow/util/bit_util_test.cc
@@ -832,14 +832,14 @@ TEST(FirstTimeBitmapWriter, NormalOperation) {
     const uint8_t fill_byte = static_cast<uint8_t>(fill_byte_int);
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
-      auto writer = internal::FirstTimeBitmapWriter(bitmap, 0, 12);
+      auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 0, 12);
       WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1});
       //                      {0b00110110, 0b1010, 0, 0}
       ASSERT_BYTES_EQ(bitmap, {0x36, 0x0a});
     }
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
-      auto writer = internal::FirstTimeBitmapWriter(bitmap, 4, 12);
+      auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 4, 12);
       WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1});
       //                      {0b00110110, 0b1010, 0, 0}
       ASSERT_BYTES_EQ(bitmap, {static_cast<uint8_t>(0x60 | (fill_byte & 0x0f)), 0xa3});
@@ -848,15 +848,15 @@ TEST(FirstTimeBitmapWriter, NormalOperation) {
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 0, 6);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 0, 6);
         WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 6, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 6, 3);
         WriteVectorToWriter(writer, {0, 0, 0});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 9, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 9, 3);
         WriteVectorToWriter(writer, {1, 0, 1});
       }
       ASSERT_BYTES_EQ(bitmap, {0x36, 0x0a});
@@ -864,23 +864,23 @@ TEST(FirstTimeBitmapWriter, NormalOperation) {
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 4, 0);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 4, 0);
         WriteVectorToWriter(writer, {});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 4, 6);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 4, 6);
         WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 10, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 10, 3);
         WriteVectorToWriter(writer, {0, 0, 0});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 13, 0);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 13, 0);
         WriteVectorToWriter(writer, {});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 13, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 13, 3);
         WriteVectorToWriter(writer, {1, 0, 1});
       }
       ASSERT_BYTES_EQ(bitmap, {static_cast<uint8_t>(0x60 | (fill_byte & 0x0f)), 0xa3});
@@ -900,8 +900,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)
   auto check_append = [](const std::string& expected_bits, int64_t offset) {
     std::vector<uint8_t> valid_bits = {0x00};
     constexpr int64_t kBitsAfterAppend = 8;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(8 * valid_bits.size()) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(8 * valid_bits.size()) - offset);
     writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/kBitsAfterAppend - offset);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, kBitsAfterAppend), expected_bits);
@@ -918,8 +918,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)
   auto check_with_set = [](const std::string& expected_bits, int64_t offset) {
     std::vector<uint8_t> valid_bits = {0x1};
     constexpr int64_t kBitsAfterAppend = 8;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(8 * valid_bits.size()) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(8 * valid_bits.size()) - offset);
     writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/kBitsAfterAppend - offset);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, kBitsAfterAppend), expected_bits);
@@ -936,8 +936,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)
   auto check_with_preceding = [](const std::string& expected_bits, int64_t offset) {
     std::vector<uint8_t> valid_bits = {0xFF};
     constexpr int64_t kBitsAfterAppend = 8;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(8 * valid_bits.size()) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(8 * valid_bits.size()) - offset);
     writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/kBitsAfterAppend - offset);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, kBitsAfterAppend), expected_bits);
@@ -954,8 +954,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)
 
 TEST(FirstTimeBitmapWriter, AppendZeroBitsHasNoImpact) {
   std::vector<uint8_t> valid_bits(/*count=*/1, 0);
-  internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                         /*length=*/valid_bits.size() * 8);
+  internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                           /*length=*/valid_bits.size() * 8);
   writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/0);
   writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/0);
   writer.AppendWord(/*word=*/0x01, /*number_of_bits=*/1);
@@ -966,8 +966,8 @@ TEST(FirstTimeBitmapWriter, AppendZeroBitsHasNoImpact) {
 TEST(FirstTimeBitmapWriter, AppendLessThanByte) {
   {
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/8);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/8);
     writer.AppendWord(0xB, 4);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, /*bit_count=*/8), "01101000");
@@ -975,8 +975,8 @@ TEST(FirstTimeBitmapWriter, AppendLessThanByte) {
   {
     // Test with all bits initially set.
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0xFF);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/8);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/8);
     writer.AppendWord(0xB, 4);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, /*bit_count=*/8), "11101000");
@@ -986,8 +986,8 @@ TEST(FirstTimeBitmapWriter, AppendLessThanByte) {
 TEST(FirstTimeBitmapWriter, AppendByteThenMore) {
   {
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/0,
-                                           /*length=*/9);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/0,
+                                             /*length=*/9);
     writer.AppendWord(0xC3, 8);
     writer.AppendWord(0x01, 1);
     writer.Finish();
@@ -995,8 +995,8 @@ TEST(FirstTimeBitmapWriter, AppendByteThenMore) {
   }
   {
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0xFF);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/0,
-                                           /*length=*/9);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/0,
+                                             /*length=*/9);
     writer.AppendWord(0xC3, 8);
     writer.AppendWord(0x01, 1);
     writer.Finish();
@@ -1012,8 +1012,8 @@ TEST(FirstTimeBitmapWriter, AppendWordShiftsBitsCorrectly) {
     ASSERT_GE(offset, 8);
     std::vector<uint8_t> valid_bits(/*count=*/10, preset_buffer_bits ? 0xFF : 0);
     valid_bits[0] = 0x99;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(9 * sizeof(kPattern)) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(9 * sizeof(kPattern)) - offset);
     writer.AppendWord(/*word=*/kPattern, /*number_of_bits=*/64);
     writer.Finish();
     EXPECT_EQ(valid_bits[0], 0x99);  // shouldn't get changed.
@@ -1051,15 +1051,15 @@ TEST(TestAppendBitmap, AppendWordOnlyAppropriateBytesWritten) {
 
   uint64_t bitmap = 0x1FF;
   {
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/(8 * valid_bits.size()) - 1);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/(8 * valid_bits.size()) - 1);
     writer.AppendWord(bitmap, /*number_of_bits*/ 7);
     writer.Finish();
     EXPECT_THAT(valid_bits, ElementsAreArray(std::vector<uint8_t>{0xFE, 0x00}));
   }
   {
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/(8 * valid_bits.size()) - 1);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/(8 * valid_bits.size()) - 1);
     writer.AppendWord(bitmap, /*number_of_bits*/ 8);
     writer.Finish();
     EXPECT_THAT(valid_bits, ElementsAreArray(std::vector<uint8_t>{0xFE, 0x03}));
diff --git a/cpp/src/arrow/util/bitmap_writer.h b/cpp/src/arrow/util/bitmap_writer.h
index 65d0d188d7..7a70b16f15 100644
--- a/cpp/src/arrow/util/bitmap_writer.h
+++ b/cpp/src/arrow/util/bitmap_writer.h
@@ -21,6 +21,8 @@
 #include <cstring>
 
 #include "arrow/util/bit_util.h"
+#include "arrow/util/config.h"
+#include "arrow/util/dispatch.h"
 #include "arrow/util/endian.h"
 #include "arrow/util/macros.h"
 
@@ -78,6 +80,7 @@ class BitmapWriter {
   int64_t byte_offset_;
 };
 
+template <DispatchLevel level = ARROW_COMPILE_TIME_DISPATCH_LEVEL>
 class FirstTimeBitmapWriter {
   // Like BitmapWriter, but any bit values *following* the bits written
   // might be clobbered.  It is hence faster than BitmapWriter, and can
diff --git a/cpp/src/arrow/util/config.h.cmake b/cpp/src/arrow/util/config.h.cmake
index 55bc2d0100..1ecb4f39f3 100644
--- a/cpp/src/arrow/util/config.h.cmake
+++ b/cpp/src/arrow/util/config.h.cmake
@@ -36,6 +36,8 @@
 
 #define ARROW_PACKAGE_KIND "@ARROW_PACKAGE_KIND@"
 
+#define ARROW_COMPILE_TIME_DISPATCH_LEVEL ::arrow::internal::DispatchLevel::@ARROW_SIMD_LEVEL@
+
 #cmakedefine ARROW_COMPUTE
 #cmakedefine ARROW_CSV
 #cmakedefine ARROW_CUDA
diff --git a/cpp/src/arrow/util/dispatch.h b/cpp/src/arrow/util/dispatch.h
index fae9293f9e..d6f4dbb028 100644
--- a/cpp/src/arrow/util/dispatch.h
+++ b/cpp/src/arrow/util/dispatch.h
@@ -33,6 +33,7 @@ enum class DispatchLevel : int {
   SSE4_2,
   AVX2,
   AVX512,
+  BMI2,
   NEON,
   MAX
 };
@@ -105,6 +106,8 @@ class DynamicDispatch {
         return cpu_info->IsSupported(CpuInfo::AVX2);
       case DispatchLevel::AVX512:
         return cpu_info->IsSupported(CpuInfo::AVX512);
+      case DispatchLevel::BMI2:
+        return cpu_info->IsSupported(CpuInfo::BMI2);
       default:
         return false;
     }
diff --git a/cpp/src/parquet/encoding.cc b/cpp/src/parquet/encoding.cc
index af0e543c3e..34f0eef3b5 100644
--- a/cpp/src/parquet/encoding.cc
+++ b/cpp/src/parquet/encoding.cc
@@ -348,8 +348,8 @@ class PlainEncoder<BooleanType> : public EncoderImpl, virtual public BooleanEnco
     } else {
       auto n_valid = bit_util::BytesForBits(data.length() - data.null_count());
       PARQUET_THROW_NOT_OK(sink_.Reserve(n_valid));
-      ::arrow::internal::FirstTimeBitmapWriter writer(sink_.mutable_data(),
-                                                      sink_.length(), n_valid);
+      ::arrow::internal::FirstTimeBitmapWriter<> writer(sink_.mutable_data(),
+                                                        sink_.length(), n_valid);
 
       for (int64_t i = 0; i < data.length(); i++) {
         if (data.IsValid(i)) {
diff --git a/cpp/src/parquet/level_comparison_avx2.cc b/cpp/src/parquet/level_comparison_avx2.cc
index b33eb2e295..521cf96520 100644
--- a/cpp/src/parquet/level_comparison_avx2.cc
+++ b/cpp/src/parquet/level_comparison_avx2.cc
@@ -16,7 +16,9 @@
 // under the License.
 
 #define PARQUET_IMPL_NAMESPACE avx2
+#define PARQUET_DISPATCH_LEVEL ::arrow::internal::DispatchLevel::AVX2
 #include "parquet/level_comparison_inc.h"
+#undef PARQUET_DISPATCH_LEVEL
 #undef PARQUET_IMPL_NAMESPACE
 
 namespace parquet {
diff --git a/cpp/src/parquet/level_conversion.cc b/cpp/src/parquet/level_conversion.cc
index ffdca476dd..ab440af95a 100644
--- a/cpp/src/parquet/level_conversion.cc
+++ b/cpp/src/parquet/level_conversion.cc
@@ -28,7 +28,9 @@
 
 #include "parquet/level_comparison.h"
 #define PARQUET_IMPL_NAMESPACE standard
+#define PARQUET_DISPATCH_LEVEL ARROW_COMPILE_TIME_DISPATCH_LEVEL
 #include "parquet/level_conversion_inc.h"
+#undef PARQUET_DISPATCH_LEVEL
 #undef PARQUET_IMPL_NAMESPACE
 
 namespace parquet {
@@ -43,7 +45,7 @@ void DefRepLevelsToListInfo(const int16_t* def_levels, const int16_t* rep_levels
                             int64_t num_def_levels, LevelInfo level_info,
                             ValidityBitmapInputOutput* output, OffsetType* offsets) {
   OffsetType* orig_pos = offsets;
-  optional<::arrow::internal::FirstTimeBitmapWriter> valid_bits_writer;
+  optional<::arrow::internal::FirstTimeBitmapWriter<>> valid_bits_writer;
   if (output->valid_bits) {
     valid_bits_writer.emplace(output->valid_bits, output->valid_bits_offset,
                               output->values_read_upper_bound);
diff --git a/cpp/src/parquet/level_conversion_bmi2.cc b/cpp/src/parquet/level_conversion_bmi2.cc
index 274d54e503..679d01d0c9 100644
--- a/cpp/src/parquet/level_conversion_bmi2.cc
+++ b/cpp/src/parquet/level_conversion_bmi2.cc
@@ -17,7 +17,9 @@
 #include "parquet/level_conversion.h"
 
 #define PARQUET_IMPL_NAMESPACE bmi2
+#define PARQUET_DISPATCH_LEVEL ::arrow::internal::DispatchLevel::BMI2
 #include "parquet/level_conversion_inc.h"
+#undef PARQUET_DISPATCH_LEVEL
 #undef PARQUET_IMPL_NAMESPACE
 
 namespace parquet {
diff --git a/cpp/src/parquet/level_conversion_inc.h b/cpp/src/parquet/level_conversion_inc.h
index 710d2f6237..4b5a9def80 100644
--- a/cpp/src/parquet/level_conversion_inc.h
+++ b/cpp/src/parquet/level_conversion_inc.h
@@ -296,7 +296,10 @@ static constexpr int64_t kExtractBitsSize = 8 * sizeof(extract_bitmap_t);
 template <bool has_repeated_parent>
 int64_t DefLevelsBatchToBitmap(const int16_t* def_levels, const int64_t batch_size,
                                int64_t upper_bound_remaining, LevelInfo level_info,
-                               ::arrow::internal::FirstTimeBitmapWriter* writer) {
+#ifndef PARQUET_DISPATCH_LEVEL
+#error "PARQUET_DISPATCH_LEVEL must be defined"
+#endif
+                               ::arrow::internal::FirstTimeBitmapWriter<PARQUET_DISPATCH_LEVEL>* writer) {
   DCHECK_LE(batch_size, kExtractBitsSize);
 
   // Greater than level_info.def_level - 1 implies >= the def_level
@@ -330,7 +333,7 @@ int64_t DefLevelsBatchToBitmap(const int16_t* def_levels, const int64_t batch_si
 template <bool has_repeated_parent>
 void DefLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels,
                            LevelInfo level_info, ValidityBitmapInputOutput* output) {
-  ::arrow::internal::FirstTimeBitmapWriter writer(
+  ::arrow::internal::FirstTimeBitmapWriter<PARQUET_DISPATCH_LEVEL> writer(
       output->valid_bits,
       /*start_offset=*/output->valid_bits_offset,
       /*length=*/output->values_read_upper_bound);

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
@kou We can do that for the specific symptoms here. However, a more general solution will have to be found since other files have the same problem: compiling SIMD-specific code which calls into other routines.

@asfimport
Copy link
Collaborator Author

Ben Kietzman / @bkietz:
@pitrou The most robust solution I can think of is to avoid linking between objects with differing instruction sets altogether. We'd have something like

$ nm libarrow_compute_avx2.so | grep DefLevelsBitmapSimd
0000000000404ff0 t DefLevelsBitmapSimd

That library would be acquired with dlopen(path, RTLD_LOCAL)/LoadLibrary(path) which would guarantee that any functions like FirstTimeBitmapWriter::* which might have been recompiled with illegal instructions are not available outside libarrow_compute_avx2.so.

@asfimport
Copy link
Collaborator Author

Ben Kietzman / @bkietz:
IIUC, we'll still need to pass -mavx2 so that we can include immintrin.h so the attribute described in the ARROW_SPECIALIZED_SIMD_TARGET approach would need to be attached to the {}non{}-SIMD functions to ensure that they're compiled with no special instructions

... or I suppose we could try to declare all the intrinsics manually at function scope

ARROW_SIMD_FUNCTION(avx2) void SimdThing() {
  // inlined from immintrin.h:
  typedef unsigned short __mmask16;
  extern
    __inline
    __mmask16
    __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
        _mm512_int2mask (int __M);
}

@asfimport
Copy link
Collaborator Author

Ben Kietzman / @bkietz:
On further investigation, we can include immintrin.h with or without -mavx2 and clang at least will not complain unless the intrinsics are referenced, so

#include <immintrin.h>

[[gnu::target("avx2")]]
void use_simd() {
  __m256i arg;
  _mm256_abs_epi16 (arg);
}

int main() { use_simd(); }

compiles and runs happily without any special compilation flags. Using an attribute like this seems viable provided we can be certain that the modified target isn't transitively applied to functions which might be invoked for the first time inside a SIMD enabled function

@asfimport
Copy link
Collaborator Author

Ian Cook / @ianmcook:
@jonkeane  this issue is marked as a blocker for 9.0.0. Should this block the release?

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
Last I checked, the homebrew maintainers have said that they will disable all optimization for arrow if we don't get this sorted on our own. So not required if we're ok with that (though we should engage with them on this)

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Disabling all optimizations for Arrow is brutal. It would be much better to simply disable runtime SIMD optimizations (by passing -DARROW_RUNTIME_SIMD_LEVEL=NONE to CMake, AFAIR).

@asfimport
Copy link
Collaborator Author

Raúl Cumplido / @raulcd:
Is this still a blocker?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Ideally it would... But there's little chance for it to be fixed in time for 9.0.0.

As I said above, the workaround should be to disable runtime SIMD optimizations on the affected builds. Somehow has to validate that suggestion, though (i.e. someone who's able to reproduce this issue).

@asfimport
Copy link
Collaborator Author

Jacob Wujciak / @assignUser:
Looking at [ARROW-15664] and this PR it seems like a workaround has been implemented for homebrew IIUC, so this is still an issue but as the real fix wont happen for 9.0.0 it shouldn't be a blocker anymore?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
If that was actually accepted by Homebrew then fine.

@asfimport
Copy link
Collaborator Author

Jacob Wujciak / @assignUser:
That was my impression: issue and PR in homebrew-core.
Maybe @jonkeane can confirm?

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
Homebrew only accepted that as a temporary workaround and has threatened to turn off optimizations if we don't resolve this. They haven't yet followed through yet, though. Homebrew/homebrew-core#94724 (comment)

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Note that I suggested a perhaps more acceptable workaround above.

@asfimport
Copy link
Collaborator Author

Krisztian Szucs / @kszucs:
@jonkeane can you give an update on this issue?

@asfimport
Copy link
Collaborator Author

Krisztian Szucs / @kszucs:
Postponing to 10.0 for now.

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
I have no updates beyond what's discussed above: there are a few approaches, none of them ideal, we need someone to champion this (or risk the homebrew maintainers turning off optimizations on us)

@asfimport
Copy link
Collaborator Author

Raúl Cumplido / @raulcd:
This was a blocker for the last release and is still a blocker for the 10.0.0 release. @jonkeane do you know if there has been any move?

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
I thought that @kou was going to take a look at this (or at least the underlying multiple SIMD instruction ordering issue that causes the failures...)

The only update I have is that I continue to run into the segfault in CI for downstream projects I'm working on, so it continues to be an issue for pre-built libarrow on machines like github's macos runners.

@asfimport
Copy link
Collaborator Author

Kouhei Sutou / @kou:
Yes. I will fix this before the 10.0.0 release. Sorry for not working on this yet.

@asfimport
Copy link
Collaborator Author

Kouhei Sutou / @kou:
Summary of this problem:

Problem:

  • Parquet module is crashed with -DCMAKE_BUILD_TYPE=MinSizeRel

    Why the problem is happened:

  • We compile the same code (level_conversion_inc.h) multiple times with different optimization flags such as -msse4.2 and -mavx2

  • The code calls the same function (arrow::internal::FirstTimeBitmapWriter::AppendWord()) that is defined in header file

  • The called function isn't inlined with -DCMAKE_BUILD_TYPE=MinSizeRel

  • It generates multiple definitions for the called (not-inlined) function (arrow::internal::FirstTimeBitmapWriter::AppendWord())

    Proposed solutions so far:

  1. Force to inline functions that are called from the code that are compiled with SIMD related optimization flags

  2. Restrict SIMD related optimization area to only the target function

    For 1., we have two approaches for it:

@asfimport
Copy link
Collaborator Author

Kouhei Sutou / @kou:
I propose one more approach for the proposed solution 1.:

How about always enabling inline optimization for SIMD optimized compile units (level_conversion_bmi2.cc) even when an user specifies -DCMAKE_BUILD_TYPE=MinSizeRel?

It may increases binary size but it may be better that SIMD related code prioritizes performance than binary size.

We don't need to write manual template/\_\_attribute\_\_((always\_inline)) s with this approach.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Well, I don't think we can force the compiler to inline everything.

@asfimport
Copy link
Collaborator Author

Kouhei Sutou / @kou:
We don't have this problem with -DCMAKE_BUILD_TYPE=Release. So it may work with most cases.

@asfimport
Copy link
Collaborator Author

Kouhei Sutou / @kou:
Issue resolved by pull request 14342
#14342

kou added a commit that referenced this issue Jul 30, 2023
### Rationale for this change

Summary of this problem: #31132 (comment)

Why this problem is happen again? Because I removed `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"` in #36583. The solution we chose by #14342 was forcing to use `-O2` for SIMD related code. It works for `-DCMAKE_BUILD_TYPE=MinSizeRel` but it doesn't work for Homebrew.

Because Homebrew's CC https://github.com/Homebrew/brew/blob/master/Library/Homebrew/shims/super/cc forces to use the same `-O` flag. The default is `-Os`. If we specify `-O2`, Homebrew's CC replaces it to `-Os`. If we use  `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"`, Homebrew's CC always use `-O2`. So the solution we chose by #14342 isn't used for Homebrew.

But Homebrew thinks that `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"` is a workaround. So we need another solution for Homebrew.

Here are candidate solutions:
1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`
2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`

"1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`" works because we don't use the runtime SIMD dispatch feature (the problematic feature) entirely.

"2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`" works but I don't know why... If `ENV.runtime_cpu_detection` is called, Homebrew's CC stops replacing `-march=*`. If we call `ENV.runtime_cpu_detection`, `-march=haswell` is used for AVX2 related code and `-march=skylake-avx512` is used for AVX512 including BMI2 related code. If we don't call `ENV.runtime_cpu_detection`, `-march=nehalem` is always used. (Note that SIMD related flags such as `-mbmi2` aren't removed by Homebrew's CC. So I think that SIMD is enabled.) I don't know why but "the one-definition-rule violation" (see the summary for details: #31132 (comment) ) isn't happen.

FYI: CPU info for GitHub Actions macOS hosted-runner:

```console
$ sysctl hw.optional machdep.cpu
hw.optional.adx: 0
hw.optional.aes: 1
hw.optional.avx1_0: 1
hw.optional.avx2_0: 0
hw.optional.avx512bw: 0
hw.optional.avx512cd: 0
hw.optional.avx512dq: 0
hw.optional.avx512f: 0
hw.optional.avx512ifma: 0
hw.optional.avx512vbmi: 0
hw.optional.avx512vl: 0
hw.optional.bmi1: 0
hw.optional.bmi2: 0
hw.optional.enfstrg: 0
hw.optional.f16c: 1
hw.optional.floatingpoint: 1
hw.optional.fma: 0
hw.optional.hle: 0
hw.optional.mmx: 1
hw.optional.mpx: 0
hw.optional.rdrand: 1
hw.optional.rtm: 0
hw.optional.sgx: 0
hw.optional.sse: 1
hw.optional.sse2: 1
hw.optional.sse3: 1
hw.optional.sse4_1: 1
hw.optional.sse4_2: 1
hw.optional.supplementalsse3: 1
hw.optional.x86_64: 1
machdep.cpu.address_bits.physical: 43
machdep.cpu.address_bits.virtual: 48
machdep.cpu.arch_perf.events: 127
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.fixed_number: 0
machdep.cpu.arch_perf.fixed_width: 0
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.version: 1
machdep.cpu.arch_perf.width: 48
machdep.cpu.cache.L2_associativity: 8
machdep.cpu.cache.linesize: 64
machdep.cpu.cache.size: 256
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.linesize_max: 4096
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.mwait.sub_Cstates: 16
machdep.cpu.thermal.ACNT_MCNT: 0
machdep.cpu.thermal.core_power_limits: 0
machdep.cpu.thermal.dynamic_acceleration: 0
machdep.cpu.thermal.energy_policy: 0
machdep.cpu.thermal.fine_grain_clock_mod: 0
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.package_thermal_intr: 0
machdep.cpu.thermal.sensor: 0
machdep.cpu.thermal.thresholds: 0
machdep.cpu.tlb.data.small: 64
machdep.cpu.tlb.inst.large: 8
machdep.cpu.tlb.inst.small: 64
machdep.cpu.tlb.shared: 512
machdep.cpu.tsc_ccc.denominator: 0
machdep.cpu.tsc_ccc.numerator: 0
machdep.cpu.xsave.extended_state: 7 832 832 0
machdep.cpu.xsave.extended_state1: 0 0 0 0
machdep.cpu.brand: 0
machdep.cpu.brand_string: Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
machdep.cpu.core_count: 3
machdep.cpu.cores_per_package: 4
machdep.cpu.extfamily: 0
machdep.cpu.extfeature_bits: 4967106816
machdep.cpu.extfeatures: SYSCALL XD EM64T LAHF RDTSCP TSCI
machdep.cpu.extmodel: 3
machdep.cpu.family: 6
machdep.cpu.feature_bits: 18427078393948011519
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH MMX FXSR SSE SSE2 SS HTT SSE3 PCLMULQDQ MON VMX SSSE3 CX16 SSE4.1 SSE4.2 x2APIC POPCNT AES VMM PCID XSAVE OSXSAVE TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_feature_bits: 643 0
machdep.cpu.leaf7_feature_bits_edx: 3154117632
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SMEP ERMS MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD
machdep.cpu.logical_per_package: 4
machdep.cpu.max_basic: 13
machdep.cpu.max_ext: 2147483656
machdep.cpu.microcode_version: 1070
machdep.cpu.model: 58
machdep.cpu.processor_flag: 0
machdep.cpu.signature: 198313
machdep.cpu.stepping: 9
machdep.cpu.thread_count: 3
machdep.cpu.vendor: GenuineIntel
```

### What changes are included in this PR?

"1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`" because it's straightforward and  "2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`" may also disable runtime SIMD dispatch implicitly.

This also adds the following debug information for easy to debug in future:
* CPU information for GitHub Actions runner
* Homebrew's build logs

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* Closes: #36685

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this issue Aug 20, 2023
…ache#36705)

### Rationale for this change

Summary of this problem: apache#31132 (comment)

Why this problem is happen again? Because I removed `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"` in apache#36583. The solution we chose by apache#14342 was forcing to use `-O2` for SIMD related code. It works for `-DCMAKE_BUILD_TYPE=MinSizeRel` but it doesn't work for Homebrew.

Because Homebrew's CC https://github.com/Homebrew/brew/blob/master/Library/Homebrew/shims/super/cc forces to use the same `-O` flag. The default is `-Os`. If we specify `-O2`, Homebrew's CC replaces it to `-Os`. If we use  `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"`, Homebrew's CC always use `-O2`. So the solution we chose by apache#14342 isn't used for Homebrew.

But Homebrew thinks that `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"` is a workaround. So we need another solution for Homebrew.

Here are candidate solutions:
1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`
2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`

"1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`" works because we don't use the runtime SIMD dispatch feature (the problematic feature) entirely.

"2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`" works but I don't know why... If `ENV.runtime_cpu_detection` is called, Homebrew's CC stops replacing `-march=*`. If we call `ENV.runtime_cpu_detection`, `-march=haswell` is used for AVX2 related code and `-march=skylake-avx512` is used for AVX512 including BMI2 related code. If we don't call `ENV.runtime_cpu_detection`, `-march=nehalem` is always used. (Note that SIMD related flags such as `-mbmi2` aren't removed by Homebrew's CC. So I think that SIMD is enabled.) I don't know why but "the one-definition-rule violation" (see the summary for details: apache#31132 (comment) ) isn't happen.

FYI: CPU info for GitHub Actions macOS hosted-runner:

```console
$ sysctl hw.optional machdep.cpu
hw.optional.adx: 0
hw.optional.aes: 1
hw.optional.avx1_0: 1
hw.optional.avx2_0: 0
hw.optional.avx512bw: 0
hw.optional.avx512cd: 0
hw.optional.avx512dq: 0
hw.optional.avx512f: 0
hw.optional.avx512ifma: 0
hw.optional.avx512vbmi: 0
hw.optional.avx512vl: 0
hw.optional.bmi1: 0
hw.optional.bmi2: 0
hw.optional.enfstrg: 0
hw.optional.f16c: 1
hw.optional.floatingpoint: 1
hw.optional.fma: 0
hw.optional.hle: 0
hw.optional.mmx: 1
hw.optional.mpx: 0
hw.optional.rdrand: 1
hw.optional.rtm: 0
hw.optional.sgx: 0
hw.optional.sse: 1
hw.optional.sse2: 1
hw.optional.sse3: 1
hw.optional.sse4_1: 1
hw.optional.sse4_2: 1
hw.optional.supplementalsse3: 1
hw.optional.x86_64: 1
machdep.cpu.address_bits.physical: 43
machdep.cpu.address_bits.virtual: 48
machdep.cpu.arch_perf.events: 127
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.fixed_number: 0
machdep.cpu.arch_perf.fixed_width: 0
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.version: 1
machdep.cpu.arch_perf.width: 48
machdep.cpu.cache.L2_associativity: 8
machdep.cpu.cache.linesize: 64
machdep.cpu.cache.size: 256
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.linesize_max: 4096
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.mwait.sub_Cstates: 16
machdep.cpu.thermal.ACNT_MCNT: 0
machdep.cpu.thermal.core_power_limits: 0
machdep.cpu.thermal.dynamic_acceleration: 0
machdep.cpu.thermal.energy_policy: 0
machdep.cpu.thermal.fine_grain_clock_mod: 0
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.package_thermal_intr: 0
machdep.cpu.thermal.sensor: 0
machdep.cpu.thermal.thresholds: 0
machdep.cpu.tlb.data.small: 64
machdep.cpu.tlb.inst.large: 8
machdep.cpu.tlb.inst.small: 64
machdep.cpu.tlb.shared: 512
machdep.cpu.tsc_ccc.denominator: 0
machdep.cpu.tsc_ccc.numerator: 0
machdep.cpu.xsave.extended_state: 7 832 832 0
machdep.cpu.xsave.extended_state1: 0 0 0 0
machdep.cpu.brand: 0
machdep.cpu.brand_string: Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
machdep.cpu.core_count: 3
machdep.cpu.cores_per_package: 4
machdep.cpu.extfamily: 0
machdep.cpu.extfeature_bits: 4967106816
machdep.cpu.extfeatures: SYSCALL XD EM64T LAHF RDTSCP TSCI
machdep.cpu.extmodel: 3
machdep.cpu.family: 6
machdep.cpu.feature_bits: 18427078393948011519
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH MMX FXSR SSE SSE2 SS HTT SSE3 PCLMULQDQ MON VMX SSSE3 CX16 SSE4.1 SSE4.2 x2APIC POPCNT AES VMM PCID XSAVE OSXSAVE TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_feature_bits: 643 0
machdep.cpu.leaf7_feature_bits_edx: 3154117632
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SMEP ERMS MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD
machdep.cpu.logical_per_package: 4
machdep.cpu.max_basic: 13
machdep.cpu.max_ext: 2147483656
machdep.cpu.microcode_version: 1070
machdep.cpu.model: 58
machdep.cpu.processor_flag: 0
machdep.cpu.signature: 198313
machdep.cpu.stepping: 9
machdep.cpu.thread_count: 3
machdep.cpu.vendor: GenuineIntel
```

### What changes are included in this PR?

"1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`" because it's straightforward and  "2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`" may also disable runtime SIMD dispatch implicitly.

This also adds the following debug information for easy to debug in future:
* CPU information for GitHub Actions runner
* Homebrew's build logs

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* Closes: apache#36685

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…ache#36705)

### Rationale for this change

Summary of this problem: apache#31132 (comment)

Why this problem is happen again? Because I removed `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"` in apache#36583. The solution we chose by apache#14342 was forcing to use `-O2` for SIMD related code. It works for `-DCMAKE_BUILD_TYPE=MinSizeRel` but it doesn't work for Homebrew.

Because Homebrew's CC https://github.com/Homebrew/brew/blob/master/Library/Homebrew/shims/super/cc forces to use the same `-O` flag. The default is `-Os`. If we specify `-O2`, Homebrew's CC replaces it to `-Os`. If we use  `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"`, Homebrew's CC always use `-O2`. So the solution we chose by apache#14342 isn't used for Homebrew.

But Homebrew thinks that `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"` is a workaround. So we need another solution for Homebrew.

Here are candidate solutions:
1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`
2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`

"1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`" works because we don't use the runtime SIMD dispatch feature (the problematic feature) entirely.

"2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`" works but I don't know why... If `ENV.runtime_cpu_detection` is called, Homebrew's CC stops replacing `-march=*`. If we call `ENV.runtime_cpu_detection`, `-march=haswell` is used for AVX2 related code and `-march=skylake-avx512` is used for AVX512 including BMI2 related code. If we don't call `ENV.runtime_cpu_detection`, `-march=nehalem` is always used. (Note that SIMD related flags such as `-mbmi2` aren't removed by Homebrew's CC. So I think that SIMD is enabled.) I don't know why but "the one-definition-rule violation" (see the summary for details: apache#31132 (comment) ) isn't happen.

FYI: CPU info for GitHub Actions macOS hosted-runner:

```console
$ sysctl hw.optional machdep.cpu
hw.optional.adx: 0
hw.optional.aes: 1
hw.optional.avx1_0: 1
hw.optional.avx2_0: 0
hw.optional.avx512bw: 0
hw.optional.avx512cd: 0
hw.optional.avx512dq: 0
hw.optional.avx512f: 0
hw.optional.avx512ifma: 0
hw.optional.avx512vbmi: 0
hw.optional.avx512vl: 0
hw.optional.bmi1: 0
hw.optional.bmi2: 0
hw.optional.enfstrg: 0
hw.optional.f16c: 1
hw.optional.floatingpoint: 1
hw.optional.fma: 0
hw.optional.hle: 0
hw.optional.mmx: 1
hw.optional.mpx: 0
hw.optional.rdrand: 1
hw.optional.rtm: 0
hw.optional.sgx: 0
hw.optional.sse: 1
hw.optional.sse2: 1
hw.optional.sse3: 1
hw.optional.sse4_1: 1
hw.optional.sse4_2: 1
hw.optional.supplementalsse3: 1
hw.optional.x86_64: 1
machdep.cpu.address_bits.physical: 43
machdep.cpu.address_bits.virtual: 48
machdep.cpu.arch_perf.events: 127
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.fixed_number: 0
machdep.cpu.arch_perf.fixed_width: 0
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.version: 1
machdep.cpu.arch_perf.width: 48
machdep.cpu.cache.L2_associativity: 8
machdep.cpu.cache.linesize: 64
machdep.cpu.cache.size: 256
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.linesize_max: 4096
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.mwait.sub_Cstates: 16
machdep.cpu.thermal.ACNT_MCNT: 0
machdep.cpu.thermal.core_power_limits: 0
machdep.cpu.thermal.dynamic_acceleration: 0
machdep.cpu.thermal.energy_policy: 0
machdep.cpu.thermal.fine_grain_clock_mod: 0
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.package_thermal_intr: 0
machdep.cpu.thermal.sensor: 0
machdep.cpu.thermal.thresholds: 0
machdep.cpu.tlb.data.small: 64
machdep.cpu.tlb.inst.large: 8
machdep.cpu.tlb.inst.small: 64
machdep.cpu.tlb.shared: 512
machdep.cpu.tsc_ccc.denominator: 0
machdep.cpu.tsc_ccc.numerator: 0
machdep.cpu.xsave.extended_state: 7 832 832 0
machdep.cpu.xsave.extended_state1: 0 0 0 0
machdep.cpu.brand: 0
machdep.cpu.brand_string: Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz
machdep.cpu.core_count: 3
machdep.cpu.cores_per_package: 4
machdep.cpu.extfamily: 0
machdep.cpu.extfeature_bits: 4967106816
machdep.cpu.extfeatures: SYSCALL XD EM64T LAHF RDTSCP TSCI
machdep.cpu.extmodel: 3
machdep.cpu.family: 6
machdep.cpu.feature_bits: 18427078393948011519
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH MMX FXSR SSE SSE2 SS HTT SSE3 PCLMULQDQ MON VMX SSSE3 CX16 SSE4.1 SSE4.2 x2APIC POPCNT AES VMM PCID XSAVE OSXSAVE TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_feature_bits: 643 0
machdep.cpu.leaf7_feature_bits_edx: 3154117632
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SMEP ERMS MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD
machdep.cpu.logical_per_package: 4
machdep.cpu.max_basic: 13
machdep.cpu.max_ext: 2147483656
machdep.cpu.microcode_version: 1070
machdep.cpu.model: 58
machdep.cpu.processor_flag: 0
machdep.cpu.signature: 198313
machdep.cpu.stepping: 9
machdep.cpu.thread_count: 3
machdep.cpu.vendor: GenuineIntel
```

### What changes are included in this PR?

"1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`" because it's straightforward and  "2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`" may also disable runtime SIMD dispatch implicitly.

This also adds the following debug information for easy to debug in future:
* CPU information for GitHub Actions runner
* Homebrew's build logs

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* Closes: apache#36685

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants