[C++][CI] a crossbow job with MinRelSize enabled #31132

asfimport · 2022-02-14T17:46:30Z

Reporter: Jonathan Keane / @jonkeane
Assignee: Kouhei Sutou / @kou

Related issues:

[C++] parquet reader Segfaults with illegal SIMD instruction (is duplicated by)

PRs and other links:

_{Note: This issue was originally created as ARROW-15678. Please see the migration documentation for further details.}

asfimport · 2022-03-02T17:44:33Z

Jonathan Keane / @jonkeane:
The pull request linked has the starts of this — but there's still an unidentified segfault in one of the tests

asfimport · 2022-04-17T08:39:05Z

Weston Pace / @westonpace:
Ok, so I was finally able to track this down. Fortunately (unfortunately?) it is not really a compiler bug (or maybe it is, I'm not sure). At the very least, I think we can avoid it.

level_comparison.cc is compiled with -msse4.2.

level_comparison_avx2.cc is compiled with -mavx2

This is expected and the functions they generate are housed in separate namespaces so they don't get confused. However, both functions rely on the function arrow::internal::FirstTimeBitmapWriter::AppendWord. The function is not templated but it is defined in the header file (and is not marked inline). I'm not really sure how we aren't getting a duplicate symbol error but some reading suggests it is implicitly inlined at link time.

In the object file (libparquet.a), there are two identical symbols named __ZN5arrow8internal21FirstTimeBitmapWriter10AppendWordEyx. One of them has SHLX and one of them has SHL. This disassembly of the SHLX version matches exactly the disassembly in the stack trace that @jonkeane posted in the PR. The two calling functions are (parquet::internal::standard::DefLevelsBatchToBitmap and parquet::internal::bmi2::DefLevelsBatchToBitmap.

So I think, the -O3 version is inlining the functions. The -Os version is not (-Os seems to discourage inlining in general). The linker is then faced with two identical symbols and just picks one (again, trying to optimize for size). It just so happens the version it picked was the one with SHLX.

So, as a test, we can try splitting the implementation part of bitmap_writer.h into bitmap_writer.cc (at least for FirstTimeBitmapWriter). The .cc file should then only be compiled once (with sse4.2). However, it's very possible we are just hitting the tip of the iceberg here, as any header file linked in by these avx2 compiled versions could be a ticking time bomb.

asfimport · 2022-04-19T12:37:39Z

David Li / @lidavidm:
Good catch. This is exactly the same problem we ran into before with kernels:

arrow/cpp/src/arrow/compute/kernels/aggregate_internal.h

Lines 135 to 138 in 6c10a38

    
           // SumArray must be parameterized with the SIMD level since it's called both from 
        
           // translation units with and without vectorization. Normally it gets inlined but 
        
           // if not, without the parameter, we'll have multiple definitions of the same 
        
           // symbol and we'll get unexpected results.

(ARROW-13382). I wonder if we should reconsider the plan of vectorizing kernels by rebuilding the same source multiple times given this potential pitfall.

asfimport · 2022-04-19T17:21:14Z

Weston Pace / @westonpace:
We can perhaps add a static check that reports symbols outside the appropriate namespace. We might need some configurable suppression. For example, level_conversion_bmi2.cc.o would report:


0000000000000000 W arrow::util::ArrowLogBase& arrow::util::ArrowLogBase::operator<< <char [51]>(char const (&) [51])
0000000000000000 W std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > arrow::util::StringBuilder<char const (&) [33]>(char const (&) [33])
0000000000000000 W void arrow::util::StringBuilderRecursive<char const (&) [33]>(std::ostream&, char const (&) [33])
0000000000000000 W arrow::util::detail::StringStreamWrapper::stream()
0000000000000000 W arrow::util::Voidify::operator&(arrow::util::ArrowLogBase&)
0000000000000000 W arrow::util::Voidify::Voidify()
0000000000000000 W arrow::bit_util::BytesForBits(long)
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::AppendWord(unsigned long, long)
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::Finish()
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::FirstTimeBitmapWriter(unsigned char*, long, long)
0000000000000000 W parquet::ParquetException::ParquetException<char const (&) [33]>(char const (&) [33])
0000000000000000 W parquet::ParquetException::~ParquetException()
0000000000000000 W parquet::ParquetException::~ParquetException()
0000000000000000 W arrow::internal::FirstTimeBitmapWriter::position() const
0000000000000000 W parquet::ParquetException::what() const
0000000000000000 W std::exception::exception()
0000000000000000 W char const (&std::forward<char const (&) [33]>(std::remove_reference<char const (&) [33]>::type&)) [33]

level_comparison_avx.cc.o looks to be in better shape:


0000000000000000 W short const& std::max<short>(short const&, short const&)
0000000000000000 W short const& std::min<short>(short const&, short const&)

But yes, if we have a better solution for this problem it might be safer.

asfimport · 2022-04-21T19:07:55Z

Antoine Pitrou / @pitrou:
Wow, thanks for the diagnosis @westonpace.
So, it turns out that our method for compiling multiple versions of code is violating the one-definition-rule for any inline function or method used in the caller code.

asfimport · 2022-04-21T19:08:32Z

Antoine Pitrou / @pitrou:
@bkietz You may have some idea about how to fix this cleanly and reliably.

asfimport · 2022-04-21T19:50:52Z

Antoine Pitrou / @pitrou:
So, currently we are doing something such as:

clang -c something_avx2.cc -mavx2

An alternative would be not to pass the optimization flag on the command line but enable it selectively inside the source code, e.g.:

clang -c something_avx2.cc -DARROW_SPECIALIZED_SIMD_TARGET=avx2

namespace parquet {
namespace internal {
namespace PARQUET_IMPL_NAMESPACE {

#ifdef ARROW_SPECIALIZED_SIMD_TARGET

#define STRINGIFY_EXPANDED(a) ARROW_STRINGIFY(a)
#pragma clang attribute push (__attribute__((target( STRINGIFY_EXPANDED(ARROW_SPECIALIZED_SIMD_TARGET)) )), apply_to=function)

#endif

...

#ifdef ARROW_SPECIALIZED_SIMD_TARGET
#pragma clang attribute pop
#endif

}  // namespace PARQUET_IMPL_NAMESPACE
}  // namespace internal
}  // namespace parquet

This way we would avoid enabling the particular instruction set on code inlined from other headers. Of course, perhaps that's not actually desirable...

asfimport · 2022-04-21T19:54:11Z

Antoine Pitrou / @pitrou:
In any case, this is probably too involved a change for 8.0.0, so the 8.0.0 fix would simply to disable SIMD optimizations for Homebrew?

asfimport · 2022-04-21T19:58:06Z

Weston Pace / @westonpace:
That seems like an good solution to me. I had no idea it was possible.

If we want the other headers to be included then we already have a bit of a solution demonstrated in level_conversion_inc.h. In the common header file you require some kind of target namespace to be defined.


namespace parquet {
namespace internal {
#ifndef PARQUET_IMPL_NAMESPACE
#error "PARQUET_IMPL_NAMESPACE must be defined"
#endif
namespace PARQUET_IMPL_NAMESPACE {
...
}  // namespace PARQUET_IMPL_NAMESPACE
}  // namespace internal
}  // namespace parquet

However, anything that includes one of these "common headers" must define that namespace...


#define PARQUET_IMPL_NAMESPACE standard
#include "parquet/level_conversion_inc.h"
#undef PARQUET_IMPL_NAMESPACE

asfimport · 2022-05-18T22:02:12Z

Jonathan Keane / @jonkeane:
@kou Do you think you might be able to take a look at this?

The comment at #12928 (comment) has a good explanation of what's going on and following that there are a few possible fixes (though none of them were fully implemented or decided

asfimport · 2022-05-19T00:44:10Z

Kouhei Sutou / @kou:
How about using template to distinct implementation for each architecture?


diff --git a/cpp/src/arrow/compute/kernels/codegen_internal.h b/cpp/src/arrow/compute/kernels/codegen_internal.h
index fa50427bc3..a4bd0eb586 100644
--- a/cpp/src/arrow/compute/kernels/codegen_internal.h
+++ b/cpp/src/arrow/compute/kernels/codegen_internal.h
@@ -710,8 +710,8 @@ struct ScalarUnaryNotNullStateful {
                        Datum* out) {
       Status st = Status::OK();
       ArrayData* out_arr = out->mutable_array();
-      FirstTimeBitmapWriter out_writer(out_arr->buffers[1]->mutable_data(),
-                                       out_arr->offset, out_arr->length);
+      FirstTimeBitmapWriter<> out_writer(out_arr->buffers[1]->mutable_data(),
+                                         out_arr->offset, out_arr->length);
       VisitArrayValuesInline<Arg0Type>(
           arg0,
           [&](Arg0Value v) {
diff --git a/cpp/src/arrow/compute/kernels/row_encoder.cc b/cpp/src/arrow/compute/kernels/row_encoder.cc
index 10a1f4cda5..26316ec315 100644
--- a/cpp/src/arrow/compute/kernels/row_encoder.cc
+++ b/cpp/src/arrow/compute/kernels/row_encoder.cc
@@ -42,7 +42,7 @@ Status KeyEncoder::DecodeNulls(MemoryPool* pool, int32_t length, uint8_t** encod
     ARROW_ASSIGN_OR_RAISE(*null_bitmap, AllocateBitmap(length, pool));
     uint8_t* validity = (*null_bitmap)->mutable_data();
 
-    FirstTimeBitmapWriter writer(validity, 0, length);
+    FirstTimeBitmapWriter<> writer(validity, 0, length);
     for (int32_t i = 0; i < length; ++i) {
       if (encoded_bytes[i][0] == kValidByte) {
         writer.Set();
diff --git a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
index 7d8d2edc4b..433df0f1b7 100644
--- a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
@@ -353,8 +353,8 @@ struct IsInVisitor {
     const auto& state = checked_cast<const SetLookupState<Type>&>(*ctx->state());
     ArrayData* output = out->mutable_array();
 
-    FirstTimeBitmapWriter writer(output->buffers[1]->mutable_data(), output->offset,
-                                 output->length);
+    FirstTimeBitmapWriter<> writer(output->buffers[1]->mutable_data(), output->offset,
+                                   output->length);
 
     VisitArrayDataInline<Type>(
         this->data,
diff --git a/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc b/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
index 611601cab8..da7de1c277 100644
--- a/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
@@ -1456,7 +1456,7 @@ struct MatchSubstringImpl {
         [&matcher](const void* raw_offsets, const uint8_t* data, int64_t length,
                    int64_t output_offset, uint8_t* output) {
           const offset_type* offsets = reinterpret_cast<const offset_type*>(raw_offsets);
-          FirstTimeBitmapWriter bitmap_writer(output, output_offset, length);
+          FirstTimeBitmapWriter<> bitmap_writer(output, output_offset, length);
           for (int64_t i = 0; i < length; ++i) {
             const char* current_data = reinterpret_cast<const char*>(data + offsets[i]);
             int64_t current_length = offsets[i + 1] - offsets[i];
diff --git a/cpp/src/arrow/util/bit_util_benchmark.cc b/cpp/src/arrow/util/bit_util_benchmark.cc
index 258fd27785..66a81b4e04 100644
--- a/cpp/src/arrow/util/bit_util_benchmark.cc
+++ b/cpp/src/arrow/util/bit_util_benchmark.cc
@@ -386,7 +386,7 @@ static void BitmapWriter(benchmark::State& state) {
 }
 
 static void FirstTimeBitmapWriter(benchmark::State& state) {
-  BenchmarkBitmapWriter<internal::FirstTimeBitmapWriter>(state, state.range(0));
+  BenchmarkBitmapWriter<internal::FirstTimeBitmapWriter<>>(state, state.range(0));
 }
 
 struct GenerateBitsFunctor {
diff --git a/cpp/src/arrow/util/bit_util_test.cc b/cpp/src/arrow/util/bit_util_test.cc
index 6c2aff4fbe..9b9f19feb1 100644
--- a/cpp/src/arrow/util/bit_util_test.cc
+++ b/cpp/src/arrow/util/bit_util_test.cc
@@ -832,14 +832,14 @@ TEST(FirstTimeBitmapWriter, NormalOperation) {
     const uint8_t fill_byte = static_cast<uint8_t>(fill_byte_int);
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
-      auto writer = internal::FirstTimeBitmapWriter(bitmap, 0, 12);
+      auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 0, 12);
       WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1});
       //                      {0b00110110, 0b1010, 0, 0}
       ASSERT_BYTES_EQ(bitmap, {0x36, 0x0a});
     }
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
-      auto writer = internal::FirstTimeBitmapWriter(bitmap, 4, 12);
+      auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 4, 12);
       WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1});
       //                      {0b00110110, 0b1010, 0, 0}
       ASSERT_BYTES_EQ(bitmap, {static_cast<uint8_t>(0x60 | (fill_byte & 0x0f)), 0xa3});
@@ -848,15 +848,15 @@ TEST(FirstTimeBitmapWriter, NormalOperation) {
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 0, 6);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 0, 6);
         WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 6, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 6, 3);
         WriteVectorToWriter(writer, {0, 0, 0});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 9, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 9, 3);
         WriteVectorToWriter(writer, {1, 0, 1});
       }
       ASSERT_BYTES_EQ(bitmap, {0x36, 0x0a});
@@ -864,23 +864,23 @@ TEST(FirstTimeBitmapWriter, NormalOperation) {
     {
       uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 4, 0);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 4, 0);
         WriteVectorToWriter(writer, {});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 4, 6);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 4, 6);
         WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 10, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 10, 3);
         WriteVectorToWriter(writer, {0, 0, 0});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 13, 0);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 13, 0);
         WriteVectorToWriter(writer, {});
       }
       {
-        auto writer = internal::FirstTimeBitmapWriter(bitmap, 13, 3);
+        auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 13, 3);
         WriteVectorToWriter(writer, {1, 0, 1});
       }
       ASSERT_BYTES_EQ(bitmap, {static_cast<uint8_t>(0x60 | (fill_byte & 0x0f)), 0xa3});
@@ -900,8 +900,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)
   auto check_append = [](const std::string& expected_bits, int64_t offset) {
     std::vector<uint8_t> valid_bits = {0x00};
     constexpr int64_t kBitsAfterAppend = 8;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(8 * valid_bits.size()) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(8 * valid_bits.size()) - offset);
     writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/kBitsAfterAppend - offset);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, kBitsAfterAppend), expected_bits);
@@ -918,8 +918,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)
   auto check_with_set = [](const std::string& expected_bits, int64_t offset) {
     std::vector<uint8_t> valid_bits = {0x1};
     constexpr int64_t kBitsAfterAppend = 8;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(8 * valid_bits.size()) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(8 * valid_bits.size()) - offset);
     writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/kBitsAfterAppend - offset);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, kBitsAfterAppend), expected_bits);
@@ -936,8 +936,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)
   auto check_with_preceding = [](const std::string& expected_bits, int64_t offset) {
     std::vector<uint8_t> valid_bits = {0xFF};
     constexpr int64_t kBitsAfterAppend = 8;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(8 * valid_bits.size()) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(8 * valid_bits.size()) - offset);
     writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/kBitsAfterAppend - offset);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, kBitsAfterAppend), expected_bits);
@@ -954,8 +954,8 @@ TEST(FirstTimeBitmapWriter, AppendWordOffsetOverwritesCorrectBitsOnExistingByte)
 
 TEST(FirstTimeBitmapWriter, AppendZeroBitsHasNoImpact) {
   std::vector<uint8_t> valid_bits(/*count=*/1, 0);
-  internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                         /*length=*/valid_bits.size() * 8);
+  internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                           /*length=*/valid_bits.size() * 8);
   writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/0);
   writer.AppendWord(/*word=*/0xFF, /*number_of_bits=*/0);
   writer.AppendWord(/*word=*/0x01, /*number_of_bits=*/1);
@@ -966,8 +966,8 @@ TEST(FirstTimeBitmapWriter, AppendZeroBitsHasNoImpact) {
 TEST(FirstTimeBitmapWriter, AppendLessThanByte) {
   {
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/8);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/8);
     writer.AppendWord(0xB, 4);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, /*bit_count=*/8), "01101000");
@@ -975,8 +975,8 @@ TEST(FirstTimeBitmapWriter, AppendLessThanByte) {
   {
     // Test with all bits initially set.
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0xFF);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/8);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/8);
     writer.AppendWord(0xB, 4);
     writer.Finish();
     EXPECT_EQ(BitmapToString(valid_bits, /*bit_count=*/8), "11101000");
@@ -986,8 +986,8 @@ TEST(FirstTimeBitmapWriter, AppendLessThanByte) {
 TEST(FirstTimeBitmapWriter, AppendByteThenMore) {
   {
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/0,
-                                           /*length=*/9);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/0,
+                                             /*length=*/9);
     writer.AppendWord(0xC3, 8);
     writer.AppendWord(0x01, 1);
     writer.Finish();
@@ -995,8 +995,8 @@ TEST(FirstTimeBitmapWriter, AppendByteThenMore) {
   }
   {
     std::vector<uint8_t> valid_bits(/*count*/ 8, 0xFF);
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/0,
-                                           /*length=*/9);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/0,
+                                             /*length=*/9);
     writer.AppendWord(0xC3, 8);
     writer.AppendWord(0x01, 1);
     writer.Finish();
@@ -1012,8 +1012,8 @@ TEST(FirstTimeBitmapWriter, AppendWordShiftsBitsCorrectly) {
     ASSERT_GE(offset, 8);
     std::vector<uint8_t> valid_bits(/*count=*/10, preset_buffer_bits ? 0xFF : 0);
     valid_bits[0] = 0x99;
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), offset,
-                                           /*length=*/(9 * sizeof(kPattern)) - offset);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), offset,
+                                             /*length=*/(9 * sizeof(kPattern)) - offset);
     writer.AppendWord(/*word=*/kPattern, /*number_of_bits=*/64);
     writer.Finish();
     EXPECT_EQ(valid_bits[0], 0x99);  // shouldn't get changed.
@@ -1051,15 +1051,15 @@ TEST(TestAppendBitmap, AppendWordOnlyAppropriateBytesWritten) {
 
   uint64_t bitmap = 0x1FF;
   {
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/(8 * valid_bits.size()) - 1);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/(8 * valid_bits.size()) - 1);
     writer.AppendWord(bitmap, /*number_of_bits*/ 7);
     writer.Finish();
     EXPECT_THAT(valid_bits, ElementsAreArray(std::vector<uint8_t>{0xFE, 0x00}));
   }
   {
-    internal::FirstTimeBitmapWriter writer(valid_bits.data(), /*start_offset=*/1,
-                                           /*length=*/(8 * valid_bits.size()) - 1);
+    internal::FirstTimeBitmapWriter<> writer(valid_bits.data(), /*start_offset=*/1,
+                                             /*length=*/(8 * valid_bits.size()) - 1);
     writer.AppendWord(bitmap, /*number_of_bits*/ 8);
     writer.Finish();
     EXPECT_THAT(valid_bits, ElementsAreArray(std::vector<uint8_t>{0xFE, 0x03}));
diff --git a/cpp/src/arrow/util/bitmap_writer.h b/cpp/src/arrow/util/bitmap_writer.h
index 65d0d188d7..7a70b16f15 100644
--- a/cpp/src/arrow/util/bitmap_writer.h
+++ b/cpp/src/arrow/util/bitmap_writer.h
@@ -21,6 +21,8 @@
 #include <cstring>
 
 #include "arrow/util/bit_util.h"
+#include "arrow/util/config.h"
+#include "arrow/util/dispatch.h"
 #include "arrow/util/endian.h"
 #include "arrow/util/macros.h"
 
@@ -78,6 +80,7 @@ class BitmapWriter {
   int64_t byte_offset_;
 };
 
+template <DispatchLevel level = ARROW_COMPILE_TIME_DISPATCH_LEVEL>
 class FirstTimeBitmapWriter {
   // Like BitmapWriter, but any bit values *following* the bits written
   // might be clobbered.  It is hence faster than BitmapWriter, and can
diff --git a/cpp/src/arrow/util/config.h.cmake b/cpp/src/arrow/util/config.h.cmake
index 55bc2d0100..1ecb4f39f3 100644
--- a/cpp/src/arrow/util/config.h.cmake
+++ b/cpp/src/arrow/util/config.h.cmake
@@ -36,6 +36,8 @@
 
 #define ARROW_PACKAGE_KIND "@ARROW_PACKAGE_KIND@"
 
+#define ARROW_COMPILE_TIME_DISPATCH_LEVEL ::arrow::internal::DispatchLevel::@ARROW_SIMD_LEVEL@
+
 #cmakedefine ARROW_COMPUTE
 #cmakedefine ARROW_CSV
 #cmakedefine ARROW_CUDA
diff --git a/cpp/src/arrow/util/dispatch.h b/cpp/src/arrow/util/dispatch.h
index fae9293f9e..d6f4dbb028 100644
--- a/cpp/src/arrow/util/dispatch.h
+++ b/cpp/src/arrow/util/dispatch.h
@@ -33,6 +33,7 @@ enum class DispatchLevel : int {
   SSE4_2,
   AVX2,
   AVX512,
+  BMI2,
   NEON,
   MAX
 };
@@ -105,6 +106,8 @@ class DynamicDispatch {
         return cpu_info->IsSupported(CpuInfo::AVX2);
       case DispatchLevel::AVX512:
         return cpu_info->IsSupported(CpuInfo::AVX512);
+      case DispatchLevel::BMI2:
+        return cpu_info->IsSupported(CpuInfo::BMI2);
       default:
         return false;
     }
diff --git a/cpp/src/parquet/encoding.cc b/cpp/src/parquet/encoding.cc
index af0e543c3e..34f0eef3b5 100644
--- a/cpp/src/parquet/encoding.cc
+++ b/cpp/src/parquet/encoding.cc
@@ -348,8 +348,8 @@ class PlainEncoder<BooleanType> : public EncoderImpl, virtual public BooleanEnco
     } else {
       auto n_valid = bit_util::BytesForBits(data.length() - data.null_count());
       PARQUET_THROW_NOT_OK(sink_.Reserve(n_valid));
-      ::arrow::internal::FirstTimeBitmapWriter writer(sink_.mutable_data(),
-                                                      sink_.length(), n_valid);
+      ::arrow::internal::FirstTimeBitmapWriter<> writer(sink_.mutable_data(),
+                                                        sink_.length(), n_valid);
 
       for (int64_t i = 0; i < data.length(); i++) {
         if (data.IsValid(i)) {
diff --git a/cpp/src/parquet/level_comparison_avx2.cc b/cpp/src/parquet/level_comparison_avx2.cc
index b33eb2e295..521cf96520 100644
--- a/cpp/src/parquet/level_comparison_avx2.cc
+++ b/cpp/src/parquet/level_comparison_avx2.cc
@@ -16,7 +16,9 @@
 // under the License.
 
 #define PARQUET_IMPL_NAMESPACE avx2
+#define PARQUET_DISPATCH_LEVEL ::arrow::internal::DispatchLevel::AVX2
 #include "parquet/level_comparison_inc.h"
+#undef PARQUET_DISPATCH_LEVEL
 #undef PARQUET_IMPL_NAMESPACE
 
 namespace parquet {
diff --git a/cpp/src/parquet/level_conversion.cc b/cpp/src/parquet/level_conversion.cc
index ffdca476dd..ab440af95a 100644
--- a/cpp/src/parquet/level_conversion.cc
+++ b/cpp/src/parquet/level_conversion.cc
@@ -28,7 +28,9 @@
 
 #include "parquet/level_comparison.h"
 #define PARQUET_IMPL_NAMESPACE standard
+#define PARQUET_DISPATCH_LEVEL ARROW_COMPILE_TIME_DISPATCH_LEVEL
 #include "parquet/level_conversion_inc.h"
+#undef PARQUET_DISPATCH_LEVEL
 #undef PARQUET_IMPL_NAMESPACE
 
 namespace parquet {
@@ -43,7 +45,7 @@ void DefRepLevelsToListInfo(const int16_t* def_levels, const int16_t* rep_levels
                             int64_t num_def_levels, LevelInfo level_info,
                             ValidityBitmapInputOutput* output, OffsetType* offsets) {
   OffsetType* orig_pos = offsets;
-  optional<::arrow::internal::FirstTimeBitmapWriter> valid_bits_writer;
+  optional<::arrow::internal::FirstTimeBitmapWriter<>> valid_bits_writer;
   if (output->valid_bits) {
     valid_bits_writer.emplace(output->valid_bits, output->valid_bits_offset,
                               output->values_read_upper_bound);
diff --git a/cpp/src/parquet/level_conversion_bmi2.cc b/cpp/src/parquet/level_conversion_bmi2.cc
index 274d54e503..679d01d0c9 100644
--- a/cpp/src/parquet/level_conversion_bmi2.cc
+++ b/cpp/src/parquet/level_conversion_bmi2.cc
@@ -17,7 +17,9 @@
 #include "parquet/level_conversion.h"
 
 #define PARQUET_IMPL_NAMESPACE bmi2
+#define PARQUET_DISPATCH_LEVEL ::arrow::internal::DispatchLevel::BMI2
 #include "parquet/level_conversion_inc.h"
+#undef PARQUET_DISPATCH_LEVEL
 #undef PARQUET_IMPL_NAMESPACE
 
 namespace parquet {
diff --git a/cpp/src/parquet/level_conversion_inc.h b/cpp/src/parquet/level_conversion_inc.h
index 710d2f6237..4b5a9def80 100644
--- a/cpp/src/parquet/level_conversion_inc.h
+++ b/cpp/src/parquet/level_conversion_inc.h
@@ -296,7 +296,10 @@ static constexpr int64_t kExtractBitsSize = 8 * sizeof(extract_bitmap_t);
 template <bool has_repeated_parent>
 int64_t DefLevelsBatchToBitmap(const int16_t* def_levels, const int64_t batch_size,
                                int64_t upper_bound_remaining, LevelInfo level_info,
-                               ::arrow::internal::FirstTimeBitmapWriter* writer) {
+#ifndef PARQUET_DISPATCH_LEVEL
+#error "PARQUET_DISPATCH_LEVEL must be defined"
+#endif
+                               ::arrow::internal::FirstTimeBitmapWriter<PARQUET_DISPATCH_LEVEL>* writer) {
   DCHECK_LE(batch_size, kExtractBitsSize);
 
   // Greater than level_info.def_level - 1 implies >= the def_level
@@ -330,7 +333,7 @@ int64_t DefLevelsBatchToBitmap(const int16_t* def_levels, const int64_t batch_si
 template <bool has_repeated_parent>
 void DefLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels,
                            LevelInfo level_info, ValidityBitmapInputOutput* output) {
-  ::arrow::internal::FirstTimeBitmapWriter writer(
+  ::arrow::internal::FirstTimeBitmapWriter<PARQUET_DISPATCH_LEVEL> writer(
       output->valid_bits,
       /*start_offset=*/output->valid_bits_offset,
       /*length=*/output->values_read_upper_bound);

asfimport · 2022-05-19T07:39:53Z

Antoine Pitrou / @pitrou:
@kou We can do that for the specific symptoms here. However, a more general solution will have to be found since other files have the same problem: compiling SIMD-specific code which calls into other routines.

asfimport · 2022-05-25T20:35:45Z

Ben Kietzman / @bkietz:
@pitrou The most robust solution I can think of is to avoid linking between objects with differing instruction sets altogether. We'd have something like

$ nm libarrow_compute_avx2.so | grep DefLevelsBitmapSimd
0000000000404ff0 t DefLevelsBitmapSimd

That library would be acquired with dlopen(path, RTLD_LOCAL)/LoadLibrary(path) which would guarantee that any functions like FirstTimeBitmapWriter::* which might have been recompiled with illegal instructions are not available outside libarrow_compute_avx2.so.

asfimport · 2022-05-25T20:50:18Z

Ben Kietzman / @bkietz:
IIUC, we'll still need to pass -mavx2 so that we can include immintrin.h so the attribute described in the ARROW_SPECIALIZED_SIMD_TARGET approach would need to be attached to the {}non{}-SIMD functions to ensure that they're compiled with no special instructions

... or I suppose we could try to declare all the intrinsics manually at function scope

ARROW_SIMD_FUNCTION(avx2) void SimdThing() {
  // inlined from immintrin.h:
  typedef unsigned short __mmask16;
  extern
    __inline
    __mmask16
    __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
        _mm512_int2mask (int __M);
}

asfimport · 2022-05-25T21:19:40Z

Ben Kietzman / @bkietz:
On further investigation, we can include immintrin.h with or without -mavx2 and clang at least will not complain unless the intrinsics are referenced, so

#include <immintrin.h>

[[gnu::target("avx2")]]
void use_simd() {
  __m256i arg;
  _mm256_abs_epi16 (arg);
}

int main() { use_simd(); }

compiles and runs happily without any special compilation flags. Using an attribute like this seems viable provided we can be certain that the modified target isn't transitively applied to functions which might be invoked for the first time inside a SIMD enabled function

asfimport · 2022-07-07T14:55:01Z

Ian Cook / @ianmcook:
@jonkeane this issue is marked as a blocker for 9.0.0. Should this block the release?

asfimport · 2022-07-07T22:48:53Z

Jonathan Keane / @jonkeane:
Last I checked, the homebrew maintainers have said that they will disable all optimization for arrow if we don't get this sorted on our own. So not required if we're ok with that (though we should engage with them on this)

asfimport · 2022-07-08T08:09:56Z

Antoine Pitrou / @pitrou:
Disabling all optimizations for Arrow is brutal. It would be much better to simply disable runtime SIMD optimizations (by passing -DARROW_RUNTIME_SIMD_LEVEL=NONE to CMake, AFAIR).

asfimport · 2022-07-22T10:23:06Z

Raúl Cumplido / @raulcd:
Is this still a blocker?

asfimport · 2022-07-22T11:47:42Z

Antoine Pitrou / @pitrou:
Ideally it would... But there's little chance for it to be fixed in time for 9.0.0.

As I said above, the workaround should be to disable runtime SIMD optimizations on the affected builds. Somehow has to validate that suggestion, though (i.e. someone who's able to reproduce this issue).

asfimport · 2022-07-22T12:29:41Z

Jacob Wujciak / @assignUser:
Looking at [ARROW-15664] and this PR it seems like a workaround has been implemented for homebrew IIUC, so this is still an issue but as the real fix wont happen for 9.0.0 it shouldn't be a blocker anymore?

asfimport · 2022-07-22T12:39:00Z

Antoine Pitrou / @pitrou:
If that was actually accepted by Homebrew then fine.

asfimport · 2022-07-22T16:59:19Z

Jacob Wujciak / @assignUser:
That was my impression: issue and PR in homebrew-core.
Maybe @jonkeane can confirm?

asfimport · 2022-07-22T19:26:30Z

Jonathan Keane / @jonkeane:
Homebrew only accepted that as a temporary workaround and has threatened to turn off optimizations if we don't resolve this. They haven't yet followed through yet, though. Homebrew/homebrew-core#94724 (comment)

asfimport · 2022-07-22T19:27:54Z

Antoine Pitrou / @pitrou:
Note that I suggested a perhaps more acceptable workaround above.

asfimport · 2022-07-25T08:38:09Z

Krisztian Szucs / @kszucs:
@jonkeane can you give an update on this issue?

asfimport · 2022-07-25T21:11:07Z

Krisztian Szucs / @kszucs:
Postponing to 10.0 for now.

asfimport · 2022-07-25T22:50:01Z

Jonathan Keane / @jonkeane:
I have no updates beyond what's discussed above: there are a few approaches, none of them ideal, we need someone to champion this (or risk the homebrew maintainers turning off optimizations on us)

asfimport · 2022-10-06T09:17:09Z

Raúl Cumplido / @raulcd:
This was a blocker for the last release and is still a blocker for the 10.0.0 release. @jonkeane do you know if there has been any move?

asfimport · 2022-10-06T13:06:46Z

Jonathan Keane / @jonkeane:
I thought that @kou was going to take a look at this (or at least the underlying multiple SIMD instruction ordering issue that causes the failures...)

The only update I have is that I continue to run into the segfault in CI for downstream projects I'm working on, so it continues to be an issue for pre-built libarrow on machines like github's macos runners.

asfimport · 2022-10-06T20:59:50Z

Kouhei Sutou / @kou:
Yes. I will fix this before the 10.0.0 release. Sorry for not working on this yet.

asfimport · 2022-10-07T07:05:48Z

Kouhei Sutou / @kou:
Summary of this problem:

Problem:

Parquet module is crashed with -DCMAKE_BUILD_TYPE=MinSizeRel

Why the problem is happened:
We compile the same code (level_conversion_inc.h) multiple times with different optimization flags such as -msse4.2 and -mavx2
The code calls the same function (arrow::internal::FirstTimeBitmapWriter::AppendWord()) that is defined in header file
The called function isn't inlined with -DCMAKE_BUILD_TYPE=MinSizeRel
It generates multiple definitions for the called (not-inlined) function (arrow::internal::FirstTimeBitmapWriter::AppendWord())

Proposed solutions so far:

Force to inline functions that are called from the code that are compiled with SIMD related optimization flags
Restrict SIMD related optimization area to only the target function

For 1., we have two approaches for it:

Use template:
- https://issues.apache.org/jira/browse/ARROW-15678?focusedCommentId=17524280&#comment-17524280 : It's used in our compute module
- https://issues.apache.org/jira/browse/ARROW-15678?focusedCommentId=17539142&#comment-17539142 : A patch for this case
Use \_\_attribute\_\_((always\_inline))
- ARROW-15678: [C++][CI] a crossbow job with MinRelSize enabled #12928 (comment)
  
  This approach seems ad-hoc. We need to apply this approach to called functions when we find this problem in other codes.
  
  For 2., we have one approach for it:
Use target attribute:
- https://issues.apache.org/jira/browse/ARROW-15678?focusedCommentId=17526053#comment-17526053 : A sample code that shows this approach
  
  (I'm not sure that this approach is portable. For example, can we use this approach with MSVC?)

asfimport · 2022-10-07T07:18:56Z

Kouhei Sutou / @kou:
I propose one more approach for the proposed solution 1.:

How about always enabling inline optimization for SIMD optimized compile units (level_conversion_bmi2.cc) even when an user specifies -DCMAKE_BUILD_TYPE=MinSizeRel?

It may increases binary size but it may be better that SIMD related code prioritizes performance than binary size.

We don't need to write manual template/\_\_attribute\_\_((always\_inline)) s with this approach.

asfimport · 2022-10-07T07:44:18Z

Antoine Pitrou / @pitrou:
Well, I don't think we can force the compiler to inline everything.

asfimport · 2022-10-07T07:56:05Z

Kouhei Sutou / @kou:
We don't have this problem with -DCMAKE_BUILD_TYPE=Release. So it may work with most cases.

asfimport · 2022-10-10T21:18:56Z

Kouhei Sutou / @kou:
Issue resolved by pull request 14342
#14342

### Rationale for this change Summary of this problem: #31132 (comment) Why this problem is happen again? Because I removed `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"` in #36583. The solution we chose by #14342 was forcing to use `-O2` for SIMD related code. It works for `-DCMAKE_BUILD_TYPE=MinSizeRel` but it doesn't work for Homebrew. Because Homebrew's CC https://github.com/Homebrew/brew/blob/master/Library/Homebrew/shims/super/cc forces to use the same `-O` flag. The default is `-Os`. If we specify `-O2`, Homebrew's CC replaces it to `-Os`. If we use `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"`, Homebrew's CC always use `-O2`. So the solution we chose by #14342 isn't used for Homebrew. But Homebrew thinks that `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"` is a workaround. So we need another solution for Homebrew. Here are candidate solutions: 1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE` 2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?` "1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`" works because we don't use the runtime SIMD dispatch feature (the problematic feature) entirely. "2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`" works but I don't know why... If `ENV.runtime_cpu_detection` is called, Homebrew's CC stops replacing `-march=*`. If we call `ENV.runtime_cpu_detection`, `-march=haswell` is used for AVX2 related code and `-march=skylake-avx512` is used for AVX512 including BMI2 related code. If we don't call `ENV.runtime_cpu_detection`, `-march=nehalem` is always used. (Note that SIMD related flags such as `-mbmi2` aren't removed by Homebrew's CC. So I think that SIMD is enabled.) I don't know why but "the one-definition-rule violation" (see the summary for details: #31132 (comment) ) isn't happen. FYI: CPU info for GitHub Actions macOS hosted-runner: ```console $ sysctl hw.optional machdep.cpu hw.optional.adx: 0 hw.optional.aes: 1 hw.optional.avx1_0: 1 hw.optional.avx2_0: 0 hw.optional.avx512bw: 0 hw.optional.avx512cd: 0 hw.optional.avx512dq: 0 hw.optional.avx512f: 0 hw.optional.avx512ifma: 0 hw.optional.avx512vbmi: 0 hw.optional.avx512vl: 0 hw.optional.bmi1: 0 hw.optional.bmi2: 0 hw.optional.enfstrg: 0 hw.optional.f16c: 1 hw.optional.floatingpoint: 1 hw.optional.fma: 0 hw.optional.hle: 0 hw.optional.mmx: 1 hw.optional.mpx: 0 hw.optional.rdrand: 1 hw.optional.rtm: 0 hw.optional.sgx: 0 hw.optional.sse: 1 hw.optional.sse2: 1 hw.optional.sse3: 1 hw.optional.sse4_1: 1 hw.optional.sse4_2: 1 hw.optional.supplementalsse3: 1 hw.optional.x86_64: 1 machdep.cpu.address_bits.physical: 43 machdep.cpu.address_bits.virtual: 48 machdep.cpu.arch_perf.events: 127 machdep.cpu.arch_perf.events_number: 7 machdep.cpu.arch_perf.fixed_number: 0 machdep.cpu.arch_perf.fixed_width: 0 machdep.cpu.arch_perf.number: 4 machdep.cpu.arch_perf.version: 1 machdep.cpu.arch_perf.width: 48 machdep.cpu.cache.L2_associativity: 8 machdep.cpu.cache.linesize: 64 machdep.cpu.cache.size: 256 machdep.cpu.mwait.extensions: 3 machdep.cpu.mwait.linesize_max: 4096 machdep.cpu.mwait.linesize_min: 64 machdep.cpu.mwait.sub_Cstates: 16 machdep.cpu.thermal.ACNT_MCNT: 0 machdep.cpu.thermal.core_power_limits: 0 machdep.cpu.thermal.dynamic_acceleration: 0 machdep.cpu.thermal.energy_policy: 0 machdep.cpu.thermal.fine_grain_clock_mod: 0 machdep.cpu.thermal.hardware_feedback: 0 machdep.cpu.thermal.invariant_APIC_timer: 1 machdep.cpu.thermal.package_thermal_intr: 0 machdep.cpu.thermal.sensor: 0 machdep.cpu.thermal.thresholds: 0 machdep.cpu.tlb.data.small: 64 machdep.cpu.tlb.inst.large: 8 machdep.cpu.tlb.inst.small: 64 machdep.cpu.tlb.shared: 512 machdep.cpu.tsc_ccc.denominator: 0 machdep.cpu.tsc_ccc.numerator: 0 machdep.cpu.xsave.extended_state: 7 832 832 0 machdep.cpu.xsave.extended_state1: 0 0 0 0 machdep.cpu.brand: 0 machdep.cpu.brand_string: Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz machdep.cpu.core_count: 3 machdep.cpu.cores_per_package: 4 machdep.cpu.extfamily: 0 machdep.cpu.extfeature_bits: 4967106816 machdep.cpu.extfeatures: SYSCALL XD EM64T LAHF RDTSCP TSCI machdep.cpu.extmodel: 3 machdep.cpu.family: 6 machdep.cpu.feature_bits: 18427078393948011519 machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH MMX FXSR SSE SSE2 SS HTT SSE3 PCLMULQDQ MON VMX SSSE3 CX16 SSE4.1 SSE4.2 x2APIC POPCNT AES VMM PCID XSAVE OSXSAVE TSCTMR AVX1.0 RDRAND F16C machdep.cpu.leaf7_feature_bits: 643 0 machdep.cpu.leaf7_feature_bits_edx: 3154117632 machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SMEP ERMS MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD machdep.cpu.logical_per_package: 4 machdep.cpu.max_basic: 13 machdep.cpu.max_ext: 2147483656 machdep.cpu.microcode_version: 1070 machdep.cpu.model: 58 machdep.cpu.processor_flag: 0 machdep.cpu.signature: 198313 machdep.cpu.stepping: 9 machdep.cpu.thread_count: 3 machdep.cpu.vendor: GenuineIntel ``` ### What changes are included in this PR? "1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`" because it's straightforward and "2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`" may also disable runtime SIMD dispatch implicitly. This also adds the following debug information for easy to debug in future: * CPU information for GitHub Actions runner * Homebrew's build logs ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: #36685 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

…ache#36705) ### Rationale for this change Summary of this problem: apache#31132 (comment) Why this problem is happen again? Because I removed `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"` in apache#36583. The solution we chose by apache#14342 was forcing to use `-O2` for SIMD related code. It works for `-DCMAKE_BUILD_TYPE=MinSizeRel` but it doesn't work for Homebrew. Because Homebrew's CC https://github.com/Homebrew/brew/blob/master/Library/Homebrew/shims/super/cc forces to use the same `-O` flag. The default is `-Os`. If we specify `-O2`, Homebrew's CC replaces it to `-Os`. If we use `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"`, Homebrew's CC always use `-O2`. So the solution we chose by apache#14342 isn't used for Homebrew. But Homebrew thinks that `ENV["HOMEBREW_OPTIMIZATION_LEVEL"] = "O2"` is a workaround. So we need another solution for Homebrew. Here are candidate solutions: 1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE` 2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?` "1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`" works because we don't use the runtime SIMD dispatch feature (the problematic feature) entirely. "2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`" works but I don't know why... If `ENV.runtime_cpu_detection` is called, Homebrew's CC stops replacing `-march=*`. If we call `ENV.runtime_cpu_detection`, `-march=haswell` is used for AVX2 related code and `-march=skylake-avx512` is used for AVX512 including BMI2 related code. If we don't call `ENV.runtime_cpu_detection`, `-march=nehalem` is always used. (Note that SIMD related flags such as `-mbmi2` aren't removed by Homebrew's CC. So I think that SIMD is enabled.) I don't know why but "the one-definition-rule violation" (see the summary for details: apache#31132 (comment) ) isn't happen. FYI: CPU info for GitHub Actions macOS hosted-runner: ```console $ sysctl hw.optional machdep.cpu hw.optional.adx: 0 hw.optional.aes: 1 hw.optional.avx1_0: 1 hw.optional.avx2_0: 0 hw.optional.avx512bw: 0 hw.optional.avx512cd: 0 hw.optional.avx512dq: 0 hw.optional.avx512f: 0 hw.optional.avx512ifma: 0 hw.optional.avx512vbmi: 0 hw.optional.avx512vl: 0 hw.optional.bmi1: 0 hw.optional.bmi2: 0 hw.optional.enfstrg: 0 hw.optional.f16c: 1 hw.optional.floatingpoint: 1 hw.optional.fma: 0 hw.optional.hle: 0 hw.optional.mmx: 1 hw.optional.mpx: 0 hw.optional.rdrand: 1 hw.optional.rtm: 0 hw.optional.sgx: 0 hw.optional.sse: 1 hw.optional.sse2: 1 hw.optional.sse3: 1 hw.optional.sse4_1: 1 hw.optional.sse4_2: 1 hw.optional.supplementalsse3: 1 hw.optional.x86_64: 1 machdep.cpu.address_bits.physical: 43 machdep.cpu.address_bits.virtual: 48 machdep.cpu.arch_perf.events: 127 machdep.cpu.arch_perf.events_number: 7 machdep.cpu.arch_perf.fixed_number: 0 machdep.cpu.arch_perf.fixed_width: 0 machdep.cpu.arch_perf.number: 4 machdep.cpu.arch_perf.version: 1 machdep.cpu.arch_perf.width: 48 machdep.cpu.cache.L2_associativity: 8 machdep.cpu.cache.linesize: 64 machdep.cpu.cache.size: 256 machdep.cpu.mwait.extensions: 3 machdep.cpu.mwait.linesize_max: 4096 machdep.cpu.mwait.linesize_min: 64 machdep.cpu.mwait.sub_Cstates: 16 machdep.cpu.thermal.ACNT_MCNT: 0 machdep.cpu.thermal.core_power_limits: 0 machdep.cpu.thermal.dynamic_acceleration: 0 machdep.cpu.thermal.energy_policy: 0 machdep.cpu.thermal.fine_grain_clock_mod: 0 machdep.cpu.thermal.hardware_feedback: 0 machdep.cpu.thermal.invariant_APIC_timer: 1 machdep.cpu.thermal.package_thermal_intr: 0 machdep.cpu.thermal.sensor: 0 machdep.cpu.thermal.thresholds: 0 machdep.cpu.tlb.data.small: 64 machdep.cpu.tlb.inst.large: 8 machdep.cpu.tlb.inst.small: 64 machdep.cpu.tlb.shared: 512 machdep.cpu.tsc_ccc.denominator: 0 machdep.cpu.tsc_ccc.numerator: 0 machdep.cpu.xsave.extended_state: 7 832 832 0 machdep.cpu.xsave.extended_state1: 0 0 0 0 machdep.cpu.brand: 0 machdep.cpu.brand_string: Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz machdep.cpu.core_count: 3 machdep.cpu.cores_per_package: 4 machdep.cpu.extfamily: 0 machdep.cpu.extfeature_bits: 4967106816 machdep.cpu.extfeatures: SYSCALL XD EM64T LAHF RDTSCP TSCI machdep.cpu.extmodel: 3 machdep.cpu.family: 6 machdep.cpu.feature_bits: 18427078393948011519 machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH MMX FXSR SSE SSE2 SS HTT SSE3 PCLMULQDQ MON VMX SSSE3 CX16 SSE4.1 SSE4.2 x2APIC POPCNT AES VMM PCID XSAVE OSXSAVE TSCTMR AVX1.0 RDRAND F16C machdep.cpu.leaf7_feature_bits: 643 0 machdep.cpu.leaf7_feature_bits_edx: 3154117632 machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SMEP ERMS MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD machdep.cpu.logical_per_package: 4 machdep.cpu.max_basic: 13 machdep.cpu.max_ext: 2147483656 machdep.cpu.microcode_version: 1070 machdep.cpu.model: 58 machdep.cpu.processor_flag: 0 machdep.cpu.signature: 198313 machdep.cpu.stepping: 9 machdep.cpu.thread_count: 3 machdep.cpu.vendor: GenuineIntel ``` ### What changes are included in this PR? "1. `-DARROW_RUNTIME_SIMD_LEVEL=NONE`" because it's straightforward and "2. Remove `ENV.runtime_cpu_detection if Hardware::CPU.intel?`" may also disable runtime SIMD dispatch implicitly. This also adds the following debug information for easy to debug in future: * CPU information for GitHub Actions runner * Homebrew's build logs ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * Closes: apache#36685 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

asfimport closed this as completed Oct 10, 2022

asfimport assigned kou Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[C++] parquet reader Segfaults with illegal SIMD instruction #31119

Closed

westonpace mentioned this issue Jul 14, 2023

[R][C++] Homebrew nightly test failing with illegal opcode #36685

Closed

This was referenced Jul 26, 2023

GH-36685: [R][C++] Fix illegal opcode failure with Homebrew #36705

Merged

[C++][SIMD] Avoid one-definition-rule violation of arrow::internal::BitmapWriter without depending on -O2 #36902

Open

[C++][CI] a crossbow job with MinRelSize enabled #31132

[C++][CI] a crossbow job with MinRelSize enabled #31132

Comments

asfimport commented Feb 14, 2022 • edited Loading

Related issues:

PRs and other links:

asfimport commented Mar 2, 2022

asfimport commented Apr 17, 2022

asfimport commented Apr 19, 2022

asfimport commented Apr 19, 2022

asfimport commented Apr 21, 2022

asfimport commented Apr 21, 2022

asfimport commented Apr 21, 2022

asfimport commented Apr 21, 2022

asfimport commented Apr 21, 2022

asfimport commented May 18, 2022

asfimport commented May 19, 2022

asfimport commented May 19, 2022

asfimport commented May 25, 2022

asfimport commented May 25, 2022

asfimport commented May 25, 2022

asfimport commented Jul 7, 2022

asfimport commented Jul 7, 2022

asfimport commented Jul 8, 2022

asfimport commented Jul 22, 2022

asfimport commented Jul 22, 2022

asfimport commented Jul 22, 2022

asfimport commented Jul 22, 2022

asfimport commented Jul 22, 2022

asfimport commented Jul 22, 2022

asfimport commented Jul 22, 2022

asfimport commented Jul 25, 2022

asfimport commented Jul 25, 2022

asfimport commented Jul 25, 2022

asfimport commented Oct 6, 2022

asfimport commented Oct 6, 2022

asfimport commented Oct 6, 2022

asfimport commented Oct 7, 2022

asfimport commented Oct 7, 2022

asfimport commented Oct 7, 2022

asfimport commented Oct 7, 2022

asfimport commented Oct 10, 2022

asfimport commented Feb 14, 2022 •

edited

Loading