Merge branch 'master' into Hossein/FastHierVisitor

hosseinmoein · Nov 13, 2024 · 3a90b94 · 3a90b94
2 parents 101e81c + 6bf1500
commit 3a90b94
Show file tree

Hide file tree

Showing 10 changed files with 1,836 additions and 38 deletions.
diff --git a/README.md b/README.md
@@ -56,7 +56,7 @@ I have followed a few <B>principles in this library</B>:<BR>
 
 ### Performance
 You have probably heard of Polars DataFrame. It is implemented in Rust and ported with zero-overhead to Python (as long as you don’t have a loop). I have been asked by many people to write a comparison for <B>DataFrame vs. Polars</B>. So, I finally found some time to learn a bit about Polars and write a very simple benchmark.<BR>
-I wrote the following identical programs for both Polars and C++ DataFrame (and Pandas). I used Polars version: 0.19.14 (Pandas version: 1.5.3, Numpy version: 1.24.2). And I used C++20 clang compiler with -O3 option. I ran both on my, somewhat outdated, MacBook Pro.<BR>
+I wrote the following identical programs for both Polars and C++ DataFrame (and Pandas). I used Polars version: 0.19.14 (Pandas version: 1.5.3, Numpy version: 1.24.2). And I used C++20 clang compiler with -O3 option. I ran both on my, somewhat outdated, MacBook Pro (Intel chip, 96GB RAM).<BR>
 In both cases, I created a dataframe with 3 random columns. The C++ DataFrame also required an additional index column of the same size. Polars doesn’t believe in index columns (that has its own pros and cons. I am not going through it here).
 Each program has three identical parts. First it generates and populates 3 columns with 300m random numbers each (in case of C++ DataFrame, it must also generate a sequential index column of the same size). That is the part I am _not_ interested in. In the second part, it calculates the mean of the first column, the variance of the second column, and the Pearson correlation of the second and third columns. In the third part, it does a select (or filter as Polars calls it) on one of the columns.
 

diff --git a/docs/HTML/DataFrame.html b/docs/HTML/DataFrame.html
@@ -622,7 +622,7 @@ <H2 ID="2"><font color="blue">API Reference with code samples <font size="+4">&#
     </tr>
 
     <tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
-      <td title="These are other functionalities of DataFrame" style="text-align:center;background-color:LightGrey;color:DarkBlue">Gears &nbsp;&nbsp; <font size="+3">&#x2699;</font></td>
+      <td title="These are other functionalities of DataFrame" style="text-align:center;background-color:LightGrey;color:DarkBlue">Gears &amp; Stuff &nbsp;&nbsp; <font size="+3">&#x2699;</font></td>
     </tr>
 
     <tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
@@ -946,6 +946,14 @@ <H2 ID="2"><font color="blue">API Reference with code samples <font size="+4">&#
       <td title="Calculates the diff between shifted values">struct <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DiffVisitor.html">DiffVisitor</a>{}</td>
     </tr>
 
+    <tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
+      <td title="Gives you the first dataitem in the given column">struct <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/FirstVisitor.html">FirstVisitor</a>{}</td>
+    </tr>
+
+    <tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
+      <td title="Gives you the last dataitem in the given column">struct <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/FirstVisitor.html">LastVisitor</a>{}</td>
+    </tr>
+
     <tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
       <td title="Calculates product">struct <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/ProdVisitor.html">ProdVisitor</a>{}</td>
     </tr>

diff --git a/docs/HTML/FirstVisitor.html b/docs/HTML/FirstVisitor.html
diff --git a/docs/HTML/self_contained.html b/docs/HTML/self_contained.html
@@ -51,7 +51,7 @@
   It also has some disadvantages:
   <UL>
     <LI>There might be functionalities that are hard/time-consuming to implement that are already there</LI>
-	<LI>If you find a battle-test library, the debugging is already done for you</LI>
+	<LI>If you find a battle-tested library, the debugging is already done for you</LI>
 	<LI>There might be industry-wide standards/trends that you want to follow by using a reputed library</LI>
   </UL>
   <BR>

diff --git a/include/DataFrame/DataFrameStatsVisitors.h b/include/DataFrame/DataFrameStatsVisitors.h
@@ -6382,16 +6382,12 @@ struct  LinearFitVisitor  {
                 const H &x_begin, const H &x_end,
                 const H &y_begin, const H &y_end)  {
 
-        const size_type col_s = std::distance(x_begin, x_end);
+        const size_type col_s =
+            std::min(std::distance(x_begin, x_end),
+                     std::distance(y_begin, y_end));
         const auto      thread_level = (col_s < ThreadPool::MUL_THR_THHOLD)
             ? 0L : ThreadGranularity::get_thread_level();
 
-#ifdef HMDF_SANITY_EXCEPTIONS
-        if (col_s != size_type(std::distance(y_begin, y_end)))
-            throw DataFrameError("LinearFitVisitor: two columns must be "
-                                 "of equal sizes");
-#endif // HMDF_SANITY_EXCEPTIONS
-
         value_type  sum_x { 0 };   // Sum of all observed x
         value_type  sum_y { 0 };   // Sum of all observed y
         value_type  sum_x2 { 0 };  // Sum of all observed x squared
@@ -7543,30 +7539,32 @@ is_normal(const V &column, double epsl, bool check_for_standard)  {
     svisit.post();
 
     const value_type    mean = static_cast<value_type>(svisit.get_mean());
-    const value_type    std = static_cast<value_type>(svisit.get_std());
-    const value_type    high_band_1 = static_cast<value_type>(mean + std);
-    const value_type    low_band_1 = static_cast<value_type>(mean - std);
+    const value_type    stdev = static_cast<value_type>(svisit.get_std());
+    const value_type    high_band_1 = static_cast<value_type>(mean + stdev);
+    const value_type    low_band_1 = static_cast<value_type>(mean - stdev);
     double              count_1 = 0.0;
     const value_type    high_band_2 =
-        static_cast<value_type>(mean + std * 2.0);
-    const value_type    low_band_2 = static_cast<value_type>(mean - std * 2.0);
+        static_cast<value_type>(mean + stdev * 2.0);
+    const value_type    low_band_2 =
+        static_cast<value_type>(mean - stdev * 2.0);
     double              count_2 = 0.0;
     const value_type    high_band_3 =
-        static_cast<value_type>(mean + std * 3.0);
-    const value_type    low_band_3 = static_cast<value_type>(mean - std * 3.0);
+        static_cast<value_type>(mean + stdev * 3.0);
+    const value_type    low_band_3 =
+        static_cast<value_type>(mean - stdev * 3.0);
     double              count_3 = 0.0;
 
-    for (auto citer : column) [[likely]]  {
-        if (citer >= low_band_1 && citer < high_band_1)  {
+    for (const auto &val : column) [[likely]]  {
+        if (val >= low_band_1 && val < high_band_1)  {
             count_3 += 1;
             count_2 += 1;
             count_1 += 1;
         }
-        else if (citer >= low_band_2 && citer < high_band_2)  {
+        else if (val >= low_band_2 && val < high_band_2)  {
             count_3 += 1;
             count_2 += 1;
         }
-        else if (citer >= low_band_3 && citer < high_band_3)  {
+        else if (val >= low_band_3 && val < high_band_3)  {
             count_3 += 1;
         }
     }
@@ -7578,7 +7576,7 @@ is_normal(const V &column, double epsl, bool check_for_standard)  {
         std::fabs((count_3 / col_s) - 0.997) <= epsl)  {
         if (check_for_standard)
             return (std::fabs(mean - 0) <= epsl &&
-                    std::fabs(std - 1.0) <= epsl);
+                    std::fabs(stdev - 1.0) <= epsl);
         return (true);
     }
     return (false);
@@ -7597,28 +7595,30 @@ is_lognormal(const V &column, double epsl)  {
     StatsVisitor<value_type, int>   log_visit;
 
     svisit.pre();
-    for (auto citer : column) [[likely]]  {
-        svisit(dummy_idx, static_cast<value_type>(std::log(citer)));
-        log_visit(dummy_idx, citer);
+    for (auto val : column) [[likely]]  {
+        svisit(dummy_idx, static_cast<value_type>(std::log(val)));
+        log_visit(dummy_idx, val);
     }
     svisit.post();
 
     const value_type    mean = static_cast<value_type>(svisit.get_mean());
-    const value_type    std = static_cast<value_type>(svisit.get_std());
-    const value_type    high_band_1 = static_cast<value_type>(mean + std);
-    const value_type    low_band_1 = static_cast<value_type>(mean - std);
+    const value_type    stdev = static_cast<value_type>(svisit.get_std());
+    const value_type    high_band_1 = static_cast<value_type>(mean + stdev);
+    const value_type    low_band_1 = static_cast<value_type>(mean - stdev);
     double              count_1 = 0.0;
     const value_type    high_band_2 =
-        static_cast<value_type>(mean + std * 2.0);
-    const value_type    low_band_2 = static_cast<value_type>(mean - std * 2.0);
+        static_cast<value_type>(mean + stdev * 2.0);
+    const value_type    low_band_2 =
+        static_cast<value_type>(mean - stdev * 2.0);
     double              count_2 = 0.0;
     const value_type    high_band_3 =
-        static_cast<value_type>(mean + std * 3.0);
-    const value_type    low_band_3 = static_cast<value_type>(mean - std * 3.0);
+        static_cast<value_type>(mean + stdev * 3.0);
+    const value_type    low_band_3 =
+        static_cast<value_type>(mean - stdev * 3.0);
     double              count_3 = 0.0;
 
-    for (auto citer : column) [[likely]]  {
-        const auto  log_val = std::log(citer);
+    for (const auto &val : column) [[likely]]  {
+        const auto  log_val = std::log(val);
 
         if (log_val >= low_band_1 && log_val < high_band_1)  {
             count_3 += 1;