datatable.options.nthreads may need to be set manually under some situations #3329

lmullany · 2022-08-04T20:30:52Z

lmullany
Aug 4, 2022

My apologies up front if this is too vague, but I am trying to understand when setting a key on a datatable Frame is slow. Specifically, I have a three column Frame, df with the following stypes: stype.str32, stype.date32, stype.int32. The Frame is not that big, only 400,000 rows, and the first three rows look like this:

   | a      b               c
   | str32  date32      int32
-- + -----  ----------  -----
 0 | M29    2014-10-01     65
 1 | M60    2014-10-01     15
 2 | M72    2014-10-01      6
[3 rows x 3 columns]

However, when I try to set a key on a and b, using the following:

df.key = ['a','b']

it churns away for more than 60 seconds.. Anyone have any suggestions, as to under what conditions makes this operation particulary slow?

On the same machine using R data.table, the same operation takes no time at all

> system.time(setkey(df,a,b))
   user  system elapsed 
  0.035   0.001   0.006

Answered by lmullany

Aug 5, 2022

Solution: Make sure that the number of threads being detected by datatable upon initialization does not exceed the actual maximum available. In some situations, the number for cpus on the system (say extracted from /proc/cpu_info for example, will be more than are actually available to the job (say on a compute cluster where the number of cpus provided to the job is less than that on the node).

In such cases, extract the actual number available and set using datatable.options.nthreads=x, where x is the equal to or less than the actual number available.

View full answer

oleksiyskononenko · 2022-08-04T21:14:56Z

oleksiyskononenko
Aug 4, 2022

To set a key on a set of columns, datatable invokes a group by operation on those columns: https://github.com/h2oai/datatable/blob/main/src/core/frame/key.cc#L118 Grouping two columns having 400K rows is a pretty expensive operation, because at the end it requires sorting of one string and one integer column. Note, that internally date32 is stored as int32 (time64 as int64), meaning that setting a key on a date32 column vs int32 column should not be much different in terms of performance.

From what I see in the R's data.table documentation, setkey() will also sort the provided columns. I'm not an expert in R's data.table, but don't think it is possible to sort 400K rows instantly, so there is either some lazy evaluation involved or the columns were already sorted before setting the key up.

Btw, can you share your data and platform information? My feeling that sorting of two columns should not take 60s anyways. Also, what are the units for the R's data.table timing? If that's in seconds, I guess Python datatable should demonstrate similar performance. At least this is what I observe locally when I group by randomly generated 400K x 2 frame.

0 replies

lmullany · 2022-08-04T21:59:01Z

lmullany
Aug 4, 2022
Author

Thanks much for your time @oleksiyskononenko , I've figured out that it is most definitely a platform issue.. On a similar machine, I'm able to set the key (good question, but not, it is not already sorted), using datatable in a time frame that is similar to what I'm seeing with R.

%%timeit
from datatable import dt
df = dt.fread("df.csv")
df.key = ['a','b']

51.3 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

There are some import differences between the machines.

The slower performing has 32 cores, is running RedHat Centos 7, and is running data.table 1.0.0
The faster performing as 24 cores is running Ubuntu 18.04, and is running the development version.

I would try to install the development version from github on the slower performing, but I can't get it to compile.. the gcc version perhaps is too old? 4.8.5?

2 replies

oleksiyskononenko Aug 4, 2022

I see, try to play with the number of cores that datatable uses for sorting. It could be that on a 32 core system it just uses too many cores (all 32) and performance actually goes down. Check the value of dt.options.sort.nthreads on Ubuntu and try to set the same value on Centos.

Yeah, we require gcc 6+: https://datatable.readthedocs.io/en/latest/start/install.html#install-latest-dev-version But I don't think you need to build anything from the source. Have you tried just to pip install the latest dev wheels from our Python repo: https://h2o-release.s3.amazonaws.com/datatable/index.html ?

Anyways, I do not expect the difference of four orders of magnitude between the current dev and 1.0.0, the difference will be actually very minor. The source of the issue is probably the over-parallelization.

lmullany Aug 4, 2022
Author

@oleksiyskononenko , thanks - of course, for some reason I did not realize the development wheels were on the repo - I've downloaded and installed, and yes, you are right - that does not make a difference..

I wasn't able to get any change by setting sort.nthreads option down to 24 (which is what the value was on the faster Ubuntu machine)

lmullany · 2022-08-04T22:37:39Z

lmullany
Aug 4, 2022
Author

@oleksiyskononenko its interesting, because on the same platform, pandas is not having a problem at all, with sorting:

With pandas - read and sort

%%time
import pandas as pd
df =pd.read_csv("df.csv")
df.sort_values(by=['a','b'])

CPU times: user 180 ms, sys: 14.4 ms, total: 194 ms
Wall time: 194 ms

With datatable - read and sort

%%time
import datatable as dt
df = dt.fread("df.csv")
df.sort(['a','b'])

CPU times: user 5min 48s, sys: 121 ms, total: 5min 48s
Wall time: 21.9 s

4 replies

oleksiyskononenko Aug 4, 2022

hm, can we decouple reading the data from sorting to see where the problem is?

oleksiyskononenko Aug 4, 2022

Btw, pandas is single-threaded, so could have no issues. You can also play with dt.options.nthreads to see if it makes any difference.

lmullany Aug 5, 2022
Author

hm, can we decouple reading the data from sorting to see where the problem is?

Yes, I should have split this, but I tested and all the bottleneck is in the sort, not the read. Both pd.read_csv() and dt.fread() and lightning fast

lmullany Aug 5, 2022
Author

Btw, pandas is single-threaded, so could have no issues. You can also play with dt.options.nthreads to see if it makes any difference.

Thanks, when I get back to this I’ll take a look at dt.options.threads

lmullany · 2022-08-05T12:31:41Z

lmullany
Aug 5, 2022
Author

@oleksiyskononenko , I set dt.options.nthreads = 1 and the problem disappeared:

%%timeit
df.sort(['a','b'])

69.1 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

However, there is more to the story. The problem only seems to be occuring within an interactive jupyter notebook session on a compute cluster. If I ssh into the cluster, and run the same code in a python terminal (where dt.options.nthreads is initialized at 8), there is no slow down.

3 replies

oleksiyskononenko Aug 5, 2022

Thanks for the information. I thought there are 32 cores on the machine in question? Not sure why nthreads was initially set to 8 then. By default, this parameter is set to the number of cores available.

Anyways, we seem to have a problem with not adjusting this parameter automatically. Can you share your data for me to reproduce?

lmullany Aug 5, 2022
Author

Actually, this seems to be a problem perhaps specific to this system and/or only within a jupyter session. In particular, with the help of one of my colleagues, I've learn that the number of CPUs available on the system is being recognized (for example the initial dt.options.threads is set at 32, but that is not the number available to my job! I've found that if I manually set nodes_available=x where x is the actual number of nodes available to my job (not that on the system), then dt.options.threads = nodes_availble, will solve the problem completely.

A similar issue occurs with multiprocessing.

import multiprocessing as mp
mp.cpu_count()

32

but in fact, if only 16 (or 8, or 24, or any number less are truly available), then we have a major slowdown..

oleksiyskononenko Aug 5, 2022

Yeah, that could be the source of the problem. If you have 1 core available, but nthreads is defined as 32 — then there's gonna be a huge overhead due to unnecessary parallelization. Unfortunately, it seems that sometimes we don't have a reliable way to determine the number of threads available.

lmullany · 2022-08-05T21:41:27Z

lmullany
Aug 5, 2022
Author

Solution: Make sure that the number of threads being detected by datatable upon initialization does not exceed the actual maximum available. In some situations, the number for cpus on the system (say extracted from /proc/cpu_info for example, will be more than are actually available to the job (say on a compute cluster where the number of cpus provided to the job is less than that on the node).

In such cases, extract the actual number available and set using datatable.options.nthreads=x, where x is the equal to or less than the actual number available.

4 replies

samukweku Aug 10, 2022

Noob question, how do you identify the number of threads available for the job?

lmullany Aug 10, 2022
Author

Its a good question @samukweku , and I think it will be system/platform dependent.

For example, on the specific compute cluster on which I identified this problem (i.e discrepancy between cpus for system at /proc/cpu_info and those available for the actual job) the Slurm Workload Manger is utilized for job scheduling, and there is an environ value that can be accessed and set as below:

import os
import datatable as dt
dt.options.nthreads = int(os.environ['SLURM_CPUS_ON_NODE'])

nankaimy Sep 27, 2022

import os
os.environ['SLURM_CPUS_ON_NODE']
import os
os.environ['SLURM_CPUS_ON_NODE']

Traceback (most recent call last):

File "", line 2, in
os.environ['SLURM_CPUS_ON_NODE']

File "C:\myprogram\Anaconda3\lib\os.py", line 678, in getitem
raise KeyError(key) from None

KeyError: 'SLURM_CPUS_ON_NODE'

windows python 3.7 not run

lmullany Mar 22, 2023
Author

@nankaimy as I indicate above, this will by system-specific.. SLURM_CPUS_ON_NODE is an environ variable on my system, but clearly is not on yours.. investigate the environ variables, and try to determine the key for the variable that indicates the number of cpus, and use that instead

datatable.options.nthreads may need to be set manually under some situations #3329

Replies: 5 comments · 13 replies

lmullany Aug 4, 2022 Author

lmullany Aug 4, 2022 Author

lmullany Aug 4, 2022 Author

lmullany Aug 5, 2022 Author

lmullany Aug 5, 2022 Author

lmullany Aug 5, 2022 Author

lmullany Aug 5, 2022 Author

lmullany Aug 5, 2022 Author

lmullany Aug 10, 2022 Author

lmullany Mar 22, 2023 Author

Replies: 5 comments 13 replies

lmullany
Aug 4, 2022
Author

lmullany Aug 4, 2022
Author

lmullany
Aug 4, 2022
Author

lmullany Aug 5, 2022
Author

lmullany Aug 5, 2022
Author

lmullany
Aug 5, 2022
Author

lmullany Aug 5, 2022
Author

lmullany
Aug 5, 2022
Author

lmullany Aug 10, 2022
Author

lmullany Mar 22, 2023
Author