decimalls are not supported normally #1602

ibobak · 2024-06-07T21:14:00Z

Current Behaviour

Spark Dataframe structure:

root
 |-- device_id: string (nullable = true)
 |-- device_install_date: date (nullable = true)
 |-- max_device_event_date: date (nullable = true)
 |-- distinct_play_days: long (nullable = true)
 |-- sessions: long (nullable = true)
 |-- playtime_sec_total: decimal(38,6) (nullable = true)
 |-- intersession_sec_sum: decimal(38,6) (nullable = true)
 |-- playtime_sec_per_session: decimal(38,6) (nullable = true)
 |-- playtime_sec_per_playing_day: decimal(38,6) (nullable = true)
 |-- days_since_install: long (nullable = true)
 |-- avg_ses_between_sessions: decimal(38,6) (nullable = true)
 |-- loyalty_index: double (nullable = true)
 |-- install_date: date (nullable = true)

code:

from ydata_profiling import ProfileReport

report = ProfileReport(df_basic_features_3, minimal=True, title=app_code)
report.to_file(f"profiling/{app_code}_features_3.html")

Look what distribution it produced for playtime_sec_total:

Now I converted this dataframe to the Pandas dataframe and here is what I see indeed:

So, conclusion is this: the product is totally buggy with this type of fields, and I don't trust it any more.

Expected Behaviour

You need to fix the handling of decimal fields.

Data Description

see above

Code that reproduces the bug

see above

pandas-profiling version

ydata-profiling==4.8.3

Dependencies

a2wsgi==1.10.4
aiohttp==3.9.5
aiosignal==1.3.1
alembic==1.13.1
altair==5.3.0
annotated-types==0.6.0
anyio==4.3.0
apache-airflow==2.7.1
apache-airflow-providers-common-sql==1.13.0
apache-airflow-providers-ftp==3.9.0
apache-airflow-providers-http==4.11.0
apache-airflow-providers-imap==3.6.0
apache-airflow-providers-sqlite==3.8.0
apispec==6.6.1
argcomplete==3.3.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
arviz==0.16.1
asgiref==3.8.1
asn1crypto==1.5.1
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
Babel==2.15.0
backcall==0.2.0
backoff==2.2.1
bcrypt==4.1.3
beautifulsoup4==4.12.3
bleach==6.1.0
blinker==1.8.2
boto3==1.28.29
botocore==1.31.85
build==1.2.1
cachelib==0.9.0
cachetools==5.3.3
cattrs==23.2.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
chroma-hnswlib==0.7.3
chromadb==0.4.24
click==8.1.7
cloudpickle==2.2.1
colorama==0.4.6
coloredlogs==15.0.1
colorlog==4.8.0
comm==0.2.2
ConfigUpdater==3.2
connexion==3.0.6
cons==0.4.6
contourpy==1.2.1
cron-descriptor==1.4.3
croniter==2.0.5
cryptography==42.0.7
cycler==0.12.1
dacite==1.8.1
databricks-cli==0.18.0
dataclasses-json==0.6.6
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.14
dill==0.3.8
dnspython==2.6.1
docker==6.1.3
docutils==0.21.2
email-validator==1.3.1
entrypoints==0.4
et-xmlfile==1.1.0
etuples==0.3.9
exceptiongroup==1.2.0
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastjsonschema==2.19.1
fastprogress==1.0.3
filelock==3.14.0
Flask==2.2.5
Flask-AppBuilder==4.3.6
Flask-Babel==2.0.0
Flask-Caching==2.3.0
Flask-JWT-Extended==4.6.0
Flask-Limiter==3.7.0
Flask-Login==0.6.3
Flask-Session==0.8.0
Flask-SQLAlchemy==2.5.1
Flask-WTF==1.2.1
flatbuffers==24.3.25
fonttools==4.51.0
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.5.0
gitdb==4.0.11
GitPython==3.1.43
google-auth==2.29.0
google-re2==1.1.20240501
googleapis-common-protos==1.63.0
graphviz==0.20.3
greenlet==3.0.3
grpcio==1.64.0
gunicorn==20.1.0
h11==0.14.0
h5netcdf==1.3.0
h5py==3.11.0
htmlmin==0.1.12
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.2
humanfriendly==10.0
idna==3.7
ImageHash==4.3.1
importlib-metadata==6.11.0
importlib_resources==6.4.0
inflection==0.5.1
ipykernel==6.19.2
ipynb-py-convert==0.4.6
ipython==8.10.0
ipython-genutils==0.2.0
ipywidgets==7.6.5
isoduration==20.11.0
itsdangerous==2.2.0
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
json5==0.9.25
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter-contrib-core==0.4.2
jupyter-contrib-nbextensions==0.7.0
jupyter-events==0.10.0
jupyter-highlight-selected-word==0.2.0
jupyter-lsp==2.2.5
jupyter-nbextensions-configurator==0.6.3
jupyter_client==7.4.4
jupyter_core==5.7.2
jupyter_server==2.14.0
jupyter_server_terminals==0.5.3
jupyterlab==4.2.1
jupyterlab-execute-time==3.1.2
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.2
jupyterlab_widgets==3.0.10
kiwisolver==1.4.5
kubernetes==29.0.0
langchain==0.1.13
langchain-community==0.0.38
langchain-core==0.1.52
langchain-text-splitters==0.0.2
langsmith==0.1.67
lazy-object-proxy==1.10.0
lazyprofiler==0.1.1
limits==3.12.0
linkify-it-py==2.0.3
llvmlite==0.42.0
lockfile==0.12.2
logical-unification==0.4.6
lxml==5.2.2
Mako==1.3.5
Markdown==3.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.2
marshmallow-oneofschema==3.1.1
marshmallow-sqlalchemy==0.26.1
matplotlib==3.8.4
matplotlib-inline==0.1.7
mdit-py-plugins==0.4.1
mdurl==0.1.2
miniKanren==1.0.3
mistune==3.0.2
mlflow==2.5.0
mmh3==4.1.0
more-itertools==10.2.0
mpmath==1.3.0
msgspec==0.18.6
multidict==6.0.5
multimethod==1.11.2
multipledispatch==1.0.0
mypy-extensions==1.0.0
nbclassic==1.0.0
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
notebook==7.2.0
notebook_shim==0.2.4
numba==0.59.1
numpy==1.23.5
oauthlib==3.2.2
onnx==1.15.0
onnxconverter-common==1.14.0
onnxmltools==1.12.0
onnxruntime==1.17.1
openai==1.22.0
openpyxl==3.1.2
opentelemetry-api==1.24.0
opentelemetry-exporter-otlp==1.24.0
opentelemetry-exporter-otlp-proto-common==1.24.0
opentelemetry-exporter-otlp-proto-grpc==1.24.0
opentelemetry-exporter-otlp-proto-http==1.24.0
opentelemetry-instrumentation==0.46b0
opentelemetry-instrumentation-asgi==0.46b0
opentelemetry-instrumentation-fastapi==0.46b0
opentelemetry-proto==1.24.0
opentelemetry-sdk==1.24.0
opentelemetry-semantic-conventions==0.45b0
opentelemetry-util-http==0.46b0
optuna==3.5.0
optuna-fast-fanova==0.0.4
ordered-set==4.1.0
orjson==3.10.3
overrides==7.7.0
packaging==23.2
pandas==1.5.3
pandas-datareader==0.10.0
pandasql==0.7.3
pandocfilters==1.5.1
parso==0.8.4
pathspec==0.12.1
patsy==0.5.6
pendulum==2.1.2
pexpect==4.9.0
pgcopy==1.6.0
phik==0.12.4
pickleshare==0.7.5
pillow==10.3.0
platformdirs==4.2.2
plotly==5.22.0
pluggy==1.5.0
posthog==3.5.0
prison==0.2.1
prometheus_client==0.20.0
prompt-toolkit==3.0.43
protobuf==3.20.2
psutil==5.9.8
psycopg2==2.9.9
psycopg2-binary==2.9.7
ptyprocess==0.7.0
pulsar-client==3.5.0
pure-eval==0.2.2
pyarrow==12.0.1
pyasn1_modules==0.4.0
pycountry==23.12.11
pycparser==2.22
pydantic==2.7.0
pydantic_core==2.18.1
pydeck==0.9.1
Pygments==2.18.0
PyJWT==2.8.0
pymc==5.6.0
pyparsing==3.1.2
pypdf==4.1.0
PyPika==0.48.9
pyproject_hooks==1.1.0
pytensor==2.12.3
python-daemon==3.0.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.9
python-nvd3==0.16.0
python-slugify==8.0.4
pytz==2023.4
pytzdata==2020.1
PyWavelets==1.6.0
PyYAML==6.0.1
pyzmq==26.0.3
querystring-parser==1.2.4
redshift-connector==2.0.911
referencing==0.35.1
requests==2.32.3
requests-oauthlib==2.0.0
requests-toolbelt==1.0.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rich-argparse==1.4.0
rpds-py==0.18.1
s3transfer==0.6.2
scikit-learn==1.3.2
scipy==1.12.0
scramp==1.4.5
seaborn==0.12.2
Send2Trash==1.8.3
setproctitle==1.3.3
shap==0.42.1
shellingham==1.5.4
six==1.16.0
skl2onnx==1.16.0
slicer==0.0.7
smart-open==6.3.0
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.5
spark_framework @ git+https://github.com/ibobak/spark_framework.git@8dcf0f5b29e71721d4d6069a76ae4fde1e7e7bde
SQLAlchemy==1.4.49
SQLAlchemy-JSONField==1.0.2
SQLAlchemy-Utils==0.41.2
sqlparse==0.5.0
stack-data==0.6.3
starlette==0.37.2
statsmodels==0.14.2
streamlit==1.32.2
sympy==1.12
tabulate==0.9.0
tenacity==8.0.1
termcolor==2.4.0
terminado==0.18.1
text-unidecode==1.3
threadpoolctl==3.5.0
tinycss2==1.3.0
tokenizers==0.19.1
tomli==2.0.1
toolz==0.12.1
tornado==6.2
tqdm==4.66.2
traitlets==5.9.0
typeguard==4.3.0
typer==0.12.3
types-python-dateutil==2.9.0.20240316
typing-inspect==0.9.0
typing_extensions==4.12.0
tzdata==2024.1
uc-micro-py==1.0.3
ujson==5.10.0
unicodecsv==0.14.1
uri-template==1.3.0
urllib3==2.0.7
uvicorn==0.30.0
uvloop==0.19.0
visions==0.7.6
watchdog==4.0.1
watchfiles==0.22.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.8.0
websockets==12.0
Werkzeug==3.0.3
widgetsnbextension==3.5.2
wordcloud==1.9.3
wrapt==1.16.0
WTForms==3.1.2
xarray==2024.3.0
xarray-einstats==0.7.0
xgboost==2.0.2
XlsxWriter==3.2.0
yarl==1.9.4
ydata-profiling==4.8.3
zipp==3.18.2

OS

Ubuntu 22.04

Checklist

There is not yet another bug report for this issue in the issue tracker
The problem is reproducible from this bug report. This guide can help to craft a minimal bug report.
The issue has not been resolved by the entries listed under Common Issues.

The text was updated successfully, but these errors were encountered:

fabclmnt · 2024-07-09T20:56:09Z

Hi @ibobak ,

thank you for reporting the issue. Regarding ydata-profiling for spark it is clear that we have only launched one initial version that not only includes only a small set of functionality but also have some know issues.

We are looking for contributors that are willing to keep evolving the Spark integration, as this was something initiated by the community. If you're open to it, feel free to check the issues labelled with the tag spark.

azory-ydata added the needs-triage label Jun 7, 2024

fabclmnt added bug 🐛 Something isn't working spark ⚡ PySpark features! and removed needs-triage labels Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decimalls are not supported normally #1602

decimalls are not supported normally #1602

ibobak commented Jun 7, 2024

fabclmnt commented Jul 9, 2024

decimalls are not supported normally #1602

decimalls are not supported normally #1602

Comments

ibobak commented Jun 7, 2024

Current Behaviour

Expected Behaviour

Data Description

Code that reproduces the bug

pandas-profiling version

Dependencies

OS

Checklist

fabclmnt commented Jul 9, 2024