Data Science is extracting useful and actionable information out of structured and unstructured data.
Exploratory Data Analysis (EDA)
When you get a dataset, it’s a set of rows and columns. If it’s a supervised learning task, there are labels as well. But before you go straight to modeling, you should make yourself familiar with the data first.
Oftentimes, 1 hour spent looking at the data will be more useful than 1 hour spent tweaking model stuff. After all, garbage-in garbage-out, so you should try to put in something as clean as possible.
ydata-profiling
This is both a Python library and a command-line tool. The Python library can analyse Pandas dataframes, and the command-line tool can analyse CSV files.
The tool works a bit slowly, and the generated reports make your browser use a lot of RAM. But the analysis is very good and helpful.
You can run it on a CSV like this
uv run --python cpython-3.12.10-linux-x86_64-gnu --with ydata-profiling --with setuptools -- ydata_profiling data.csv report.html
This will read data.csv
and output a report.html
.
I needed to run it with Python 3.12 for some reason, it didn’t work with Python 3.14. Probably this will be fixed in the future.