-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Over in dask/dask#12022 (comment), I'm debugging a test failure with dask and pandas 3.x that comes down to the behavior of DataFrame.copy(deep=True)
with an arrow-backed extension array.
In
pandas/pandas/core/arrays/arrow/array.py
Line 1092 in 628c7fb
def copy(self) -> Self: |
DataFrame.copy(deep=True)
, you'll still have a reference back to the original buffer. If the output of the .copy(deep=True)
is the only one with a reference to the original buffer, then it won't be garbage collected. Consider:
import pandas as pd
import pyarrow as pa
pool = pa.default_memory_pool()
print("before", pool.bytes_allocated()) # 0
df = pd.DataFrame({"a": ["a", "b", "c"] * 1000})
print("df", pool.bytes_allocated()) # 27200
del df
print("df", pool.bytes_allocated()) # 0
df2 = pd.DataFrame({"a": ["a", "b", "c"] * 1000})
clone = df2.iloc[:0].copy(deep=True)
print("df2", pool.bytes_allocated()) # 27200
del df2
print("after - clone", pool.bytes_allocated()) # 27200
Maybe this is fine. We can probably figure out some workaround in dask (in this case we're making an empty dataframe object as a kind of Schema object. We can probably do something other than df.iloc[:0].copy(deep=True)
). But perhaps pandas could consider changing the behavior here.
The downside is that df.copy(deep=True)
will become more expensive and use more memory.
Installed Versions
In [4]: pd.show_versions()
INSTALLED VERSIONS
------------------
commit : 962168f06d15d1aced28b414eb82909d3c930916
python : 3.12.8
python-bits : 64
OS : Darwin
OS-release : 24.5.0
Version : Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6041
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 3.0.0.dev0+2254.g962168f06d
numpy : 2.4.0.dev0+git20250717.d02611a
dateutil : 2.9.0.post0
pip : None
Cython : None
sphinx : None
IPython : 9.4.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : 2025.7.0
html5lib : None
hypothesis : 6.136.1
gcsfs : None
jinja2 : 3.1.6
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : 21.0.0
pyiceberg : None
pyreadstat : None
pytest : 8.4.1
python-calamine : None
pytz : 2025.2
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
qtpy : None
pyqt5 : None
Prior Performance
No response