Moving to PyArrow dtypes by default #61618

datapythonista · 2025-06-09T21:51:57Z

There have been some discussions about this before, but as far as I know there is not a plan or any decision made.

My understanding is (feel free to disagree):

If we were starting pandas today, we would only use Arrow as storage for DataFrame columns / Series
After all the great work that has been done on building new pandas types based on PyArrow, we are not considering other Arrow implementations
Based on 1 and 2, we are moving towards pandas based on PyArrow, and the main question is what's the transition path

@jbrockmendel commented this, and I think many others share this point of view, based on past interactions:

There's a path to making it feasible to use PyArrow types by default, but that path probably takes multiple major release cycles. Doing it for 3.0 would be, frankly, insane.

It would be interesting to know why exactly, I guess it's mainly because of two main reasons:

Finish PyArrow types and making operations with them as reliable and fast as the original pandas types
Giving users time to adapt

I don't know the exact state of the PyArrow types, and how often users will face problems if using them instead of the original ones. From my perception, there aren't any major efforts to make them better at this point. So, I'm unsure if the situation in that regard will be very different if we make the PyArrow types the default ones tomorrow, or if we make them the default ones in two years.

My understanding is that the only person who is paid consistently to work on pandas is Matt, and he's doing an amazing job at keeping the project going, reviewing most of the PRs, keeping the CI in good shape... But I don't think not him not anyone else if being able to put hours into developing new things as it used to be. For reference, this is the GitHub chart of pandas activity (commits) since pandas 2.0:

So, in my opinion, the existing problems with PyArrow will start to be addressed significantly, whenever they become the default ones.

So, in my opinion our two main options are:

We move forward with the PyArrow transition. pandas 3.0 will surely not be the best pandas version ever if we start using PyArrow types, but pandas 3.1 will be much better, and pandas 3.2 may be as good as pandas 2 in reliability and speed, but much closer to what we would like pandas to be.

Of course not all users are ready for pandas 3.0 with Arrow types. They can surely pin to pandas=2 until pandas 3 is more mature and they made the required changes to their code. We can surely add a flag pandas.options.mode.use_arrow = False that reverts the new default to the old status quo. So users can actually move to pandas 3.0 but stay with the old types until we (pandas) and them are ready to get into the new default types. The transition from Python 2 to 3 (which is surely an example of what not to do) took more than 10 years. I don't think in our case we need as much. And if there is interest (aka money) we can also support the pandas 2 series while needed.

The other option is to continue with the transition to the new nullable types, that my understanding is that we implemented because PyArrow didn't exist at that time. Continue to put our little resources on them. Making users adapt their code to a new temporary status quo, not the final one we envision, and stay in this transition period and delay the move to PyArrow I assume around 6 years (Brock mentioned multiple major release cycles, so I assume something like 3 at a rate of one major release every 2 years).

It will be great to know what are other people's thoughts and ideal plans, and see what makes more sense. But to me personally, based on the above information, it doesn't sound more insane to move to PyArrow in pandas 3, than to move in pandas 6.

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2025-06-09T22:30:04Z

Moving users to pyarrow types by default in 3.0 would be insane because #32265 has not been addressed. Getting that out of the way was the point of PDEP 16, which you and Simon are oddly hostile toward.

mroeschke · 2025-06-09T23:38:29Z

Since I helped draft PDEP-10, I would like a world where the Arrow type system with NA semantics would be the only pandas type system.

Secondarily, I would like a world where pandas has the Arrow type system with NA semantics and the legacy (NaN) Numpy type system which is completely independent from the Arrow type system (e.g. a user cannot mix the two in any way)

I agree with Brock that null semantics (NA vs NaN) must inevitably be discussed with adopting a new type system.

I've also generally been concerned about the growing complexity of PDEP-14 with the configurability of "storage" and NA semantics (like having to define a comparison hierarchy #60639). While I understand that we've been very cautious about compatibility with existing types, I don't think this is maintainable or clearer for users in the long run.

My ideal roadplan would be:

In pandas 3.x
- Deprecate the NumPy nullable types
- Deprecate NaN as a missing value for Arrow types
- Deprecate mixing Arrow types and Numpy types in any operation
- Add some global configuration API, set_option("type_system", "legacy" | "pyarrow"), that configures the "default" type system as either NaN semantics with NumPy or NA semantics with Arrow (with the default being "legacy")
In pandas 4.x
- Enforce above deprecations
- Deprecate "legacy" as the "default" type system
In pandas 5.x
- Enforce "pyarrow" as the new "default" type system

jbrockmendel · 2025-06-10T00:06:24Z

Deprecate NaN as a missing value for Arrow types

So with this deprecation enforced, NaN in a constructor, setitem, or csv would be treated as distinct from pd.NA? If so, I’m on board but I expect that to be a painful deprecation.

datapythonista · 2025-06-10T00:06:53Z

To be clear, I'm not hostile towards PDEP-16 at all. I think it's important for pandas to have clear and simple missing value handling, and while incomplete, I think the PDEP and the discussions have been very useful and insightful. Amd I really appreciate that work.

I just don't see PDEP-16 as a blocker for moving to pyarrow, even less if implemented at the same time. And also, I wouldn't spend time with our own nullable dtypes, I would implement PDEP-16 only for pyarrow types.

I couldn't agree more on the points Matt mentions for pandas 3.x. Personally I would change the default earlier. Sounds like pandas 4.x is mostly about showing a warning to users until they manually change the default, which I personally wouldn't do. But that's a minor point, I like the general idea.

mroeschke · 2025-06-10T00:17:14Z

Deprecate NaN as a missing value for Arrow types

So with this deprecation enforced, NaN in a constructor, setitem, or csv would be treated as distinct from pd.NA?

Correct. Yeah I am optimistic that most of the deprecation would hopefully go into ArrowExtensionArray, but of course there are probably a lot of one-off places that need addressing.

I just don't see PDEP-16 as a blocker for moving to pyarrow

There is a lot of references about type systems in that PDEP that I think would warrant some re-imagining given which type systems are favored. As mentioned before (unfortunately) I think type systems and missing value semantics need to be discussed together

datapythonista · 2025-06-10T08:31:56Z

I created a separate issue #61620 for the option mentioned in the description and in Matt's roadmap, since I think that's somehow independent, and no blocked by PDEP-15, by this issue, or by nothing else that I know.

There is a lot of references about type systems in that PDEP that I think would warrant some re-imagining given which type systems are favored. As mentioned before (unfortunately) I think type systems and missing value semantics need to be discussed together

I fully agree with this. But I'm not sure I fully understand why PDEP-16 must be a blocker for defaulting to PyArrow types.

For users already using PyArrow, they'll have to follow the deprecation transition if they are using NaN as missing. To me, this can be started in 3.0, 4.0 or whenever. And probably the earlier the better, so no more code is written with the undesired behavior.

For users not yet using PyArrow, I do understand that it's better to force the move when the PyArrow dtypes behave as we think they should behave. I'm not convinced this should be a blocker, and even less if the deprecation of the special treatment of NaN is also implemented in 3.0 or in the version when PyArrow types become the default. Maybe I'm wrong, but if you are getting data from parquet, csv... you are not immediately affected by this missing value semantics problems. You need to create data manually (rare for most professional use cases imho), or you need to be explicitly setting NaN in existing data (also rare in my personal experience). Am I missing something that makes this point important enough to be a blocker for moving to PyArrow and stop investing time in types that we plan to deprecate? If you have an example of code commonly used that is problematic for what I'm proposing, that can surely convince me, and help identify the optimal transition path.

simonjayhawkins · 2025-06-10T09:42:13Z

Correct. Yeah I am optimistic that most of the deprecation would hopefully go into ArrowExtensionArray, but of course there are probably a lot of one-off places that need addressing.

@mroeschke a couple of questions:

could the ArrowExtensionArray be described as a true extension array, i.e. using the extension array interface for 3rd party EAs.?
does the ArrowExtensionArray rely on any Cython code for the implementation to work?

simonjayhawkins · 2025-06-10T09:57:53Z

Deprecate NaN as a missing value for Arrow types

So with this deprecation enforced, NaN in a constructor, setitem, or csv would be treated as distinct from pd.NA? If so, I’m on board but I expect that to be a painful deprecation.

I have always understood the ArrowExtensionArray and ArrowDtype to be experimental, no PDEP and no roadmap item, for the purpose of evaluating PyArrow types in pandas to potentially eventually use as a backend for pandas nullable dtypes.

So I can sort of understand why the ArrowDtypes are no longer pure and have allowed pandas semantics to creep into the API.

As experimental dtypes why do they need any deprecation at all? Where do we promote these types as pandas recommended types?

simonjayhawkins · 2025-06-10T10:07:16Z

just to be clear about my previous statement, there is a roadmap item

Apache Arrow interoperability
Apache Arrow is a cross-language development platform for in-memory data. The Arrow logical types are closely aligned with typical pandas use cases.

We'd like to provide better-integrated support for Arrow memory and data types within pandas. This will let us take advantage of its I/O capabilities and provide for better interoperability with other languages and libraries using Arrow.

but I've never interpreted this to cover adopting the current ArrowDtype system throughout pandas

simonjayhawkins · 2025-06-10T11:14:34Z

My ideal roadplan would be:

@mroeschke given the current level of funding and interest for the development of the pandas nullable dtypes and the proportion of core devs that now appear to favor embracing the ArrowDtype instead, I fear that this may be, at this time, the most pragmatic approach in keeping pandas development moving forward as it does seem to have slowed of late. I'm not necessarily comfortable deprecating so much prior effort but then the same could have been said about Panel many years ago and I'm not sure anyone misses it today. If the community wants nullable dtypes by default, they may be less interested in the implementation details or even to some extent the performance. If the situation changes and there is more funding and contributions in the future and we have released a Windows 8 in the meantime then we could perhaps bring back the pandas nullable types.

WillAyd · 2025-06-10T13:16:09Z

The goal of this and PDEP-13 were pretty aligned; prefer PyArrow to build out our default type system where applicable, and fill in the gaps using whatever we have as a fallback. That PDEP conversation stalled; not sure if its worth reviving or if this issue is going to tackle a smaller subset of the problem, but in any case I definitely support this

mroeschke · 2025-06-10T16:17:11Z

But I'm not sure I fully understand why PDEP-16 must be a blocker for defaulting to PyArrow types.

@datapythonista I just would like some agreement that that defaulting to PyArrow types also matches PDEP-16's proposal to (only) NA semantics for this type as well when making the change for a consistent story, but I suppose they don't need to be done at the same time

could the ArrowExtensionArray be described as a true extension array, i.e. using the extension array interface for 3rd party EAs.?

@simonjayhawkins yes, it purely uses ExtensionArray and ExtensionDtype to implement functionality.

does the ArrowExtensionArray rely on any Cython code for the implementation to work?

It does not, and ideally it won't. When interacting with other parts of Cython in pandas, e.g. groupby, we've created hooks to convert to numpy first.

As experimental dtypes why do they need any deprecation at all? Where do we promote these types as pandas recommended types?

This is a valid point; technically it shouldn't require deprecation.

While our docs don't necessarily state a recommended type, anecdotally, it has felt like in past year or two there's been quite a number of conference talks, blogs posts, books that have "celebrated" the newer Arrow types in pandas. Although attention != usage, it may still warrant some care if changing behavior IMO.

If the situation changes and there is more funding and contributions in the future and we have released a Windows 8 in the meantime then we could perhaps bring back the pandas nullable types.

An alternative to completely deprecating and removing the pandas NumPy nullable types is to spin them off into their own repository & package and treat them like any other 3rd party ExtensionArray library for users that still want to use them.

simonjayhawkins · 2025-06-10T16:48:31Z

If the situation changes and there is more funding and contributions in the future and we have released a Windows 8 in the meantime then we could perhaps bring back the pandas nullable types.

An alternative to completely deprecating and removing the pandas NumPy nullable types is to spin them off into their own repository & package and treat them like any other 3rd party ExtensionArray library for users that still want to use them.

we have a roadmap item:

Extensibility
Pandas extending.extension-types allow for extending NumPy types with custom data types and array storage. Pandas uses extension types internally, and provides an interface for 3rd-party libraries to define their own custom data types.

Many parts of pandas still unintentionally convert data to a NumPy array. These problems are especially pronounced for nested data.

We'd like to improve the handling of extension arrays throughout the library, making their behavior more consistent with the handling of NumPy arrays. We'll do this by cleaning up pandas' internals and adding new methods to the extension array interface.

the roadmap also states

pandas is in the process of moving roadmap points to PDEPs

so we could perhaps have a separate issue on this (perhaps culminating in a PDEP to clear this off that list) to address any points to make the spin off easier?

simonjayhawkins · 2025-06-11T11:17:51Z

But I'm not sure I fully understand why PDEP-16 must be a blocker for defaulting to PyArrow types.

@datapythonista I just would like some agreement that that defaulting to PyArrow types also matches PDEP-16's proposal to (only) NA semantics for this type as well when making the change for a consistent story, but I suppose they don't need to be done at the same time

In the absence of a competing proposal, I think you should proceed assuming that PDEP-16 is going to be accepted someday. A stale PDEP cannot be allowed to hold up development in other areas, especially if you have the bandwidth and resources to move forward on this.

simonjayhawkins · 2025-06-11T11:23:49Z

@mroeschke

Deprecate the NumPy nullable types

just to be clear, Int, Float, Boolean and pd.NA variants of StringArray only?

Deprecate mixing Arrow types and Numpy types in any operation

does that include numpy object type?

Dr-Irv · 2025-06-11T14:50:28Z

I'm going to throw out another (possibly crazy) idea based on these comments:

If we were starting pandas today, we would only use Arrow as storage for DataFrame columns / Series

An alternative to completely deprecating and removing the pandas NumPy nullable types is to spin them off into their own repository & package and treat them like any other 3rd party ExtensionArray library for users that still want to use them.

The idea is as follows:

We create a new repository corresponding to what is in pandas 2.3 - let's call it npandas ("n" indicating "numpy-based")
pandas 3.0 goes fully PyArrow - removes the numpy storage and removes the nullable extension types

We only do bug fixes to npandas - so if people want to use the numpy types or nullable types, they use that version.

If they want "modern" pandas, they fully buy into pandas 3.0 and PyArrow support - requiring pyarrow and start working with the future.

Then we don't have to worry about a transition path that WE maintain. The transition is up to the users. They can take working code with npandas 2.3 and try pandas 3.0, and report issues in converting from npandas 2.3 to pandas 3.0. But we don't have to keep all these code paths working - we have 2 separate projects.

WillAyd · 2025-06-11T14:58:46Z

Generally curious but why do we feel the need to remove the extension types? If anything, couldn't we just update them to use PyArrow behind the scenes?

I realize we use the term "numpy_nullable" throughout, but conceptually there's nothing different between what they and their corresponding Arrow types do. The Arrow implementation is just going to be more lightweight, so why not update those in place and not require users to change their code?

datapythonista · 2025-06-11T15:27:13Z

All proposals and discussion points so far seem very interesting. Personally I think it's great we are having this conversation.

Based on the feedback above, what makes most sense to me is:

Require PyArrow in 3.0
We deprecate the NaN behavior in Arrow types. Correct me if I'm wrong, but I think it's not a huge effort, Bodo will probably fund the development, and Brock and Simon have availability to do it
We default to PyArrow dtypes starting in 3.0, but we provide a global flag (e.g. pandas.options.mode.use_legacy_typing)for users who aren't ready to use the Arrow types.

This wouldn't take us to a perfect situation, but it makes the transition path very easy for both users and us, and we can start focussing on pandas backed by Arrow, while gathering feedback from users quite fast without creating them any inconvenient (other that adding a line of code setting the option). After getting the feedback we will be in a better position to decide on deprecating the legacy types, moving them to a separate repo, maintaining them forever...

mroeschke · 2025-06-11T15:56:41Z

@simonjayhawkins

just to be clear, Int, Float, Boolean and pd.NA variants of StringArray only?

Yes and I'd say all variants of StringArray (ArrowExtensionArray supports all functionality of StringArray)

does that include numpy object type?

I would say yes.

jbrockmendel · 2025-06-11T16:20:58Z

We default to PyArrow dtypes starting in 3.0, but we provide a global flag

3.0 is already huge (CoW, Strings) and way, way late. Let's not risk adding another year to that lateness.

jbrockmendel · 2025-06-11T16:37:18Z

We deprecate the NaN behavior in Arrow types. Correct me if I'm wrong, but I think it's not a huge effort, Bodo will probably fund the development, and Brock and Simon have availability to do it

Deprecating that behavior is relatively easy (though I kind of expect @jorisvandenbossche to chime in that we're missing something important). The hard part is the step after that telling users they have to change all their existing ser[1] = np.nan to use pd.NA instead (and constructor calls, and csvs).

As to "huge effort", off the top of my head, some tasks that would be required in order to make pyarrow dtypes the default include:

a) making object, category, interval, period dtypes Just Work
b) changing the non-numeric Index subclasses to use pyarrow dtypes
c) updating all the tests to work with both/all backend global options. Even if all the non-test code is bug-free, this is months of work.
d) (longer-term) try to ameliorate the performance hit for axis=1 operations, arithmetic, etc

datapythonista · 2025-06-11T17:28:19Z

Thanks @jbrockmendel, this is great feedback.

Do you think implementing the global option (without defaulting to arrow or fixing the above points) is reasonable for 3.0 (releasing it soon)?

For numpy nullable types, does that same list apply? Or just the work on missing value semantics is needed to consider them ready to be the default?

jbrockmendel · 2025-06-11T18:35:29Z

Everything except the performance would need to be done regardless of the default.

jorisvandenbossche · 2025-06-11T19:19:46Z

Trying to catch up with all discussions .. but already two quick points of feedback:

Timing:

We default to PyArrow dtypes starting in 3.0

3.0 is already huge (CoW, Strings) and way, way late. Let's not risk adding another year to that lateness.

I agree with @jbrockmendel here. Personally I would certainly not consider any of this for 3.0. There is still lots of work to do before we can even consider making the pyarrow dtypes the default (right now users cannot even try to this out fully), which I think will easily take at least a year.
(and saying that as the person who caused a year of delay for 3.0 thinking we could "quickly" include the string dtype feature in pandas 3.0 at a point where we were otherwise already ready to release ...)

General approach of the "PyArrow dtypes":

Personally, I would prefer that we go for the model chosen for the StringDtype, i.e. not necessarily the fact that there are multiple backends (or the na_value), but I mean the fact that we went for a pd.StringDtype() as the default dtype and not for pd.ArrowDtype(pa.large_string()).
We should make good use of the Arrow type system, but IMO we 1) should not necessarily expose all of its details to our users (e.g. string vs large_string vs string_view, but have a default "string" dtype (with potentially options for power users than want to customized it)), and so I think we should hide some implementation details of (Py)Arrow under a nice "pandas logical dtype" system (i.e. @WillAyd's PDEP), and 2) by having separate classes per "type" of dtype, we can also provide a better user experience (e.g. a datetime/timestamp dtype will have a timezone attribute, but numeric dtypes don't need this).

As another example of this, I think we should create a variant of pd.CategoricalDtype that is nullable and uses pyarrow under the hood, and not have people rely on the fact that this translates to pd.ArrowDtype(pa.dictionary(...))
(in the line of what Brock said above about "making object, category, interval, period dtypes Just Work")

simonjayhawkins · 2025-06-11T20:05:01Z

Thanks @jorisvandenbossche

2. General approach of the "PyArrow dtypes":

This aligns with my expectation of the future direction of pandas. Do we have any umbrella discussion anywhere that could be used as the basis of a PDEP so this is on the roadmap. (note: I'm not in any way volunteering to take that on)

I don't think PDEP-16 covers the using of pyarrow as a backend only and hiding the functionality behind our pandas nullable types as it is focused on the missing value semantics. Now, I appreciate that is part of that plan, but is not the plan as a roadmap item.

But to be fair, ArrowDType was introduced without a PDEP or IMO a roadmap item that covered what was implemented. As said above, these types have been promoted at conferences etc and are potentially being actively used. This is now IMO pulling the development team in two different directions. For instance, I am very frustrated at the ArrowDType for string being available to users as it has caused so much confusion even within the team. IMO we don't need this as "experimental" since we have the pandas nullable dtype available. However, I also appreciate that for evaluating an experimental dtype system (not just individual dtypes) we should perhaps include it. So I am conflicted.

Now we are where we are. And we probably now have as many (or more) core-devs that appear to favor the ArrowDtype system. So managing that is now likely as difficult as the technical discussions themselves.

@mroeschke please don't take this as criticism as the work you have done is great (for the whole project, the maintenance and not just the Arrow work) and any issues that have arisen are systematic failures. As @WillAyd pointed out the PDEP process needs to be able to allow to make decisions without every minute detail ironed out before the PDEP is approved.

datapythonista · 2025-06-11T20:29:32Z

Thanks all for the feedback. Personally feels like while opening so many discussions have been a bit chaotic (sorry about that as I started or reopened most), I think it's been very productive.

I'd like to know how do you feel about trying to unify all the type system topics into a single PDEP that broadly defines a roadmap for the next steps and final goal. Maybe I'm too optimistic, but I think there is agreement in most points, and it's just the details that need to be decided. Some of the things that I think could go into the PDEP and there is mostly agreement:

"Final" pandas dtypes backed by arrow (probably also agreement on not exposing it to the user)
Eventually making PyArrow a required dependency (unclear when)
Implementing a global option to at least change between the status quo and the final dtype system (maybe other options)
The PyArrow dtypes still need work before being the default/final:
- NA semantics
- Functionality not yet implemented
- Bugs / undesired behavior
- Performance
We should be testing the final PyArrow type system via our test suite with the global option (this can lead the work on what needs to be fixed)
There should be a transition path from the status quo to the final types system that needs to be defined
We need to take into consideration that we have around 500 work hours of funding, and that if things don't change, it doesn't seem likely that anyone will put the huge amount of time this needs to finish the roadmap

Does people feel like it make sense to write a PDEP with the general picture of this roadmap (that should superseed this issue, PDEP-15 and probably others)? Does anyone want to lead the effort to get this PDEP done?

mroeschke · 2025-06-11T20:33:40Z

please don't take this as criticism as the work you have done is great

None taken :) . Yes, the development of these types preceded our agreed-upon approval process for larger pandas enhancements.

Personally, I would prefer that we go for the model chosen for the StringDtype

FWIW, the early model of ArrowDtype almost followed a variant of the StringDtype model (#46972). I have come to like the benefits of the current ArrowDtype (probably with some bias), but I'll argue my case in a different forum.

simonjayhawkins · 2025-06-12T09:08:21Z

FWIW, the early model of ArrowDtype almost followed a variant of the StringDtype model (#46972). I have come to like the benefits of the current ArrowDtype (probably with some bias), but I'll argue my case in a different forum.

Surely, this is that forum? (From the title of the discussion)

I think that we should have another discussion for pandas nullable dtypes by default, which arguably is already opened #58243?

@mroeschke you have come up with a proposed (maybe seen as alternative) roadmap and I for one am more than happy to give it serious consideration. To do this IMO we must keep the discussion about pandas nullable types in a different thread. Discussion is good. With respect to bias, not a problem in this thread. To actively participate in this discussion I need to avoid any bias that I may have towards the pandas nullable types arising from my involvement in the pd.NA variant of the string array.

In my head, I can see a path that allows both to progress in parallel avoiding conflict. I'll elaborate later when I can present it as a coherent plan. (it involves separation of type systems as you have suggested, some deprecations as you have suggested but not removals, instead moving things behind flags as suggested) - A pure arrow approach and a cuddly wrapped one. Now that users have already been exposed to the more "raw" types, users may not needed to be shielded from this to the same extent and that could indeed be a path to the pandas nullable types. We may need to recognize the realities on the ground to come to a resolution with respect to the natural evolution and adoption of the ArrowDtype, the current level of funding and more importantly what we would have approval to use the funds for and lastly the decline in contributions of late and therefor the realistic speed of progress/interest on the nullable types.

simonjayhawkins · 2025-06-12T10:26:19Z

does that include numpy object type?

I would say yes.

df = pd.DataFrame(
    {
        "int": pd.Series([1, 2, 3], dtype=pd.ArrowDtype(pa.int64())),
        "str": pd.Series(["a", "b", pd.NA], dtype=pd.ArrowDtype(pa.string())),
    }
)
print(df)
#    int   str
# 0    1     a
# 1    2     b
# 2    3  <NA>

res = df.iloc[2]
res
# int       3
# str    <NA>
# Name: 2, dtype: object

res.dtype
# dtype('O')

res["int"]
# 3

type(res["int"])

print(df.T)
#      0  1     2
# int  1  2     3
# str  a  b  <NA>

df.T.dtypes
# 0    object
# 1    object
# 2    object
# dtype: object

So loc/iloc and transpose, and along with other axis=1 operations, on a heterogeneous dataframe of arrow only types, currently return a numpy array of Python objects. Numpy object dtype with pd.NA doesn't really work so well IMO.

How would we avoid numpy object dtype in a arrow only dtype system? Breaking API changes to the return types of iloc/loc? A nullable object dtype? having pyarrow scalars in the numpy array instead? ....

Dr-Irv · 2025-06-12T13:56:19Z

So loc/iloc and transpose, and along with other axis=1 operations, on a heterogeneous dataframe of arrow only types, currently return a numpy array of Python objects. Numpy object dtype with pd.NA doesn't really work so well IMO.

That happens with the nullable types as well:

>>> sa = pd.Series([1,2,pd.NA], dtype="Int64")
>>> sb = pd.Series([1.1, pd.NA, 2.2], dtype="Float64")
>>> sc = pd.Series([pd.NA, "v2", "v3"], dtype="string")
>>> df = pd.DataFrame({"a":sa, "b":sb, "c": sc})
>>> df
      a     b     c
0     1   1.1  <NA>
1     2  <NA>    v2
2  <NA>   2.2    v3
>>> df.T
      0     1     2
a     1     2  <NA>
b   1.1  <NA>   2.2
c  <NA>    v2    v3
>>> df.T.dtypes
0    object
1    object
2    object
dtype: object

And even with the numpy types:

>>> san = pd.Series([1,2,3], dtype="int")
>>> sbn = pd.Series([1.1, np.nan, 2.2], dtype="float")
>>> scn = pd.Series([np.nan, "v2", "v3"])
>>> dfn = pd.DataFrame({"a":san, "b":sbn, "c": scn})
>>> dfn.dtypes
a      int64
b    float64
c     object
dtype: object
>>> dfn.T
     0    1    2
a    1    2    3
b  1.1  NaN  2.2
c  NaN   v2   v3
>>> dfn.T.dtypes
0    object
1    object
2    object
dtype: object
>>> dfn.iloc[0]
a      1
b    1.1
c    NaN
Name: 0, dtype: object

So here in dfn.iloc[0], we have an integer, a float and np.nan, and we infer the dtype as object

jbrockmendel · 2025-06-12T16:39:35Z

in a pyarrow-mode world, what happens if i want to include a 3rd party EA column that isn't pyarrow-based?

simonjayhawkins · 2025-06-13T08:47:03Z

Personally, I would prefer that we go for the model chosen for the StringDtype, i.e. not necessarily the fact that there are multiple backends (or the na_value), but I mean the fact that we went for a pd.StringDtype() as the default dtype and not for pd.ArrowDtype(pa.large_string()).
We should make good use of the Arrow type system, but IMO we 1) should not necessarily expose all of its details to our users (e.g. string vs large_string vs string_view, but have a default "string" dtype (with potentially options for power users than want to customized it)), and so I think we should hide some implementation details of (Py)Arrow under a nice "pandas logical dtype" system (i.e. @WillAyd's PDEP), and 2) by having separate classes per "type" of dtype, we can also provide a better user experience (e.g. a datetime/timestamp dtype will have a timezone attribute, but numeric dtypes don't need this).

@jorisvandenbossche

can you elaborate on what work has actually been done on using PyArrow as a backend to the existing pandas nullable types.

Is it only the StringArray so far?

simonjayhawkins · 2025-06-13T09:24:11Z

I'd like to know how do you feel about trying to unify all the type system topics into a single PDEP that broadly defines a roadmap for the next steps and final goal.

@datapythonista

I think this may be a big ask at this time. I think that we may need compare 2 alternative directions to find the most preferred approach rather than trying to unify the two at this time. (I think that ultimately we would need to unify the two to keep cohesion within the team)

@mroeschke wrote

I would like a world where the Arrow type system with NA semantics would be the only pandas type system.

and from this discussion about considering PyArrow as the default, it could be fair to say that the idea is that if pandas were designed today, it would use Arrow for column storage.

Where we are today is that significant work has already been done to build new pandas types based on PyArrow, perhaps suggesting a future where pandas is underpinned by Arrow instead of legacy NumPy types.

AFAICT this work is mainly the work on the ArrowDType and apart from the StringDtype there has been no/little progress/interest in using PyArrow for storage in the pandas nullable types.

@mroeschke also wrote

Secondarily, I would like a world where pandas has the Arrow type system with NA semantics and the legacy (NaN) Numpy type system which is completely independent from the Arrow type system (e.g. a user cannot mix the two in any way)

interpreting @jorisvandenbossche comment, mixing the two was arguably a goal of PDEP-13. This effort stalled and so arguably would need to be resurrected if we did not instead consider a roadmap where the type systems were perhaps instead considered as completely independent. So applying @mroeschke comment to pandas nullable types as well - i..e a pandas nullable type system which is completely independent from the Arrow type system (e.g. a user cannot mix the two in any way)

simonjayhawkins · 2025-06-13T10:11:32Z

@mroeschke

My ideal roadplan would be:

In pandas 3.x

Deprecate the NumPy nullable types

We need to acknowledge that this would be contentious.

Deprecate NaN as a missing value for Arrow types

This would perhaps make the transition to ArrowDtype less straightforward for existing users but after the initial pain, the resulting type system will be more maintainable? Can you perhaps elaborate on this point?

Deprecate mixing Arrow types and Numpy types in any operation

I think this point is a good suggestion that we could definitely explore further as this is something that we would be more likely to get consensus on

Add some global configuration API, set_option("type_system", "legacy" | "pyarrow"), that configures the "default" type system as either NaN semantics with NumPy or NA semantics with Arrow (with the default being "legacy")

This suggestion would make sense to me, especially if we moved in the direction of better separation of the dtype systems. It would help potentially address the issues of complexity in transition paths and compatibility of types, the concerns you raised about PDEP-14 and the issues raised in PDEP-13

datapythonista · 2025-06-13T10:54:40Z

Thanks Simon for the feedback

I think this may be a big ask at this time

I think we agree on many things, and it's the details, and that discussing too many related topics at once makes the discussion very complex. What I'm proposing is to start with a PDEP with all the topics we can agree on, and a general roadmap that doesn't go too much into details.

Our vision is to move to a pandas that doesn't expose the underlying data storage, but so far it should move to using PyArrow? I don't think there have been strong opinions against this (maybe I misunderstood). If we can agree on this, that's a huge step in my opinion.

Do we consider pd.NA based missing values the ideal, and NaN something we consider good for backward compatibility? Feels like this is what most people believe (again, maybe I just misunderstood).

Should we have a global flag to enable the typing system that we consider unstable but closer to the "final" one we'd like to have? I think everybody agrees on this.

Should we use funds so we have progress in making the PyArrow types more stable and feature complete? I assume there is agreement. Should we start by numeric types, strings, dates and categorical maybe? I guess we can find an agreement on this.

So, I don't think it's a big ask. We can leave for other PDEPs any topic that happens to be too controversial at this point, and still agree on plenty of things. And if we have a document with a good enough roadmap, I think there will be big advantages:

You and Brock can start working with the funds we've already got in the top priorities
We will be able to have focused discussions, that don't need to deal with total uncertainty and keep reopening discussions all the time
Both pandas developers and users can manage their expectations and plan accordingly
We can use the roadmap to request funding, which at present day is not possible because we can't tell what we want to do long term with any level of detail

@jbrockmendel @jorisvandenbossche if any of you (or both) want to work with me in putting together a document with the things we consider we can find agreement with everybody at this point, I'd be happy to do that. Or if anyone else wants to work on this, I'm surely more than happy with it. I just think that Brock and Joris have besides the technical expertise quite a different point of view from mine, so working together may produce something that considers most opinions.

simonjayhawkins · 2025-06-13T12:19:14Z

@datapythonista writing a PDEP is potentially not so hard these days so long as the discussion has enough material and consensus to work with to produce a useable document for the initial draft.

Hence my reservation to delve into a PDEP "trying to unify all the type system topics into a single PDEP that broadly defines a roadmap for the next steps and final goal."

If you feel that there is enough agreement here to progress to a PDEP then my intention was not to hinder that.

To save you the effort of writing a PDEP we could instead perhaps at this stage get feedback on an AI generated one from the discussion so far in the first instance?

You will perhaps understand, after reading on, my comments regarding keeping discussions focused on the discussion title and that a future using nullable types should perhaps be in a different thread.

Below is a draft PDEP—Pandas Development Enhancement Proposal—that synthesizes the discussion from this GitHub thread into a coherent future direction for pandas. This draft aims to provide a roadmap for unifying pandas’ type system around a PyArrow-backed framework while offering an opt‑in transition path via a global configuration API.

PDEP‑XX: Unified Pandas Type System—Transitioning to PyArrow‑Backed Dtypes as Default

1. Abstract

This proposal outlines a phased roadmap to move pandas toward a unified type system that uses the PyArrow engine as its default backend. The primary goal is to modernize pandas by aligning column storage with a modern, cross‑language Arrow type system, thereby addressing longstanding challenges with missing value semantics and interoperability. In parallel, we will retain compatibility with the legacy NumPy‑based types during a multi‐release transition period via an explicit global configuration option. Ultimately, this migration will deliver a cleaner type system that supports both NA semantics and improved performance across operations.

2. Motivation

The current pandas architecture is rooted in NumPy‑centric types, which present several challenges:

Inconsistent missing value semantics: There is confusion between np.nan (legacy NumPy behavior) and pd.NA (pandas’ native missing value indicator).
Interoperability: With the rise of Apache Arrow as a cross‑language in‑memory format, it is natural to consider a backend that supports modern I/O and Arrow’s robust type system.
Long‑term vision: If pandas were designed from scratch, a PyArrow‑backed system would be the natural choice. Significant work on experimental ArrowDtypes (e.g., for strings, numerics, and categories) has already set the stage for this evolution.

This proposal seeks to establish a clear, step‑by‑step strategy for adopting PyArrow as the default while managing the vast code base and diverse user expectations.

3. Goals and Principles

Phased Transition: Allow current users to keep the legacy behavior while providing a controlled pathway for future migrations.
Global Configuration: Introduce a global flag (e.g., via pandas.options) that lets users toggle the “default” type system between the legacy (NumPy with NaN) and the new PyArrow‑backed system.
Unified Semantics: Eliminate ambiguities by standardizing on NA semantics (using pd.NA) for the new Arrow‑backed types.
Interoperability and Extensibility: Maintain strict boundaries between the new Arrow‑based types and the legacy types so that they are not inadvertently mixed in operations.
Backward Compatibility: Ensure that users who are not ready for the change can continue to use the legacy system by pinning to older pandas releases or explicitly setting the global option.

4. Proposal Details

4.1 Global Configuration Option

New Option: Introduce a configuration parameter, for example:

pandas.options.type_system = "legacy"  # alternatives: "legacy", "pyarrow"

Default Behavior: Initially, the default will be set to "legacy" to preserve current semantics. Users who wish to experiment or migrate early will set the option to "pyarrow".
Impact on Constructors and Functions: All constructors (e.g., pd.Series(), pd.DataFrame(), I/O functions) will consult this option. When "pyarrow" is selected, new columns are instantiated using Arrow-backed Extension Arrays based on the ArrowDtype.

4.2 Transition Roadmap

The migration path spans multiple major releases:

Pandas 3.x – Introduction and Deprecation Phase:
- Implementation:
  - Implement the new Arrow‑backed dtypes with complete adherence to the ExtensionArray/ExtensionDtype interface.
  - Integrate PyArrow as an optional backend for core dtypes (e.g., numeric, string, categorical, interval, period).
- Deprecation Warnings:
  - Begin deprecating legacy handling of missing values in the context of Arrow‑backed types (i.e., allow only pd.NA for missing values in this mode).
  - Warn users when mixing legacy and Arrow‑based types in operations, as mixing may lead to inconsistent results.
- Testing and Stability:
  - Ensure that comprehensive test suites verify expected behavior under both settings.
  - Identify and document any performance regressions (e.g., in axis‑1 operations) that need addressing.
Pandas 4.x – Warning and Migration Enforcement:
- More Aggressive Deprecations:
  - Emit warnings for legacy behavior in Arrow‑mode operations.
  - Gradually restrict mixing between the two type systems by raising warnings and, eventually, deprecation errors.
- Enhanced Documentation and Migration Guides:
  - Provide detailed guides on how to update code bases from using np.nan to pd.NA when operating with Arrow‑backed types.
  - Clarify behavior differences (e.g., transposition returning numpy object arrays) and propose workarounds.
Pandas 5.x – Full Transition:
- Default Switch:
  - Change the default global option value to "pyarrow".
  - Continue warning about legacy type usage, with planned removal of legacy handling in a future release.
- Removal of Legacy Defaults:
  - Eventually retire the legacy type system from core pandas (or spin it off into an external package for users who need it).
- Performance and Compatibility:
  - Finalize optimization efforts for vectorized operations, ensuring that the new system performs equivalently (or better) than the legacy system.

4.3 Handling Platform and Operational Concerns

Extension Array Consistency:
- Ensure that Arrow‑backed Extension Arrays (e.g., ArrowExtensionArray) implement all required operations without relying on Cython bridging for key operations.
Mixing Third‑Party Arrays:
- Define clear semantics for operations that involve a mix of PyArrow‑backed dtypes and third‑party extension types.
- Document these boundaries to avoid unexpected type coercions.
API Changes:
- Minimize breaking changes by providing interim deprecation cycles and detailed migration guides.
- Consider the use of explicit type constructors to allow fine‑grained control when necessary.

5. Implementation Considerations

Backward Compatibility:
- Users not yet ready for the migration can reliably use the legacy system by either pinning to an older release or explicitly setting pandas.options.type_system = "legacy".
Testing and Benchmarking:
- Develop rigorous benchmarks comparing numeric performance (e.g., axis‑1 operations, arithmetic) and ensure that the Arrow‑backed types are tested under real‑world workloads.
Collaboration with Funding and Community:
- Recognize that successful implementation depends on both core developer time and external contributions/funding.
- Establish a clear list of deliverables and milestones that can be revisited as new contributions are secured.
Documentation and Communication:
- Engage with the wide pandas user community (via conferences, mailing lists, GitHub discussions) to gather continuous feedback throughout the transition.

6. Risks and Mitigation

Potential for User Confusion:
- Mitigation: Clear documentation, comprehensive deprecation warnings, and migration guides.
Performance Regressions:
- Mitigation: Optimize and test critical operators; provide opt‑in for early adapters to help surface issues before full adoption.
Inadequate Funding or Development Resources:
- Mitigation: Maintain backwards compatibility through the global configuration option and consider spinning the legacy system as an independent package if necessary.

7. Final Remarks and Call for Community Involvement

This proposal represents a holistic vision for a more modern, consistent pandas type system—one that leverages PyArrow’s strengths for internal storage and interoperability while allowing a gradual transition from legacy practices. The outlined roadmap acknowledges the enormous efforts already invested, respects the need for backward compatibility, and offers a clear path for continuous evolution. We invite further feedback and encourage interested core developers and community members to help refine and eventually lead this transition.

Additional Considerations:

Example Usage:
A simple demonstration of how the global configuration affects DataFrame construction:

import pandas as pd
pd.options.type_system = "pyarrow"  # switch to the new system

df = pd.DataFrame({
    "integers": pd.Series([1, 2, pd.NA], dtype=pd.ArrowDtype("int64")),
    "strings": pd.Series(["a", "b", pd.NA], dtype=pd.ArrowDtype("string"))
})
print(df)

Community Roadmap Sessions:
Organizing virtual roadmap meetings or hackathons might catalyze development efforts, address outstanding technical challenges (such as object dtype behavior during transposition), and further synchronize community priorities.
Future Research:
The proposal opens further inquiry into using Arrow’s capabilities (such as zero‑copy transformations and interoperability with other Arrow‑based libraries) within pandas. Investigating these avenues could maximize performance and broaden the library’s ecosystem reach.

This draft is intended as a starting point for discussion. Feedback—especially concrete examples from real‑world usage and performance benchmarks—will be vital to refine and ensure successful adoption of this new direction for pandas.

jbrockmendel · 2025-06-13T14:36:41Z

Simon and Marc, please slow down. I woke up to at least ten screens of text from you two. Every bit of it means more to catch up on for the people who haven’t participated yet.

…

On Fri, Jun 13, 2025 at 5:19 AM Simon Hawkins ***@***.***> wrote: *simonjayhawkins* left a comment (pandas-dev/pandas#61618) <#61618 (comment)> @datapythonista <https://github.com/datapythonista> writing a PDEP is potentially not so hard these days so long as the discussion has enough material and consensus to work with to produce a useable document for the initial draft. Hence my reservation to delve into a PDEP "trying to unify all the type system topics into a single PDEP that broadly defines a roadmap for the next steps and final goal." If you feel that there is enough agreement here to progress to a PDEP then my intention was not to hinder that. To save you the effort of writing a PDEP we could instead perhaps at this stage get feedback on an AI generated one from the discussion so far in the first instance? You will perhaps understand, after reading on, my comments regarding keeping discussions focused on the discussion title and that a future using nullable types should perhaps be in a different thread. Below is a draft PDEP—Pandas Development Enhancement Proposal—that synthesizes the discussion from this GitHub thread into a coherent future direction for pandas. This draft aims to provide a roadmap for unifying pandas’ type system around a PyArrow-backed framework while offering an opt‑in transition path via a global configuration API. ------------------------------ PDEP‑XX: Unified Pandas Type System—Transitioning to PyArrow‑Backed Dtypes as Default 1. Abstract This proposal outlines a phased roadmap to move pandas toward a unified type system that uses the PyArrow engine as its default backend. The primary goal is to modernize pandas by aligning column storage with a modern, cross‑language Arrow type system, thereby addressing longstanding challenges with missing value semantics and interoperability. In parallel, we will retain compatibility with the legacy NumPy‑based types during a multi‐release transition period via an explicit global configuration option. Ultimately, this migration will deliver a cleaner type system that supports both NA semantics and improved performance across operations. 2. Motivation The current pandas architecture is rooted in NumPy‑centric types, which present several challenges: - *Inconsistent missing value semantics:* There is confusion between np.nan (legacy NumPy behavior) and pd.NA (pandas’ native missing value indicator). - *Interoperability:* With the rise of Apache Arrow as a cross‑language in‑memory format, it is natural to consider a backend that supports modern I/O and Arrow’s robust type system. - *Long‑term vision:* If pandas were designed from scratch, a PyArrow‑backed system would be the natural choice. Significant work on experimental ArrowDtypes (e.g., for strings, numerics, and categories) has already set the stage for this evolution. This proposal seeks to establish a clear, step‑by‑step strategy for adopting PyArrow as the default while managing the vast code base and diverse user expectations. 3. Goals and Principles - *Phased Transition:* Allow current users to keep the legacy behavior while providing a controlled pathway for future migrations. - *Global Configuration:* Introduce a global flag (e.g., via pandas.options) that lets users toggle the “default” type system between the legacy (NumPy with NaN) and the new PyArrow‑backed system. - *Unified Semantics:* Eliminate ambiguities by standardizing on NA semantics (using pd.NA) for the new Arrow‑backed types. - *Interoperability and Extensibility:* Maintain strict boundaries between the new Arrow‑based types and the legacy types so that they are not inadvertently mixed in operations. - *Backward Compatibility:* Ensure that users who are not ready for the change can continue to use the legacy system by pinning to older pandas releases or explicitly setting the global option. 4. Proposal Details 4.1 Global Configuration Option - *New Option:* Introduce a configuration parameter, for example: pandas.options.type_system = "legacy" # alternatives: "legacy", "pyarrow" - *Default Behavior:* Initially, the default will be set to "legacy" to preserve current semantics. Users who wish to experiment or migrate early will set the option to "pyarrow". - *Impact on Constructors and Functions:* All constructors (e.g., pd.Series(), pd.DataFrame(), I/O functions) will consult this option. When "pyarrow" is selected, new columns are instantiated using Arrow-backed Extension Arrays based on the ArrowDtype. 4.2 Transition Roadmap The migration path spans multiple major releases: 1. *Pandas 3.x – Introduction and Deprecation Phase:* - *Implementation:* - Implement the new Arrow‑backed dtypes with complete adherence to the ExtensionArray/ExtensionDtype interface. - Integrate PyArrow as an optional backend for core dtypes (e.g., numeric, string, categorical, interval, period). - *Deprecation Warnings:* - Begin deprecating legacy handling of missing values in the context of Arrow‑backed types (i.e., allow only pd.NA for missing values in this mode). - Warn users when mixing legacy and Arrow‑based types in operations, as mixing may lead to inconsistent results. - *Testing and Stability:* - Ensure that comprehensive test suites verify expected behavior under both settings. - Identify and document any performance regressions (e.g., in axis‑1 operations) that need addressing. 2. *Pandas 4.x – Warning and Migration Enforcement:* - *More Aggressive Deprecations:* - Emit warnings for legacy behavior in Arrow‑mode operations. - Gradually restrict mixing between the two type systems by raising warnings and, eventually, deprecation errors. - *Enhanced Documentation and Migration Guides:* - Provide detailed guides on how to update code bases from using np.nan to pd.NA when operating with Arrow‑backed types. - Clarify behavior differences (e.g., transposition returning numpy object arrays) and propose workarounds. 3. *Pandas 5.x – Full Transition:* - *Default Switch:* - Change the default global option value to "pyarrow". - Continue warning about legacy type usage, with planned removal of legacy handling in a future release. - *Removal of Legacy Defaults:* - Eventually retire the legacy type system from core pandas (or spin it off into an external package for users who need it). - *Performance and Compatibility:* - Finalize optimization efforts for vectorized operations, ensuring that the new system performs equivalently (or better) than the legacy system. 4.3 Handling Platform and Operational Concerns - *Extension Array Consistency:* - Ensure that Arrow‑backed Extension Arrays (e.g., ArrowExtensionArray) implement all required operations without relying on Cython bridging for key operations. - *Mixing Third‑Party Arrays:* - Define clear semantics for operations that involve a mix of PyArrow‑backed dtypes and third‑party extension types. - Document these boundaries to avoid unexpected type coercions. - *API Changes:* - Minimize breaking changes by providing interim deprecation cycles and detailed migration guides. - Consider the use of explicit type constructors to allow fine‑grained control when necessary. 5. Implementation Considerations - *Backward Compatibility:* - Users not yet ready for the migration can reliably use the legacy system by either pinning to an older release or explicitly setting pandas.options.type_system = "legacy". - *Testing and Benchmarking:* - Develop rigorous benchmarks comparing numeric performance (e.g., axis‑1 operations, arithmetic) and ensure that the Arrow‑backed types are tested under real‑world workloads. - *Collaboration with Funding and Community:* - Recognize that successful implementation depends on both core developer time and external contributions/funding. - Establish a clear list of deliverables and milestones that can be revisited as new contributions are secured. - *Documentation and Communication:* - Engage with the wide pandas user community (via conferences, mailing lists, GitHub discussions) to gather continuous feedback throughout the transition. 6. Risks and Mitigation - *Potential for User Confusion:* - *Mitigation:* Clear documentation, comprehensive deprecation warnings, and migration guides. - *Performance Regressions:* - *Mitigation:* Optimize and test critical operators; provide opt‑in for early adapters to help surface issues before full adoption. - *Inadequate Funding or Development Resources:* - *Mitigation:* Maintain backwards compatibility through the global configuration option and consider spinning the legacy system as an independent package if necessary. 7. Final Remarks and Call for Community Involvement This proposal represents a holistic vision for a more modern, consistent pandas type system—one that leverages PyArrow’s strengths for internal storage and interoperability while allowing a gradual transition from legacy practices. The outlined roadmap acknowledges the enormous efforts already invested, respects the need for backward compatibility, and offers a clear path for continuous evolution. We invite further feedback and encourage interested core developers and community members to help refine and eventually lead this transition. ------------------------------ *Additional Considerations:* - *Example Usage:* A simple demonstration of how the global configuration affects DataFrame construction: import pandas as pdpd.options.type_system = "pyarrow" # switch to the new system df = pd.DataFrame({ "integers": pd.Series([1, 2, pd.NA], dtype=pd.ArrowDtype("int64")), "strings": pd.Series(["a", "b", pd.NA], dtype=pd.ArrowDtype("string")) })print(df) - *Community Roadmap Sessions:* Organizing virtual roadmap meetings or hackathons might catalyze development efforts, address outstanding technical challenges (such as object dtype behavior during transposition), and further synchronize community priorities. - *Future Research:* The proposal opens further inquiry into using Arrow’s capabilities (such as zero‑copy transformations and interoperability with other Arrow‑based libraries) within pandas. Investigating these avenues could maximize performance and broaden the library’s ecosystem reach. This draft is intended as a starting point for discussion. Feedback—especially concrete examples from real‑world usage and performance benchmarks—will be vital to refine and ensure successful adoption of this new direction for pandas. — Reply to this email directly, view it on GitHub <#61618 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5UM6BPTAWEKZQLQRZ5VD33DK6VRAVCNFSM6AAAAAB66AB4EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNZQGIYTQMBQHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Dr-Irv · 2025-06-13T15:33:15Z

Below is a draft PDEP—Pandas Development Enhancement Proposal—that synthesizes the discussion from this GitHub thread into a coherent future direction for pandas. This draft aims to provide a roadmap for unifying pandas’ type system around a PyArrow-backed framework while offering an opt‑in transition path via a global configuration API.

Have to admit that this draft looks pretty good as a starting point. Nice application of AI generated content

datapythonista added the Needs Discussion Requires discussion from core team before further action label Jun 9, 2025

datapythonista mentioned this issue Jun 9, 2025

PDEP-18: Nullable Object Dtype #61599

Open

mroeschke added the Arrow pyarrow functionality label Jun 9, 2025

Uh oh!

Moving to PyArrow dtypes by default #61618

Moving to PyArrow dtypes by default #61618

Comments

datapythonista commented Jun 9, 2025

jbrockmendel commented Jun 9, 2025

Uh oh!

mroeschke commented Jun 9, 2025

Uh oh!

jbrockmendel commented Jun 10, 2025

Uh oh!

datapythonista commented Jun 10, 2025

Uh oh!

mroeschke commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datapythonista commented Jun 10, 2025

Uh oh!

simonjayhawkins commented Jun 10, 2025

Uh oh!

simonjayhawkins commented Jun 10, 2025

Uh oh!

simonjayhawkins commented Jun 10, 2025

Uh oh!

simonjayhawkins commented Jun 10, 2025

Uh oh!

WillAyd commented Jun 10, 2025

Uh oh!

mroeschke commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonjayhawkins commented Jun 10, 2025

Uh oh!

simonjayhawkins commented Jun 11, 2025

Uh oh!

simonjayhawkins commented Jun 11, 2025

Uh oh!

Dr-Irv commented Jun 11, 2025

Uh oh!

WillAyd commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datapythonista commented Jun 11, 2025

Uh oh!

mroeschke commented Jun 11, 2025

Uh oh!

jbrockmendel commented Jun 11, 2025

Uh oh!

jbrockmendel commented Jun 11, 2025

Uh oh!

datapythonista commented Jun 11, 2025

Uh oh!

jbrockmendel commented Jun 11, 2025

Uh oh!

jorisvandenbossche commented Jun 11, 2025

Uh oh!

simonjayhawkins commented Jun 11, 2025

Uh oh!

datapythonista commented Jun 11, 2025

Uh oh!

mroeschke commented Jun 11, 2025

Uh oh!

simonjayhawkins commented Jun 12, 2025

Uh oh!

simonjayhawkins commented Jun 12, 2025

Uh oh!

Dr-Irv commented Jun 12, 2025

Uh oh!

jbrockmendel commented Jun 12, 2025

Uh oh!

simonjayhawkins commented Jun 13, 2025

Uh oh!

simonjayhawkins commented Jun 13, 2025

Uh oh!

simonjayhawkins commented Jun 13, 2025

Uh oh!

datapythonista commented Jun 13, 2025

Uh oh!

simonjayhawkins commented Jun 13, 2025

PDEP‑XX: Unified Pandas Type System—Transitioning to PyArrow‑Backed Dtypes as Default

mroeschke commented Jun 10, 2025 •

edited

Loading

mroeschke commented Jun 10, 2025 •

edited

Loading

WillAyd commented Jun 11, 2025 •

edited

Loading