Skip to content

Conversation

@hmaarrfk
Copy link
Contributor

@hmaarrfk hmaarrfk commented Jan 1, 2024

closes #98

Copy link
Owner

@mverleg mverleg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to update the readme. I'm happy to do it myself if you prefer.

else:
return _scalar_to_numpy(data_json, nptype)
# This code path is mostly for 0-dimensional arrays
# numpy scalars are separately decoded
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about sclars that were serialized with encode_scalars_inplace before these changes?

elif isinstance(obj, scalar_types):
dct = hashodict((
('__numpy_scalar__', obj.item()),
('dtype', str(obj.dtype)),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new approach seems great and indeed in an ideal world it would work perfectly.

...

However, it was found in issue #18 that Python sees some numpy scalars are primitives, and refuses to call encoders for them (presumably for performance).

Which ones depends on the Python version, for extra confusion, although I guess Python 2 is less important now.

In any case, this function in Python 3 works for a lot of types, but not for float64, which is an important one. There are two concerns with this:

  • While keeping the numpy scalar type is better in general, doing it half the time seems like it adds more confusion than it's worth.
  • We've been doing it this way, it's be a (slightly) breaking change to start doing it differently.

I think we'll need to think a bit more about this, maybe make it opt-in, or skip the scalars and do 0dimensional arrays only, if that's possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg, i feel so ashamed to not have tested float64....

I literally tested uint8 through uint64, int8 through int64, datetime64s, and float32.... but not float64....

for reference, this is the current output.

{                                                                                   
    "uint32": {                                                                     
        "__numpy_scalar__": 1,                                                      
        "dtype": "uint32"                                                           
    },                                                                              
    "int32": {                                                                      
        "__numpy_scalar__": 1,                                                      
        "dtype": "int32"                                                            
    },                                                                              
    "float32": {                                                                    
        "__numpy_scalar__": 1.0,                                                    
        "dtype": "float32"                                                          
    },                                                                                                                                                                                                                
    "float64": 1.0,                                                                 
    "datetime64[ns]": {                                                         
        "__numpy_scalar__": 1704235669639528000,                                    
        "dtype": "datetime64[ns]"                                                   
    },                                                                              
    "datetime64[us]": {                                                             
        "__numpy_scalar__": {                                                       
            "__datetime__": null,                                                   
            "year": 2024,                                                           
            "month": 1,                                                             
            "day": 2,                                                               
            "hour": 22,                                                             
            "minute": 47,                                                           
            "second": 49,                                                           
            "microsecond": 639546                                                   
        },                                                                          
        "dtype": "datetime64[us]"                                                   
    }                                                                               
}

@hmaarrfk hmaarrfk force-pushed the serialize_scalars branch 3 times, most recently from 0061bd4 to 3f9a396 Compare January 3, 2024 03:10
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Jan 3, 2024

The test you added in ac5c8be is in direct conflict with some of the spirits here of serializing numpy dtypes in a round trip way.

@mverleg
Copy link
Owner

mverleg commented Jan 4, 2024

The test you added in ac5c8be is in direct conflict with some of the spirits here of serializing numpy dtypes in a round trip way.

Yeah I get that, but I think it could be useful to preserve backwards compatibility (and test to ensure we do so). That's what I meant in the previous review.

We've been doing it this way, it's be a (slightly) breaking change to start doing it differently.

Maybe it's useful enough to break compatibility but I'm not sure. I guess people who really like the old format could use the primitives=True flag. I'll think about it.

@mverleg
Copy link
Owner

mverleg commented Jan 26, 2024

I think it's best to keep this one for consideration in version 4.0 because of the backwards compatibility concerns

@mverleg mverleg added the v4.0 label Jan 26, 2024
@hmaarrfk
Copy link
Contributor Author

honestly, I think the challenges with float vs np.float64 occur because the default json parser is used as a base. If you want, we can have a heavy handed approach, effectively taking _make_iterencode and adding a stricter check against float type.

ultimately, you could "solve" this issue by trying to use an other json parser like orjson or something other.

Your choice.

This isn't a deal breaker for me just at this exact moment, however, round tripping is something that is quite important to us and this might make me look elsewhere as a user as the challenges increase.

Understanding your timeline for 4.0 would be helpful in us making our decision to stick with pyjson_tricks and continue to contribute to this wonderful project!

@hmaarrfk
Copy link
Contributor Author

This somewhat came up as we are reviewing our test suite and trying to cleanup our warnings. It seems that the main thing that was left is trying to decide on how to deal with

Maybe it's useful enough to break compatibility but I'm not sure. I guess people who really like the old format could use the primitives=True flag. I'll think about it.

Is that in fact the only outstanding issue? it might be possible if I create enough test cases to find a pattern for how to address these old datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Roundtripping inconsistent for objects of shape=() and numpy scalars...

2 participants