This repository was archived by the owner on Aug 8, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 64
This repository was archived by the owner on Aug 8, 2024. It is now read-only.
Unable to pickle parsed output #27
Copy link
Copy link
Open
Description
I'm trying to do some multiprocess/distributed processing of apache logs, which uses serialization/deserialization via pickle for moving data between scheduler/worker processes.
However, deserialization fails on the parsed outputs, in my case specifically time_received_tz_datetimeobj
and time_received_utc_datetimeobj
, for input strings like:
import apache_log_parser
import pickle
mylist = ['157.55.39.31 - - [21/Mar/2019:07:56:41 +0000] "GET / HTTP/1.1" 200 6878 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"',
'40.77.167.37 - - [21/Mar/2019:07:59:11 +0000] "GET / HTTP/1.1" 301 469 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"'
]
logparser = apache_log_parser.make_parser('%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"')
parsed_logline = logparser(mylist[0])
_ = pickle.dumps(parsed_logline)
# this causes error:
pickle.loads(_)
(This is in python 3.66, and apache log parser 1.7.0, by the way.)
I can fix this in my implementation by converting the '0000' timezone to UTC:
def to_utc(datetimeobj):
if str(datetimeobj.tzinfo) == "'0000'":
return datetimeobj.astimezone(datetime.timezone.utc)
else:
return datetimeobj
parsed_logline['time_received_tz_datetimeobj'] = to_utc(parsed_logline['time_received_tz_datetimeobj'])
parsed_logline['time_received_utc_datetimeobj'] = to_utc(parsed_logline['time_received_utc_datetimeobj'])
But this seems like something more appropriate to do in the parser. That said, I'm not sure if this would break backwards compatibility with other Python versions.
Metadata
Metadata
Assignees
Labels
No labels