-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Describe the Bug
In get_legistar_content_uris, the BeautifulSoup code to extract_url from a Legistar Event URL (legistar_ev[LEGISTAR_EV_SITE_URL]) misses available video links in some circumstances.
Expected Behavior
The City of Olympia has what appears to be a pretty standard Legistar implementation including Granicus-hosted media files. So I was surprised when the stock get_content_uris call didn't result in matches.
Here's an example Olympia Planning Commission event detail screen and the corresponding valid "Media" anchor tag:
<a id="ctl00_ContentPlaceHolder1_gridMain_ctl00_ctl06_hypVideo" onclick="window.open('Video.aspx?Mode=Granicus&ID1=1536&ID2=120417&G=19510D34-31FB-48B8-9C02-4D026953451C&Mode2=Video','video');return false;" href="#" style="color:Blue;font-family:Tahoma;font-size:10pt;">Media</a>I identified three potential issues which could be addressed while hopefully not impacting existing matches in the wild. Here's the operative CDP code:
extract_url = soup.find(
"a",
id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
class_="videolink",
)
if extract_url is None:
return (ContentUriScrapeResult.Status.UnrecognizedPatternError, None)
# the <a> tag will not have this attribute if there is no video
if "onclick" not in extract_url.attrs:
return (ContentUriScrapeResult.Status.ContentNotProvidedError, None)videolinkclass - City of Olympia Media links do not have avideolinkclass assigned. Is this a requirement to differentiate links on other Legistar instances, or is the highly specific ID enough?findonly identifies the first Media link instance - and in the example provided, the first Media link is not associated with a video, therefore resulting in a failure for the entire event. You could do afind_alland iterate through, but a different approach might be...onclickis a distinguishing attribute - while checked subsequently to provide a unique error, we could test for the presence of theonclickattribute to more quickly identify a valid Media link.
Here's how I suggest modifying the code:
extract_url = soup.find(
"a",
id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
onclick=True,
)
if extract_url is None:
return (ContentUriScrapeResult.Status.UnrecognizedPatternError, None)Reproduction
You can see where the Event Gather workflow is failing on the cdp-usa-wa-olympia instance here; while not specifically pointing out this issue, this is the next hiccup:
https://github.com/CannObserv/cdp-usa-wa-city-olympia/actions/runs/6999433306/job/19038863304
If this change isn't apt to break anything, I'd much rather change things here than have to derive a dedicated scraper class (at least not yet) and then override get_legistar_content_uris in that file. I'm not sure how to get the Python import hierarchy to respect an override otherwise.
Environment
- OS Version: [e.g. macOS 11.3.1]
- cdp-scrapers Version: [e.g. 0.5.0]