Skip to content

get_legistar_content_uris initial media URL extraction inaccurate for some events #145

@gregoryfoster

Description

@gregoryfoster

Describe the Bug

In get_legistar_content_uris, the BeautifulSoup code to extract_url from a Legistar Event URL (legistar_ev[LEGISTAR_EV_SITE_URL]) misses available video links in some circumstances.

Expected Behavior

The City of Olympia has what appears to be a pretty standard Legistar implementation including Granicus-hosted media files. So I was surprised when the stock get_content_uris call didn't result in matches.

Here's an example Olympia Planning Commission event detail screen and the corresponding valid "Media" anchor tag:

<a id="ctl00_ContentPlaceHolder1_gridMain_ctl00_ctl06_hypVideo" onclick="window.open('Video.aspx?Mode=Granicus&amp;ID1=1536&amp;ID2=120417&amp;G=19510D34-31FB-48B8-9C02-4D026953451C&amp;Mode2=Video','video');return false;" href="#" style="color:Blue;font-family:Tahoma;font-size:10pt;">Media</a>

I identified three potential issues which could be addressed while hopefully not impacting existing matches in the wild. Here's the operative CDP code:

    extract_url = soup.find(
        "a",
        id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
        class_="videolink",
    )
    if extract_url is None:
        return (ContentUriScrapeResult.Status.UnrecognizedPatternError, None)
    # the <a> tag will not have this attribute if there is no video
    if "onclick" not in extract_url.attrs:
        return (ContentUriScrapeResult.Status.ContentNotProvidedError, None)
  1. videolink class - City of Olympia Media links do not have a videolink class assigned. Is this a requirement to differentiate links on other Legistar instances, or is the highly specific ID enough?
  2. find only identifies the first Media link instance - and in the example provided, the first Media link is not associated with a video, therefore resulting in a failure for the entire event. You could do a find_all and iterate through, but a different approach might be...
  3. onclick is a distinguishing attribute - while checked subsequently to provide a unique error, we could test for the presence of the onclick attribute to more quickly identify a valid Media link.

Here's how I suggest modifying the code:

    extract_url = soup.find(
        "a",
        id=re.compile(r"ct\S*_ContentPlaceHolder\S*_hypVideo"),
        onclick=True,
    )
    if extract_url is None:
        return (ContentUriScrapeResult.Status.UnrecognizedPatternError, None)

Reproduction

You can see where the Event Gather workflow is failing on the cdp-usa-wa-olympia instance here; while not specifically pointing out this issue, this is the next hiccup:
https://github.com/CannObserv/cdp-usa-wa-city-olympia/actions/runs/6999433306/job/19038863304

If this change isn't apt to break anything, I'd much rather change things here than have to derive a dedicated scraper class (at least not yet) and then override get_legistar_content_uris in that file. I'm not sure how to get the Python import hierarchy to respect an override otherwise.

Environment

  • OS Version: [e.g. macOS 11.3.1]
  • cdp-scrapers Version: [e.g. 0.5.0]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions