Skip to content

Conversation about chicken-egg problem with provider and site ids in database #15

@shi-jie-samuel-tan

Description

@shi-jie-samuel-tan

In our previous meeting, we faced the chicken-egg dilemma of deciding how we should assign provider ids as foreign keys for the npi, xwaiver, and pals listings that were scraped.

The solution that Josephine and I propose is that we will first "scrape" samhsa's master list of providers and apply the samhsa comparison script on it to identify new providers as well as the providers that were either edited or removed. It's not really a scraping process because we are just downloading a csv file and importing the information to the Django API. Before we proceed to create these new providers based on the list of new providers, we will perform a fuzzy search on our database for providers with the same first and last name. This is done to ensure that we do not add any provider that already exists in the database. Following which, we will create new provider entities based on the list of new providers. They will not have npi, xwaiver, pals foreign keys on them for now. Following which, the pipeline will proceed to scrape our three sources (npi, xwaiver, and pals) with the new list of providers. The new listings will then be tagged with their respective provider ids and likewise, the new providers will be given their npi, xwaiver, and pals foreign keys.

If we were to adopt the above solution, we will have to do the following:

  1. Make npi, xwaiver, and pals foreign keys optional for the provider entity. A provider can be created without us knowing its npi, xwaiver, or pals data. However, npi, xwaiver, and pals must have a provider foreign key because they can only be scraped and obtained after we know that the provider (as well as its provider id) exists.

  2. Finish issue Automate npi number collection based on names #2 to make sure that we can scrape npi entries based on their names alone. This will allow us to directly use the little information we have about the providers from samhsa's spreadsheet to match npi data with the provider.

  3. Finish issue Explore better data scraping from SAMHSA database #14 to find a more robust method to fill in the blanks for our xwaiver data for each provider

  4. Create a new GitHub issue that addresses our current lack of a script that allows us to automatically download samhsa's spreadsheet of providers and apply our samhsa comparison script on it before linking it to our Django API

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions