Skip to content

Conversation

cmatKhan
Copy link
Member

@cmatKhan cmatKhan commented Sep 8, 2025

this is another proposal for a AbstractHfAPI()

@cmatKhan cmatKhan mentioned this pull request Sep 9, 2025
@cmatKhan
Copy link
Member Author

cmatKhan commented Sep 9, 2025

Focus on AbstractHfAPI() and HfCacheManager.py and the associated tests. The other changes have to do with fixing some mypy typing errors. Just administrative in other words

Copy link
Collaborator

@MackLiao MackLiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmatKhan
Copy link
Member Author

cmatKhan commented Sep 9, 2025

still need to confirm that download() works. transferring this message from slack to keep everything in one spot:

the task is, take my proposed AbstractHfAPI() method, see if download() works. If it doesn't, at least document the failure in a response to the pull, but better yet see if you can fix it. Once the basic functionality works, try to break it by trying various different options. no need to be sneaky, but it shoudln't break int he face of 'normal' usage. Next we want to figure out how it behaves to intermittently deleting the cache -- the idea is that it is possible to automatically handle periodically deleting the cache, then querying a file and if it is not there, downloading it int he background without the user really knowing what is happening

@cmatKhan
Copy link
Member Author

cmatKhan commented Sep 9, 2025

weakness of download() -- if the revision/files already exist in the cache, and you call download(), if the repo size is larger than the threshold limit it will error even though no download is actually necessary

@cmatKhan
Copy link
Member Author

cmatKhan commented Sep 10, 2025

@MackLiao , this is how I am currently using the HfRankResponse and other classes

classes to focus on are AbstractHfAPI, HfQueryAPI, IncrementalAnalysisDb and HfRankResponse

# %%
from tfbpapi.HfQueryAPI import HfQueryAPI

mahendrawada_2025 = HfQueryAPI(repo_id="BrentLab/mahendrawada_2025", repo_type="dataset")

# %%
mahendrawada_2025.describe_table("chec_seq")

# %%
sql = "SELECT * FROM rna_Seq LIMIT 5;"
results = mahendrawada_2025.query(sql)
print(results)

# %%
rank_response_sql = """
  WITH ranked_chec AS (
    SELECT *,
           ROW_NUMBER() OVER (
             PARTITION BY regulator_locus_tag
             ORDER BY peak_score DESC
           ) as rank_within_regulator
    FROM chec_seq
  ),
  joined_data AS (
    SELECT
      rc.*,
      rs.log2fc,
      CASE
        WHEN rs.log2fc IS NOT NULL THEN TRUE
        ELSE FALSE
      END as responsive
    FROM ranked_chec rc
    LEFT JOIN rna_seq rs
      ON rc.regulator_locus_tag = rs.regulator_locus_tag
      AND rc.target_locus_tag = rs.target_locus_tag
  ),
  binned_data AS (
    SELECT *,
           ((ROW_NUMBER() OVER (ORDER BY
  regulator_locus_tag, rank_within_regulator) - 1) / 5 +
  1) * 5 as bin
    FROM joined_data
  ),
  final_data AS (
    SELECT *,
           SUM(CASE WHEN responsive THEN 1 ELSE 0 END)
           OVER (ORDER BY bin ROWS UNBOUNDED PRECEDING)
  as bin_cum_sum
    FROM binned_data
  )
  SELECT * FROM final_data
  ORDER BY regulator_locus_tag, rank_within_regulator;
"""

res = mahendrawada_2025.query(rank_response_sql)

# %%
rr_intermediate_sql = """
WITH binned_data AS (
  SELECT a.regulator_locus_tag, a.target_locus_tag, peak_score, log2fc,
         CASE WHEN log2fc IS NOT NULL THEN 1 ELSE 0 END AS responsive,
         CEILING(ROW_NUMBER() OVER (PARTITION BY a.regulator_locus_tag ORDER BY a.regulator_locus_tag, a.target_locus_tag) / 5.0) * 5 AS bin_label
  FROM chec_seq AS a
  LEFT JOIN rna_seq AS b
  ON a.regulator_locus_tag = b.regulator_locus_tag
  AND a.target_locus_tag = b.target_locus_tag
)
SELECT regulator_locus_tag, target_locus_tag, peak_score, log2fc, responsive, bin_label,
       SUM(responsive) OVER (
         PARTITION BY regulator_locus_tag 
         ORDER BY bin_label
         RANGE UNBOUNDED PRECEDING
       ) AS cumulative_responsive
FROM binned_data
"""

rr_final_sql = """
WITH row_numbers AS (
  SELECT 
    a.regulator_locus_tag,
    a.target_locus_tag,
    ROW_NUMBER() OVER (
      PARTITION BY a.regulator_locus_tag 
      ORDER BY a.regulator_locus_tag, a.target_locus_tag
    ) AS row_num,
    CASE WHEN b.log2fc IS NOT NULL THEN 1 ELSE 0 END AS responsive
  FROM chec_seq a
  LEFT JOIN rna_seq b USING (regulator_locus_tag, target_locus_tag)
),
bin_aggregates AS (
  SELECT 
    regulator_locus_tag,
    CEILING(row_num / 5.0) * 5 AS bin_label,
    SUM(responsive) AS bin_responsive_count
  FROM row_numbers
  GROUP BY regulator_locus_tag, CEILING(row_num / 5.0) * 5
)
SELECT 
  regulator_locus_tag,
  bin_label,
  SUM(bin_responsive_count) OVER (
    PARTITION BY regulator_locus_tag 
    ORDER BY bin_label
  ) / bin_label AS rr
FROM bin_aggregates
ORDER BY regulator_locus_tag, bin_label
"""

res = mahendrawada_2025.query(sql)


# %%
res

# %%
from tfbpapi.IncrementalAnalysisDB import IncrementalAnalysisDB
from tfbpapi.HfRankResponse import HfRankResponse

rr_db = IncrementalAnalysisDB(db_path="rr_tmpdb.duckdb")
rrapi = HfRankResponse(rr_db)

# %%
x = rrapi.compute(
    ranking_api=mahendrawada_2025,
    response_api=mahendrawada_2025,
    ranking_table="chec_seq",
    response_table="rna_seq",
    ranking_score_column="peak_score",
    response_column="log2fc",
    comparison_id='mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq'
)


# %%
hackett_2020 = HfQueryAPI(repo_id="BrentLab/hackett_2020",
                          repo_type="dataset",
                          auto_download_threshold_mb=500)

hackett_2020.set_table_filter(
    "hackett_2020",
    """
        time = 15 AND 
        mechanism = 'ZEV' AND 
        restriction = 'P' AND 
        regulator_locus_tag IN 
        (SELECT 
        regulator_locus_tag 
        FROM 
        hackett_2020 
        WHERE time = 15 AND 
        mechanism = 'ZEV' AND 
        restriction = 'P'
        GROUP BY 
        regulator_locus_tag 
        HAVING COUNT(*) = 6175)
    """)

# %%
hackett_2020.describe_table("hackett_2020")

# %%
mahendrawada_hackett = rrapi.compute(
    ranking_api=mahendrawada_2025,
    response_api=hackett_2020,
    ranking_table="chec_seq",
    response_table="hackett_2020",
    ranking_score_column="peak_score",
    response_column="log2_shrunken_timecourses",
    responsive_condition="log2_shrunken_timecourses != 0",
    comparison_id='mahendrawada_2025_chec_seq_vs_hackett_2020'
)

# %%
tally = hackett_2020.query("SELECT regulator_locus_tag, COUNT(*) as target_count FROM hackett_2020 GROUP BY regulator_locus_tag ORDER BY target_count DESC;")

# %%
tally

# %%
x = hackett_2020.query("SELECT * FROM hackett_2020 WHERE regulator_locus_tag = 'YPL016W'")

# %%
x

# %%


# %%
y = rrapi.compute(
    ranking_api=mahendrawada_2025,
    response_api=mahendrawada_2025,
    ranking_table="chec_seq",
    response_table="rna_seq",
    ranking_score_column="peak_score",
    response_column="log2fc",
    comparison_id='mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq'
)

# %%
rrapi.summarize("mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq")

# %%
rrapi.summarize("mahendrawada_2025_chec_seq_vs_hackett_2020")

# %%
rrapi.summarize("mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq")

There is a lot of code in the source files for these classes that Claude wrote. The task right now is to go through every line and make sure that we understand it, that it is documented clearly for us and "future us" (when we forget about it in 3 months), and that it actually works. There is a ton of editing that needs to happen, and other functionality that needs to be added. Look at figure S1C in the Mateusiak_2025/supplement directory -- that is over common regulators between mcisaac, callingcards and harbison, where the callingcards data is passing and a common set of target symbols, for example. It is a lot of filtering, in other words. that needs to be easy to do in this iteration of the code

@cmatKhan
Copy link
Member Author

This is the beginning of starting to use the huggingface API classes, and the start of designing something like a "Rank Response API endpoint". We're going to be doing a very similar thing with univariate models (perturbation ~ binding and maybe even binding ~ perturbation?), DTO, correlations within datasets, ... .

Design questions right now are, "how do we make doing these sorts of analysis/comparisons, etc easy at the user level, and clear enough that they are maintainable/adjustable at the developer level"

@cmatKhan cmatKhan requested a review from MackLiao September 10, 2025 23:15
@cmatKhan
Copy link
Member Author

@MackLiao , datainfo is close to its at least initially usable form. HfQueryAPI and the other objects aren't usable in the current state. However, the documentation works and there is a good tutorial on using DataCard. Pull this branch and do poetry run mkdocs serve to build and serve the documentation locally. look at the tutorial

@cmatKhan
Copy link
Member Author

cmatKhan commented Sep 20, 2025

@MackLiao , there is now a tutorial showing how to use the get_metadata() and set_filter() functions of HfQueryAPI. I deployed this branch's current state to gh-pages, so can use the gh-pages serve here, too

@MackLiao
Copy link
Collaborator

ok -- I will test these two functions.

@cmatKhan
Copy link
Member Author

@MackLiao and @fei1cell , the rank response tutorial is an example of how we will use this data more seriously. One of the first tasks for the shiny app re-write is going to be implementing this type of overview using the rank response.

the most important part, and the most challenging to figure out how to implement, is the filtering (done done in this tutorial) -- how to select data based on condition (eg, only regulator tf1, tf2, tf3 which are in minimal media + galactose)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants