Huggingface #26

cmatKhan · 2025-09-08T19:42:49Z

this is another proposal for a AbstractHfAPI()

cmatKhan · 2025-09-09T00:10:09Z

Focus on AbstractHfAPI() and HfCacheManager.py and the associated tests. The other changes have to do with fixing some mypy typing errors. Just administrative in other words

MackLiao

LGTM

cmatKhan · 2025-09-09T13:55:46Z

still need to confirm that download() works. transferring this message from slack to keep everything in one spot:

the task is, take my proposed AbstractHfAPI() method, see if download() works. If it doesn't, at least document the failure in a response to the pull, but better yet see if you can fix it. Once the basic functionality works, try to break it by trying various different options. no need to be sneaky, but it shoudln't break int he face of 'normal' usage. Next we want to figure out how it behaves to intermittently deleting the cache -- the idea is that it is possible to automatically handle periodically deleting the cache, then querying a file and if it is not there, downloading it int he background without the user really knowing what is happening

cmatKhan · 2025-09-09T17:26:50Z

weakness of download() -- if the revision/files already exist in the cache, and you call download(), if the repo size is larger than the threshold limit it will error even though no download is actually necessary

cmatKhan · 2025-09-10T23:06:31Z

@MackLiao , this is how I am currently using the HfRankResponse and other classes

classes to focus on are AbstractHfAPI, HfQueryAPI, IncrementalAnalysisDb and HfRankResponse

# %%
from tfbpapi.HfQueryAPI import HfQueryAPI

mahendrawada_2025 = HfQueryAPI(repo_id="BrentLab/mahendrawada_2025", repo_type="dataset")

# %%
mahendrawada_2025.describe_table("chec_seq")

# %%
sql = "SELECT * FROM rna_Seq LIMIT 5;"
results = mahendrawada_2025.query(sql)
print(results)

# %%
rank_response_sql = """
  WITH ranked_chec AS (
    SELECT *,
           ROW_NUMBER() OVER (
             PARTITION BY regulator_locus_tag
             ORDER BY peak_score DESC
           ) as rank_within_regulator
    FROM chec_seq
  ),
  joined_data AS (
    SELECT
      rc.*,
      rs.log2fc,
      CASE
        WHEN rs.log2fc IS NOT NULL THEN TRUE
        ELSE FALSE
      END as responsive
    FROM ranked_chec rc
    LEFT JOIN rna_seq rs
      ON rc.regulator_locus_tag = rs.regulator_locus_tag
      AND rc.target_locus_tag = rs.target_locus_tag
  ),
  binned_data AS (
    SELECT *,
           ((ROW_NUMBER() OVER (ORDER BY
  regulator_locus_tag, rank_within_regulator) - 1) / 5 +
  1) * 5 as bin
    FROM joined_data
  ),
  final_data AS (
    SELECT *,
           SUM(CASE WHEN responsive THEN 1 ELSE 0 END)
           OVER (ORDER BY bin ROWS UNBOUNDED PRECEDING)
  as bin_cum_sum
    FROM binned_data
  )
  SELECT * FROM final_data
  ORDER BY regulator_locus_tag, rank_within_regulator;
"""

res = mahendrawada_2025.query(rank_response_sql)

# %%
rr_intermediate_sql = """
WITH binned_data AS (
  SELECT a.regulator_locus_tag, a.target_locus_tag, peak_score, log2fc,
         CASE WHEN log2fc IS NOT NULL THEN 1 ELSE 0 END AS responsive,
         CEILING(ROW_NUMBER() OVER (PARTITION BY a.regulator_locus_tag ORDER BY a.regulator_locus_tag, a.target_locus_tag) / 5.0) * 5 AS bin_label
  FROM chec_seq AS a
  LEFT JOIN rna_seq AS b
  ON a.regulator_locus_tag = b.regulator_locus_tag
  AND a.target_locus_tag = b.target_locus_tag
)
SELECT regulator_locus_tag, target_locus_tag, peak_score, log2fc, responsive, bin_label,
       SUM(responsive) OVER (
         PARTITION BY regulator_locus_tag 
         ORDER BY bin_label
         RANGE UNBOUNDED PRECEDING
       ) AS cumulative_responsive
FROM binned_data
"""

rr_final_sql = """
WITH row_numbers AS (
  SELECT 
    a.regulator_locus_tag,
    a.target_locus_tag,
    ROW_NUMBER() OVER (
      PARTITION BY a.regulator_locus_tag 
      ORDER BY a.regulator_locus_tag, a.target_locus_tag
    ) AS row_num,
    CASE WHEN b.log2fc IS NOT NULL THEN 1 ELSE 0 END AS responsive
  FROM chec_seq a
  LEFT JOIN rna_seq b USING (regulator_locus_tag, target_locus_tag)
),
bin_aggregates AS (
  SELECT 
    regulator_locus_tag,
    CEILING(row_num / 5.0) * 5 AS bin_label,
    SUM(responsive) AS bin_responsive_count
  FROM row_numbers
  GROUP BY regulator_locus_tag, CEILING(row_num / 5.0) * 5
)
SELECT 
  regulator_locus_tag,
  bin_label,
  SUM(bin_responsive_count) OVER (
    PARTITION BY regulator_locus_tag 
    ORDER BY bin_label
  ) / bin_label AS rr
FROM bin_aggregates
ORDER BY regulator_locus_tag, bin_label
"""

res = mahendrawada_2025.query(sql)


# %%
res

# %%
from tfbpapi.IncrementalAnalysisDB import IncrementalAnalysisDB
from tfbpapi.HfRankResponse import HfRankResponse

rr_db = IncrementalAnalysisDB(db_path="rr_tmpdb.duckdb")
rrapi = HfRankResponse(rr_db)

# %%
x = rrapi.compute(
    ranking_api=mahendrawada_2025,
    response_api=mahendrawada_2025,
    ranking_table="chec_seq",
    response_table="rna_seq",
    ranking_score_column="peak_score",
    response_column="log2fc",
    comparison_id='mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq'
)


# %%
hackett_2020 = HfQueryAPI(repo_id="BrentLab/hackett_2020",
                          repo_type="dataset",
                          auto_download_threshold_mb=500)

hackett_2020.set_table_filter(
    "hackett_2020",
    """
        time = 15 AND 
        mechanism = 'ZEV' AND 
        restriction = 'P' AND 
        regulator_locus_tag IN 
        (SELECT 
        regulator_locus_tag 
        FROM 
        hackett_2020 
        WHERE time = 15 AND 
        mechanism = 'ZEV' AND 
        restriction = 'P'
        GROUP BY 
        regulator_locus_tag 
        HAVING COUNT(*) = 6175)
    """)

# %%
hackett_2020.describe_table("hackett_2020")

# %%
mahendrawada_hackett = rrapi.compute(
    ranking_api=mahendrawada_2025,
    response_api=hackett_2020,
    ranking_table="chec_seq",
    response_table="hackett_2020",
    ranking_score_column="peak_score",
    response_column="log2_shrunken_timecourses",
    responsive_condition="log2_shrunken_timecourses != 0",
    comparison_id='mahendrawada_2025_chec_seq_vs_hackett_2020'
)

# %%
tally = hackett_2020.query("SELECT regulator_locus_tag, COUNT(*) as target_count FROM hackett_2020 GROUP BY regulator_locus_tag ORDER BY target_count DESC;")

# %%
tally

# %%
x = hackett_2020.query("SELECT * FROM hackett_2020 WHERE regulator_locus_tag = 'YPL016W'")

# %%
x

# %%


# %%
y = rrapi.compute(
    ranking_api=mahendrawada_2025,
    response_api=mahendrawada_2025,
    ranking_table="chec_seq",
    response_table="rna_seq",
    ranking_score_column="peak_score",
    response_column="log2fc",
    comparison_id='mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq'
)

# %%
rrapi.summarize("mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq")

# %%
rrapi.summarize("mahendrawada_2025_chec_seq_vs_hackett_2020")

# %%
rrapi.summarize("mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq")

There is a lot of code in the source files for these classes that Claude wrote. The task right now is to go through every line and make sure that we understand it, that it is documented clearly for us and "future us" (when we forget about it in 3 months), and that it actually works. There is a ton of editing that needs to happen, and other functionality that needs to be added. Look at figure S1C in the Mateusiak_2025/supplement directory -- that is over common regulators between mcisaac, callingcards and harbison, where the callingcards data is passing and a common set of target symbols, for example. It is a lot of filtering, in other words. that needs to be easy to do in this iteration of the code

cmatKhan · 2025-09-10T23:09:21Z

This is the beginning of starting to use the huggingface API classes, and the start of designing something like a "Rank Response API endpoint". We're going to be doing a very similar thing with univariate models (perturbation ~ binding and maybe even binding ~ perturbation?), DTO, correlations within datasets, ... .

Design questions right now are, "how do we make doing these sorts of analysis/comparisons, etc easy at the user level, and clear enough that they are maintainable/adjustable at the developer level"

…d. documentation works and is updated

cmatKhan · 2025-09-18T01:25:04Z

@MackLiao , datainfo is close to its at least initially usable form. HfQueryAPI and the other objects aren't usable in the current state. However, the documentation works and there is a good tutorial on using DataCard. Pull this branch and do poetry run mkdocs serve to build and serve the documentation locally. look at the tutorial

cmatKhan · 2025-09-20T21:51:02Z

@MackLiao , there is now a tutorial showing how to use the get_metadata() and set_filter() functions of HfQueryAPI. I deployed this branch's current state to gh-pages, so can use the gh-pages serve here, too

MackLiao · 2025-09-22T02:50:53Z

ok -- I will test these two functions.

cmatKhan · 2025-09-25T03:59:37Z

@MackLiao and @fei1cell , the rank response tutorial is an example of how we will use this data more seriously. One of the first tasks for the shiny app re-write is going to be implementing this type of overview using the rank response.

the most important part, and the most challenging to figure out how to implement, is the filtering (done done in this tutorial) -- how to select data based on condition (eg, only regulator tf1, tf2, tf3 which are in minimal media + galactose)

cmatKhan added 3 commits September 8, 2025 14:40

resolving requests missing types

ca24fb4

adding tests for AbstractHfAPI

c75f6d3

removing windows-2019 from ci

c8e0c39

cmatKhan mentioned this pull request Sep 9, 2025

Impl hfapi #25

Closed

MackLiao approved these changes Sep 9, 2025

View reviewed changes

working on adding rank response functionality

c164f32

cmatKhan requested a review from MackLiao September 10, 2025 23:15

cmatKhan added 2 commits September 17, 2025 20:23

this is getting closer to what im after. datainfo is close to finishe…

0494d09

…d. documentation works and is updated

this is getting closer to what im after. datainfo is close to finishe…

fff878d

…d. documentation works and is updated

working on hfqueryapi

cad9834

adding rank response tutorial and role to features

1dd92d4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Huggingface #26

Huggingface #26

Uh oh!

cmatKhan commented Sep 8, 2025

Uh oh!

cmatKhan commented Sep 9, 2025

Uh oh!

MackLiao left a comment

Uh oh!

cmatKhan commented Sep 9, 2025

Uh oh!

cmatKhan commented Sep 9, 2025

Uh oh!

cmatKhan commented Sep 10, 2025 •

edited

Loading

Uh oh!

cmatKhan commented Sep 10, 2025

Uh oh!

cmatKhan commented Sep 18, 2025

Uh oh!

cmatKhan commented Sep 20, 2025 •

edited

Loading

Uh oh!

MackLiao commented Sep 22, 2025

Uh oh!

cmatKhan commented Sep 25, 2025

Uh oh!

Uh oh!

Huggingface #26

Are you sure you want to change the base?

Huggingface #26

Uh oh!

Conversation

cmatKhan commented Sep 8, 2025

Uh oh!

cmatKhan commented Sep 9, 2025

Uh oh!

MackLiao left a comment

Choose a reason for hiding this comment

Uh oh!

cmatKhan commented Sep 9, 2025

Uh oh!

cmatKhan commented Sep 9, 2025

Uh oh!

cmatKhan commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmatKhan commented Sep 10, 2025

Uh oh!

cmatKhan commented Sep 18, 2025

Uh oh!

cmatKhan commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MackLiao commented Sep 22, 2025

Uh oh!

cmatKhan commented Sep 25, 2025

Uh oh!

Uh oh!

cmatKhan commented Sep 10, 2025 •

edited

Loading

cmatKhan commented Sep 20, 2025 •

edited

Loading