-
Notifications
You must be signed in to change notification settings - Fork 3
Huggingface #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Huggingface #26
Conversation
Focus on AbstractHfAPI() and HfCacheManager.py and the associated tests. The other changes have to do with fixing some mypy typing errors. Just administrative in other words |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
still need to confirm that download() works. transferring this message from slack to keep everything in one spot:
|
weakness of |
@MackLiao , this is how I am currently using the HfRankResponse and other classes classes to focus on are # %%
from tfbpapi.HfQueryAPI import HfQueryAPI
mahendrawada_2025 = HfQueryAPI(repo_id="BrentLab/mahendrawada_2025", repo_type="dataset")
# %%
mahendrawada_2025.describe_table("chec_seq")
# %%
sql = "SELECT * FROM rna_Seq LIMIT 5;"
results = mahendrawada_2025.query(sql)
print(results)
# %%
rank_response_sql = """
WITH ranked_chec AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY regulator_locus_tag
ORDER BY peak_score DESC
) as rank_within_regulator
FROM chec_seq
),
joined_data AS (
SELECT
rc.*,
rs.log2fc,
CASE
WHEN rs.log2fc IS NOT NULL THEN TRUE
ELSE FALSE
END as responsive
FROM ranked_chec rc
LEFT JOIN rna_seq rs
ON rc.regulator_locus_tag = rs.regulator_locus_tag
AND rc.target_locus_tag = rs.target_locus_tag
),
binned_data AS (
SELECT *,
((ROW_NUMBER() OVER (ORDER BY
regulator_locus_tag, rank_within_regulator) - 1) / 5 +
1) * 5 as bin
FROM joined_data
),
final_data AS (
SELECT *,
SUM(CASE WHEN responsive THEN 1 ELSE 0 END)
OVER (ORDER BY bin ROWS UNBOUNDED PRECEDING)
as bin_cum_sum
FROM binned_data
)
SELECT * FROM final_data
ORDER BY regulator_locus_tag, rank_within_regulator;
"""
res = mahendrawada_2025.query(rank_response_sql)
# %%
rr_intermediate_sql = """
WITH binned_data AS (
SELECT a.regulator_locus_tag, a.target_locus_tag, peak_score, log2fc,
CASE WHEN log2fc IS NOT NULL THEN 1 ELSE 0 END AS responsive,
CEILING(ROW_NUMBER() OVER (PARTITION BY a.regulator_locus_tag ORDER BY a.regulator_locus_tag, a.target_locus_tag) / 5.0) * 5 AS bin_label
FROM chec_seq AS a
LEFT JOIN rna_seq AS b
ON a.regulator_locus_tag = b.regulator_locus_tag
AND a.target_locus_tag = b.target_locus_tag
)
SELECT regulator_locus_tag, target_locus_tag, peak_score, log2fc, responsive, bin_label,
SUM(responsive) OVER (
PARTITION BY regulator_locus_tag
ORDER BY bin_label
RANGE UNBOUNDED PRECEDING
) AS cumulative_responsive
FROM binned_data
"""
rr_final_sql = """
WITH row_numbers AS (
SELECT
a.regulator_locus_tag,
a.target_locus_tag,
ROW_NUMBER() OVER (
PARTITION BY a.regulator_locus_tag
ORDER BY a.regulator_locus_tag, a.target_locus_tag
) AS row_num,
CASE WHEN b.log2fc IS NOT NULL THEN 1 ELSE 0 END AS responsive
FROM chec_seq a
LEFT JOIN rna_seq b USING (regulator_locus_tag, target_locus_tag)
),
bin_aggregates AS (
SELECT
regulator_locus_tag,
CEILING(row_num / 5.0) * 5 AS bin_label,
SUM(responsive) AS bin_responsive_count
FROM row_numbers
GROUP BY regulator_locus_tag, CEILING(row_num / 5.0) * 5
)
SELECT
regulator_locus_tag,
bin_label,
SUM(bin_responsive_count) OVER (
PARTITION BY regulator_locus_tag
ORDER BY bin_label
) / bin_label AS rr
FROM bin_aggregates
ORDER BY regulator_locus_tag, bin_label
"""
res = mahendrawada_2025.query(sql)
# %%
res
# %%
from tfbpapi.IncrementalAnalysisDB import IncrementalAnalysisDB
from tfbpapi.HfRankResponse import HfRankResponse
rr_db = IncrementalAnalysisDB(db_path="rr_tmpdb.duckdb")
rrapi = HfRankResponse(rr_db)
# %%
x = rrapi.compute(
ranking_api=mahendrawada_2025,
response_api=mahendrawada_2025,
ranking_table="chec_seq",
response_table="rna_seq",
ranking_score_column="peak_score",
response_column="log2fc",
comparison_id='mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq'
)
# %%
hackett_2020 = HfQueryAPI(repo_id="BrentLab/hackett_2020",
repo_type="dataset",
auto_download_threshold_mb=500)
hackett_2020.set_table_filter(
"hackett_2020",
"""
time = 15 AND
mechanism = 'ZEV' AND
restriction = 'P' AND
regulator_locus_tag IN
(SELECT
regulator_locus_tag
FROM
hackett_2020
WHERE time = 15 AND
mechanism = 'ZEV' AND
restriction = 'P'
GROUP BY
regulator_locus_tag
HAVING COUNT(*) = 6175)
""")
# %%
hackett_2020.describe_table("hackett_2020")
# %%
mahendrawada_hackett = rrapi.compute(
ranking_api=mahendrawada_2025,
response_api=hackett_2020,
ranking_table="chec_seq",
response_table="hackett_2020",
ranking_score_column="peak_score",
response_column="log2_shrunken_timecourses",
responsive_condition="log2_shrunken_timecourses != 0",
comparison_id='mahendrawada_2025_chec_seq_vs_hackett_2020'
)
# %%
tally = hackett_2020.query("SELECT regulator_locus_tag, COUNT(*) as target_count FROM hackett_2020 GROUP BY regulator_locus_tag ORDER BY target_count DESC;")
# %%
tally
# %%
x = hackett_2020.query("SELECT * FROM hackett_2020 WHERE regulator_locus_tag = 'YPL016W'")
# %%
x
# %%
# %%
y = rrapi.compute(
ranking_api=mahendrawada_2025,
response_api=mahendrawada_2025,
ranking_table="chec_seq",
response_table="rna_seq",
ranking_score_column="peak_score",
response_column="log2fc",
comparison_id='mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq'
)
# %%
rrapi.summarize("mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq")
# %%
rrapi.summarize("mahendrawada_2025_chec_seq_vs_hackett_2020")
# %%
rrapi.summarize("mahendrawada_2025_chec_seq_vs_mahendrawada_2025_rna_seq") There is a lot of code in the source files for these classes that Claude wrote. The task right now is to go through every line and make sure that we understand it, that it is documented clearly for us and "future us" (when we forget about it in 3 months), and that it actually works. There is a ton of editing that needs to happen, and other functionality that needs to be added. Look at figure S1C in the Mateusiak_2025/supplement directory -- that is over common regulators between mcisaac, callingcards and harbison, where the callingcards data is passing and a common set of target symbols, for example. It is a lot of filtering, in other words. that needs to be easy to do in this iteration of the code |
This is the beginning of starting to use the huggingface API classes, and the start of designing something like a "Rank Response API endpoint". We're going to be doing a very similar thing with univariate models (perturbation ~ binding and maybe even binding ~ perturbation?), DTO, correlations within datasets, ... . Design questions right now are, "how do we make doing these sorts of analysis/comparisons, etc easy at the user level, and clear enough that they are maintainable/adjustable at the developer level" |
…d. documentation works and is updated
…d. documentation works and is updated
@MackLiao , |
@MackLiao , there is now a tutorial showing how to use the get_metadata() and set_filter() functions of HfQueryAPI. I deployed this branch's current state to gh-pages, so can use the gh-pages serve here, too |
ok -- I will test these two functions. |
@MackLiao and @fei1cell , the rank response tutorial is an example of how we will use this data more seriously. One of the first tasks for the shiny app re-write is going to be implementing this type of overview using the rank response. the most important part, and the most challenging to figure out how to implement, is the filtering (done done in this tutorial) -- how to select data based on condition (eg, only regulator tf1, tf2, tf3 which are in minimal media + galactose) |
this is another proposal for a AbstractHfAPI()