Skip to content

Commit 5e556f3

Browse files
authored
added knowledgepack creation scripts (#12)
* added knowledgepack creation scripts * removed installations in examples * doc updates
1 parent f70588a commit 5e556f3

13 files changed

+168
-22
lines changed

README.md

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Pkg.add(url="https://github.com/JuliaGenAI/DocsScraper.jl")
3232

3333
## Building the Index
3434
```julia
35-
crawlable_urls = ["https://juliagenai.github.io/DocsScraper.jl/dev/home/"]
35+
crawlable_urls = ["https://juliagenai.github.io/DocsScraper.jl/dev"]
3636

3737
index_path = make_knowledge_packs(crawlable_urls;
3838
index_name = "docsscraper", embedding_dimension = 1024, embedding_bool = true, target_path="knowledge_packs")
@@ -100,3 +100,24 @@ using AIHelpMe: last_result
100100
# last_result() returns the last result from the RAG pipeline, ie, same as running aihelp(; return_all=true)
101101
print(last_result())
102102
```
103+
## Output
104+
`make_knowledge_packs` creates the following files:
105+
106+
```
107+
index_name\
108+
109+
├── Index\
110+
│ ├── index_name__artifact__info.txt
111+
│ ├── index_name__vDate__model_embedding_size-embedding_type__v1.0.hdf5
112+
│ └── index_name__vDate__model_embedding_size-embedding_type__v1.0.tar.gz
113+
114+
├── Scraped_files\
115+
│ ├── scraped_hostname-chunks-max-chunk_size-min-min_chunk_size.jls
116+
│ ├── scraped_hostname-sources-max-chunk_size-min-min_chunk_size.jls
117+
│ └── . . .
118+
119+
└── index_name_URL_mapping.csv
120+
```
121+
- Index\: contains the .hdf5 and .tar.gz files along with the artifact__info.txt. Artifact info contains sha256 and git-tree-sha1 hashes. 
122+
- Scraped_files\: contains the scraped chunks and sources. These are separated by the hostnames of the URLs.
123+
- URL_mapping.csv contains the scraped URLs mapping them with the estimated package name.

docs/make.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ DocMeta.setdocmeta!(DocsScraper, :DocTestSetup, :(using DocsScraper); recursive
55

66
makedocs(;
77
modules = [DocsScraper],
8-
authors = "Shreyas Agrawal @splendidbug and contributors",
8+
authors = "Shreyas Agrawal @splendidbug and contributors",
99
sitename = "DocsScraper.jl",
1010
repo = "https://github.com/JuliaGenAI/DocsScraper.jl/blob/{commit}{path}#{line}",
1111
format = Documenter.HTML(;

docs/src/index.md

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Pkg.add(url="https://github.com/JuliaGenAI/DocsScraper.jl")
3030

3131
## Building the Index
3232
```julia
33-
crawlable_urls = ["https://juliagenai.github.io/DocsScraper.jl/dev/home/"]
33+
crawlable_urls = ["https://juliagenai.github.io/DocsScraper.jl/dev"]
3434

3535
index_path = make_knowledge_packs(crawlable_urls;
3636
index_name = "docsscraper", embedding_dimension = 1024, embedding_bool = true, target_path=joinpath(pwd(), "knowledge_packs"))
@@ -97,3 +97,24 @@ Tip: Use `pprint` for nicer outputs with sources and `last_result` for more deta
9797
using AIHelpMe: last_result
9898
print(last_result())
9999
```
100+
## Output
101+
`make_knowledge_packs` creates the following files:
102+
103+
```
104+
index_name\
105+
106+
├── Index\
107+
│ ├── index_name__artifact__info.txt
108+
│ ├── index_name__vDate__model_embedding_size-embedding_type__v1.0.hdf5
109+
│ └── index_name__vDate__model_embedding_size-embedding_type__v1.0.tar.gz
110+
111+
├── Scraped_files\
112+
│ ├── scraped_hostname-chunks-max-chunk_size-min-min_chunk_size.jls
113+
│ ├── scraped_hostname-sources-max-chunk_size-min-min_chunk_size.jls
114+
│ └── . . .
115+
116+
└── index_name_URL_mapping.csv
117+
```
118+
- Index\: contains the .hdf5 and .tar.gz files along with the artifact__info.txt. Artifact info contains sha256 and git-tree-sha1 hashes. 
119+
- Scraped_files\: contains the scraped chunks and sources. These are separated by the hostnames of the URLs.
120+
- URL_mapping.csv contains the scraped URLs mapping them with the estimated package name.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# The example below demonstrates the creation of Genie knowledge pack
2+
3+
using DocsScraper
4+
5+
# The crawler will run on these URLs to look for more URLs with the same hostname
6+
crawlable_urls = ["https://learn.genieframework.com/"]
7+
8+
index_path = make_knowledge_packs(crawlable_urls;
9+
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "genie", custom_metadata = "Genie ecosystem")

examples/scripts/generate_knowledge_pack.jl renamed to example_scripts/creating_knowledge_packs/juliaData_knowledge_pack.jl

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
# The example below demonstrates the creation of JuliaData knowledge pack
22

3-
using Pkg
4-
Pkg.activate(temp = true)
5-
Pkg.add(url = "https://github.com/JuliaGenAI/DocsScraper.jl")
63
using DocsScraper
74

85
# The crawler will run on these URLs to look for more URLs with the same hostname
@@ -24,11 +21,4 @@ single_page_urls = ["https://docs.julialang.org/en/v1/manual/missing/",
2421
"https://arrow.apache.org/julia/stable/reference/"]
2522

2623
index_path = make_knowledge_packs(crawlable_urls; single_urls = single_page_urls,
27-
embedding_dimension = 1024, embedding_bool = true,
28-
target_path = joinpath(pwd(), "knowledge_to_delete"), index_name = "juliadata", custom_metadata = "JuliaData ecosystem")
29-
30-
# The index created here has 1024 embedding dimensions with boolean embeddings and max chunk size is 384.
31-
32-
# The above example creates the output directory (Link to the output directory). It contains the sub-directories "Scraped" and "Index".
33-
# "Scraped" contains .jls files of chunks and sources of the scraped URLs. Index contains the created index along with a .txt file
34-
# containing the artifact info. The output directory also contains the URL mapping csv.
24+
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "juliadata", custom_metadata = "JuliaData ecosystem")
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# The example below demonstrates the creation of JuliaLang knowledge pack
2+
3+
using DocsScraper
4+
5+
# The crawler will run on these URLs to look for more URLs with the same hostname
6+
crawlable_urls = [
7+
"https://docs.julialang.org/en/v1/", "https://julialang.github.io/IJulia.jl/stable/",
8+
"https://julialang.github.io/PackageCompiler.jl/stable/", "https://pkgdocs.julialang.org/dev/",
9+
"https://julialang.github.io/JuliaSyntax.jl/dev/",
10+
"https://julialang.github.io/AllocCheck.jl/dev/", "https://julialang.github.io/PrecompileTools.jl/stable/",
11+
"https://julialang.github.io/StyledStrings.jl/dev/"]
12+
13+
index_path = make_knowledge_packs(crawlable_urls;
14+
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"),
15+
index_name = "julialang", custom_metadata = "JuliaLang ecosystem")
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# The example below demonstrates the creation of Makie knowledge pack
2+
3+
using DocsScraper
4+
5+
# The crawler will run on these URLs to look for more URLs with the same hostname
6+
crawlable_urls = ["https://docs.juliahub.com/MakieGallery/Ql23q/0.2.17/",
7+
"https://beautiful.makie.org/dev/",
8+
"https://juliadatascience.io/DataVisualizationMakie",
9+
"https://docs.makie.org/v0.21/explanations/backends/glmakie", "https://juliadatascience.io/glmakie",
10+
"https://docs.makie.org/v0.21/explanations/backends/cairomakie", "https://juliadatascience.io/cairomakie", "http://juliaplots.org/WGLMakie.jl/stable/",
11+
"http://juliaplots.org/WGLMakie.jl/dev/", "https://docs.makie.org/v0.21/explanations/backends/wglmakie",
12+
"https://docs.juliahub.com/MakieGallery/Ql23q/0.2.17/abstractplotting_api.html", "http://juliaplots.org/StatsMakie.jl/latest/",
13+
"https://docs.juliahub.com/StatsMakie/RRy0o/0.2.3/manual/tutorial/", "https://geo.makie.org/v0.7.3/", "https://geo.makie.org/dev/",
14+
"https://libgeos.org/doxygen/geos__c_8h.html", "https://docs.makie.org/v0.21/"]
15+
16+
index_path = make_knowledge_packs(crawlable_urls;
17+
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "makie", custom_metadata = "Makie ecosystem")
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# The example below demonstrates the creation of plots knowledge pack
2+
3+
using DocsScraper
4+
5+
# The crawler will run on these URLs to look for more URLs with the same hostname
6+
crawlable_urls = [
7+
"https://docs.juliaplots.org/stable/", "https://docs.juliaplots.org/dev/",
8+
"https://docs.juliaplots.org/latest/",
9+
"https://docs.juliaplots.org/latest/generated/statsplots/", "https://docs.juliaplots.org/latest/ecosystem/",
10+
"http://juliaplots.org/PlotlyJS.jl/stable/",
11+
"http://juliaplots.org/PlotlyJS.jl/stable/manipulating_plots/", "https://docs.juliaplots.org/latest/gallery/gr/",
12+
"https://docs.juliaplots.org/latest/gallery/unicodeplots/",
13+
"https://docs.juliaplots.org/latest/gallery/pgfplotsx/",
14+
"https://juliaplots.org/RecipesBase.jl/stable/",
15+
"https://juliastats.org/StatsBase.jl/stable/", "https://juliastats.org/StatsBase.jl/stable/statmodels/",
16+
"http://juliagraphs.org/GraphPlot.jl/",
17+
"https://docs.juliahub.com/GraphPlot/bUwXr/0.6.0/"]
18+
19+
index_path = make_knowledge_packs(crawlable_urls;
20+
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "plots", custom_metadata = "Plots ecosystem")
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# The example below demonstrates the creation of SciML knowledge pack
2+
3+
using DocsScraper
4+
5+
# The crawler will run on these URLs to look for more URLs with the same hostname
6+
crawlable_urls = ["https://sciml.ai/", "https://docs.sciml.ai/DiffEqDocs/stable/",
7+
"https://docs.sciml.ai/DiffEqDocs/stable/types/sde_types/",
8+
"https://docs.sciml.ai/ModelingToolkit/dev/", "https://docs.sciml.ai/DiffEqFlux/stable/",
9+
"https://docs.sciml.ai/NeuralPDE/stable/", "https://docs.sciml.ai/NeuralPDE/stable/tutorials/pdesystem/",
10+
"https://docs.sciml.ai/Optimization/stable/",
11+
"https://docs.sciml.ai/SciMLSensitivity/stable/", "https://docs.sciml.ai/DataDrivenDiffEq/stable/", "https://turinglang.org/",
12+
"https://turinglang.org/docs/tutorials/docs-00-getting-started/", "https://juliamath.github.io/MeasureTheory.jl/stable/",
13+
"https://juliamath.github.io/MeasureTheory.jl/stable/", "https://docs.sciml.ai/DiffEqGPU/stable/",
14+
"https://chevronetc.github.io/DistributedOperations.jl/dev/", "https://docs.sciml.ai/DiffEqBayes/stable/",
15+
"https://turinglang.org/docs/tutorials/10-bayesian-differential-equations/index.html", "https://docs.sciml.ai/OrdinaryDiffEq/stable/",
16+
"https://docs.sciml.ai/Overview/stable/", "https://docs.sciml.ai/DiffEqDocs/stable/solvers/sde_solve/",
17+
"https://docs.sciml.ai/SciMLSensitivity/stable/examples/dde/delay_diffeq/", "https://docs.sciml.ai/DiffEqDocs/stable/tutorials/dde_example/",
18+
"https://docs.sciml.ai/DiffEqDocs/stable/types/dae_types/", "https://docs.sciml.ai/DiffEqCallbacks/stable/",
19+
"https://docs.sciml.ai/SciMLBase/stable/",
20+
"https://docs.sciml.ai/DiffEqDocs/stable/features/callback_library/", "https://docs.sciml.ai/LinearSolve/stable/",
21+
"https://docs.sciml.ai/ModelingToolkit/stable/",
22+
"https://docs.sciml.ai/DataInterpolations/stable/", "https://docs.sciml.ai/DeepEquilibriumNetworks/stable/",
23+
"https://docs.sciml.ai/DiffEqParamEstim/stable/",
24+
"https://docs.sciml.ai/Integrals/stable/", "https://docs.sciml.ai/EasyModelAnalysis/stable/",
25+
"https://docs.sciml.ai/GlobalSensitivity/stable/",
26+
"https://docs.sciml.ai/ExponentialUtilities/stable/", "https://docs.sciml.ai/HighDimPDE/stable/",
27+
"https://docs.sciml.ai/SciMLTutorialsOutput/stable/",
28+
"https://docs.sciml.ai/Catalyst/stable/", "https://docs.sciml.ai/Surrogates/stable/",
29+
"https://docs.sciml.ai/SciMLBenchmarksOutput/stable/",
30+
"https://docs.sciml.ai/NeuralOperators/stable/", "https://docs.sciml.ai/NonlinearSolve/stable/",
31+
"https://docs.sciml.ai/RecursiveArrayTools/stable/",
32+
"https://docs.sciml.ai/ReservoirComputing/stable/", "https://docs.sciml.ai/MethodOfLines/stable/", "https://lux.csail.mit.edu/dev/"
33+
]
34+
35+
# Crawler would not look for more URLs on these
36+
single_page_urls = [
37+
"https://johnfoster.pge.utexas.edu/hpc-book/DifferentialEquations_jl.html",
38+
"https://julialang.org/blog/2019/01/fluxdiffeq/", "https://juliapackages.com/p/galacticoptim",
39+
"https://julianlsolvers.github.io/Optim.jl/stable/"]
40+
41+
index_path = make_knowledge_packs(crawlable_urls; single_urls = single_page_urls,
42+
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "sciml", custom_metadata = "SciML ecosystem")
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# The example below demonstrates the creation of Tidier knowledge pack
2+
3+
using DocsScraper
4+
5+
# The crawler will run on these URLs to look for more URLs with the same hostname
6+
crawlable_urls = ["https://tidierorg.github.io/Tidier.jl/dev/",
7+
"https://tidierorg.github.io/TidierPlots.jl/latest/",
8+
"https://tidierorg.github.io/TidierData.jl/latest/",
9+
"https://tidierorg.github.io/TidierDB.jl/latest/"]
10+
11+
index_path = make_knowledge_packs(crawlable_urls;
12+
target_path = joinpath("knowledge_packs", "dim=3072;chunk_size=384;Float32"), index_name = "tidier", custom_metadata = "Tidier ecosystem")

0 commit comments

Comments
 (0)