A backend service for handling html and json files crawled from bgm-archive-sh.
It's a monolithic app that consists of the following features:
- Process the html files crawled by bgm-archive-sh. HTML file will be
converted to json file with the stem contents. The json file will be lie on another git repository resembles the
folder structure of html repo. The process is commit by commit, which means each commit in html repo will be reflected
in a commit in the json repo. These linked commit share the same commit message. Check
/hook/commitfor details. - Persist the json file to database. Each topic is now converted to a json file. To efficiently query the information we
need, the app will pour the db with fields from json file. Note that the
contentHtmlwill not be put into db. Other information like topic title, topic id, post id, likes and also blog tags will be inserted to db. Check/hook/dbfor details. - Expose the history of each topic via restful api. Now that we have version management for each topic on top of git,
it's easy to expose them via api. Check the contents under
/historypath for details. - Build git commit history cache in db. The query performance of git history isn't great, so some cache is built in db.
Check the
ba_cache_*tables and definitions and/hook/cachefor details. - Forum enhance handler. To enhance the forum experience, some endpoints are built to query the statistic data of users
for specific type of topics. Also, to leverage the snapshots managed by git, an endpoint to query deleted post is
built. Check the contents under
/forum-enhancepath for details. - Holes detection. It's possible that the
bgm-archive-shnot able to find some hidden topics from homepage. So some endpoints to detect and to mask the holes for deleted/uncrawled topics are built. Check/holesfor details. - Spot-check. Some topics could be changed and not popped up in homepage. Spot checker will randomly pick topic ids
during processing html commits and write to the
sc.txtfile under topic folder of html repos.bgm-archive-shwill check this file in the next round and will crawl those topic ids in thesc.txtfile.
Check the config file example: config.sample.json.
Most default settings are good enough. By default the working directory will be ~/source.
You should provide the repoList, otherwise nothing will be processed and persisted.
- For bare repo
Please run:
to make fetch same as pull
$ git config remote.origin.fetch 'refs/heads/*:refs/heads/*'