æ ªåŒäŒç€ŸGunosyæ§ãã€ã³ã¿ãŒã³èª²é¡çšã®ã¬ããžããªã§ãã
Article Classifierã¯ããã©ãŒã ã«å ¥åãããèšäºURLããHTMLãååŸããèšäºã®ã«ããŽãªãå€å®ããç»é¢ã«åºåããã¢ããªã±ãŒã·ã§ã³ã§ããã
ã¢ããªã±ãŒã·ã§ã³ç»é¢ã®äžå€®ã«ãããã©ãŒã ã«ãhttps://gunosy.com/ ããéžãã èšäºURLãå
¥åããã
Analyzeãã¿ã³ãæŒããšãèšäºã®ã«ããŽãªãåæ ããç»é¢ã«åºåããã
èšäºURLã§ã¯ãªãURLãå
¥åãããšãäŸå€åŠçãåãã
以äžã®ç»åã®ããã«ã"Please submit a gunosy article"ãšãšã©ãŒæãåºãã
èšäºã®ã«ããŽãªãå€å®ããéã«äºçš®é¡ã®åé¡åšã䜿çšããããšãå¯èœã§ãããäžã€ã¯ãã€ãŒããã€ãºãçšããåé¡åšãããäžã€ã¯ããžã¹ãã£ãã¯ååž°ãçšããåé¡åšã§ããã Â
ãŸãããããã®åé¡åšã«å¯ŸããŠããèšç·ŽããŒã¿ãšãã¹ãããŒã¿ã®æ¯ã¯8:2ã§åŠç¿ãè¡ãªã£ãŠããã
ããããã®åé¡åšã®é©åç(precision)ãåçŸç(recall)ãFå€(f1-score)ããã¹ãã«äœ¿ãããããŒã¿æ°(support)ã以äžã«ç€ºãã Â
ãŸããç¹å®ã®ããŒã¿ã»ããã«å¯ŸããŠéåŠç¿ãèµ·ãããŠããªãããšã瀺ãããã亀差æ€å®ã«ããå€ã以äžã«èšããŠããã
category | precision | recall | f1-score | support |
---|---|---|---|---|
ITã»ç§åŠ | 0.79 | 0.94 | 0.86 | 541 |
ãããã | 0.75 | 0.15 | 0.25 | 101 |
ãšã³ã¿ã¡ | 0.97 | 0.94 | 0.96 | 4039 |
ã°ã«ã¡ | 0.87 | 0.95 | 0.91 | 611 |
ã³ã©ã | 0.81 | 0.87 | 0.83 | 1155 |
ã¹ããŒã | 0.97 | 0.96 | 0.97 | 827 |
åœå | 0.87 | 0.82 | 0.84 | 671 |
æµ·å€ | 0.84 | 0.86 | 0.85 | 336 |
avg / total | 0.91 | 0.91 | 0.91 | 8281 |
ãŸããäºåå²äº€å·®æ€èšŒã«ããåŸãããå€ã¯
scores: [ 0.91063881 0.90882744 0.90725758 0.91038647 0.90736715]
ã§ãå¹³åå€ã¯
average value: 0.908895489763
ã§ããã亀差æ€èšŒã«ãããŠã¯å¹³åã§çŽ90.9%
ã®ç²ŸåºŠãåºããã
category | precision | recall | f1-score | support |
---|---|---|---|---|
ITã»ç§åŠ | 0.90 | 0.95 | 0.92 | 541 |
ãããã | 0.80 | 0.69 | 0.74 | 101 |
ãšã³ã¿ã¡ | 0.98 | 0.98 | 0.98 | 4039 |
ã°ã«ã¡ | 0.93 | 0.96 | 0.94 | 611 |
ã³ã©ã | 0.91 | 0.89 | 0.90 | 1155 |
ã¹ããŒã | 0.98 | 0.98 | 0.98 | 827 |
åœå | 0.90 | 0.87 | 0.88 | 671 |
æµ·å€ | 0.88 | 0.91 | 0.89 | 336 |
avg / total | 0.95 | 0.95 | 0.95 | 8281 |
ãŸããäºåå²äº€å·®æ€èšŒã«ããåŸãããå€ã¯
scores: [ 0.94046613 0.9364811 0.94119068 0.93538647 0.93913043]
ã§ãå¹³åå€ã¯
average value: 0.938530962853
ã§ããã亀差æ€èšŒã«ãããŠã¯å¹³åã§çŽ93.9%
ã®ç²ŸåºŠãåºããã Â
å®è¡ç°å¢ã¯ä»¥äžã®éãã§ãã
Mac OS X: Sierra 10.12.2
Python: 3.6.1 Â
ã¿ãŒããã«ã«ãŠã
$ brew update
$ brew install python3
$ pip install virtualenv
$ virtualenv --python=/usr/local/bin/python3 --no-site-packages env
$ source env/bin/activate
ãšå ¥åããŠãä»®æ³ç°å¢ãèµ·åããã
次ã«ã
$ brew install mecab
$ brew install mecab-ipadic
$ git clone --depth 1 [email protected]:neologd/mecab-ipadic-neologd.git
$ ./mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -y -n
ãšå ¥åããŠãMeCabã®èŸæžãšããŠäœ¿çšããmecab-ipadic-neologdãã€ã³ã¹ããŒã«ããã
æŽã« Â
$ git clone [email protected]:ajingu/gunosy.git
$ cd gunosy
$ pip install -r requirements.txt
ãšå ¥åããå¿ èŠãªpythonããã±ãŒãžãä»®æ³ç°å¢ã«ã€ã³ã¹ããŒã«ããã
æåŸã«ã.bash_profileã«ç°å¢å€æ°ãæã蟌ãå¿ èŠãããã  ä»åã¯ãããŒã¿ããŒã¹ã®ãã¹ã¯ãŒãçãé èœããããã«ã.bash_profileã«ç°å¢å€æ°ã远å ããosã¢ãžã¥ãŒã«ã䜿ã£ãŠããã°ã©ã äžã§ç°å¢å€æ°ãèªã¿èŸŒãã§ããã Â
$ cd ~
$ vim .bash_profile
ã§.bash_profileãéããããã¹ãå ã«
export GUNOSY_HOST="****"
export GUNOSY_USERNAME="****"
export GUNOSY_PASSWORD="****"
export GUNOSY_DATABASE_NAME="****"
export GUNOSY_TABLE_NAME="****"
ãšå ¥åããç°å¢å€æ°ã远å ããã Â
ããããã®å€æ°ã®å¯Ÿå¿ã¯ä»¥äžã®éãã§ãããåã ã䜿ãããŒã¿ããŒã¹ãèæ ®ããŠããµããããå€ãèšå®ããå¿ èŠãããã Â
ç°å¢å€æ° | å€ |
---|---|
GUNOSY_HOST | ãã¹ãå |
GUNOSY_USERNAME | ãŠãŒã¶ãŒå |
GUNOSY_PASSWORD | ãã¹ã¯ãŒã |
GUNOSY_DATABASE_NAME | ããŒã¿ããŒã¹å |
GUNOSY_TABLE_NAME | ããŒãã«å |
â»åœã¬ããžããªã«ã¯ããã©ã«ãã§åŠç¿æžã¿ããŒã¿ãå
¥ã£ãŠããã®ã§ãæåãã$ python manage.py runserver
ãšå
¥åããŠãåãã Â
1.ããŒã¿åéã®éã«ã以åã«åéããããŒã¿ãæ¶ãããå Žåãgunosychallengeã¬ããžããªã«ãŠ Â
$ python manage.py initialize
ãšããã³ãã³ããæã€ããšã§ã該åœããŒãã«ã®å šãŠã®è¡ãæ¶å»ããããŒã¿ããŒã¹ãåæåããããšãã§ããã
2.scrapyãçšããããŒã¿ãåéãè¡ãéãgunosychallengeã¬ããžããªã«ãŠ
$ python manage.py scrapy crawl gunosy
ãšããã³ãã³ããæã£ãŠè¡ããããŒã¿åéã®å®äºã«ã¯çŽ70åãããã40000èšäºååŸã®ããŒã¿ãååŸããã
3.ãã€ãŒããã€ãºåé¡åšã®åŠç¿ã¯ãgunosychallengeã¬ããžããªã«ãŠ
$ python manage.py make_clf nb
ãšããã³ãã³ããæã£ãŠè¡ããåŠç¿ã«ã¯çŽ8åãããã
4.ãŠã§ãã¢ããªãèµ·åããéã«ã¯ãgunosychallengeã¬ããžããªã«ãŠ Â
$ python manage.py runserver
ãšããã³ãã³ããæã€ã
ããŒã«ã«ãµãŒããŒã§ç«ã¡äžãããããhttp://127.0.0.1:8000/ ã«ã¢ã¯ã»ã¹ãããšã該åœãããŠã§ãã¢ããªãèµ·åããŠããã
äžèšã®ãæŠèŠèª¬æãã§ã説æããããäžå€®ã®ãã©ãŒã ã« https://gunosy.com/ ããéžãã èšäºURLãå
¥åããAnalyzeããã¿ã³ãæŒããšãèšäºã®ã«ããŽãªãããšã³ã¿ã¡ãããã¹ããŒãããããããããããåœå
ãããæµ·å€ãããã³ã©ã ãããITã»ç§åŠãããã°ã«ã¡ãã®äžããæšæž¬ããç»é¢ã«åºåããã
1.ããžã¹ãã£ãã¯ååž°ãçšããåé¡åšã®åŠç¿ã¯ãgunosychallengeã¬ããžããªã«ãŠ Â
$ python manage.py make_clf logistic
ãšããã³ãã³ããæã£ãŠè¡ããåŠç¿ã«ã¯çŽ10åãããã
2.ãŠã§ãã¢ããªã®ç«ã¡äžããšèšäºURLã®å ¥åã»ã«ããŽãªã®æšæž¬ã¯ãStep1ãšå šãåãæ¹æ³ã§è¡ãã Â
ã¢ããªã±ãŒã·ã§ã³ã®ãã¹ããè¡ãããšãå¯èœã§ããã
ã»gunosyã¬ããžããªã«ãŠ Â
$ python gunosynews/scrapy_test.py
ãšå
¥åãããšãã¯ããŒã©ãŒã®ãã¹ããè¡ãããšãã§ããã
ã»gunosychallengeã¬ããžããªã«ãŠ Â
$ python manage.py test
ãšå
¥åãããšããŠã§ãã¢ããªã®ãã¹ããè¡ãããšãã§ããã Â
ã»å¯èœãªéãå€ãã®èšäºã®ååŸ
該åœç®æ : gunosy.py
åŠç¿ã»ãã¹ãã®éã«ãªãã¹ãå€ãã®èšäºã䜿ãããã«ãhttps://gunosy.com/tags ããèšäºãåéããã
ã¿ã°ã¯çŸæç¹ã§1~2500ã®2500åååšããããããã®ã¿ã°ãããšã³ã¿ã¡ãããã¹ããŒãããããããããããåœå
ãããæµ·å€ãããã³ã©ã ãããITã»ç§åŠãããã°ã«ã¡ãã®8ã€ã®ã«ããŽãªã«å²ãæ¯ãããŠããã Â
ããããã¿ã°ã®äžã«ã¯ãã«ããŽãªã«å²ãæ¯ãããŠããªããã®ãã¿ã°ã ãååšããŠèšäºãååšããªããã®ãããããã®ãããªã€ã¬ã®ã¥ã©ãŒãªã¿ã°ã¯ç¡èŠããå®è£
ãè¡ãªã£ãã
ã»pipelines.pyã«Mysqlãèšå®
該åœç®æ : pipelines.py
pipelines.pyã«ãdjangoã¢ããªãšçŽä»ããMysqlãèšå®ããããšã§ãããŒã¿åéããããŒã¿ããŒã¹ãžã®ã¢ããããŒããŸã§ã®æµããã¹ã ãŒãºã«è¡ãããããã«ããã
該åœç®æ : preprocess.py
æ¥æ¬èªã®åœ¢æ
çŽ è§£æåšMeCabã䜿ã£ãŠåœ¢æ
çŽ è§£æãè¡ã£ããããã®éã«èŸæžãšããŠmecab-ipadic-neologdã䜿çšããã Â
ããã«ãã£ãŠã人ç©åãå°åãªã©ã®åºæåè©ãå€ããã¥ãŒã¹èšäºãããããé©åãªç¹åŸŽèªãæœåºããããšãã§ããŠããããŸããç¹åŸŽèªã¯åè©ãšåœ¢å®¹è©ã«éå®ããè©±ã®æèãšã®é¢é£æ§ãããé«ãèšèãæœåºããã Â
ãŸããæ§ç¯ç°å¢ã«ãã£ãŠèŸæžã®äœçœ®ãæå®ãããã¹ãå€ãããããããŒã¿ãåå å·¥ããéã«æåã«èŸæžã®ãã¹ãæ€çŽ¢ããããã«å®è£
ããã
該åœç®æ : preprocess.py
æ¥æ¬èªã®ã¹ãããã¯ãŒããéããslothlib ãããã°ã©ã äžã§èªã¿èŸŒãå®è£
ãè¡ããã¹ãããã¯ãŒããèšå®ããã
該åœç®æ : NaiveBayes.py, Logistic.py
åé¡åšã®åŠç¿ã®éã«ãåŠç¿ããåé¡åšãdillã©ã€ãã©ãªã䜿ã£ãŠã·ãªã¢ã©ã€ãºããŠããã Â
ãã®ããããŠã§ãã¢ããªã§èšäºURLãå
¥åããéããã§ã«äœã£ãåé¡åšãèªã¿èŸŒãã ãã§æ°èšäºã®è§£æãå¯èœã«ãªããããèšäºã®ã«ããŽãªå€å¥ã«ãããæéã倧å¹
ã«ççž®ããã Â
該åœç®æ : NaiveBayes.py
ãã€ãŒããã€ãºæ³ã«ãããããŒãé »åºŠåé¡ã(ããã«ããŽãªã«åŠç¿æã«å«ãŸããªãã£ãåèªããã¹ãææžã«å«ãŸããŠãããšããã®ã«ããŽãªã§ãã確çã0ã«ãªã£ãŠããŸãåé¡)ãåé¿ãããããã©ãã©ã¹ã¹ã ãŒãžã³ã°ãå®è£ ããã Â
該åœç®æ : Logistic.py
ãã€ãŒããã€ãºæ³ã§ã¯ãããããã®åèªã®çŸããäºè±¡ã¯äºãã«ç¬ç«ã§ãããšåæããŠãåèªã®æ¡ä»¶ä»ã確çãæãåãããŠãããããã ãšå ã»ã©ãããããŒãé »åºŠåé¡ãã®ããã«ãåŠç¿æã«ååšããªãã£ãåèªã«çµæãå·Šå³ãããããã  ãã®ãããåŠç¿æã«ååšããåèªã®ã¿ã«çç®ããŠèšç®ãè¡ã(åŠç¿æã«ååšããªãã£ãç¹åŸŽèªã«é¢ããŠã¯ãããžã¹ãã£ãã¯é¢æ°ãžã®å ¥åã¯0ãšãªãå®è³ªçã«åœ±é¿ãäžããªã)ã¢ãã«ã§ãããã«ããŽãªå€å¥ã«åºã䜿ãããããžã¹ãã£ãã¯ååž°ã¢ãã«ãä»åã¯äœ¿çšããã Â
åèªããšã®TF-IDFãèšç®ããŠãããããã®åèªã«å¯ŸããŠé©åãªéã¿ä»ããè¡ãªã£ãã Â
ããããã®ã«ããŽãªãŒã®ãµã³ãã«æ°ã«å€§å¹
ãªéããããäºãåå ã§ããã€ãŒããã€ãºåé¡åšã®æã«ã¯ããããããã«ããŽãªãŒã®åçŸçã0.15
ãšå€§å€äœãæ°å€ã«ãªã£ãŠãããããã¯ããããããã«ããŽãªãŒã®ãµã³ãã«æ°ãä»ã«ããŽãªãŒã«æ¯ã¹ãŠéåžžã«å°ãªãããšãçç±ã§ãå®éã«ã¯ããããããã«ããŽãªãŒã§ããèšäºãä»ã«ããŽãªãŒã§ãããšæšæž¬ãããå Žåãå€ããªããããåœé°æ§ãé«ããªã£ãŠãããšèããããã Â
ãã®ãããªãµã³ãã«æ°ã«ããã«ããŽãªãŒå€å¥ã®åãã軜æžããããã«ãLogisticRegressionã¢ãã«ã®éã¿ä»ããã©ã¡ãŒã¿ã§ããclass_weightã"balanced"ã«èšå®ããåã«ããŽãªãŒã§ã®ç¹åŸŽèªã®éã¿ããµã³ãã«æ°ã«åæ¯äŸãããäºã§ãããããããã«ããŽãªãŒã®åçŸçã50ãã€ã³ãä»¥äžæ¹åããäºãã§ããã Â
ãªããããŒã¿ãã¢ã³ããŒãµã³ããªã³ã°ããŠããããã®ã«ããŽãªã®ããŒã¿æ°ãããããæ¹æ³ãèæ
®ãããããã®å ŽåããŒã¿æ°ãå
šéšã§çŽ4000ãšãªã£ãŠãµã³ãã«æ°ãæ¿æžããç²ŸåºŠãæ¥æ¿ã«äžããã®ã§ãä»åã¯class_weightãèšå®ããææ³ããšã£ãã
LogisticRegressionã¢ãã«ã®æ£ååã®ãã©ã¡ãŒã¿ã§ããCã®å€ãæé©åããããã«ãGridSearchã䜿çšããã
該åœç®æ : views.py
https://gunosy.com/ ã®èšäºURLã§ã¯ãªãURLãå
¥åãããšãHTMLæ§é ãææ¡ã§ããããŠã§ãã¢ããªã®ç»é¢ã§ã¯ãªããdjangoã®ãšã©ãŒç»é¢ãåºåãããŠããŸãã Â
ãã®ããããããããäŸå€åŠçãæžããŠãããäžé©åœãªURLãå
¥åãããå Žåã«ã¯ããŠã§ãã¢ããªã®ç»é¢ã«ãšã©ãŒæãåºåããããã«èšå®ããã
該åœç®æ : settings.py(gunosychallengeãã£ã¬ã¯ããª), database.py, pipelines.py  ~/.bash_profileã«ç°å¢å€æ°ãèšå®ããããšã§ãMysqlã®ãã¹ã¯ãŒãã®å ¬éãé¿ããã