diff --git a/paper_zoo/textrecog/A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding.yaml b/paper_zoo/textrecog/A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding.yaml new file mode 100644 index 000000000..2a2eecb7e --- /dev/null +++ b/paper_zoo/textrecog/A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding.yaml @@ -0,0 +1,79 @@ +Title: 'A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding' +Abbreviation: Qiao et al +Tasks: + - TextRecog +Venue: ICFHR +Year: 2022 +Lab/Company: + - Tomorrow Advancing Life, Beijing, China +URL: + Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-21648-0_14' + Arxiv: 'https://books.google.fr/books?hl=zh-CN&lr=&id=hvmdEAAAQBAJ&oi=fnd&pg=PA198&ots=Gg_BaAnXLm&sig=gpJ2h9NjKz1PjLWSfwDpyd8eLZE&redir_esc=y#v=onepage&q&f=false' +Paper Reading URL: N/A +Code: N/A +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Recently, vision Transformer (ViT) has attracted more and more attention, +many works introduce the ViT into concrete vision tasks and achieve impressive +performance. However, there are only a few works focused on the applications of +the ViT for scene text recognition. This paper takes a further step and proposes +a strong scene text recognizer with a fully ViT-based architecture. +Specifically, we introduce multi-grained features into both the encoder and +decoder. For the encoder, we adopt a two-stage ViT with different grained +patches, where the first stage extracts extent visual features with 2D +ine-grained patches and the second stage aims at the sequence of contextual +features with 1D coarse-grained patches. The decoder integrates Connectionist +Temporal Classification (CTC)-based and attention-based decoding, where the +two decoding schemes introduce different grained features into the decoder and +benefit from each other with a deep interaction. To improve the extraction of +fine-grained features, we additionally explore self-supervised learning for +text recognition with masked autoencoders. Furthermore, a focusing mechanism is +proposed to let the model target the pixel reconstruction of the text area. Our +proposed method achieves state-of-the-art or comparable accuracies on benchmarks +of scene text recognition with a faster inference speed and nearly 50% reduction +of parameters compared with other recent works.' +MODELS: + Architecture: + - CTC + - Attention + - Transformer + Learning Method: + - Self-Supervised + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/210053998-385587ef-2b0e-4c9b-a8b8-d6171261c621.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - ST + - MJ + Test DataSets: + Avg.: 90.5 + IIIT5K: + WAICS: 96.1 + SVT: + WAICS: 92.3 + IC13: + WAICS: 95.0 + IC15: + WAICS: 86.0 + SVTP: + WAICS: 87.0 + CUTE: + WAICS: 86.8 +Bibtex: '@inproceedings{qiao2022vision, + title={A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding}, + author={Qiao, Zhi and Ji, Zhilong and Yuan, Ye and Bai, Jinfeng}, + booktitle={International Conference on Frontiers in Handwriting Recognition}, + pages={198--212}, + year={2022}, + organization={Springer} +}' diff --git a/paper_zoo/textrecog/Levenshtein OCR.yaml b/paper_zoo/textrecog/Levenshtein OCR.yaml new file mode 100644 index 000000000..782e76594 --- /dev/null +++ b/paper_zoo/textrecog/Levenshtein OCR.yaml @@ -0,0 +1,72 @@ +Title: 'Levenshtein OCR' +Abbreviation: Lev-OCR +Tasks: + - TextRecog +Venue: ECCV +Year: 2022 +Lab/Company: + - Alibaba DAMO Academy, Beijing, China +URL: + Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_19' + Arxiv: 'https://arxiv.org/abs/2209.03594' +Paper Reading URL: 'https://mp.weixin.qq.com/s/Nuc8j3V5YeaXpY64SsIeCw' +Code: 'https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/LevOCR' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'A novel scene text recognizer based on Vision-Language Transformer +(VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the +proposed method (named Levenshtein OCR, and LevOCR for short) explores an +alternative way for automatically transcribing textual content from cropped +natural images. Specifically, we cast the problem of scene text recognition as +an iterative sequence refinement process. The initial prediction sequence +produced by a pure vision model is encoded and fed into a cross-modal +transformer to interact and fuse with the visual features, to progressively +approximate the ground truth. The refinement process is accomplished via two +basic characterlevel operations: deletion and insertion, which are learned with +imitation learning and allow for parallel decoding, dynamic length change and +good interpretability. The quantitative experiments clearly demonstrate that +LevOCR achieves state-of-the-art performances on standard benchmarks and the +qualitative analyses verify the effectiveness and advantage of the proposed +LevOCR algorithm. Code will be released soon.' +MODELS: + Architecture: + - Transformer + Learning Method: + - Supervised + Language Modality: + - Explicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/210163468-bb6c14ba-134a-4dd5-881e-a7adb4058dcd.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - ST + - MJ + Test DataSets: + Avg.: 92.1 + IIIT5K: + WAICS: 96.6 + SVT: + WAICS: 92.9 + IC13: + WAICS: 96.9 + IC15: + WAICS: 86.4 + SVTP: + WAICS: 88.1 + CUTE: + WAICS: 91.7 +Bibtex: '@inproceedings{wang2022multi, + title={Multi-granularity Prediction for Scene Text Recognition}, + author={Wang, Peng and Da, Cheng and Yao, Cong}, + booktitle={European Conference on Computer Vision}, + pages={339--355}, + year={2022}, + organization={Springer} +}' diff --git a/paper_zoo/textrecog/Multi-Granularity Prediction for Scene Text Recognition.yaml b/paper_zoo/textrecog/Multi-Granularity Prediction for Scene Text Recognition.yaml new file mode 100644 index 000000000..e6f7d34a9 --- /dev/null +++ b/paper_zoo/textrecog/Multi-Granularity Prediction for Scene Text Recognition.yaml @@ -0,0 +1,74 @@ +Title: 'Multi-Granularity Prediction for Scene Text Recognition' +Abbreviation: MGP-STR +Tasks: + - TextRecog +Venue: ECCV +Year: 2022 +Lab/Company: + - Alibaba DAMO Academy, Beijing, China +URL: + Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_20' + Arxiv: 'https://arxiv.org/abs/2209.03592' +Paper Reading URL: N/A +Code: 'https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Scene text recognition (STR) has been an active research topic in +computer vision for years. To tackle this challenging problem, numerous +innovative methods have been successively proposed and incorporating linguistic +knowledge into STR models has recently become a prominent trend. In this work, +we first draw inspiration from the recent progress in Vision Transformer (ViT) +to construct a conceptually simple yet powerful vision STR model, which is built +upon ViT and outperforms previous state-of-the-art models for scene text +recognition, including both pure vision models and language-augmented methods. +To integrate linguistic knowledge, we further propose a Multi-Granularity +Prediction strategy to inject information from the language modality into the +model in an implicit way, i.e. , subword representations (BPE and WordPiece) +widely-used in NLP are introduced into the output space, in addition to the +conventional character level representation, while no independent language model +(LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the +performance envelop of STR to an even higher level. Specifically, it achieves +an average recognition accuracy of 93.35% on standard benchmarks. Code will be +released soon.' +MODELS: + Architecture: + - Transformer + Learning Method: + - Supervised + Language Modality: + - Explicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/210163378-fc11a79b-fb7d-4a3f-947e-a8f6dfd14dd2.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - ST + - MJ + Test DataSets: + Avg.: 92.8 + IIIT5K: + WAICS: 96.4 + SVT: + WAICS: 94.7 + IC13: + WAICS: 97.3 + IC15: + WAICS: 87.2 + SVTP: + WAICS: 91.0 + CUTE: + WAICS: 90.3 +Bibtex: '@inproceedings{wang2022multi, + title={Multi-granularity Prediction for Scene Text Recognition}, + author={Wang, Peng and Da, Cheng and Yao, Cong}, + booktitle={European Conference on Computer Vision}, + pages={339--355}, + year={2022}, + organization={Springer} +}' diff --git a/paper_zoo/textrecog/On Vocabulary Reliance in Scene Text Recognition.yaml b/paper_zoo/textrecog/On Vocabulary Reliance in Scene Text Recognition.yaml new file mode 100644 index 000000000..391c45331 --- /dev/null +++ b/paper_zoo/textrecog/On Vocabulary Reliance in Scene Text Recognition.yaml @@ -0,0 +1,77 @@ +Title: 'On Vocabulary Reliance in Scene Text Recognition' +Abbreviation: Wan et al +Tasks: + - TextRecog +Venue: CVPR +Year: 2020 +Lab/Company: + - Megvii + - China University of Mining and Technology + - University of Rochester +URL: + Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Wan_On_Vocabulary_Reliance_in_Scene_Text_Recognition_CVPR_2020_paper.html' + Arxiv: 'https://arxiv.org/abs/2005.03959' +Paper Reading URL: N/A +Code: N/A +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'The pursuit of high performance on public benchmarks has been the +driving force for research in scene text recognition, and notable progress has +been achieved. However, a close investigation reveals a startling fact that the +state-ofthe-art methods perform well on images with words within vocabulary but +generalize poorly to images with words outside vocabulary. We call this +phenomenon “vocabulary reliance”. In this paper, we establish an analytical +framework to conduct an in-depth study on the problem of vocabulary reliance +in scene text recognition. Key findings include: (1) Vocabulary reliance is +ubiquitous, i.e., all existing algorithms more or less exhibit such +characteristic; (2) Attention-based decoders prove weak in generalizing to +words outside vocabulary and segmentation-based decoders perform well in +utilizing visual features; (3) Context modeling is highly coupled with the +prediction layers. These findings provide new insights and can benefit future +research in scene text recognition. Furthermore, we propose a simple yet +effective mutual learning strategy to allow models of two families +(attention-based and segmentationbased) to learn collaboratively. This remedy +alleviates the problem of vocabulary reliance and improves the overall scene +text recognition performance.' +MODELS: + Architecture: + - CTC + - Attention + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/210054683-5d5f3117-4bee-43d6-a36c-8e645d47c2b1.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - ST + - MJ + Test DataSets: + Avg.: N/A + IIIT5K: + WAICS: N/A + SVT: + WAICS: N/A + IC13: + WAICS: N/A + IC15: + WAICS: N/A + SVTP: + WAICS: N/A + CUTE: + WAICS: N/A +Bibtex: '@inproceedings{wan2020vocabulary, + title={On vocabulary reliance in scene text recognition}, + author={Wan, Zhaoyi and Zhang, Jielei and Zhang, Liang and Luo, Jiebo and Yao, Cong}, + booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, + pages={11425--11434}, + year={2020} +}' diff --git a/paper_zoo/textrecog/Parallel and Robust Text Rectifier for Scene Text Recognition.yaml b/paper_zoo/textrecog/Parallel and Robust Text Rectifier for Scene Text Recognition.yaml new file mode 100644 index 000000000..c1833a008 --- /dev/null +++ b/paper_zoo/textrecog/Parallel and Robust Text Rectifier for Scene Text Recognition.yaml @@ -0,0 +1,79 @@ +Title: 'Parallel and Robust Text Rectifier for Scene Text Recognition' +Abbreviation: PRTR +Tasks: + - TextRecog +Venue: BMVC +Year: 2022 +Lab/Company: + - Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen, China + - Ping An Technology (Shenzhen) Co. Ltd. + - School of Information and Telecommunication Engineering, Guangzhou Maritime University, Guangzhou, China +URL: + Venue: 'https://bmvc2022.mpi-inf.mpg.de/0770.pdf' + Arxiv: 'https://bmvc2022.mpi-inf.mpg.de/0770.pdf' +Paper Reading URL: N/A +Code: N/A +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Scene text recognition (STR) is to recognize text appearing in images. +Current stateof-the-art STR methods usually adopt a multi-stage framework which +uses a rectifier to iteratively rectify errors from previous stage. However, the +rectifiers of those models are not proficient in addressing the misalignment +problem. To alleviate this problem, we proposed a novel network named Parallel +and Robust Text Rectifier (PRTR), which consists of a bi-directional position +attention initial decoder and a sequence of stacked Robust Visual Semantic +Rectifiers (RVSRs). In essence, PRTR is creatively designed as a coarse-to-fine +architecture that exploits a sequence of rectifiers for repeatedly refining the +prediction in a stage-wise manner. RVSR is a core component in the proposed +model which comprises two key modules, Dual-Path Semantic Alignment (DPSA) +module and Visual-Linguistic Alignment (VLA). DPSA can rectify the linguistic +misalignment issues via the global semantic features that are derived from the +recognized characters as a whole, while VLA re-aligns the linguistic features +with visual features by an attention model to avoid the overfitting of +linguistic features. All parts of PRTR are nonautoregressive (parallel), and +its RVSR re-aligns its output according to the linguistic features and the +visual features, so it is robust to the mis-aligned error. Extensive experiments +on mainstream benchmarks demonstrate that the proposed model can alleviate +the misalignment problem to a large extent and outperformed state-of-the-art +models.' +MODELS: + Architecture: + - Transformer + Learning Method: + - Supervised + Language Modality: + - Explicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/210052800-ab1f29d1-de7c-43bd-8297-b13cd83e28d3.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - ST + - SA + - MJ + Test DataSets: + Avg.: 93.3 + IIIT5K: + WAICS: 97.0 + SVT: + WAICS: 94.4 + IC13: + WAICS: 95.8 + IC15: + WAICS: 86.1 + SVTP: + WAICS: 89.8 + CUTE: + WAICS: 96.5 +Bibtex: '@article{tang2021visual, + title={Visual-semantic transformer for scene text recognition}, + author={Tang, Xin and Lai, Yongquan and Liu, Ying and Fu, Yuanyuan and Fang, Rui}, + journal={arXiv preprint arXiv:2112.00948}, + year={2021} +}' diff --git a/paper_zoo/textrecog/SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition.yaml b/paper_zoo/textrecog/SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition.yaml new file mode 100644 index 000000000..2517e6a17 --- /dev/null +++ b/paper_zoo/textrecog/SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition.yaml @@ -0,0 +1,82 @@ +Title: 'SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition' +Abbreviation: SGBANet +Tasks: + - TextRecog +Venue: ECCV +Year: 2022 +Lab/Company: + - Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai, China + - Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia + - iFLYTEK Research, iFLYTEK, Hefei, China + - CVPR Unit, Indian Statistical Institute, Kolkata, India +URL: + Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_27' + Arxiv: 'https://arxiv.org/abs/2207.10256' +Paper Reading URL: N/A +Code: N/A +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Scene text recognition is a challenging task due to the complex +backgrounds and diverse variations of text instances. In this paper, we +propose a novel Semantic GAN and Balanced Attention Network (SGBANet) to +recognize the texts in scene images. The proposed method first generates +the simple semantic feature using Semantic GAN and then recognizes the scene +text with the Balanced Attention Module. The Semantic GAN aims to align the +semantic feature distribution between the support domain and target domain. +Different from the conventional image-to-image translation methods that +perform at the image level, the Semantic GAN performs the generation and +discrimination on the semantic level with the Semantic Generator Module +(SGM) and Semantic Discriminator Module (SDM). For target images (scene text +images), the Semantic Generator Module generates simple semantic features +that share the same feature distribution with support images (clear text +images). The Semantic Discriminator Module is used to distinguish the semantic +features between the support domain and target domain. In addition, a +Balanced Attention Module is designed to alleviate the problem of attention +drift. The Balanced Attention Module first learns a balancing parameter based +on the visual glimpse vector and semantic glimpse vector, and then performs +the balancing operation for obtaining a balanced glimpse vector. Experiments +on six benchmarks, including regular datasets, i.e., IIIT5K, SVT, ICDAR2013, +and irregular datasets, i.e., ICDAR2015, SVTP, CUTE80, validate the +effectiveness of our proposed method.' +MODELS: + Architecture: + - Attention + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/210163800-3ecb592b-daae-450f-907b-cd239b2af1c0.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - ST + - MJ + Test DataSets: + Avg.: 88.21 + IIIT5K: + WAICS: 95.4 + SVT: + WAICS: 89.1 + IC13: + WAICS: 95.1 + IC15: + WAICS: 78.4 + SVTP: + WAICS: 83.1 + CUTE: + WAICS: 88.2 +Bibtex: '@inproceedings{zhong2022sgbanet, + title={SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition}, + author={Zhong, Dajian and Lyu, Shujing and Shivakumara, Palaiahnakote and Yin, Bing and Wu, Jiajia and Pal, Umapada and Lu, Yue}, + booktitle={European Conference on Computer Vision}, + pages={464--480}, + year={2022}, + organization={Springer} +}' diff --git a/paper_zoo/textrecog/Scene Text Detection and Recognition: The Deep Learning Era.yaml b/paper_zoo/textrecog/Scene Text Detection and Recognition: The Deep Learning Era.yaml new file mode 100644 index 000000000..d7d276b77 --- /dev/null +++ b/paper_zoo/textrecog/Scene Text Detection and Recognition: The Deep Learning Era.yaml @@ -0,0 +1,74 @@ +Title: 'Scene Text Detection and Recognition: The Deep Learning Era' +Abbreviation: Long et al +Tasks: + - TextRecog + - TextDet +Venue: IJCV +Year: 2021 +Lab/Company: + - Alibaba DAMO Academy, Beijing, China +URL: + Venue: 'https://link.springer.com/article/10.1007/s11263-020-01369-0' + Arxiv: 'https://arxiv.org/abs/1811.04256' +Paper Reading URL: N/A +Code: 'https://github.com/Jyouhou/SceneTextPapers' +Supported In MMOCR: N/S +PaperType: + - Survey +Abstract: 'With the rise and development of deep learning, computer vision has +been tremendously transformed and reshaped. As an important research area in +computer vision, scene text detection and recognition has been inevitably +influenced by this wave of revolution, consequently entering the era of +deep learning. In recent years, the community has witnessed substantial +advancements in mindset, methodology and performance. This survey is aimed at +summarizing and analyzing the major changes and significant progresses of +scene text detection and recognition in the deep learning era. Through this +article, we devote to: (1) introduce new insights and ideas; (2) highlight +recent techniques and benchmarks; (3) look ahead into future trends. +Specifically, we will emphasize the dramatic differences brought by deep +learning and remaining grand challenges. We expect that this review paper +would serve as a reference book for researchers in this field.' +MODELS: + Architecture: + - CTC + - Attention + - Transformer + Learning Method: + - Supervised + Language Modality: + - Explicit Language Model + - Implicit Language Model + Network Structure: N/A + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: N/A + Test DataSets: + Avg.: N/A + IIIT5K: + WAICS: N/A + SVT: + WAICS: N/A + IC13: + WAICS: N/A + IC15: + WAICS: N/A + SVTP: + WAICS: N/A + CUTE: + WAICS: N/A +Bibtex: '@article{long2021scene, + title={Scene text detection and recognition: The deep learning era}, + author={Long, Shangbang and He, Xin and Yao, Cong}, + journal={International Journal of Computer Vision}, + volume={129}, + number={1}, + pages={161--184}, + year={2021}, + publisher={Springer} +}' diff --git a/paper_zoo/textrecog/Vision Transformer for Fast and Efficient Scene Text Recognition.yaml b/paper_zoo/textrecog/Vision Transformer for Fast and Efficient Scene Text Recognition.yaml new file mode 100644 index 000000000..d9deb7d62 --- /dev/null +++ b/paper_zoo/textrecog/Vision Transformer for Fast and Efficient Scene Text Recognition.yaml @@ -0,0 +1,75 @@ +Title: 'Visual-Semantic Transformer for Scene Text Recognition' +Abbreviation: ViTSTR +Tasks: + - TextRecog +Venue: ICDAR +Year: 2021 +Lab/Company: + - Electrical and Electronics Engineering Institute, University of the Philippines, Quezon City, Philippines +URL: + Venue: 'https://link.springer.com/chapter/10.1007/978-3-030-86549-8_21' + Arxiv: 'https://arxiv.org/abs/2105.08582' +Paper Reading URL: N/A +Code: 'https://github.com/roatienza/deep-text-recognition-benchmark' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Scene text recognition (STR) enables computers to read text in natural +scenes such as object labels, road signs and instructions. STR helps machines +perform informed decisions such as what object to pick, which direction to go, +and what is the next step of action. In the body of work on STR, the focus has +always been on recognition accuracy. There is little emphasis placed on speed +and computational efficiency which are equally important especially for +energy-constrained mobile machines. In this paper we propose ViTSTR, an STR +with a simple single stage model architecture built on a compute and parameter +efficient vision transformer (ViT). On a comparable strong baseline method such +as TRBA with accuracy of 84.3%, our small ViTSTR achieves a competitive accuracy +of 82.6% (84.2% with data augmentation) at 2.4× speed up, using only 43.4% of +the number of parameters and 42.2% FLOPS. The tiny version of ViTSTR achieves +80.3% accuracy (82.1% with data augmentation), at 2.5× the speed, requiring +only 10.9% of the number of parameters and 11.9% FLOPS. With data augmentation, +our base ViTSTR outperforms TRBA at 85.2% accuracy (83.7% without augmentation) +at 2.3× the speed but requires 73.2% more parameters and 61.5% more FLOPS. In +terms of trade-offs, nearly all ViTSTR configurations are at or near the frontiers +to maximize accuracy, speed and computational efficiency all at the same time.' +MODELS: + Architecture: + - Transformer + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/210161050-476296e7-10e5-4ec9-9024-af6b5c5ee84b.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: 2080Ti + ITEM: 17.6e9 + PARAMS: 85.8e6 + Experiment: + Training DataSets: + - ST + - MJ + Test DataSets: + Avg.: 84.0 + IIIT5K: + WAICS: 88.4 + SVT: + WAICS: 87.7 + IC13: + WAICS: 92.4 + IC15: + WAICS: 72.6 + SVTP: + WAICS: 81.8 + CUTE: + WAICS: 81.3 +Bibtex: '@inproceedings{atienza2021vision, + title={Vision transformer for fast and efficient scene text recognition}, + author={Atienza, Rowel}, + booktitle={International Conference on Document Analysis and Recognition}, + pages={319--334}, + year={2021}, + organization={Springer} +}' diff --git a/paper_zoo/textrecog/Visual-Semantic Transformer for Scene Text Recognition.yaml b/paper_zoo/textrecog/Visual-Semantic Transformer for Scene Text Recognition.yaml new file mode 100644 index 000000000..c30bd9adf --- /dev/null +++ b/paper_zoo/textrecog/Visual-Semantic Transformer for Scene Text Recognition.yaml @@ -0,0 +1,78 @@ +Title: 'Visual-Semantic Transformer for Scene Text Recognition' +Abbreviation: VST +Tasks: + - TextRecog +Venue: BMVC +Year: 2022 +Lab/Company: + - Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen, China + - Ping An Technology (Shenzhen) Co. Ltd. + - School of Information and Telecommunication Engineering, Guangzhou Maritime University, Guangzhou, China +URL: + Venue: 'https://bmvc2022.mpi-inf.mpg.de/0772.pdf' + Arxiv: 'https://arxiv.org/abs/2112.00948' +Paper Reading URL: N/A +Code: N/A +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Semantic information plays an important role in scene text recognition +(STR) as well as visual information. Although state-of-the-art models have +achieved great improvement in STR, they usually rely on extra external language +models to refine the semantic features through context information, and the +separate utilization of semantic and visual information leads to biased +results, which limits the performance of those models. In this paper, we +propose a novel model called Visual-Semantic Transformer (VST) for text +recognition. VST consists of several key modules, including a ConvNet, a visual +module, two visual-semantic modules, a visual-semantic feature interaction +module and a semantic module. VST is a conceptually much simpler model. +Different from existing STR models, VST can efficiently extract semantic +features without using external language models and it also allows visual +features and semantic features to interact with each other parallel so that +global information from two domains can be fully exploited and more powerful +representations can be learned. The working mechanism of VST is highly similar +to our cognitive system, where the visual information is first captured by our +sensory organ, and is simultaneously transformed to semantic information by our +brain. Extensive experiments on seven public benchmarks including regular/ +irregular text recognition datasets verify the effectiveness of VST, it +outperformed other 14 popular models on four out of seven benchmark datasets +and yielded competitive performance on the other three datasets.' +MODELS: + Architecture: + - Transformer + Learning Method: + - Supervised + Language Modality: + - Explicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/210052231-22092115-0eba-4c2c-9050-b8fc9aff38ca.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - ST + - MJ + Test DataSets: + Avg.: 92.9 + IIIT5K: + WAICS: 96.7 + SVT: + WAICS: 94.0 + IC13: + WAICS: 96.7 + IC15: + WAICS: 85.4 + SVTP: + WAICS: 89.0 + CUTE: + WAICS: 95.5 +Bibtex: '@article{tang2021visual, + title={Visual-semantic transformer for scene text recognition}, + author={Tang, Xin and Lai, Yongquan and Liu, Ying and Fu, Yuanyuan and Fang, Rui}, + journal={arXiv preprint arXiv:2112.00948}, + year={2021} +}' diff --git a/paper_zoo/textrecog/Why You Should Try the Real Data for the Scene Text Recognition.yaml b/paper_zoo/textrecog/Why You Should Try the Real Data for the Scene Text Recognition.yaml new file mode 100644 index 000000000..c251b6518 --- /dev/null +++ b/paper_zoo/textrecog/Why You Should Try the Real Data for the Scene Text Recognition.yaml @@ -0,0 +1,70 @@ +Title: 'Why You Should Try the Real Data for the Scene Text Recognition' +Abbreviation: Loginov et al +Tasks: + - TextRecog +Venue: arXiv +Year: 2021 +Lab/Company: + - Intel Corporation +URL: + Venue: 'https://arxiv.org/abs/2107.13938' + Arxiv: 'https://arxiv.org/abs/2107.13938' +Paper Reading URL: N/A +Code: 'https://github.com/openvinotoolkit/training_extensions' +Supported In MMOCR: N/S +PaperType: + - Algorithm +Abstract: 'Recent works in the text recognition area have pushed forward the +recognition results to the new horizons. But for a long time a lack of large +human-labeled natural text recognition datasets has been forcing researchers +to use synthetic data for training text recognition models. Even though +synthetic datasets are very large (MJSynth and SynthText, two most famous +synthetic datasets, have several million images each), their diversity could +be insufficient, compared to natural datasets like ICDAR and others. +Fortunately, the recently released text recognition annotation for OpenImages +V5 dataset has comparable with synthetic dataset number of instances and more +diverse examples. We have used this annotation with a Text Recognition head +architecture from the Yet Another Mask Text Spotter and got comparable to the +SOTA results. On some datasets we have even outperformed previous SOTA models. +In this paper we also introduce a text recognition model. The model’s code is +available.' +MODELS: + Architecture: + - Attention + Learning Method: + - Supervised + Language Modality: + - Implicit Language Model + Network Structure: 'https://user-images.githubusercontent.com/65173622/210163669-0848839e-185f-4d8c-9de1-ac34e957d685.png' + FPS: + DEVICE: N/A + ITEM: N/A + FLOPS: + DEVICE: N/A + ITEM: N/A + PARAMS: N/A + Experiment: + Training DataSets: + - ST + - MJ + - Real + Test DataSets: + Avg.: 91.0 + IIIT5K: + WAICS: 93.5 + SVT: + WAICS: 94.7 + IC13: + WAICS: 96.8 + IC15: + WAICS: 80.2 + SVTP: + WAICS: 89.9 + CUTE: + WAICS: N/A +Bibtex: '@article{loginov2021you, + title={Why You Should Try the Real Data for the Scene Text Recognition}, + author={Loginov, Vladimir}, + journal={arXiv preprint arXiv:2107.13938}, + year={2021} +}'