Trying to avoid a multi-byte character breaking. #55

Oliverity · 2024-01-12T13:15:33Z

On the issue #54. Unfortunately, it seems we can only get the encoding from the XML heading if we read the stream, and we better assume the encoding before that to avoid the breaking. Here I assume the utf-8, but it wasn't tested on the *.docx files containing utf-16 XMLs inside.

Tries to prevent multi-byte characters from breaking. Unfortunately, we need to setEncoding() before actually reading contents, to avoid such breaking. Which means, we won't know the encoding yet. Right know UTF-8 is assumed, but OOXML files might be UTF-16 too. Haven't tested those as yet.

fritx · 2025-03-26T08:52:30Z

Works for me. Thanks.

pnpm add 'github:Oliverity/node-word-extractor#develop'

// package.json
{
  "dependencies": {
    // ...
    "word-extractor": "github:Oliverity/node-word-extractor#develop"
  }
}

Иванов Олег added 2 commits January 12, 2024 16:05

OOXML stands for "Office Open XML", not "Open Office XML".

66d711f

Oliverity mentioned this pull request Jan 12, 2024

Broken multi-byte letters at the borders of 4096-byte chunks #54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trying to avoid a multi-byte character breaking. #55

Trying to avoid a multi-byte character breaking. #55

Uh oh!

Oliverity commented Jan 12, 2024 •

edited

Loading

Uh oh!

fritx commented Mar 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Trying to avoid a multi-byte character breaking. #55

Are you sure you want to change the base?

Trying to avoid a multi-byte character breaking. #55

Uh oh!

Conversation

Oliverity commented Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fritx commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Oliverity commented Jan 12, 2024 •

edited

Loading

fritx commented Mar 26, 2025 •

edited

Loading