Skip to content

Conversation

Oliverity
Copy link

@Oliverity Oliverity commented Jan 12, 2024

On the issue #54. Unfortunately, it seems we can only get the encoding from the XML heading if we read the stream, and we better assume the encoding before that to avoid the breaking. Here I assume the utf-8, but it wasn't tested on the *.docx files containing utf-16 XMLs inside.

Иванов Олег added 2 commits January 12, 2024 16:05
Tries to prevent multi-byte characters from breaking. Unfortunately, we need to setEncoding() before actually reading contents, to avoid such breaking. Which means, we won't know the encoding yet. Right know UTF-8 is assumed, but OOXML files might be UTF-16 too. Haven't tested those as yet.
@fritx
Copy link

fritx commented Mar 26, 2025

Works for me. Thanks.

pnpm add 'github:Oliverity/node-word-extractor#develop'
// package.json
{
  "dependencies": {
    // ...
    "word-extractor": "github:Oliverity/node-word-extractor#develop"
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants