Skip to content

Conversation

@aantich
Copy link

@aantich aantich commented Nov 9, 2025

  • Reads multiple spreadsheets
  • Converts tables to tables
  • Basic tests added

.gitignore Outdated
Comment on lines 9 to 12
/test-docs
doc/pptx-reader-design-v2.md
doc/pptx-reader-design.md
doc/xlsx-reader-design.md
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the other PR, let's leave off these .gitignore changes.

@jgm
Copy link
Owner

jgm commented Nov 9, 2025

I think the test files may need to be included in extra-source-files in pandoc.cabal.

@jgm
Copy link
Owner

jgm commented Nov 9, 2025

The new format should be added to MANUAL.txt ; see the list under --readers.

@jgm
Copy link
Owner

jgm commented Nov 9, 2025

One thing I noticed testing this locally: I got a table with hundreds of empty rows.
Generally when one starts a spreadsheet in Excel, it already comes with lots of rows, only some of which get used. It would probably make sense to strip all trailing empty rows before parsing.

@aantich
Copy link
Author

aantich commented Nov 9, 2025

One thing I noticed testing this locally: I got a table with hundreds of empty rows. Generally when one starts a spreadsheet in Excel, it already comes with lots of rows, only some of which get used. It would probably make sense to strip all trailing empty rows before parsing.

We'll fix the trailing empty rows; for empty rows in the middle of the sheet I believe it makes sense to keep them, since I may want to reference the exact cell coordinates eventually, e.g. a table like this:
image

is converted to markdown like this:

Main {#sheet-1}

Person Age Location


Anton Antich 23.0 Switzerland
James Bond 35.0 Moscow

Just a random cell

I believe this is the correct behavior.

- MANUAL updated
- trailing empty rows removed
@aantich
Copy link
Author

aantich commented Nov 9, 2025

Believe we addressed all comments. Didnt test empty rows rigorously, but on a couple of quick files it works.

cellToInlines :: XlsxCell -> [Inline]
cellToInlines cell =
let base = case cellValue cell of
TextValue t -> [Str t]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For text values, better to use B.toList (B.text t) where B is Text.Pandoc.Builder. This will convert the string into a list of Str and Space elements. Pandoc expects spaces to be represented as Space and not space characters inside a Str.

Comment on lines +31 to +42
nativeDiff :: FilePath -> Pandoc -> Pandoc -> IO (Maybe String)
nativeDiff normPath expectedNative actualNative
| expectedNative == actualNative = return Nothing
| otherwise = Just <$> do
expected <- T.unpack <$> runIOorExplode (writeNative def expectedNative)
actual <- T.unpack <$> runIOorExplode (writeNative def actualNative)
let dash = replicate 72 '-'
let diff = getDiff (lines actual) (lines expected)
return $ '\n' : dash ++
"\n--- " ++ normPath ++
"\n+++ " ++ "test" ++ "\n" ++
showDiff (1,1) diff ++ dash
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this seems to be duplicated from docx reader tests, I wonder if it makes sense to import it from there, or put it in some common place, e.g. Test.Helpers ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants