Skip to content

Conversation

@YuweiXiao
Copy link
Contributor

This PR introduces external table support, allowing users to persist an external file's view queried through DuckDB's readers (read_csv, read_parquet, and read_json).

Previously, users had to embed file locations and options directly in queries, and use r[xx] syntax for column reference. External tables simplify this by defining file paths and reader options once at CREATE time, enabling clean SELECT statements without r[xx] syntax. This also opens room for access control on external files, such as fine-grained permissions like column-level visibility for different users.

CREATE TABLE Syntax

CREATE TABLE external_csv () USING duckdb WITH (
    duckdb_external_location = '../../data/iris.csv',
    duckdb_external_format = 'csv',
    duckdb_external_options = '{"header": true}'
);

-- Query like a regular table
SELECT * FROM external_csv;
SELECT "sepal.length" FROM external_csv;

-- Raw SQL way
SELECT r['sepal.length'] FROM read_parquet('../../data/iris.csv')

Features

  • DDL Support: CREATE TABLE, DROP TABLE, ALTER TABLE NAME
  • Auto Schema Inference: Column names and types inferred by DuckDB, persisted in Postgres catalog
  • Lazy loading: External tables are dynamically loaded as DuckDB views only when referenced in queries

@visardida
Copy link

visardida commented Oct 11, 2025

This is a much needed improvement @YuweiXiao , thank you!

Does this implementation of external table support handle partitioned Parquet datasets for example, when using wildcard paths or recursive directory patterns such as:

read_parquet('/path/to/data/**/*.parquet')

In other words, if I create an external table pointing to a directory of Parquet partitions, will it automatically discover and read all matching files, or does it only support a single file path per table definition?

@YuweiXiao
Copy link
Contributor Author

Does this implementation of external table support handle partitioned Parquet datasets for example, when using wildcard paths or recursive directory patterns such as:

read_parquet('/path/to/data/**/*.parquet')

In other words, if I create an external table pointing to a directory of Parquet partitions, will it automatically discover and read all matching files, or does it only support a single file path per table definition?

YES. External table tracks path / read_options in pg catalog. And file list is triggered for each query. Theoretically, all functionality supported by read_xxxx should also be available in external table.

@JelteF
Copy link
Collaborator

JelteF commented Oct 13, 2025

Thanks for the work on this! I also had something like this in mind, but I was thinking about using FOREIGN TABLES instead of table access methods for this. So I'm wondering why you went this route instead. (not saying that one is really better than the other, but I'm wondering what tradeoffs you considered)

@YuweiXiao
Copy link
Contributor Author

Thanks for the work on this! I also had something like this in mind, but I was thinking about using FOREIGN TABLES instead of table access methods for this. So I'm wondering why you went this route instead. (not saying that one is really better than the other, but I'm wondering what tradeoffs you considered)

Yes, FOREIGN TABLE would definitely work too. I didn’t have a strong tradeoff in mind — mainly wanted to reuse the existing codebase as much as possible, e.g., the DuckDB AM that’s already properly hooked and the registered triggers.

I’ll take another look at the FOREIGN TABLE approach — it has a better semantic fit (i.e., metadata only table).

@JelteF
Copy link
Collaborator

JelteF commented Oct 16, 2025

Thinking about it more, I do think FOREIGN TABLE is a better fit for this semantically. Because the CREATE TABLE command that you have now isn't actually creating the backing files. It's only registering some already existing external data in postgres.

@YuweiXiao
Copy link
Contributor Author

Thinking about it more, I do think FOREIGN TABLE is a better fit for this semantically. Because the CREATE TABLE command that you have now isn't actually creating the backing files. It's only registering some already existing external data in postgres.

Yeah. I will initiate a discussion thread and let's define the SQL interface (usage) before impl.

@AndrewJackson2020
Copy link
Contributor

The above change (or a similar change using FDW instead) would be great. One of the issues with the current syntax is that it does not play nice with ORM's which is a big annoyance for a lot of teams. Also I could see a usage pattern with pg_duckdb whereby you keep "live data" in postgres tables (or partitions) and "archive data" on s3/parquet. Would be great to be able to access both of these with a uniform interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants