Use of OFFSET very inefficient with large Postgres DB

When using maple to import a 40GB+ Postgres database I noticed that queries became too slow and the complete hadoop job failed because of the use of OFFSET:

After changing [this line](https://github.com/Cascading/maple/blob/master/src/jvm/com/twitter/maple/jdbc/db/DBInputFormat.java#L124) to this:

```
            // HARDCODING PRIMARY KEY.....
            query.append(" WHERE id >= ").append(split.getStart());
            query.append(" LIMIT ").append(split.getLength());
```

The query time doesn't grow exponentially anymore and stays the same. The above is not a generic solution (e.g. your index might not be id). Do you have suggestions to handle this situation? I'm also not sure how other JDBC databases handle OFFSET.

Has this library been used on large Postgres DB's before? I would like to gain some insights into best practices. Even with the above optimization my import time is around 3 hours.

Thanks for you work on maple.

Cheers,
Jeroen


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use of OFFSET very inefficient with large Postgres DB #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use of OFFSET very inefficient with large Postgres DB #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions