I'm trying to use tap-mysql to do an initial load ...
# best-practices
r
I'm trying to use tap-mysql to do an initial load (into target-postgres) of a few tables. Most of the tables extract just fine but one table is just over 35M rows. When Meltano hits that table, it sits there appearing to not do anything for five minutes and then seemingly carries on. Looking in the logs I can see "One or more records have exceeded the max age of 5 minutes. Draining all sinks." and it proceeds to write about 300Mb into the destination table (it should be about 6Gb) before erroring out. When I examine the processlist on MySQL, I can see a query running that's trying to select the entire table and order by the incremental column. That column has an index on (it's a datetime for reference) but `explain`ing the query says that not only is it not going to use the index, MySQL doesn't even consider it a possible index.
Copy code
SELECT <list of columns> FROM <table> ORDER BY `created_at` ASC;
My tap configuration is pretty basic, there's nothing weird going on:
Copy code
plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url:
      git+<https://github.com/transferwise/pipelinewise.git#subdirectory=singer-connectors/tap-mysql>
    select:
      - ${TAP_MYSQL_DATABASE}-<table>.*
    settings:
      - name: engine
        value: mysql
    config:
      session_sqls:
      - SET @@session.wait_timeout=28800
      - SET @@session.net_read_timeout=3600
      - SET @@session.innodb_lock_wait_timeout=3600
    metadata:
      '*-<table>':
        replication-method: INCREMENTAL
        replication-key: created_at
Am I just fundamentally missing something because trying to read the entire table in one go seems insane to me? Do I need to set the replication-method to FULL_TABLE for the first load and manually fiddle the state or something?
d
I'm not sure which fork of tap-mysql you're running, but that sounds like the default behavior for an incremental method.
Do I need to set the replication-method to FULL_TABLE for the first load and manually fiddle the state or something?
That wouldn't actually result in any difference in the query that's run against your source db; full table will request the entire contents of the table for every extraction. Similarly I wouldn't expect it to have any difference in terms of how your db responds to the query (eg whether or not it uses your index). The lack of indexing via the query plan feels like a server-side configuration issu that's worth looking into regardless, but some flavors of tap mysql have a
LIMIT
or
INCREMENTAL_LIMIT
property that's accepted in the config that would (as you'd expect) add a limit to your query clause (see this code block). The risks with using this approach are that 1. if all of the records yielded from a given query have the same replication key value as the one already stored as a bookmark, it'll never continue advancing. 2. Generally the db is going to need to scan the entire table anyway to order the records before applying the index, so low performance dbs might struggle to get though a real big table anywa. Seems ilke the pipelinewise version doesn't include that property, but it could be worth looking into