Data migration from TiDB Cloud to Aurora

6 min readFeb 25, 2022

This is about the approach how to migrate the data from TiDB Cloud to Aurora with S3 simply. The whole process is almost same as common one. Please find the data flow as the above graph. Here I list some differences as common migration approach.

dumpling is used to export data from TiDB Cloud to S3 directly
Load data to aurora from S3
sync-diff-inspector is used to compare the data between TiDB Cloud and Aurora after migration.

The whole process is as below:

Create S3 bucket to store dump data
Export data from TiDB Cloud to S3 bucket with dumpling(50 minutes)
Setup for S3 access from aurora
Create S3 policy and role for aurora access

1. Attach role to Aurora

2. Create S3 endpoint in the Aurora’s VPC

Load data to Aurora from S3(32 minutes)
Data comparison between TiDB Cloud and Aurora(4 minutes)

The taken time is only for your reference with one 31GB test table.
Before going to procedure, please make sure the timezone between TiDB Cloud and Aurora is same. Otherwise it will fail in the data comparison phase.

Create the s3 bucket to store dumpling data

Setup VPC peering between TiDB Cloud and workstation

Please refer to VPC Peering setup

Export data to s3 bucket

Export table to one file

Download tool binary(Latest version: v5.4.0)
Unzip the binary and set the search path
Update the tikv_gc_life_time to avoid the GC during the data export. Generally it’s safe to update it to 720h before data export. Please make sure set back the value after data export, which might degrade TiDB cluster’s performance.
Set the AWS key to allow the dumpling command to push the data to S3
Run the dumpling command to extract the data to S3.
Set back the tikv_gc_life_time value

$ wget https://download.pingcap.org/tidb-toolkit-v5.4.0-linux-amd64.tar.gz
$ tar xvf tidb-toolkit-v5.4.0-linux-amd64.tar.gz
$ export PATH=$(pwd)/tidb-toolkit-v5.4.0-linux-amd64/bin:$PATH
$ mysql -u root -h private-tidb.xxxx.aws.tidbcloud.com -P 4000 -p
MySQL [(none)]>select * from mysql.tidb where VARIABLE_NAME = 'tikv_gc_life_time';
+-------------------+----------------+----------------------------------------------------------------------------------------+
| VARIABLE_NAME     | VARIABLE_VALUE | COMMENT                                                                                |
+-------------------+----------------+----------------------------------------------------------------------------------------+
| tikv_gc_life_time | 10m0s          | All versions within life time will not be collected by GC, at least 10m, in Go format. |
+-------------------+----------------+----------------------------------------------------------------------------------------+
1 row in set (0.008 sec)
MySQL [(none)]> update mysql.tidb set VARIABLE_VALUE = '720h' where VARIABLE_NAME = 'tikv_gc_life_time';
Query OK, 1 row affected (0.015 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MySQL [(none)]> select * from mysql.tidb where VARIABLE_NAME = 'tikv_gc_life_time';
+-------------------+----------------+----------------------------------------------------------------------------------------+
| VARIABLE_NAME     | VARIABLE_VALUE | COMMENT                                                                                |
+-------------------+----------------+----------------------------------------------------------------------------------------+
| tikv_gc_life_time | 720h           | All versions within life time will not be collected by GC, at least 10m, in Go format. |
+-------------------+----------------+----------------------------------------------------------------------------------------+
1 row in set (0.009 sec)
MySQL [(none)]> exit
Bye
$ export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxx
$ export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxx
$ dumpling -u root -P 4000 -h private-tidb.xxxx.aws.tidbcloud.com -p1234Abcd --filetype csv -o "s3://tidb2aurora/dumpling/" --s3.region "ap-northeast-1"
MySQL [(none)]> update mysql.tidb set VARIABLE_VALUE = '10m0s' where VARIABLE_NAME = 'tikv_gc_life_time';
Query OK, 1 row affected (0.015 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Split the table file at 256MB size

$ dumpling -u root -P 4000 -h private-tidb.xxxx.ap-northeast-1.prod.aws.tidbcloud.com -p1234Abcd --filetype csv -F 256MiB -t 16 -o "s3://tidb2aurora/dumpling/" --s3.region "ap-northeast-1"

Allow aurora to access S3 storage

In order to allow AURORA to access S3 storage, first need to open S3 to AURORA. If the DB is deployed in the private subnets, we have to create the endpoint in the AURORA’s VPC. Second, create the role/policy to grant the permission to AURORA. Once these two steps are completed, we can start the data import.

Create S3 endpoint for aurora

Create policy for S3 bucket access

{
 "Version": "2012-10-17",
 "Statement": [
     {
         "Sid": "VisualEditor0",
         "Effect": "Allow",
         "Action": [
             "S3:GetObject",
             "S3:ListBucket",
             "S3:GetObjectVersion"
         ],
         "Resource": [
             "arn:aws:s3:::tidb2aurora",
             "arn:aws:s3:::tidb2aurora/dumpling/*"
         ]
     }
  ]
}

Create s3 role allowing S3 bucket access attach to Aurora

Attach s3 role to Aurora

Update db cluster parameter for S3 access

Data import into Aurora from S3

MySQL [test]> load data from s3 's3://tidb2aurora/dumpling/test.ontime.000000000.csv' into table ontime FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 lines;
Query OK, 69176519 rows affected (50 min 18.441 sec)
Records: 69176519  Deleted: 0  Skipped: 0  Warnings: 0
MySQL [test]> select count(*) from ontime; 
+----------+
| count(*) |
+----------+
| 69176519 |
+----------+
1 row in set (33.711 sec)

Data comparison

$more diff.toml
check-thread-count = 8
export-fix-sql = true
check-struct-only = false
[data-sources]
[data-sources.mysql1] 
    host = 'arsfaf89hfam1n.yyyy.ap-northeast-1.rds.amazonaws.com'
    port = 3306
    user = 'master'
    password = '1234Abcd'

[data-sources.tidb0]
    host = 'private-tidb.xxxx.ap-northeast-1.prod.aws.tidbcloud.com'
    port = 4000
    user = 'root'
    password = '1234Abcd'
[task]
    output-dir = "./output"
    source-instances = ["mysql1"]
    target-instance = "tidb0"
    target-check-tables = ["test.*"]

$ time sync_diff_inspector --config diff.toml... ...
A total of 1 tables need to be compared

Comparing the table structure of ``test`.`ontime`` ... equivalent
Comparing the table data of ``test`.`ontime`` ... equivalent
___________________________________________________________
Progress [============================================================>] 100% 0/0
A total of 1 table have been compared and all are equal.
You can view the comparison details through './output/sync_diff.log'

real    4m23.111s
user    0m1.717s
sys     0m0.590s

$ more output/sync_diff.log
.. ...
[2022/02/25 02:10:44.492 +00:00] [INFO] [mysql_shard.go:349] ["will increase connection configurations for DB of instance"] ["connection limit"=3]
[2022/02/25 02:10:44.492 +00:00] [INFO] [source.go:312] ["table match check passed!!"]
[2022/02/25 02:10:44.494 +00:00] [INFO] [tidb.go:195] ["find router for tidb source"]
[2022/02/25 02:10:44.498 +00:00] [INFO] [source.go:312] ["table match check passed!!"]
[2022/02/25 02:10:44.502 +00:00] [INFO] [diff.go:361] ["The downstream is TiDB. pick it as work source first"]

EC2 instance(workstation): t2.2xlarge
Aurora instance: db.r5.2xlarge
TiDB Cloud: 3 TiKV + 2 TiDB
Table’s csv file size: 31GB