Data migration from TiDB Cloud to Aurora

張春華
6 min readFeb 25, 2022

This is about the approach how to migrate the data from TiDB Cloud to Aurora with S3 simply. The whole process is almost same as common one. Please find the data flow as the above graph. Here I list some differences as common migration approach.

The whole process is as below:

  • Create S3 bucket to store dump data
  • Export data from TiDB Cloud to S3 bucket with dumpling(50 minutes)
  • Setup for S3 access from aurora
  • Create S3 policy and role for aurora access

1. Attach role to Aurora

2. Create S3 endpoint in the Aurora’s VPC

  • Load data to Aurora from S3(32 minutes)
  • Data comparison between TiDB Cloud and Aurora(4 minutes)

The taken time is only for your reference with one 31GB test table.
Before going to procedure, please make sure the timezone between TiDB Cloud and Aurora is same. Otherwise it will fail in the data comparison phase.

Create the s3 bucket to store dumpling data

Setup VPC peering between TiDB Cloud and workstation

Please refer to VPC Peering setup

Export data to s3 bucket

Export table to one file

  • Download tool binary(Latest version: v5.4.0)
  • Unzip the binary and set the search path
  • Update the tikv_gc_life_time to avoid the GC during the data export. Generally it’s safe to update it to 720h before data export. Please make sure set back the value after data export, which might degrade TiDB cluster’s performance.
  • Set the AWS key to allow the dumpling command to push the data to S3
  • Run the dumpling command to extract the data to S3.
  • Set back the tikv_gc_life_time value
$ wget https://download.pingcap.org/tidb-toolkit-v5.4.0-linux-amd64.tar.gz
$ tar xvf tidb-toolkit-v5.4.0-linux-amd64.tar.gz
$ export PATH=$(pwd)/tidb-toolkit-v5.4.0-linux-amd64/bin:$PATH
$ mysql -u root -h private-tidb.xxxx.aws.tidbcloud.com -P 4000 -p
MySQL [(none)]>select * from mysql.tidb where VARIABLE_NAME = 'tikv_gc_life_time';
+-------------------+----------------+----------------------------------------------------------------------------------------+
| VARIABLE_NAME | VARIABLE_VALUE | COMMENT |
+-------------------+----------------+----------------------------------------------------------------------------------------+
| tikv_gc_life_time | 10m0s | All versions within life time will not be collected by GC, at least 10m, in Go format. |
+-------------------+----------------+----------------------------------------------------------------------------------------+
1 row in set (0.008 sec)
MySQL [(none)]> update mysql.tidb set VARIABLE_VALUE = '720h' where VARIABLE_NAME = 'tikv_gc_life_time';
Query OK, 1 row affected (0.015 sec)
Rows matched: 1 Changed: 1 Warnings: 0

MySQL [(none)]> select * from mysql.tidb where VARIABLE_NAME = 'tikv_gc_life_time';
+-------------------+----------------+----------------------------------------------------------------------------------------+
| VARIABLE_NAME | VARIABLE_VALUE | COMMENT |
+-------------------+----------------+----------------------------------------------------------------------------------------+
| tikv_gc_life_time | 720h | All versions within life time will not be collected by GC, at least 10m, in Go format. |
+-------------------+----------------+----------------------------------------------------------------------------------------+
1 row in set (0.009 sec)
MySQL [(none)]> exit
Bye
$ export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxx
$ export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxx
$ dumpling -u root -P 4000 -h private-tidb.xxxx.aws.tidbcloud.com -p1234Abcd --filetype csv -o "s3://tidb2aurora/dumpling/" --s3.region "ap-northeast-1"
MySQL [(none)]> update mysql.tidb set VARIABLE_VALUE = '10m0s' where VARIABLE_NAME = 'tikv_gc_life_time';
Query OK, 1 row affected (0.015 sec)
Rows matched: 1 Changed: 1 Warnings: 0

Split the table file at 256MB size

$ dumpling -u root -P 4000 -h private-tidb.xxxx.ap-northeast-1.prod.aws.tidbcloud.com -p1234Abcd --filetype csv -F 256MiB -t 16 -o "s3://tidb2aurora/dumpling/" --s3.region "ap-northeast-1"

Allow aurora to access S3 storage

In order to allow AURORA to access S3 storage, first need to open S3 to AURORA. If the DB is deployed in the private subnets, we have to create the endpoint in the AURORA’s VPC. Second, create the role/policy to grant the permission to AURORA. Once these two steps are completed, we can start the data import.

Create S3 endpoint for aurora

Create policy for S3 bucket access

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"S3:GetObject",
"S3:ListBucket",
"S3:GetObjectVersion"
],
"Resource": [
"arn:aws:s3:::tidb2aurora",
"arn:aws:s3:::tidb2aurora/dumpling/*"
]
}
]
}

Create s3 role allowing S3 bucket access attach to Aurora

Attach s3 role to Aurora

Update db cluster parameter for S3 access

Data import into Aurora from S3

MySQL [test]> load data from s3 's3://tidb2aurora/dumpling/test.ontime.000000000.csv' into table ontime FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 lines;
Query OK, 69176519 rows affected (50 min 18.441 sec)
Records: 69176519 Deleted: 0 Skipped: 0 Warnings: 0
MySQL [test]> select count(*) from ontime;
+----------+
| count(*) |
+----------+
| 69176519 |
+----------+
1 row in set (33.711 sec)

Data comparison

$more diff.toml
check-thread-count = 8
export-fix-sql = true
check-struct-only = false
[data-sources]
[data-sources.mysql1]
host = 'arsfaf89hfam1n.yyyy.ap-northeast-1.rds.amazonaws.com'
port = 3306
user = 'master'
password = '1234Abcd'

[data-sources.tidb0]
host = 'private-tidb.xxxx.ap-northeast-1.prod.aws.tidbcloud.com'
port = 4000
user = 'root'
password = '1234Abcd'
[task]
output-dir = "./output"
source-instances = ["mysql1"]
target-instance = "tidb0"
target-check-tables = ["test.*"]

$ time sync_diff_inspector --config diff.toml
... ...
A total of 1 tables need to be compared

Comparing the table structure of ``test`.`ontime`` ... equivalent
Comparing the table data of ``test`.`ontime`` ... equivalent
___________________________________________________________
Progress [============================================================>] 100% 0/0
A total of 1 table have been compared and all are equal.
You can view the comparison details through './output/sync_diff.log'

real 4m23.111s
user 0m1.717s
sys 0m0.590s

$ more output/sync_diff.log
.. ...
[2022/02/25 02:10:44.492 +00:00] [INFO] [mysql_shard.go:349] ["will increase connection configurations for DB of instance"] ["connection limit"=3]
[2022/02/25 02:10:44.492 +00:00] [INFO] [source.go:312] ["table match check passed!!"]
[2022/02/25 02:10:44.494 +00:00] [INFO] [tidb.go:195] ["find router for tidb source"]
[2022/02/25 02:10:44.498 +00:00] [INFO] [source.go:312] ["table match check passed!!"]
[2022/02/25 02:10:44.502 +00:00] [INFO] [diff.go:361] ["The downstream is TiDB. pick it as work source first"]
  • EC2 instance(workstation): t2.2xlarge
  • Aurora instance: db.r5.2xlarge
  • TiDB Cloud: 3 TiKV + 2 TiDB
  • Table’s csv file size: 31GB
Data comparison performance test

More details please find https://github.com/luyomo/OhMyTiUP/blob/main/doc/copyDataTiDB2Aurora.org

--

--