Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement](cloud) Prohibit changing deployment mode #40764

Merged
merged 9 commits into from
Nov 13, 2024

Conversation

yagagagaga
Copy link
Contributor

@yagagagaga yagagagaga commented Sep 12, 2024

Proposed changes

At present, the version of separation of storage and computation version and the version of computational storage cannot be converted to each other. But if the user insists on mixing the two, there is no way to avoid it at the code level. The following are possible scenarios that may occur:

Case The node has been in Cloud cluster before The node has been in Local cluster before The node never been in any cluster
add BE to local cluster Add successfully, but error invalid cluster id. ignore. will be occurred. No negative impact on the original two clusters. Add successfully, but error invalid cluster id. ignore. will be occurred. No negative impact on the original two clusters. If cloud configuration is not added, it can work normally
If cloud configuration has been added, it will resulting in the inability to start normally
add FE to local cluster Add successfully, but error Socket is closed by peer. will be occurred. No negative impact on the original two clusters. Add successfully, but error Socket is closed by peer. will be occurred. No negative impact on the original two clusters. If cloud configuration is not added, it can work normally
If cloud configuration has been added, it will resulting in the inability to start normally
add BE to cloud cluster Add successfully, but error invalid cluster id. ignore. will be occurred. No negative impact on the original two clusters. Add successfully, but error invalid cluster id. ignore. will be occurred. No negative impact on the original two clusters. If cloud configuration is not added, BE can run successfully, but error will occur when execute inserting.
If cloud configuration has been added, it can work normally
add FE to cloud cluster Add successfully, but error Socket is closed by peer. will be occurred. No negative impact on the original two clusters. Add successfully, but error Socket is closed by peer. will be occurred. No negative impact on the original two clusters. If cloud configuration is not added, FE will be hang and error Unknown meta module: cloudWarmUpJob.
If cloud configuration has been added, it can work normally

Case Situation
BE in Local cluster add cloud config items Hang up
FE in Local cluster add cloud config items Hang up
BE in Cloud cluster remove cloud config items run successfully, but error occur when do query or insert
FE in Cloud cluster remove cloud config items service down

In this PR, I will check Doris' deployment mode. If the deployment mode is modified later, the service will be down and a clear error message will be given.


拟议变更

目前存算分离和存算一体模式不能互相转换,大部分情况下,这两种模式的部署应该不会搞混,但也不排除有些用户稀里糊涂,添加错了。另一个就是用户可能误删cloud相关的配置(比如从其他地方拷贝配置覆盖当前配置),导致以local模式启动。

针对不同集群的不同节点的情况:

情况 此节点之前已在其他Cloud集群 此节点之前已在其他Local集群 此节点之前从未添加到任何集群
把BE添加到Local的集群 可以添加,但心跳的时候会报invalid cluster id. ignore. 不影响原来两个集群的正常使用 可以添加,但心跳的时候会报invalid cluster id. ignore. 不影响原来两个集群的正常使用 如果未加cloud相关配置信息,能正常工作如果已加cloud相关配置信息,会以cloud的逻辑启动,导致不能正常启动
把FE添加到Local的集群 可以添加,但心跳的时候会报 Socket is closed by peer. 不影响原来两个FE的正常使用 可以添加,但心跳的时候会报 Socket is closed by peer. 不影响原来两个FE的正常使用 如果未加cloud相关配置信息,能正常工作如果已加cloud相关配置信息,会以cloud的逻辑启动,导致不能正常启动
把BE添加到Cloud的集群 可以添加,但心跳的时候会报invalid cluster id. ignore. 不影响原来两个集群的正常使用 可以添加,但心跳的时候会报invalid cluster id. ignore. 不影响原来两个集群的正常使用 如果未加cloud相关配置信息,能添加成功,但比如insert会报错,甚至会导致原有正常的be core如果已加cloud相关配置信息,能正常工作
把FE添加到Cloud的集群 可以添加,但心跳的时候会报 Socket is closed by peer. 不影响原来两个FE的正常使用 可以添加,但心跳的时候会报 Socket is closed by peer. 不影响原来两个FE的正常使用 如果未加cloud相关配置信息如果没加入cloud集群,会报failed to get local fe's type, sleep 5 s, try again.如果已加入cloud集群,读取元数据会报错Unknown meta module: cloudWarmUpJob.,卡住如果已加cloud相关配置信息,能正常工作

情况 现象
Local集群的BE添加cloud的配置 会以cloud的逻辑启动,导致启动卡住
Local集群的FE添加cloud的配置 会以cloud的逻辑启动,导致启动卡住
Cloud集群的BE删除cloud的配置 能正常启动,但查询导入会报错
Cloud集群的FE删除cloud的配置 不断刷get version from meta service failed,然后挂掉

针对这些情况,节点切换cloud/local模式的,应该快速失败,然后告知用户

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@yagagagaga
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 43530 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ae23c3b4eb5ed7d254a5a6c34e444837fa72c137, data reload: false

------ Round 1 ----------------------------------
q1	17816	7511	7330	7330
q2	2151	199	207	199
q3	11531	1547	1469	1469
q4	10416	977	1082	977
q5	8071	3321	3225	3225
q6	250	165	162	162
q7	1061	660	642	642
q8	10153	2123	2041	2041
q9	6858	6367	6338	6338
q10	7024	2550	2549	2549
q11	438	263	257	257
q12	426	234	231	231
q13	17764	3047	3043	3043
q14	291	253	254	253
q15	585	542	524	524
q16	521	462	446	446
q17	1003	969	962	962
q18	7536	6741	6784	6741
q19	1383	1260	1251	1251
q20	615	350	329	329
q21	3940	3591	3566	3566
q22	1055	1017	995	995
Total cold run time: 110888 ms
Total hot run time: 43530 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7255	7231	7204	7204
q2	346	231	230	230
q3	2950	2923	2950	2923
q4	1949	1963	1972	1963
q5	5498	5459	5479	5459
q6	242	143	146	143
q7	2069	1680	1717	1680
q8	3278	3348	3353	3348
q9	8501	8476	8489	8476
q10	3444	3505	3473	3473
q11	583	472	472	472
q12	776	593	582	582
q13	5635	3068	3042	3042
q14	317	271	272	271
q15	567	519	521	519
q16	500	458	454	454
q17	1760	1722	1720	1720
q18	8045	7706	7569	7569
q19	1737	1724	1718	1718
q20	2065	1832	1839	1832
q21	5733	5679	5631	5631
q22	1077	1008	987	987
Total cold run time: 64327 ms
Total hot run time: 59696 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 195235 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ae23c3b4eb5ed7d254a5a6c34e444837fa72c137, data reload: false

query1	925	388	380	380
query2	6489	1775	1883	1775
query3	6658	219	241	219
query4	26142	24027	24084	24027
query5	5088	548	536	536
query6	258	168	165	165
query7	4589	296	311	296
query8	293	215	214	214
query9	8482	2576	2606	2576
query10	446	283	284	283
query11	16444	15466	15532	15466
query12	169	104	102	102
query13	1674	391	377	377
query14	11351	6752	6539	6539
query15	224	181	182	181
query16	7577	479	479	479
query17	1548	588	570	570
query18	1922	294	294	294
query19	199	153	153	153
query20	124	112	115	112
query21	212	105	105	105
query22	4628	4282	4443	4282
query23	34575	33674	33529	33529
query24	10117	3117	3097	3097
query25	696	431	425	425
query26	1400	162	164	162
query27	2901	284	281	281
query28	6876	2160	2119	2119
query29	987	433	429	429
query30	301	164	157	157
query31	995	782	817	782
query32	107	60	64	60
query33	747	325	314	314
query34	932	484	477	477
query35	908	764	718	718
query36	1081	918	918	918
query37	178	81	88	81
query38	4143	3932	3905	3905
query39	1449	1394	1428	1394
query40	293	120	122	120
query41	52	54	48	48
query42	124	99	103	99
query43	498	448	454	448
query44	1285	783	795	783
query45	202	174	175	174
query46	1132	801	844	801
query47	1922	1777	1779	1777
query48	369	298	301	298
query49	1125	460	459	459
query50	911	442	454	442
query51	7122	6889	6840	6840
query52	104	96	91	91
query53	263	191	188	188
query54	811	471	475	471
query55	81	80	82	80
query56	298	275	281	275
query57	1253	1098	1095	1095
query58	255	245	246	245
query59	2889	2622	2653	2622
query60	312	298	293	293
query61	131	124	127	124
query62	935	671	678	671
query63	220	194	300	194
query64	5303	687	670	670
query65	3296	3170	3185	3170
query66	1426	303	298	298
query67	15999	15530	15551	15530
query68	3190	866	855	855
query69	443	331	334	331
query70	1183	1167	1197	1167
query71	364	355	348	348
query72	5875	3271	3380	3271
query73	597	587	587	587
query74	9355	9189	9175	9175
query75	3157	3021	3031	3021
query76	1926	879	891	879
query77	446	416	433	416
query78	9458	9295	9357	9295
query79	946	911	897	897
query80	883	850	819	819
query81	453	267	271	267
query82	268	266	265	265
query83	243	200	200	200
query84	235	108	106	106
query85	656	412	393	393
query86	327	309	336	309
query87	4427	4406	4397	4397
query88	4123	4050	4060	4050
query89	393	373	388	373
query90	1310	326	324	324
query91	126	123	126	123
query92	83	78	79	78
query93	1073	1085	1060	1060
query94	614	394	369	369
query95	460	439	430	430
query96	475	480	477	477
query97	3138	3122	3140	3122
query98	237	241	231	231
query99	1574	1315	1346	1315
Total cold run time: 280484 ms
Total hot run time: 195235 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.91 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ae23c3b4eb5ed7d254a5a6c34e444837fa72c137, data reload: false

query1	0.04	0.04	0.04
query2	0.07	0.04	0.04
query3	0.22	0.05	0.04
query4	1.67	0.07	0.07
query5	0.50	0.50	0.49
query6	1.13	0.73	0.73
query7	0.02	0.02	0.02
query8	0.05	0.04	0.05
query9	0.57	0.50	0.50
query10	0.56	0.58	0.56
query11	0.16	0.12	0.13
query12	0.15	0.12	0.13
query13	0.63	0.61	0.61
query14	1.46	1.46	1.47
query15	0.91	0.88	0.88
query16	0.36	0.36	0.36
query17	0.99	1.03	1.01
query18	0.22	0.20	0.21
query19	1.95	1.82	1.81
query20	0.01	0.01	0.01
query21	15.45	0.66	0.66
query22	4.55	8.23	1.07
query23	17.82	1.28	1.41
query24	2.26	0.22	0.22
query25	0.18	0.08	0.08
query26	0.29	0.19	0.17
query27	0.09	0.07	0.09
query28	13.19	1.14	1.11
query29	12.57	3.38	3.39
query30	0.25	0.06	0.07
query31	2.86	0.43	0.41
query32	3.24	0.51	0.51
query33	3.02	3.05	3.10
query34	15.48	4.33	4.28
query35	4.32	4.35	4.34
query36	0.70	0.49	0.49
query37	0.19	0.16	0.17
query38	0.17	0.15	0.15
query39	0.05	0.04	0.04
query40	0.16	0.13	0.14
query41	0.10	0.05	0.05
query42	0.05	0.06	0.05
query43	0.04	0.04	0.04
Total cold run time: 108.7 s
Total hot run time: 30.91 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.90% (9463/25648)
Line Coverage: 28.24% (77784/275394)
Region Coverage: 27.67% (40188/145265)
Branch Coverage: 24.27% (20420/84122)
Coverage Report: http://coverage.selectdb-in.cc/coverage/ae23c3b4eb5ed7d254a5a6c34e444837fa72c137_ae23c3b4eb5ed7d254a5a6c34e444837fa72c137/report/index.html

gavinchou
gavinchou previously approved these changes Sep 19, 2024
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Sep 19, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Oct 22, 2024
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

gavinchou
gavinchou previously approved these changes Nov 11, 2024
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 11, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@dataroaring
Copy link
Contributor

run buildall

dataroaring
dataroaring previously approved these changes Nov 11, 2024
Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yagagagaga yagagagaga dismissed stale reviews from gavinchou and dataroaring via 8a3408d November 11, 2024 03:27
@yagagagaga
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 11, 2024
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 11, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.90% (9866/26029)
Line Coverage: 29.09% (82272/282826)
Region Coverage: 28.24% (42344/149963)
Branch Coverage: 24.79% (21444/86506)
Coverage Report: http://coverage.selectdb-in.cc/coverage/8a3408d36a865ff78b21d5e684a568e564218d22_8a3408d36a865ff78b21d5e684a568e564218d22/report/index.html

@gavinchou gavinchou merged commit fd2e58b into apache:master Nov 13, 2024
27 of 30 checks passed
github-actions bot pushed a commit that referenced this pull request Nov 13, 2024
## Proposed changes

At present, the version of separation of storage and computation version
and the version of computational storage cannot be converted to each
other. But if the user insists on mixing the two, there is no way to
avoid it at the code level. The following are possible scenarios that
may occur:

Case | The node has been in Cloud cluster before | The node has been in
Local cluster before | The node never been in any cluster
-- | -- | -- | --
add BE to local cluster | Add successfully, but error `invalid cluster
id. ignore. ` will be occurred. No negative impact on the original two
clusters. | Add successfully, but error `invalid cluster id. ignore. `
will be occurred. No negative impact on the original two clusters. | If
cloud configuration is not added, it can work normally<br />If cloud
configuration has been added, it will resulting in the inability to
start normally
add FE to local cluster | Add successfully, but error `Socket is closed
by peer. ` will be occurred. No negative impact on the original two
clusters. | Add successfully, but error `Socket is closed by peer. `
will be occurred. No negative impact on the original two clusters. | If
cloud configuration is not added, it can work normally<br />If cloud
configuration has been added, it will resulting in the inability to
start normally
add BE to cloud cluster | Add successfully, but error `invalid cluster
id. ignore. ` will be occurred. No negative impact on the original two
clusters. | Add successfully, but error `invalid cluster id. ignore. `
will be occurred. No negative impact on the original two clusters. | If
cloud configuration is not added, BE can run successfully, but error
will occur when execute inserting.<br />If cloud configuration has been
added, it can work normally
add FE to cloud cluster | Add successfully, but error `Socket is closed
by peer. ` will be occurred. No negative impact on the original two
clusters. | Add successfully, but error `Socket is closed by peer. `
will be occurred. No negative impact on the original two clusters. | If
cloud configuration is not added, FE will be hang and error `Unknown
meta module: cloudWarmUpJob.`<br />If cloud configuration has been
added, it can work normally

----

| Case | Situation |
| --------------------------------------------- |
------------------------------------------------------------ |
| BE in Local cluster add cloud config items | Hang up |
| FE in Local cluster add cloud config items | Hang up |
| BE in Cloud cluster remove cloud config items | run successfully, but
error occur when do query or insert |
| FE in Cloud cluster remove cloud config items | service down |

In this PR, I will check Doris' deployment mode. If the deployment mode
is modified later, the service will be down and a clear error message
will be given.

----

## 拟议变更


目前存算分离和存算一体模式不能互相转换,大部分情况下,这两种模式的部署应该不会搞混,但也不排除有些用户稀里糊涂,添加错了。另一个就是用户可能误删cloud相关的配置(比如从其他地方拷贝配置覆盖当前配置),导致以local模式启动。

针对不同集群的不同节点的情况:

| 情况 | 此节点之前已在其他Cloud集群 | 此节点之前已在其他Local集群 | 此节点之前从未添加到任何集群 |
| :-------------------- |
:----------------------------------------------------------- |
:----------------------------------------------------------- |
:----------------------------------------------------------- |
| 把BE添加到Local的集群 | 可以添加,但心跳的时候会报invalid cluster id. ignore.
不影响原来两个集群的正常使用 | 可以添加,但心跳的时候会报invalid cluster id. ignore. 不影响原来两个集群的正常使用
| 如果未加cloud相关配置信息,能正常工作如果已加cloud相关配置信息,会以cloud的逻辑启动,导致不能正常启动 |
| 把FE添加到Local的集群 | 可以添加,但心跳的时候会报 Socket is closed by peer.
不影响原来两个FE的正常使用 | 可以添加,但心跳的时候会报 Socket is closed by peer. 不影响原来两个FE的正常使用
| 如果未加cloud相关配置信息,能正常工作如果已加cloud相关配置信息,会以cloud的逻辑启动,导致不能正常启动 |
| 把BE添加到Cloud的集群 | 可以添加,但心跳的时候会报invalid cluster id. ignore.
不影响原来两个集群的正常使用 | 可以添加,但心跳的时候会报invalid cluster id. ignore. 不影响原来两个集群的正常使用
| 如果未加cloud相关配置信息,能添加成功,但比如insert会报错,甚至会导致原有正常的be
core如果已加cloud相关配置信息,能正常工作 |
| 把FE添加到Cloud的集群 | 可以添加,但心跳的时候会报 Socket is closed by peer.
不影响原来两个FE的正常使用 | 可以添加,但心跳的时候会报 Socket is closed by peer. 不影响原来两个FE的正常使用
| 如果未加cloud相关配置信息如果没加入cloud集群,会报failed to get local fe's type, sleep 5
s, try again.如果已加入cloud集群,读取元数据会报错Unknown meta module:
cloudWarmUpJob.,卡住如果已加cloud相关配置信息,能正常工作 |

----

| 情况 | 现象 |
| :--------------------------- |
:--------------------------------------------------- |
| Local集群的BE添加cloud的配置 | 会以cloud的逻辑启动,导致启动卡住                    |
| Local集群的FE添加cloud的配置 | 会以cloud的逻辑启动,导致启动卡住                    |
| Cloud集群的BE删除cloud的配置 | 能正常启动,但查询导入会报错                         |
| Cloud集群的FE删除cloud的配置 | 不断刷get version from meta service failed,然后挂掉 |

针对这些情况,节点切换cloud/local模式的,应该快速失败,然后告知用户

---------

Co-authored-by: yagagagaga <[email protected]>
dataroaring pushed a commit that referenced this pull request Nov 14, 2024
…0764 (#43891)

Cherry-picked from #40764

Co-authored-by: yagagagaga <[email protected]>
Co-authored-by: yagagagaga <[email protected]>
@yagagagaga yagagagaga deleted the core-3831 branch February 25, 2025 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.3-merged p0_c reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants