MySQL的join buffer原理及如何提高查询效率-FinClip官网

MySQL的join buffer原理及如何提高查询效率

网友投稿 677 2022-08-29

MySQL的join buffer原理及如何提高查询效率

一、MySQL的join buffer**

在MySQL对于join操作的处理过程中，join buffer是一个重要的概念，也是MySQL对于table join的一个重要的优化手段。虽然这个概念实现并不复杂，但是这个是实现MySQL join连接优化的一个重要方法，在"暴力"连接的时候可以极大提高join查询的效率。

关于这个概念的权威说明当然是来自MySQL文档中对于这个概念的说明，说明的文字不多，但是言简意赅，说明了这个优化的主要实现思想：

Assume you have the following join:

Table name Type

t1 range

t2 ref

t3 ALL

The join is then done as follows:

- While rows in t1 matching range

- Read through all rows in t2 according to reference key

- Store used fields from t1, t2 in cache - If cache is full - Read through all rows in t3

- Compare t3 row against all t1, t2 combinations in cache - If row satisfies join condition, send it to client - Empty cache - Read through all rows in t3

- Compare t3 row against all stored t1, t2 combinations in cache - If row satisfies join condition, send it to client

二、join buffer cache存储空间的分配

下面函数中table_count表示的就是所有join table中在该table之前的非const table数量，因为这个table要缓存自己之前所有table中的每条记录中"需读取"(tables[i].table->read_set置位)。

其中两重循环每次执行都是复制下需要缓存的field的描述结构(及其对应的数据源)，或者说，二重循环只是为了赋值和保存元数据，而最后的cache->buff=(uchar*) my_malloc(size,MYF(0))才是真正的分配满足条件的记录内容。

static int

join_init_cache(THD *thd,JOIN_TAB *tables,uint table_count)

{

…… for (i=0 ; i < table_count ; i++)

{ bool have_bit_fields= FALSE;

uint null_fields=0,used_fields;

Field **f_ptr,*field;

MY_BITMAP *read_set= tables[i].table->read_set; for (f_ptr=tables[i].table->field,used_fields=tables[i].used_fields ;

used_fields ;

f_ptr++)

{

field= *f_ptr; if (bitmap_is_set(read_set, field->field_index))

{

used_fields--;

length+=field->fill_cache_field(copy);

……

}

cache->length=length+blobs*sizeof(char*);

cache->blobs=blobs;

*blob_ptr=0; /* End sequentel */ size=max(thd->variables.join_buff_size, cache->length); if (!(cache->buff=(uchar*) my_malloc(size,MYF(0))))

DBUG_RETURN(1); /* Don't use cache */ /* purecov: inspected */ cache->end=cache->buff+size;

reset_cache_write(cache);

DBUG_RETURN(0);

}

三、普通的多表查询实现

这个"普通"当然也可以理解为"朴素"、"直观"的意思，也是大部分情况下的执行流程。普通查询其实就是对于对于各个表格进行递归调用，和矩阵的乘法一样一样的，这个对应非常直观，也非常通用。

而这个常规的查询动作就是通过sub_select函数来实现，这个函数本质性上是执行

tsecer_select()

{ for (r = first ; r != end ; r = next)

{ if(sofartest())

{ nexttable.tsecer_select()

}

其中的sofartest()表示"使用所有当前已读取表格可以进行的判断"，也就是where中下推的表达式。例如 select * from a, b where a.a > 10 and b.b + a.a = 10，在a表读取之后，其实已经可以执行 a.a > 10的判断。当然这个是一个甚至算不上伪代码的描述方法，而真正的代码对应为：

enum_nested_loop_state sub_select(JOIN *join,JOIN_TAB *join_tab,bool end_of_records) {

……

error= (*join_tab->read_first_record)(join_tab);

rc= evaluate_join_record(join, join_tab, error);

…… while (rc == NESTED_LOOP_OK)

{

error= info->read_record(info);

rc= evaluate_join_record(join, join_tab, error);

}

…… return rc;

} static enum_nested_loop_state evaluate_join_record(JOIN *join, JOIN_TAB *join_tab, int error) {

…… if (select_cond)

{

select_cond_result= test(select_cond->val_int()); /* check for errors evaluating the condition */ if (join->thd->is_error()) return NESTED_LOOP_ERROR;

}

…… if (found)

{ enum enum_nested_loop_state rc; /* A match from join_tab is found for the current partial join. */ rc= (*join_tab->next_select)(join, join_tab+1, 0); if (rc != NESTED_LOOP_OK && rc != NESTED_LOOP_NO_MORE_ROWS) return rc; if (join->return_tab < join_tab) return NESTED_LOOP_OK; /*

Test if this was a SELECT DISTINCT query on a table that

was not in the field list; In this case we can abort if

we found a row, as no new rows can be added to the result.

*/ if (not_used_in_distinct && found_records != join->found_records) return NESTED_LOOP_NO_MORE_ROWS;

}

……

}

这里可以看到，这个地方是一个递归，用来产生一个笛卡尔叉乘集合，从程序实现和数学表达上看都非常简洁可爱。

在MySQL的实现中，tsecer_select函数中的for循环大致相当sub_select中的while循环，而tsecer_select函数中循环体内的内容被放在了evaluate_join_record函数中，其中的sofartest对应evaluate_join_record::test(select_cond->val_int())；tsecer_select中的nexttable.tsecer_select()语句对应evaluate_join_record::(*join_tab->next_select)(join, join_tab+1, 0)。

四、join buffer的select实现

当使用join buffer cache时，next_select函数指向sub_select_cache

enum_nested_loop_state sub_select_cache(JOIN *join,JOIN_TAB *join_tab,bool end_of_records) {

enum_nested_loop_state rc; if (end_of_records)

{

rc= flush_cached_records(join,join_tab,FALSE); if (rc == NESTED_LOOP_OK || rc == NESTED_LOOP_NO_MORE_ROWS)

rc= sub_select(join,join_tab,end_of_records); return rc;

} if (join->thd->killed) // If aborted by user { join->thd->send_kill_message(); return NESTED_LOOP_KILLED; /* purecov: inspected */ } if (join_tab->use_quick != 2 || test_if_quick_select(join_tab) <= 0)

{ if (!store_record_in_cache(&join_tab->cache)) return NESTED_LOOP_OK; // There is more room in cache return flush_cached_records(join,join_tab,FALSE);

}

rc= flush_cached_records(join, join_tab, TRUE); if (rc == NESTED_LOOP_OK || rc == NESTED_LOOP_NO_MORE_ROWS)

rc= sub_select(join, join_tab, end_of_records); return rc;

}

结合MySQL文档中的说明，这里的代码意义就比较明显。开始对于end_of_records的判断对应的就是

if (!store_record_in_cache(&join_tab->cache)) return NESTED_LOOP_OK; // There is more room in cache return flush_cached_records(join,join_tab,FALSE);

对应

- Store used fields from t1, t2 in cache - If cache is full

其中store_record_in_cache函数会判断cache是否已满，如果cache可以放入更多的缓存，则把之前table的组合记录存储在cache中，并返回NESTED_LOOP_OK。注意：这个地方可以说是整个cache优化的关键，因为这里并没有启动对于table的扫描。反过来说，如果cache数据已经满了，则调用flush_cached_records函数来进行下面的流程

- Read through all rows in t3

- Compare t3 row against all t1, t2 combinations in cache - If row satisfies join condition, send it to client - Empty cache

这个流程的特殊之处在于遍历的驱动是通过对于table的每一条记录来和cache中所有t1、t2组合来进行比较，来判断是否满足下推where条件(If row satisfies join condition)，则执行join_tab->next_select函数(send it to client)。

static enum_nested_loop_state

flush_cached_records(JOIN *join,JOIN_TAB *join_tab,bool skip_last)

{

……

info= &join_tab->read_record; do {//遍历t3表格所有记录 …… for (i=(join_tab->cache.records- (skip_last ? 1 : 0)) ; i-- > 0 ;)

{//遍历cache中所有t1、t2记录组合 read_cached_record(join_tab);

skip_record= FALSE; if (select && select->skip_record(join->thd, &skip_record))

{// reset_cache_write(&join_tab->cache); return NESTED_LOOP_ERROR;

} if (!skip_record)

{//满足下推的where条件 //执行下一个table的遍历 rc= (join_tab->next_select)(join,join_tab+1,0); if (rc != NESTED_LOOP_OK && rc != NESTED_LOOP_NO_MORE_ROWS)

{

reset_cache_write(&join_tab->cache); return rc;

}

……

} while (!(error=info->read_record(info)));

五、举例来说明下这个流程

这个实现的核心思想并不复杂，结合具体的例子来看就更加的简单直观。

举个例子，其中使用两个简单的table，其中分别存储一个x，和y的值，我们希望通过一个join操作来计算这两个表格中所有的满足 x x + y y == 5 * 5,也就是我们最常见的"勾三股四弦五"这样的经典勾股数数值。

mysql> create table harry (x int);

Query OK, 0 rows affected (0.03 sec)

mysql> insert harry values (1),(2),(3),(4),(5);

Query OK, 5 rows affected (0.00 sec) Records: 5 Duplicates: 0 Warnings: 0 mysql> create table tsecer (y int);

Query OK, 0 rows affected (0.01 sec)

mysql> insert tsecer values (1),(2),(3),(4),(5);

Query OK, 5 rows affected (0.00 sec) Records: 5 Duplicates: 0 Warnings: 0 mysql> explain select * from harry, tsecer where x * x + y * y = 5 * 5;

+----+-------------+--------+------+---------------+------+---------+------+------+--------------------------------+

+----+-------------+--------+------+---------------+------+---------+------+------+--------------------------------+

2 rows in set (0.00 sec)

mysql>

1、不使用joinbuffer

在不使用join buffer的情况下，对于harry表的每个x值，对应的tsecer表都要进行一次全表扫描，之后使用这个x和y的组合判断是否满足x x + y y == 5 * 5这条件。由于x总共有5个值，所以tsecer需要全表扫描的次数就是5次。

2、使用joinbuffer

对于x的每个值，tsecer表在执行的时候先是把这个值缓存到joinbuffer中，如果buffer缓冲内容非空，那么把此时的x的值存储在buffer中后直接返回；当join buffer满或者是最后一条记录的时候，此时开始启动对于tsecer表的扫描，对于tsecer表中读取的每一个记录，结合前面缓存的每一个记录，看是否满足自己判断条件。

对于我们看到的例子，这个地方harry表的5个值都在缓存中，在tsecer表的扫描过程中，对于从tsecer中读取的每一条记录，结合缓存中的“每一条”缓存，判断这个组合结果是否满足条件，如果任意一个组很满足，那么就继续next_select。

在这个使用buffer的例子中，可以看到这个地方只是对于tsecer表进行了一次扫描，而通常来说，数据库的扫描代码是最高的(因为要涉及到磁盘读取)，这样使用buffer的方式将tsecer表的扫描降低为1次，所以这个效率提高很多，特别是在涉及到的多个table，并且/或者每个table中的记录数量都很多的情况下。

3、cache可以优化的原因

本质上说，这个效率提高的原因在于提高了从table中获得的每条记录的“利用率”，在使用直观扫描方式时，table的全表扫描只是和一个组合进行匹配，而使用buffer之后则是和cache中的所有组合进行匹配。

在数字化转型中，选择合适的跨平台开发框架不仅能提高效率，还有助于确保数据安全与合规性。

677 2022-08-29

MySQL的join buffer原理及如何提高查询效率

洞察掌握android电视app开发中的安全与合规策略，提升企业运营效率

在数字化转型中，选择合适的跨平台开发框架不仅能提高效率，还有助于确保数据安全与合规性。

在数字化转型浪潮中，企业如何通过跨端开发框架提升运营效率，兼顾合规性与数据安全？

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计