




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
LLM+Agent助力下的数据分析智能化转型研究与实践CONTENTS1智能数据分析新时代2腾讯智能数据分析平台OlaChat功能一览3OlaChat
Agent建设4OlaChat关键能力介绍5附录数据分析帮助人类从数据中提炼知识,形成智慧数据分析平台发展史:从报表工具到智能BI大语言模型给智能数据分析带来全新可能Zhao,Wayne
Xin,
etal.
A
Survey
of
Large
Language
Models第三代数据智能化产品挑战2、目前的方案准确率低,落地使用难度大3、目前的智能平台仅针对数据取数、数据洞察等单点需求挑战:C1、怎么设计一个完善的贴合数据分析领域的Agent/工作流,支撑在业界的智能化使用?(智能化)问题:1、大模型本身存在问题C2、如何丰富功能,保证每个功能可持续迭代优化?(丰富性)C3、该如何做到将功能高效接入不同的平台?(易集成、高可用)OlaChat:建设贯穿BI全流程的能力,多角度提升分析效率OlaChat:LLM+Agent携手助力智能数据分析AgentS1:拟人思维Agent提升数据洞察效能
S2:工具箱的设计,提供丰富的工具能力
S3:分层架构体系,高集成、易扩展CONTENTS1智能数据分析新时代2腾讯智能数据分析平台OlaChat功能一览3OlaChat
Agent建设4OlaChat关键能力介绍5附录腾讯智能数据分析平台OlaChat:功能展示OlaChat一站式相关数据均为非业务数据即测试数据,仅供展示所用腾讯智能数据分析平台OlaChat:对话系统相关数据均为非业务数据即测试数据,仅供展示所用腾讯智能数据分析平台OlaChat:text2SQL相关数据均为非业务数据即测试数据,仅供展示所用腾讯智能数据分析平台OlaChat:text2DSL相关数据均为非业务数据即测试数据,仅供展示所用腾讯智能数据分析平台OlaChat:智能绘图相关数据均为非业务数据即测试数据,仅供展示所用腾讯智能数据分析平台OlaChat:智能仪表盘相关数据均为非业务数据即测试数据,仅供展示所用腾讯智能数据分析平台OlaChat:SQL纠错相关数据均为非业务数据即测试数据,仅供展示所用CONTENTS1智能数据分析新时代2腾讯智能数据分析平台OlaChat功能一览3OlaChat
Agent建设4OlaChat关键能力介绍5附录Agent的研究和相关应用General
AgentMulti-Modal
AgentData
AgentEducation
AgentGame
AgentEmbodied
AgentMulti-AgentEvaluation
AgentAutoGPT
2023-3Voyager2023-5CAMEL
2023-4HuggingGPT2023-3Dona
2023-3VisualChatGPT
2023-3StarfordTown
2023-4RoboAgent2023-9AgentBench
2023-82023.32023.42023.52023.82023.9…Data-Copliot
2023-5OlaGPT
2023-5…OlaGPT-赋予LLMs
人类解决问题的能力(理论篇)理论先行:OlaGPT架构图OlaGPT:
Empowering
LLMs
With
Human-like
Problem-Solving
Abil-
OlaGPT仔细研究了认知架构框架,并提出模拟人类认知的某些方面。OlaGPT
Paper
OlaGPT
CodeS1:拟人思维Agent提升数据洞察效能(实践篇)整体建设框架(Agent):CONTENTS1智能数据分析新时代2腾讯智能数据分析平台OlaChat功能一览3OlaChat
Agent建设4OlaChat关键能力介绍5附录OlaChat-平台关键能力支撑意图&对话系统:智能多任务对话,低门槛的人机交互系统意图&对话系统:明确意图,交互沟通意图&对话系统:高效率可扩展的意图识别能力Re景:需充分明确用户意图,进行合理的任务类型识别。不同的业务有不同的意图候选集,初期没有标注的可用数据,需需设计一套具有通用性的意图数据生成和意图分类方案。数据方案:搭建冷启动和数据增强方案
1)完全冷启动:依据问题分类+问题描述,LLM生成数据数据增强:模型预测增强;同近义词增强;query改写增强;双向翻译增强多方案投票:提升准确率,支持线上快速修复badcase
1)分类模型:训练所有问题类型下的基于bert,Roberta,Ernie等的大分类模型;基于分类分数重训练/基于策略适配业务的子分类需求;针对长期优化(模型定期更新)
2)检索策略:召回候选集+排序方案;针对提升高频问题准确率以及及时扩展性基于LLM:通过prompt的方式进行提示输出;针对问题泛化性和及时扩展性模块接入:准备问题类型和问题类型的大致描述(可给部分case)即可接入流程分类方案图示记忆提取:元数据检索增强(MetaRAG)元数据是按特定结构、层次组织的指标、字段、维度的组合方式不符合语言模型的基本假设特定修饰词很重要,如“有效”字段名可以很短,如“播放VV”结构化元数据检索:基于用户的Query,确定用哪个“指标+维度”或者哪些“表+字段”可以获取到数据来满足用户的需求。要求尽可能精确,以便减少传递给LLM的噪声,避免获取到的数据不符合要求记忆提取:元数据检索增强之FlattenedRAG记忆提取:元数据检索增强之StructuredRAGS2:工具箱组成丰富功能,以SQL矩阵为例SQL功能矩阵Text2SQL:现有方案不适用于业务场景Text2SQL:模型只是不断逼近数据上限的方法Text2SQL:半自动化数据构建流程Text2SQL:Agent为模型鲁棒性保驾护航(DEA-SQL)1)为什么我们需要文本转SQL任务的工作流范式?单步提示是有限的,并且有几个主要缺点:
a)LLM中的注意力分散导致效率降低。b)LLM很难将注意力集中在大量文本中的特定问题上。2)如何设计一个好的文本转SQL任务的工作流范式?信息确定:两阶段方法减少分散注意力的信息以集中注意力;分类和提示模块:将无法通用处理的不同问题分类,仅使用不同的简单提示即可解决;SQL生成模块:基于少量样本的问题模板检索;自纠模块:基于错误总结;主动学习模块:基于错误案例扩展模型能力。DEA-SQL
Paper
DEA-SQL
CodeACL’24-Decomposition
for
Enhancing
Attention:
Improving
LLM-basedText-to-SQL
through
Workflow
ParadigmText2SQL:数据+模型+Agent的效果提升SQL纠错:现有方案主针对SQL的全自动纠正Jipeng
Cen,
et
al,
SQLFixAgent:
TowardsSemantic-Accurate
SQL
Generation
via
Multi-Agent
Collaboration解决不完备:业界存在很多需要特殊解决的非通用语法错误Case(例如集群不稳定、特殊SQL方言不支持造成的,期望给出合理建议和解决方案)数据安全:使用闭源模型会导致数据泄露Arian
Askari,
et
al.
MAGIC:
Generating
Self-CorrectionGuideline
for
In-Context
Text-to-SQLSQL纠错:分而治之Agent加持提升准确率智能图表绘制:工作减负,低门槛可视化智能数据解读:补全数据分析最后一公里开放的一体化智能分析平台架构S3:OlaChat分层架构体系THANKS演讲嘉宾:谢苑珍高级算法研究员CONTENTS1智能数据分析新时代2腾讯智能数据分析平台OlaChat功能一览3OlaChat
Agent建设4OlaChat关键能力介绍5附录Text2SQL
Background
&&
Related
WorkBackground:
Reduce
the
technical
threshold
for
students
related
to
product
operations
and
data
analysis
in
data
analysis
workChallenge:The
LLM
has
limited
understanding
of
tables,
fields,
and
dimension
values
(Information
Determination:
two-stage)Single-step
COT
capability
is
limited,
How
to
design
a
good
workflow
paradigm
for
text2sql
tasks?
(Workflow
Paradigm)The
capabilities
of
the
LLM
are
limited,
how
to
maximize
the
potential
of
the
model?
(Check
optimization,
active
learning)Related
WorksupervisedlearningAdvantages:
controllable
optimization;
information
securityDisadvantages:
acquiring
annotated
text-to-SQL
data
is
costly;
training
and
fine-tuning
the
models
entail
significant
engineeringefforts
and
consume
substantial
computational
resources.in-context
learningAdvantages:
faster
learning
with
minimal
data;
reduced
consumption
of
computing
resourcesDisadvantages:
Uncontrollable
optimizationDecomposition
for
Enhancing
Attention:
Improving
LLM-based
Text-to-SQLthrough
Workflow
ParadigmWe
propose
Decomposition
for
Enhancing
Attention:
Improving
LLM-based
Text-to-SQL
through
Workflow
Paradigm
(DEA-SQL):It
draws
on
human
thinking
patterns,
adheres
to
the
principle
of
making
subtasksas
simple
as
possible,
and
reduces
irrelevant
information
in
each
step
tospecifically
enhance
the
solvable
scope
of
LLM
and
improve
the
attention
of
LLMto
enhance
their
performance.It
consists
of
five
sub-modules
imitating
the
common
solution
process
ofhumans
in
text-to-SQL
tasks: Information
Determination,
Classification
\&
Hint,SQL
Generation,
Self-Correction,
Active
Learning.Information
Determination:
Reduce
irrelevantinformationElements
identification:
identify
the
problem
elementsInformation
filter:
select
the
required
tables
and
colsbased
on
the
elements2)
Classification
&&
Hint:
Local
solutionFour
categories:
Easy,
join,
nested,
and
join-nestedHints:
Foreign
keys,
different
hints
in
different
categories3)
SQL
Generation:The
overall
question
prompt
follows
the
format
$<F,
D,H-FK,
I,
Q,
H,
S>$
shown
as
Figure2.Few-shot:
random,
question
similarity,
and
templatesimilarity
based
on
the
Classification
module二、模型架构设计底层预训练基座选型:预计wizard-coder,sqlcoderDEA-SQL:the
struction
of
the
methodDEA-SQL
Paper
DEA-SQL
Code3)
Self-Correction:
Global
correctionExtra
fields:
The
LLM
often
selects
an
excessive
numberof
fields,
rather
than
limiting
its
selection
to
thosepertinent
to
the
question; Incorrect
fields:
For
instance,
when
faced
with
fieldsbearing
identical
names
across
different
tables,
the
aliasmay
be
omitted,
leading
to
errors;Table
and
field
association
errors:
there
may
beinconsistencies
between
the
tables
and
fields
used;Fabricated
conditions
for
table
joins;Misuse
of
association
words:
For
example,
there
is
atendency
to
habitually
use
'left
join'
in
place
of
'join’;Group
or
order
by
errors:
Mistakes
such
as
incorrectaggregation
fields
and
conditions
may
be
encountered.4)
Active
Learning:
Error
correctionmore
prone
to
errors
for
certain
problem
types
(e.g.,extremum
problems)
->
fix
some
error
case
to
learnDEA-SQL:the
struction
of
modelDEA-SQL:the
prompt
of
the
methodDEA-SQL:the
prompt
of
the
methodDEA-SQL:the
prompt
of
the
methodDEA-SQL:
ExperimentsResearch
questions:RQ1:
How
doesDEA-SQL
perform
vs.
state-of-the-art
baselines?RQ2:
Whether
each
module
of
our
method
works
effectively
and
how
are
theyimpacting
the
task?RQ3:
What
is
the
token
and
time
cost
of
the
method?RQ4:
How
do
model
parameters
like
the
number
of
few-shots
or
the
informationfilter
layers
affect
the
method?RQ1.
How
does
DEA-SQL
perform
vs.
state-of-the-art
baselines?DEA-SQL:ExperimentIn
terms
of
execution
accuracy
in
the
Spider
dataset,
our
approach
based
on
workflow
outperforms
the
existing
baselines.In
the
Spider-Realistic
dataset,
which
is
more
adapted
to
real-world
scenarios
and
has
more
difficult
question
formulations,
ourapproach
is
more
stable
and
achieves
better
performance
than
other
solutions
based
on
LLMs.
This
validates
the
effectiveness
of
the
two-stage
information
determination
we
proposed,
which
can
mitigate
the
impact
of
different
question
formulations
to
someextent.RQ2.
Whether
each
module
of
our
method
works
effectively
and
how
are
they
impacting
thetask?1) The
effect
of
sub-models.DEA-SQL:ExperimentIn
information
determination
module,
by
reducing
irrelevantinformation
to
focus
the
attention
of
LLMs,
we
can
effectivelyenhance
performance
in
complex
tasks.It
is
indicated
that
distinguishing
problem
types(classification),using
simple
hints
for
simple
problems
and
targeting
complexhints
for
complex
problems,
can
effectively
improve
theperformance
of
the
LLM.The
active
learning
and
self-correction
modules
are
designedto
increase
the
capability
threshold
of
the
original
base
model,
butmay
to
some
extent
impair
the
ability
to
solve
other
easyquestions.RQ2.
Whether
each
module
of
our
method
works
effectively
and
how
are
theyimpacting
the
task?1) Random
is
a
little
detrimental
to
overall
performance,2)
The
effect
of
different
few-shot
scheme.DEA-SQL:Experimentwhile
the
retrieval
strategy
based
on
question
templatesimilarity
in
the
combined
question
classification
retrievallibrary
yields
the
best
results2) Question
templates
also
essentially
provide
a
simpleclassification
for
the
questions,
and
relying
on
questionclassification
methods
makes
it
easier
to
find
the
mostrelevant
questions
and
their
solutions,
therebystimulating
the
capabilities
of
LLMs.RQ3.
What
is
the
token
and
time
cost
of
themethod?DEA-SQL:ExperimentIt
consumes
less
in
terms
of
inference
time,
ensuring
theefficiency
of
our
method
in
real
applications.In
terms
of
token
consumption,
we
need
some
examplesfor
the
model
to
learn
from
the
context,
which
leads
tosome
increase
in
tokens.
However,
as
can
be
seen,
it
isstill
more
economical
than
DIN-SQL
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年度创业投资辅导与咨询服务协议
- 2025年废旧轮胎环保回收处理技术研究报告
- 2025版新型材料贴砖工程劳务分包合同样本
- 2025典当合同范本:艺术品典当融资协议范本
- 环保产业园2025年循环经济模式绿色产业发展与区域竞争力研究报告
- 2025年房地产区域分化对粤港澳大湾区投资策略的启示与建议
- 职业技能培训在乡村振兴中的政策效应分析:2025年实践与反思报告
- 农业机械使用安全及维护保养指南
- 食品零售行业线上线下融合方案
- 酒店旅游度假目的地开发策略研究方案
- 陈腐垃圾施工方案
- 渤海大学《软件工程》2022-2023学年第一学期期末试卷
- 税务会计岗位招聘笔试题及解答(某大型国企)2024年
- ICD-10疾病编码完整版
- 消防设备设施操作讲解培训讲课文档
- 内分泌科医疗管理制度
- 临床开展十二项细胞因子检测临床意义
- FlowmasterV7中文技术手册
- 房屋承包出租合同
- 石油化学工业的发展历程与前景
- 《滚珠丝杠螺母副》课件
评论
0/150
提交评论