云数据中心认知it运维_第1页
云数据中心认知it运维_第2页
云数据中心认知it运维_第3页
云数据中心认知it运维_第4页
云数据中心认知it运维_第5页
已阅读5页,还剩24页未读 继续免费阅读

付费下载

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

黄卫IBM混合云顾问工程师云数据中心下的认知IT运维管理数字化和云化转型时代的IT运维“双模态”企业正在转向一个更加灵活的业务转变方式,促使IT运维更专注于快速的DevOps以更贴近用户需求,支撑于较少的变化,但仍然至关重要的后端系统面向用户交互转型和差异化面向系统支撑依托运营水平传统模式敏捷模式IT项目规模少,大多,小开发到上线2-3

年2-3

个月变更频率低高治理架构集中式去中心化治理工具Cloud-ready,

on-premCloud-Native目标OperationsexcellenceTransformation敏捷模式(DevOps,

Lean)传统模式(ITIL,

CMMI)Sources:

The

agile

CIO:

Mastering

digital

disruption.

http://blog.kpmg.ch/the-agile-cio-mastering-digital-disruption/

,

Gartner,

Seven

Steps

toStartingaSuccessful

DevOpsInitiative云数据中心IT运营挑战无法精确预测、跟踪、和控制云端投入云端规划不确定性–what

to

move,when,

how,costs.希望更灵活选择云服务提供商需求和负载设计匹配困难很难始终如一稳定上线复杂方案用户体验的质量很难保证问题的诊断定位和解决花费太多时间关键任务依赖手工完成Shadow

IT不同的服务提供商/工具/clouds.持续压力,降低成本,优化TCO,政策法规,更多产出规划&

第三方代理持续交付持续的可用性保证3IBM认知IT运维路线Value

In

market

todayRoadmapEvent

analytics

and

machine

learningon

event

data

in

NOI.Machine

learning

on

performance

data,OOTB

anomaly

detection,

and

insights

inPredictive

Insights.We

are

building

a

Watson

basedCognitive

Assistant

for

Cloud,

ITand

Network

Operations.(倾听、学习和交互)Text

analytics

on

IT

Operational

datain

Log

Analysis,

or

scenario

basedmodelling

and

correlation.Analytics

SolutionsCognitive

DataScientist应急救火基于洞察以提前预判基于洞察以提升效率基于Watson的认知决策A

fully

Cognitive

SolutionSee

slide

notes

ASee

slide

notes

BIBM认知IT运维:三个指导原则1持续学习2预测修正3推荐决策认知分析能力Advanced

EventAnalyticsPredictiveInsights(exploiting

Watson)Log

AnalyticsEarly

access

trial

of

Watson

for

IT

ServicesManagementAvailable

in2016IBM认知IT运维平台-Netcool

OperationsInsightNetcool提供创新的大数据智能分析,提高运维价值并在整个IT环境降低成本事件日志上下文历史挖掘和规律分析No

complex

manual

intervention

to

setup

&

maintainwith

5

times

faster

processing事件日志周期性规律分析Alerting

before

potential

issues

become

service

impacting,enabling

IT

to

shift

from

reactive

to

proactive事件日志成对成组出现分析IBM

Delivered

content

for

grouping

well

known

events.

Up

to90%

reduction

in

events

out

of

the

box事件日志相关与因果分析Discover

clusters

of

events

that

occur

together,

automaticallygroup

them

for

context事件告警发生之周期性时间规律学习某应用端口Down固定发生在每天早上6点到6点半之间,且每周、每月内发生频率均衡分布某Web容器线程池使用率在下午3-4点通常会到达99%,早上9点前及晚上8点后从未发生,且每周一到周三发生频率最高,周四到周日负载逐级下降事件告警发生之周期性时间规律学习机器学习自动识别事件间相关性可以显著提升IT运维甚至运营效率部分客户可以解决该问题,但需依赖确定的场景、关联规则和专业经验,且云化环境下不具可复制性Failure多个操作员在处理不同的事件,但结果问题是同一个同一个问题触发多个工单,导致SME效率低下根源定位需要经验、时间和足够视野高成本低效率的IT运维FailureAuto

generated

or

out

of

the

boxCorrelation

rulesEvent

Clusters事件告警发生之相关性规律学习机器自学习策略设定:强、中、弱关系学习事件告警发生之相关性规律学习8类告警在历史上成组出现5次,强关系策略可确定这8类告警只可能同时出现,可依次跟踪其中7类告警围绕核心事件(Pivot

Event)出现的相对时间,以时间线(Timeline)体现使用统计学工具,和行为学知算法,揭示指标间的数学关系结构化数据认知学习和预测:Predictive

Insight单个KPI认知分析对每个KPI学习其历史的行为当KPI偏离其历史的行为时,认为是异常周期性变化分析多KPI认知分析使用统计学方法识别KPI之间的关系,确立相关性,并确定哪个KPI指标最有可能是其相关的指标数据集合的变化根源了解正常的行为模式,并在识别到行为模式与正常的行为相异时,发送警告基于Grangercausality

test的方法进行实现由诺贝尔奖获得者,经济学家Clive

Granger提出使用类似”delayed

correlations”等统计测试以确定因果关系对大量的基于时间序列的数据进行整体分析,识别存在于这些数据中的显著因果关系基于时间序列的KPI行为学知格林杰因果关系分析(GrangerCausality)If

KPI

A

makes

the

future

predictions

of

the

value

B

significantly

better

than

using

past

values

of

B

on

their

own

then

it

isconsidered

to

be

Granger

CausingGranger,

Clive

W.

J.

"Investigating

Causal

Relations

by

Econometric

Models

and

Cross-Spectral

Methods."Econometrica

37,

424-438.格林杰因果关系分析(GrangerCausality)Sridharan,

D.,

Levitin,

D.J.

&

Menon,

V.

A

critical

role

for

the

right

fronto-insular

cortex

in

switching

between

central-executive

and

default-modenetworks.

Proceedings

of

the

National

Academy

of

Sciences

of

the

United

States

of

America

105,

12569-12574

(2008).格林杰因果关系描述,箭头指示因果,粗细描述关系强弱:认知分析模型–

如下服务模型是认知计算分析的输出而非输入多KPI相关性认知学习示例统计模型可发现指标数据间的数学关系能找出多大程度的相关性取决于多个因素,比如:指标数据的范围与种类,指标数据的获取,以及环境的稳定等。如果指标数据无相关性,分析则回归到单一指标的分析了。GGIIBBDDCCEEFFHHAA网银Core

BankingApplicationApplicationApplication网银DB2ESBJava/WASAIXRHELStorage

Network目标:自学习各指标数据间的正常算法关系业务响应时间坏好用户请求时间业务响应时间异常事件业务影响早期警告PI学习到,”业务响应时间“与“用户请求”有正相关因果关系–且随着用户负载增加而变慢。如果这一正常历史规律被破坏,比如说由于內存泄漏,造成即便用户请求数下降了,业务响应时间还很高,异常预警信号将立即发出。问题被发现,尽管这时业务服务质量仍处于“好”的区间正在发生的问题可被检测出来,尽管从阀值或监控数据的绝对值衡量仍属正常。Core

BankingApplicationApplicationApplication网银DB2ESBJava/WASAIXRHELStorageNetwork多KPI相关性认知学习和预警某银行基于PI测试系统录入历史生产数据进行分析,发现除双11外,PI报出在11月13日的指标异常明显增多多KPI相关性认知学习和预警经上下文钻取发现,11月13日上午8:30

左右报出2个指标异常预警:失败交易数和交易总时长,均和贵金属营运系统相关,并列出多个因子指标,包括贵金属交易数和交易成功时长从缓慢然后到急剧升高,随后是交易失败数和队列等待数从缓慢到急剧升高,9:30左右2个MQ队列阻塞至上限,PI

Granger分析给出指标间因果关系,可明显看出导致该问题的主要原因为贵金属并发交易数过大导致交易处理时间过长。多KPI相关性认知学习和预警报警KPI:

失败交易数反向追溯验证:行内紧急事件协助群于11月13日10点左右报出贵金属交易阻塞超时,经查为当日市场营销推广活动导致,当时紧急之下处理手段为限流和三路应用重启贵金属成功交易数交易时长多KPI相关性认知学习和预警ScenarioVerified

ResultDetail#

1.

Scenario

1:

MM

busy

ratio

100%

=>

caused

by

transfersystem

abnormal

and

lockMM

resourcesOKThe

first

predictive

Pre-alarm

is

around

22

hours

before

MES

system

outage

happened.From

the

event

list,

AEL

showed

the

relative

events

happened

sequence.

And

service

diagnose

chartdisplay

the

multi-metrics

cause

relationship.From

the

result

of

AEL

point

out

Root

Cause

Trend.

Which

help

operator

shorten

trouble

shootingtime.#

2.

Scenario

1:

LC

causes

CIM

DB

performance

shiftOKAEL

event

list

and

service

diagnose

chart

display

LC

kpi

and

CPU

kpi

relationship.

But

did

not

showclearly

CPU

shift

cause

by

LC

KPI.#

3.

Scenario

2:

SPC

loader

performance

shift

(by

block

readhigh)OKProved

busy

ratio

metrics

and

other-metrics

causal

relationship

in

the

diagnose

chart.

From

AEL,

canbe

checked

by

manual

confirmed

which

were

relative

extra

box.#

4.

Scenario

3:

RTD

Apr

failure

cases

causes

OJS

hung

(adminserver

crash

willcauses,

but

others

will

not)OKOJS

job

busy

ration

metric

related

with

RTD

other

metrics

display

from

service

diagnose

chart

andrelative

alarm

information.

But

not

display

very

clearly

to

show

root

cause

from

RTD

admin

server.#5.Scenario4:AMHSeventcausesMCSabnormal(要看到系統找不出來rootcauses)OKFrom

service

chart,

cost

table

cnt,

transfer

ob

cnt

and

carrjob

his

cnt

value

should

be

positive

trend.But

5/29

2:20

AM

carrjob

his

cnt

metric

was

anormal.

Which

decreased

(should

increased).

Whichcan

find

out

the

some

causal

relation

from

IOAPI.#6.Scenario

5:FDC

掉點和CPU

loading

有關OKOnly

display

the

alarm

as

same

ccop.

Mostly

metrics

alarm

is

single.

Some

data

lost

not

found

thecause

from

FDC

server

CPU

loading

metric.#

7.

Scenario

6:

F12GigaRpeort_X86

7/1

fhopehs

replicationdelay

=>

target

partition

tables

not

be

createdOKGet

pre-alarm

and

multi-metric

chart.

Can

display

the

DB2

table

metric

related

delay

log

metric,

butone

table

related.

No

2

tables

related

delay

log

metrics

in

the

one

alarm.#

8.

Scenario

7:

F12GigaRpeort_X86

6/10

P4

fhopehsreplication

delay

=>

check

root

causesOKNot

found

from

GigaReport

replication

delay

log

which

would

causes

by

data

lock.

From

alarms

inAEL,

display

the

delay

log

metric

related

with

data

lock

metric.

Not

very

clearly

approve

cause

fromdata

lock.基于真实历史生产数据出现过的场景,验证了PI的关键能力:针对故障的提前预警能力多指标(KPI)相互影响关系的分析和定位能力一项和MCS相关的真实场景显示,PI在整个MES系统瘫痪前22小时就发出警告多KPI相关性认知学习和预警强大的实时分析能力和伸缩性分析引擎分析引擎分析引擎PI两大主要构件:分析引擎及数据库服务器预测型洞察力方案数据库服务器普通刀片服务器上每一分析引擎支持多达500,000个指标每5分钟粒度的数据一个全装备的关键业务应用可产生多达50,000个指标数据指标数据间的多个关系可由单个分析引擎发现从实时到历史的上下文挖掘–LogAnalysisBreadth

of

Searchable

DataSearch

across

all

of

your

IT

operational

data

to

quicklyresolve

issuesExpert

AdviceAny

competitorcan

isolate

problems.

IBM

helps

clientsquickly

resolve

themIBM

Open

Platform

for

Apache

HadoopEntitlement

and

24/7

support

for

the

IBM

Apache

Hadoopdistribution

for

long-term

data

storage

and

historical

analysisInsight

Pack

EcosystemCatalog

of

apps

to

provide

additional

insights

into

domain-specific

areas挑战:为了定位影响服务故障在应用和基础架构层面的原因,需要实时上下文挖掘分析海量结构化和半结构化运维数据提升运维自动化能力:日常操作自动化,交流与知识共享,以加速解决日常运维常见问题。迅速延伸已有Netcool方案能力,和快速投资回报的实现Create

and

Execute

Runbooksand

Automated

tasks

forfaster,

more

repeatable

andconsistent

problemresolutionRunbook

AutomationEscalation

policies,

notifications,andalertanalytics

for

drivinggreater

efficiencies

inrespondingto

IT

situations

and

customerproblemsAlert

NotificationDesigned

to

work

seamlessly

with

existing

solutions

work

with

on-prem

tools

with

the

benefits

of

CloudImproveROI

-immediate

value

add

and

improved

ROI

on

existing

investmentsContinuously

drive

operational

efficiency

and

knowledge

built-in

tracking

and

analysisof

automated

IT

operational

activity敏捷运维与知识共享–NOI

RBA

&

ANSIBMOperations

Analytics

onCloudNew

SaaS

offering

that

pulls

ITOA

capabilities

(Log

Analysis

and

PredictiveInsights)

into

one

experienceCapabilities

broken

into

‘cognitive’

microservicesCognitive

microservices

can

be

consumed

within

other

ITSM

offerings

to

providemarket

differentiationInstant

gratification

baseliningAdaptive

threshold

anomalyExpert

adviceParsing

and

annotationLogr

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论