已阅读5页,还剩12页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Detecting, Managing, and Diagnosing Failures with FUSE,John Dunagan, Juhan Lee (MSN), Alec WolmanWIP,2,Goals & Target Environment,Improve the ability of large internet portals to gain insight into failuresNon-goals: masking failuresuse machine learning to inferabnormal behavior,3,MSN Background,Messenger, , Hotmail, Search, many other “properties”Large ( 100 million users)Sources of Complexity: multiple data-centers large # of machinescomplex internal network topologydiversity of applications and software infrastructure,4,The Plan,Detecting, managing, and diagnosing failuresReview MSNs current approachesDescribe our solution at a high level,5,Detecting Failures,Monitor system availability with heartbeatsMonitor applications availability & quality of service using synthetic requestsCustomer complaintsTelephone, emailProblems: These approaches provide limited coverage harder to catch failures that dont affect every requestData on detected failures often lacks necessary detail to suggest a remedy:which front end is flaky? which app component caused end-user failure?,6,Managing Failures,Definition: Ability to prioritize failures Detect component service degradation Characterizing app-stability Capacity planningWhen server “x” fails, what is the impact of this failure?Better use of ops and engineering resourcesCurrent approach: no systematic attempt to provide this functionality,7,Our solution (in 2 steps),Detecting and Managing FailuresStep 1: Instrument applications to track user requests across the “service chain”Each request is tagged with a unique idService chain is composed on-the-fly with help of app instrumentationFor each request:Collect per-hop performance informationCollect per-request failure statusCentralized data collection,8,What kinds of failures?,We can handle:Machine failuresNetwork connectivity problemsMost:MisconfigurationApplication bugsBut not all:Application errors where app itself doesnt detect that there is a problem,9,Diagnosing Failures,Assigning responsibility to a specific hw or sw componentInsight into internals of a component Cross component interactionsCurrent approach: instrument applicationsApp-specific log messagesProblemsHigh request rates = log rolloverPerceived overhead = detailed logging enabled during testing, disabled in production,10,Fuse Background,FUSE (OSDI 2004): lightweight agreement on only one thing: whether or not a failure has occurredLack of a positive ack = failure,11,Step 2: Conditional Logging,Step 2: Implement “conditional logging” to significantly reduce the overhead of collecting detailed logs across different machines in the service chainStep 1 provides ability to identify a request across all participants in the service chain, Fuse provides agreement on failure status across that chainWhile fate is undecided: Detailed log messages stored in main memoryCommon case overload of logging is vastly reducedOnce the fate of service chain is decided, we discard app logs for successful requests and save logs for failuresQuantity of data generated is manageable, when most requests are successful,12,Example,Benefits:FUSE allows monitoring of real transactions.All transactions, or a sampled subset to control overhead.When a request fails, FUSE provides an audit trailHow far did it get?How long did each step take?Any additional application specific context.FUSE can be deployed incrementally.,13,Issues,Overload policy: need to handle bursts of failures without inducing more failuresHow much effort to make apps FUSE enabled?Are the right components FUSE enabled?Identifying and filtering false positivesTracking request flow is non-trivial with network load balancers,14,Status,Weve implemented FUSE for MSN, integrated with ASP.NET rendering engineTesting in progressRoll-out at end of summer,15,Backups,16,FUSE is Easy to Integrate,Example current code on Front End:ReceiveRequestFromClient() SendRequestToBackEnd();Example code on F
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 公司皮带工岗位应急处置技术规程
- 水土保持员操作能力模拟考核试卷含答案
- 玩具制作工岗位设备安全技术规程
- 施工方和司机责任协议书
- 函数模型的应用(2大考点+7大题型)-2026年新高考数学一轮复习(讲义+专练)解析版
- 流量高峰时段调控办法
- 海南省海口市2023-2024学年八年级上学期期末地理试题(A)
- 揭秘小满销售策略
- 硕士教育新篇章
- 湖南金水塘矿业有限责任公司2025招聘笔试历年参考题库附带答案详解
- 2025年心血管疾病介入培训考试电生理起搏模拟精彩试题(含答案)
- 提高重症患者出入量记录准确率品管圈成果汇报
- 药品经营企业岗前培训-复核员试题含答案
- (2025年)《大学生心理健康教育》考试试题库及参考答案
- 2025年南昌市消防救援支队水上大队招聘勤务及宣传勤务文员3人考试笔试备考试题及答案解析
- 藏族普通话培训课件
- 富士康的ESG课件
- 医院消毒灭菌培训课件
- 四羊方尊课件
- 2025年常用非金属材料检测培训试题题库及答案
- 生物防治技术整合-洞察与解读
评论
0/150
提交评论