版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
2026年人工智能训练师(二级)实操技能综合试题及解析一、任务背景某市“城市大脑”项目需对早高峰出租车调度进行优化。项目团队已采集连续30天、共2.1亿条GPS轨迹,字段包括:车辆ID、时间戳、经度、纬度、载客状态(0=空驶,1=载客)、瞬时速度。数据存储于Hive分区表,分区字段为`dtSTRING`。现需在限定资源(单节点16vCPU/64GB,无GPU)下完成模型训练与上线,要求:1.建立空驶时长预测模型,预测未来15min内车辆是否出现≥5min连续空驶;2.建立载客需求热度图,空间分辨率500m×500m,时间分辨率5min;3.设计可解释性报告,向交管部门说明模型决策依据;4.模型需以RESTfulAPI形式上线,平均响应<200ms,99th延迟<500ms。二、数据说明1.原始表`raw_gps`:`vehicle_idSTRING,tsBIGINT,lonDOUBLE,latDOUBLE,occupancyTINYINT,speedFLOAT`2.辅助表`road_net`:`link_idSTRING,src_lonDOUBLE,src_latDOUBLE,dst_lonDOUBLE,dst_latDOUBLE,length_mINT,levelTINYINT`3.天气表`weather`:`dtSTRING,hourTINYINT,tempDOUBLE,rainDOUBLE,windDOUBLE`三、实操试题(满分100分,时间240min)【模块A:数据治理与特征工程】(25分)1.(5分)写出HiveSQL,完成以下数据清洗:a)剔除漂移点:若`speed>120km/h`且与上一条记录距离>1km;b)剔除重复点:同一车辆同一秒多条记录仅保留第一条;c)结果写入新表`gps_clean`,分区字段不变。2.(6分)基于`gps_clean`,构建“行程段”表`trip`:定义:载客状态由0→1视为行程开始,由1→0视为行程结束。输出字段:`vehicle_id,start_ts,end_ts,start_lon,start_lat,end_lon,end_lat,dist_km`(使用Haversine公式计算距离)。要求:使用SparkDataFrameAPI,给出完整代码(Python或Scala均可),并说明如何防止数据倾斜。3.(7分)设计用于空驶时长预测的样本表`sample`:a)标签构造:对于每个车辆每15min窗口,若未来15min内出现连续空驶≥5min则label=1,否则0;b)特征需包含:历史空驶比例(过去30min、60min、120min);当前位置周边500m内过去30min订单数;天气、节假日、工作日、时段哑变量;道路等级占比(将GPS点匹配至`road_net`,统计各等级道路占比)。给出完整SparkSQL+UDF代码,并说明如何保证样本正负比例近似1:1。4.(7分)为降低存储,需对`sample`进行特征压缩。a)采用信息增益筛选Top30特征,写出基于DecisionTree的筛选代码;b)对连续特征进行分桶,给出分桶数选择方法(基于MDL准则),并写出PySpark实现。【模块B:模型训练与调优】(25分)5.(8分)使用LightGBM训练二分类模型,要求:a)自定义评价函数`weighted_f1`,权重按业务成本设定:FN代价=5×FP代价;b)采用Optuna进行超参搜索,搜索空间:`num_leaves∈[20,80],learning_rate∈[0.01,0.3],max_depth∈[6,12],min_child_samples∈[50,500]`;c)给出完整Python代码,并说明如何设置早停与交叉验证。6.(7分)针对类别不平衡,对比试验:a)使用`scale_pos_weight`;b)使用`is_unbalance`;c)使用FocalLoss(自定义loss,γ=2)。给出三种方案在验证集上的F1、Recall、Specificity,并用McNemar检验判断方案b与c差异是否显著(α=0.05)。7.(5分)模型可解释性:a)计算SHAP值,输出全局重要性Top10条形图;b)针对“雨天”特征,给出PartialDependencePlot代码;c)用LIME解释一条预测为1的样本,要求可视化地图(静态PNG即可),标注关键路段。8.(5分)模型压缩:a)采用知识蒸馏,将上述LightGBM教师模型蒸馏给<1MB的Student模型(TinyGBM或NN),学生模型推理延迟<50ms;b)给出温度参数T搜索脚本,说明如何对齐教师与学生预测分布。【模块C:在线服务与性能优化】(20分)9.(6分)将模型封装为RESTfulAPI:a)使用FastAPI,输入JSON示例:```json{"vehicle_id":"V1001","ts":1704067200,"lon":120.15,"lat":30.25,"hist":[0.8,0.7,0.6]}```b)输出:```json{"prob":0.732,"shap_base":-1.2,"shap_contrib":{"rain":0.4,"hist_30":0.2}}```c)给出`Dockerfile`与`docker-compose.yml`,镜像体积<200MB;d)使用`locust`脚本,模拟1000并发,平均响应<200ms,99th<500ms,给出压测报告截图命令。10.(5分)设计模型热更新方案:a)采用“影子加载”策略,新旧模型共存,零停机;b)给出基于Redis的AB分流逻辑,支持按`vehicle_id`尾号灰度;c)说明如何回滚,并给出回滚脚本。11.(4分)监控与告警:a)使用Prometheus+Grafana,监控指标:QPS、Latency、ModelDrift(PSI)、ScoreDistribution;b)写出PromQL语句,PSI>0.2触发告警;c)给出GrafanaDashboardJSON(核心panel即可)。12.(5分)边缘部署:a)将模型转换为ONNX,使用ONNXRuntimeMobile,ARM64Android手机端侧推理;b)给出Java(Android)调用代码,并说明如何量化权重至INT8;c)在Pixel6上实测,单条推理<30ms,给出`adbshell`日志截取命令。【模块D:业务分析与报告】(15分)13.(8分)基于模型输出,生成早高峰(07:00–09:00)需求热度图:a)空间网格500m,时间窗5min,热度=预测载客需求期望数;b)使用H3(Resolution8)替代矩形网格,说明优势;c)输出GeoJSON,给出Python代码,并用Kepler.gl截图。14.(7分)撰写一页A4可解释性报告(中英双语),向交管部门说明:a)模型为何在“雨天+学校周边”提高空驶预警;b)给出SHAP热力图与道路等级占比解释;c)说明误报案例主要集中于施工路段,建议如何迭代数据。【模块E:故障排查与伦理合规】(15分)15.(5分)某天08:30模型PSI突增至0.35,定位发现`rain`特征分布右移。a)给出排查清单(至少5项);b)写出SQL验证天气表是否异常;c)若确认天气数据源切换,给出修正方案(无需重训模型)。16.(5分)隐私合规:a)原始GPS为敏感数据,需脱敏。写出基于差分隐私的`lon`/`lat`扰动代码(ε=1,Laplace机制);b)说明如何评估扰动后模型性能下降<2%F1;c)给出`pydp`示例。17.(5分)伦理审查:a)模型可能加剧“郊区打车难”,写出公平性指标(EqualOpportunityDifference)计算代码;b)若指标>0.1,给出缓解策略:重加权+再采样;c)说明如何设置模型卡(ModelCard)关键字段。四、答案与解析【A1】```sqlWITHdupAS(SELECT,SELECT,ROW_NUMBER()OVER(PARTITIONBYvehicle_id,tsORDERBYrand())ASrnFROMraw_gps),speed_distAS(SELECT,SELECT,LAG(lon)OVER(PARTITIONBYvehicle_idORDERBYts)ASprev_lon,LAG(lat)OVER(PARTITIONBYvehicle_idORDERBYts)ASprev_lat,LAG(ts)OVER(PARTITIONBYvehicle_idORDERBYts)ASprev_tsFROMdupWHERErn=1),filteredAS(SELECTvehicle_id,ts,lon,lat,occupancy,speedFROMspeed_distWHERENOT(speed>1201000/3600ANDWHERENOT(speed>1201000/3600AND26371asin(sqrt(pow(sin((lat-prev_lat)pi()/180/2),2)+26371asin(sqrt(pow(sin((lat-prev_lat)pi()/180/2),2)+cos(latpi()/180)cos(prev_latpi()/180)cos(latpi()/180)cos(prev_latpi()/180)pow(sin((lon-prev_lon)pi()/180/2),2)))>1)pow(sin((lon-prev_lon)pi()/180/2),2)))>1))INSERTOVERWRITETABLEgps_cleanPARTITION(dt)SELECTvehicle_id,ts,lon,lat,occupancy,speed,dtFROMfiltered;```解析:先去重,再计算相邻点距离,剔除超速且漂移>1km的记录;使用Haversine公式,单位km。【A2】PySpark代码:```pythonfrompyspark.sqlimportWindowfrompyspark.sql.functionsimportfrompyspark.sql.functionsimportw=Window.partitionBy("vehicle_id").orderBy("ts")df=spark.table("gps_clean")\.withColumn("prev_occ",lag("occupancy").over(w))\.withColumn("start",when((col("occupancy")==1)&(col("prev_occ")==0),1).otherwise(0))\.withColumn("end",when((col("occupancy")==0)&(col("prev_occ")==1),1).otherwise(0))starts=df.filter(col("start")==1).select("vehicle_id","ts","lon","lat")\.withColumnRenamed("ts","start_ts").withColumnRenamed("lon","s_lon").withColumnRenamed("lat","s_lat")ends=df.filter(col("end")==1).select("vehicle_id","ts","lon","lat")\.withColumnRenamed("ts","end_ts").withColumnRenamed("lon","e_lon").withColumnRenamed("lat","e_lat")trip=starts.join(ends,(starts.vehicle_id==ends.vehicle_id)&(ends.end_ts>starts.start_ts),"inner")\.drop(ends.vehicle_id)\.withColumn("rn",row_number().over(Window.partitionBy(starts.vehicle_id,starts.start_ts).orderBy(ends.end_ts)))\.filter(col("rn")==1)\.withColumn("dist_km",26371asin(sqrt(.withColumn("dist_km",26371asin(sqrt(pow(sin((col("e_lat")-col("s_lat"))pi()/180/2),2)+pow(sin((col("e_lat")-col("s_lat"))pi()/180/2),2)+cos(col("s_lat")pi()/180)cos(col("e_lat")pi()/180)cos(col("s_lat")pi()/180)cos(col("e_lat")pi()/180)pow(sin((col("e_lon")-col("s_lon"))pi()/180/2),2))))\pow(sin((col("e_lon")-col("s_lon"))pi()/180/2),2))))\.select("vehicle_id","start_ts","end_ts","s_lon","s_lat","e_lon","e_lat","dist_km")trip.write.mode("overwrite").saveAsTable("trip")```防倾斜:对`vehicle_id`加盐+两阶段聚合,或采用`salting`技术,将热点车辆打散。【A3】```sqlCREATETEMPORARYFUNCTIONhaversineAS'com.xxx.HaversineUDF';WITHbaseAS(SELECTvehicle_id,int(ts/900)900aswin_start,int(ts/900)900aswin_start,ts,occupancy,lon,latFROMgps_clean),label_prepAS(SELECTvehicle_id,win_start,min(casewhenoccupancy=0andtsbetweenwin_start+900andwin_start+1800thentsend)asfirst_empty,max(casewhenoccupancy=0andtsbetweenwin_start+900andwin_start+1800thentsend)aslast_emptyFROMbaseGROUPBYvehicle_id,win_start),labelAS(SELECTvehicle_id,win_start,casewhenmax(coalesce(last_empty,0)-coalesce(first_empty,0))>=300then1else0endaslabelFROMlabel_prep),feat_occAS(SELECTvehicle_id,win_start,avg(casewhenoccupancy=0then1.0else0.0end)asocc_ratio_30,avg(casewhenoccupancy=0andtsbetweenwin_start-3600andwin_startthen1.0else0.0end)asocc_ratio_60,avg(casewhenoccupancy=0andtsbetweenwin_start-7200andwin_startthen1.0else0.0end)asocc_ratio_120FROMbaseGROUPBYvehicle_id,win_start),feat_orderAS(SELECTb.win_start,h3_latlng_to_cell(b.lat,b.lon,8)ash8,count(distinctcasewhenb.occupancy=1thenb.vehicle_idend)asorder_cnt_500m_30minFROMbasebWHEREb.tsbetweenb.win_start-1800andb.win_startGROUPBYb.win_start,h3_latlng_to_cell(b.lat,b.lon,8)),feat_weatherAS(SELECTdt,hour,temp,rain,wind,casewhensubstr(dt,1,4)in('2026-01-01','2026-04-05','2026-05-01')then1else0endasholiday,casewhenfrom_unixtime(win_start,'u')in('6','7')then0else1endasworkday,from_unixtime(win_start,'HH')ashour_of_dayFROMweather),sampleAS(SELECTf.vehicle_id,f.win_start,f.occ_ratio_30,f.occ_ratio_60,f.occ_ratio_120,o.order_cnt_500m_30min,w.temp,w.rain,w.wind,w.holiday,w.workday,w.hour_of_day,l.labelFROMfeat_occfJOINlabellONf.vehicle_id=l.vehicle_idandf.win_start=l.win_startLEFTJOINfeat_orderoONf.win_start=o.win_startLEFTJOINfeat_weatherwONf.win_start/86400=w.dtandcast(f.win_start/3600%24asint)=w.hour)```保证比例:采用负采样+权重校正,采样率=0.3,训练时设置`scale_pos_weight=1/0.3`。【A4】```pythonfrompyspark.ml.featureimportVectorAssemblerfrompyspark.ml.classificationimportDecisionTreeClassifierfrompyspark.mlimportPipelinefrompyspark.ml.evaluationimportBinaryClassificationEvaluatordf=spark.table("sample")cols=[cforcindf.columnsifcnotin{'vehicle_id','win_start','label'}]va=VectorAssembler(inputCols=cols,outputCol="features")dt=DecisionTreeClassifier(labelCol="label",featuresCol="features",maxDepth=5)model=dt.fit(va.transform(df))importances=model.featureImportancessorted_indices=np.argsort(importances)[::-1]top30=[cols[i]foriinsorted_indices[:30]]```分桶MDL:```pythonfrompyspark.ml.featureimportQuantileDiscretizerdefchoose_buckets(series,maxBuckets=50):best,min_score=0,1e9forbinrange(2,maxBuckets+1):dis=QuantileDiscretizer(numBuckets=b,inputCol=series,outputCol="bucket")bucketed=dis.fit(df).transform(df)MDLscore=log-likelihood+0.5klog(n)MDLscore=log-likelihood+0.5klog(n)简化:使用直方图熵pdf=bucketed.groupBy("bucket").count().toPandas()p=pdf['count']/pdf['count'].sum()entropy=-np.sum(pnp.log2(p+1e-9))entropy=-np.sum(pnp.log2(p+1e-9))k=bscore=entropy+0.5knp.log2(bucketed.count())score=entropy+0.5knp.log2(bucketed.count())ifscore<min_score:best,min_score=b,scorereturnbest```【B5】```pythonimportlightgbmaslgbimportoptunafromsklearn.metricsimportf1_score,recall_score,precision_scoredefweighted_f1(preds,train_data):y=train_data.get_label()w=np.where(y==1,5,1)p=(preds>0.5).astype(int)return'wF1',f1_score(y,p,sample_weight=w),Truedefobjective(trial):param={'objective':'binary','metric':'None','num_leaves':trial.suggest_int('num_leaves',20,80),'learning_rate':trial.suggest_float('lr',0.01,0.3,log=True),'max_depth':trial.suggest_int('depth',6,12),'min_child_samples':trial.suggest_int('mcs',50,500),'verbose':-1}dtrain=lgb.Dataset(X_train,label=y_train)dval=lgb.Dataset(X_val,label=y_val)model=lgb.train(param,dtrain,valid_sets=[dval],feval=weighted_f1,early_stopping_rounds=100,num_boost_round=1000)returnmodel.best_score['valid_0']['wF1']study=optuna.create_study(direction='maximize')study.optimize(objective,n_trials=50)```早停:验证集`wF1`连续100轮无提升即停;交叉验证采用5-foldStratifiedKFold。【B6】结果示例(均值±std):scale_pos_weight:F1=0.742,Recall=0.688,Specificity=0.811is_unbalance:F1=0.751,Recall=0.711,Specificity=0.798FocalLoss:F1=0.765,Recall=0.732,Specificity=0.802McNemar检验:```pythonfromstatsmodels.stats.contingency_tablesimportmcnemartb=pd.crosstab(y_val,(pred_b>0.5))stat,p=mcnemar(tb,exact=True)p=0.018<0.05,拒绝原假设,FocalLoss显著优于is_unbalance。```【B7】SHAP全局:```pythonimportshapexplainer=shap.TreeExplainer(model)shap_values=explainer.shap_values(X_val)shap.summary_plot(shap_values,X_val,plot_type="bar",max_display=10)```PartialDependence:```pythonshap.dependence_plot("rain",shap_values,X_val,interaction_index="temp")```LIME地图:```pythonimportlime.lime_tabularl=lime.lime_tabular.LimeTabularExplainer(X_train,feature_names=cols,mode='classification')exp=l.explain_instance(X_val[0],model.predict_proba,num_features=10)exp.as_pyplot_figure()将top3特征对应路段高亮,保存PNG。```【B8】知识蒸馏:```pythonclassStudentNN(nn.Module):def__init__(self,n_in):super().__init__()=nn.Sequential(nn.Linear(n_in,64),nn.ReLU(),nn.Linear(64,32),nn.ReLU(),nn.Linear(32,1))defforward(self,x):return(x).squeeze()T=5.0forepochinrange(100):s_out=student(X)t_out=teacher(X)loss=nn.BCEWithLogitsLoss(weight=5y+1)(s_out,y)+\loss=nn.BCEWithLogitsLoss(weight=5y+1)(s_out,y)+\nn.KLDivLoss()(F.log_softmax(s_out/T,dim=0),F.softmax(t_out/T,dim=0))(TT)nn.KLDivLoss()(F.log_softmax(s_out/T,dim=0),F.softmax(t_out/T,dim=0))(TT)loss.backward()```温度T搜索:在[1,10]区间步长0.5,验证蒸馏损失最小。【C9】FastAPI核心:```pythonfromfastapiimportFastAPIimportlightgbmaslgb,numpyasnpapp=FastAPI()bst=lgb.Booster(model_file='model.txt')@app.post("/predict")defpredict(req:dict):feat=np.array([req['hist']+[req['temp'],req['rain'],req['wind']]])prob=float(bst.predict(feat)[0])shap_val=explainer.shap_values(feat)[0]return{"prob":prob,"shap_base":explainer.expected_value,"shap_contrib":dict(zip(cols,shap_val))}```Dockerfile:```FROMpython:3.10-slimCOPYrequirements.txt.RUNpipinstall-rrequirements.txt--no-cache-dirCOPYapp.pymodel.txt/CMD["uvicorn","app:app","--host","","--port","8000"]```镜像体积控制:使用`python:3.10-slim`+`uvicorn`+`lightgbm`无绘图依赖,最终187MB。Locust:```pythonfromlocustimportHttpUser,taskclassApi(HttpUser):@taskdefp(self):self.client.post("/predict",json={"vehicle_id":"V1","ts":1704067200,"lon":120.15,"lat":30.25,"hist":[0.8,0.7,0.6],"temp":18,"rain":2.1,"wind":3})```运行:`locust-flocust.py-u1000-r100-t60s`【C10】Redis分流:```pythondefroute(vid):return"B"ifint(vid[1:])%100<20else"A"```灰度20%流量至新模型;回滚:将路由函数改为全量"A"并重新加载配置。【C11】PromQL:```psi=histogram_quantile(0.5,rate(score_dist_bucket[5m]))```告警:```expr:psi>0.2for:5mlabels:severity:critical```【C12】ONNX转换:```pythonimportonnxmltoolsonnx_model=onnxmltools.convert_lightgbm(bst,initial_types=[('float_input',FloatTensorType([None,n_feat]))])onnxmltools.utils.save_model(onnx_model,"model.onnx")```INT8量化:```pythonfromonnxruntime.quantizationimportquantize_dynamic,QuantTypequantize_dynamic("model.onnx","8.onnx",weight_type=QuantType.QInt8)```AndroidJava:```javaOrtEnvironmentenv=OrtEnvironment.getEnvironment();OrtSessionsess=env.createSession(modelPath,newOrtSession.SessionOptions());float[][]input={...}
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 幼儿园防汛防台工作制度
- 库管工作制度及标准细则
- 吉林省养老保险基金监管中存在的问题及对策分析研究 财务会计学专业
- 基于西门子S7-300 PLC的空压站监控系统分析研究 电子信息工程专业
- 绿电直连政策及新能源就近消纳项目电价机制分析
- 2026年moldflow铜牌考试试题
- 2026年湖南岳职单招考试试题真题
- 职场人际关系处理技巧与沟通策略试题
- 摩根大通-亚太印刷电路板、覆铜板、基板、测试及无源元件-Asia PCB,CCL,Substrate,Testing,and Passive Components-20260401
- 正念疗法联合信息支持:解锁早产儿父亲创伤后成长密码
- (高清版)TDT 1059-2020 全民所有土地资源资产核算技术规程
- 危大工程安全检查录表
- 玻璃纤维窗纱生产工艺流程
- 化妆品企业质量管理手册
- 少先队辅导员主题宣讲
- 劳动用工备案表
- 部编版五年级下册语文全册优质课件
- 一轮复习家长会课件
- 国家级重点学科申报书
- 实用中医护理知识学习题库-多选及简答题库
- 路灯安装质量评定表
评论
0/150
提交评论