נושא הפרוייקט

מספר פרוייקט מחלקה שמות סטודנטים אימייל שמות מנחים

יצירת הסבר למודל שאומן על נתונים ברמה נמוכה בעזרת פיצ'רים המובנים לבני אדם.

Explaining model trained on low-level data with human-interpretable features.

תקציר בעיברית

אלגוריתמי הסבר אגנוסטיים המייחסים ערכים לפיצ'רים של המודל (כמו SHAP, LIME) משומשים במידה רבה כדי להסביר החלטות של מודלים מורכבים, כמו רשתות נוירונים עמוקות. עם זאת, מודלים מורכבים כאלה מגיעים לביצועים טובים כאשר הם מאומנים על פיצ'רים ברמה נמוכה (או פיצ'רים מקודדים), דבר הגורם לכך שבמקרים רבים אלגוריתמי ההסבר יוצרים הסבר ברמת הפיצ'רים שאינם מובנים לבני אדם. מחקרים לאחרונה הציעו שיטות התומכות ביצירת הסברים הניתנים לפירוש לבני אדם, אך שיטות אלה לעיתים אינן פרקטיות מכיוון שהן דורשות פונקציית טרנספורמציה הפיכה הממפה את הפיצ'רים של המודל לפיצ'רים המובנים לבני אדם. בעבודה זו, אנו מציגים את Latent SHAP, אלגוריתם הסבר אגנוסטי המייחס ערכים לפיצ'רים המובנים לבני אדם (במקום הפיצ'רים של המודל), ללא הדרישה לפונקציית טרנספורמציה הפיכה. במחקר אנו מדגימים את היעילות של Latent SHAP באמצעות (1) ניסוי מבוקר שבו קיימת פונקציית טרנספורמציה הפיכה, ניסוי זה נותן הערכה כמותית חזקה של השיטה שלנו, ו- (2) סיווג האטרקטיביות של ידוענים (באמצעות הדאטה סט של CelebA) כאשר לא קיימת פונקציית טרנספורמציה הפיכה, ניסוי זה נותן הערכה איכותית יסודית של השיטה שלנו.

תקציר באנגלית

Model agnostic feature attribution algorithms (such as SHAP and LIME) are ubiquitous techniques for explaining the decisions of complex models, such as deep neural networks. However, since complex classification models produce superior performance when trained on low-level (or encoded) features, in many cases, the explanations generated by these algorithms are neither interpretable nor usable by humans. Methods proposed in recent studies that support the generation of human-interpretable explanations are impractical, because they require a fully invertible transformation function that maps the model's input features to the human-interpretable features. In this work, we introduce Latent SHAP, a black-box feature attribution framework that provides human-interpretable explanations, without the requirement for a fully invertible transformation function. We demonstrate Latent SHAP's effectiveness using (1) a controlled experiment where invertible transformation functions are available, which enables robust quantitative evaluation of our method, and (2) celebrity attractiveness classification (using the CelebA dataset) where invertible transformation functions are not available, which enables thorough qualitative evaluation of our method.