נושא הפרוייקט

מספר פרוייקט מחלקה שמות סטודנטים אימייל שמות מנחים

בניית מטא-מודל לחיזוי באגים המבוסס ייצוג אמבדינג

Embedding-Based Meta Model for Cross-project Defect Prediction

תקציר בעיברית

חיזוי באגים הינה משימה שמטרתה לזהות קטעי קוד שעלולים להיות תקולים ולגרום לכשל במערכת. ניתן להשתמש בטכניקות למידת מכונה על בסיס נתונים היסטוריים בכדי לזהות את אותם קטעי קוד.
עבור פרויקטי תוכנה רבים, אין מספיק היסטוריה בכדי לחזות באגים. כך התפתח תחום הCross project, למידת מודלים על סט פרויקטים בעלי היסטוריה וחיזוי באמצעות המודל עבור הפרויקט ללא היסטוריה. קבוצה של שיטות מובילות הינה תחת הקטגוריה Transfer learning, בשיטות מסוג זה יש פרויקט מקור – בעל דאטה היסטורי ופרויקט יעד – עבורו אין היסטוריה. 
ישנן שיטות של Transfer learning המשמשות לcross project defect prediction, שיטות שמתחלקות לכמה סוגים, חלקן ממשקלות את הפיצ'רים, חלקן ממשקלות את האינסטנסים, ועוד. החיסרון הוא שעבור פרויקטי תוכנה שונים, שיטות שונות נמצאו כמתאימות ביותר עבור חיזוי הבאגים, לכן קשה לבחור בשיטה אחת ויחידה לחיזוי אותם באגים.
במחקר הזה אנחנו מציעים גישה של Meta-Learning לחיזוי באגים. הרעיון של Meta Learning הוא שאין מודל שמתאים בצורה מדויקת לכל פרויקטי התוכנה הקיימים, לכל פרויקט יש שיטת למידה שהכי טובה עבורו. 
הרעיון שלנו הינו בניית Meta model שיודע להחליט עבור פרויקט תוכנה מה שיטת הTransfer learning המדויקת ביותר עבורו, ובכך לשפר את תוצאות חיזוי הבאגים, ישנם שני אתגרים בגישה זו: חילוץ הפיצ'רים וחילוץ הלייבלים. לחילוץ הפיצ'רים ישנן שתי דרכים:
הראשונה באמצעות מטא-פיצ'רים סטטיסטיים המתבססים על פיצ'רים שמתארים את קטעי הקוד השונים בפרויקט, למשל ממוצע שורות הקוד. השנייה הינה באמצעות יצירת Embedding עבור הפרויקט, השתמשנו במודל Pre-trained שאומן לחיזוי באגים כדי לייצר Embedding עבור כל אחד מקטעי הקוד בפרויקט, כדי לייצר את המטא-פיצ'רים לקחנו את הערך המקסימלי מכל קומפוננטה. 
לחילוץ הלייבלים, השתמשנו בציון הF1 שנתנו מספר מודלי transfer learning כדי לקבוע מי השיטה הכי טובה לכל פרויקט. 
עבור גרסאות שונות של אותו פרויקט שדומות מאוד בייצוג הembedded שלהן ייתכנו מודלים שונים שהכי טובים לחיזוי. על מנת לשפר את למידת המודל, החלטנו להקצות את הלייבל לפי עקרון הרוב, נשתמש בלייבל שהתאים לרוב גרסאות הפרויקט עבור הלמידה. 
ביצענו מספר ניסויים עם 802 גרסאות של 257 פרויקטי תוכנה, השתמשנו ב5 folds cross validation, בoversampling לאיזון הדאטה ובXGBoost כמטא-אלגוריתם שלנו. התוצאות שקיבלנו הן שהייצוג באמצעות embedding טוב בצורה מובהקת על פני הייצוג באמצעות מטא-פיצ'רים סטטיסטיים. בנוסף המודל שלנו טוב בצורה מובהקת מהמודלים הקיימים לפי ציון הF1 שלהם.

תקציר באנגלית

Bug prediction is a task in software development, aiming to identify potential faulty code sections that can lead to system failures. The lack of historical data in many software projects poses a challenge to accurate bug prediction. To address this, the field of cross-project bug prediction has emerged, leveraging transfer learning methods to train models on projects with historical data and applying them to projects without such history. 
There are transfer learning methods used for cross-project defect prediction, methods that are divided into several types, some weight of the features, some weight the instances, etc. The downside is that for different software projects, different methods have been found to be most suitable for predicting bugs, thus it’s hard to select one algorithm for the bug prediction.
In this research we propose a Meta-Learning approach for defect prediction. The idea of Meta Learning is that there is no model that fits exactly to all existing software projects, each project has a learning method that is best for it. We learn which learning method is best for each given project.
We propose a meta-learning approach to improve bug prediction results by dynamically selecting the most suitable transfer learning method for each software project. There are two challenges when handling this approach: features extraction and labeling. We explore two feature extraction approaches: statistical meta-features based on code section characteristics, for example the average of the code lines, and embeddings generated using a pre-trained bug prediction model. To create the meta-features we took the maximum value from each component.
We extract labels based on the F1 score provided by several transfer learning models. For different versions of the same project that are very similar in their embedded representation, there may be different models that are best for prediction. To address this, we used majority voting.
Our experiments, conducted on 802 versions of 257 software projects, employ 5-fold cross-validation, oversampling, and XGBoost as the meta classification method. Results demonstrate the superiority of embedding-based representation over statistical meta-features, and our meta-model outperforms existing models in terms of F1 score.