נושא הפרוייקט

מספר פרוייקט מחלקה שמות סטודנטים אימייל שמות מנחים

קיצור מסמכים בשפת לאטך

LaTeX - Paper Reduction

תקציר בעיברית

עורכי טקסט מבוססי קומפילציה, כמו Overleaf בשביל שפת LaTeX, מאפשרים למשתמשים להזין טקסט יחד עם פקודות, ולאחר קומפילציה הטקסט מומר למסמך ערוך, כגון מסמך PDF. עם זאת, חסרון משמעותי של עורכים אלה הוא חוסר היכולת לצפות בתצוגה מקדימה של המסמך הסופי מבלי לבצע את הקומפילציה. כתוצאה מכך, כאשר משתמשים צריכים לקצר מסמך עקב מגבלה על מספר העמודים, הם נוקטים לעתים קרובות בשיטת ניסוי וטעייה: הם עורכים שינויים בעורך הטקסט ובוחנים עבור כל שינוי האם המסמך המקומפל עומד במגבלת מספר העמודים. יתר על כן, שינויים שנעשו כדי לצמצם את אורך המסמך עלולים להשפיע לרעה על הנראות שלו.
בעיה זו מציבה את האתגר הבא: אילו שינויים יש לבצע במסמך כדי להכניס את האלמנטים החורגים, ובכך לעמוד במגבלת מספר העמודים? בפרויקט זה, נציע מודלים של למידת מכונה כדי להתמודד עם אתגר זה.
בשלב הראשון יצרנו מאגר של 329,000 קבצי TeX מגוונים ומסמכי PDF התואמים להם, כך שכל מסמך שונה באמצעות הפעלת אופרטורים שונים. בעזרת טכניקות של Feature Extraction המרנו כל מסמך לווקטור של ערכים נומריים, ובכך יצרנו מאגר נתונים שניתן יהיה לאמן עליו מודלים של למידת מכונה.
למדו והערכנו מספר מודלים שחוזים את הרווח הצפוי להתקבל כתוצאה מהפעלת אופרטור על מסמך, ומתוכם בחרנו את המודל הטוב ביותר. ניסויים שערכנו מוכיחים שהדיוק של מודל חיזוי הרווח עולה על זה של יוריסטיקה בסיסית, שהיא אומדן של הרווח הצפוי להתקבל מהפעלת אופרטור על מסמך. יתרה מכך, אלגוריתם חמדני שנוצר על ידינו, המשתמש במודל חיזוי הרווח הינו מהיר יותר ודורש פחות יישומי אופרטורים מאשר אלגוריתם חמדני שלא משתמש במודל.
חלק נוסף באתגר הוא למצוא רצף של אופרטורים הצפויים לקצר את המסמך. כדי לפתור זאת, פרסנו את מרחב האופרטורים של מסמך כלשהו למרחב חיפוש. זה אפשר להשתמש באלגוריתם חיפוש כדי למצוא את רצף האופרטורים הטוב ביותר. ניסויים שהשוו בין אלגוריתם החיפוש לאלגוריתם החמדני, הראו ששיעור המסמכים שקוצרו בהצלחה היה גבוה יותר עם השימוש באלגוריתם החיפוש.

תקציר באנגלית

Compilation-based text editors, like Overleaf for LaTeX, allow users to enter text along with commands, and following compilation, the text is converted into an edited document, such as a PDF. However, a significant drawback of these editors is the inability to preview the final document layout without compiling. Consequently, when users need to condense a document due to space constraints, they often resort to a trial-and-error method by experimenting with modifications in the text editor and examining whether the page fits in the necessary amount of space in the compiled document. 
This problem poses the following challenge: What changes should be made to save space so that the exceeding objects can fit? In this project, we will face this challenge by suggesting machine-learning models.
As a first step, we generated a repository of 329,000 varied TeX files and their corresponding PDF documents, each subjected to various operators. We applied feature extraction techniques to generate a dataset suitable for use in machine learning models. 
Several gain prediction models were trained and evaluated, from which we chose the best model. Experiments conducted demonstrate that the gain prediction model’s accuracy surpasses that of a basic heuristic, which is an estimation of the expected gain. Moreover, our greedy algorithm employing the gain prediction model is faster and requires fewer operator applications than the greedy algorithm without the model. 
Another part of the challenge is to find a sequence of operators predicted to reduce a paper’s length. The operators’ space of a document is deployed into a search space. This made it possible to use a search algorithm to find the best sequence of operators. Experiments demonstrate that a higher success rate of reduced papers is received when using the search algorithm instead of the greedy algorithm, both employing the gain prediction model.