Google Summer of Code 2020 Ideas List

On this page you will find project ideas for applications to the Google Summer of Code 2020. We encourage creativity; applicants are welcome to propose their own ideas in CDLI realm.

At the end of the summer, we need to instantiate our new Framework (beta version) for testing by users. Any project that help us get there will have priority. Domains of development concern design implementation, functionalities implementation (mostly PHP but some Python tasks), deployment scripts improvement and testing, etc.

Our design can be found here: https://scene.zeplin.io/project/5cd250d2e18bd568686fa597
The framework code is here: https://gitlab.com/cdli/framework
Reach out to get a copy of the database for testing!

See our issues tracker for tasks: https://gitlab.com/cdli/framework/issues

About the Cuneiform Digital Library Initiative

The Cuneiform Digital Library Initiative (CDLI) is driven by the mission to enable collection, preservation and accessibility of information— image files, textual annotation, and metadata—concerning all ancient Near Eastern artifacts inscribed with cuneiform. With over 334,000 artifacts in our catalogue, we house information about approximately two-thirds of all sources from cuneiform collections around the world. Our data are publicly available at https://cdli.ucla.edu, and our audience comprises primarily scholars, students, museum staff, and informal learners.

Through its long history, CDLI is now integral to the Assyriological discipline fabric itself. It is used as a tertiary source, a data hub and a research data repository. Based on Google Analytics reports, CDLI's website is visited on average by 3,000 monthly users in 10,000 sessions and 100,000 pageviews. 78% of these users are recurring visitors. The majority of users access CDLI collections and associated tools seeking information about a specific text or group of texts; insofar as these are available to us, CDLI has authoritative records about where the physical document is currently located, when and where is was originally created and deposited in ancient times, what is inscribed on the artifact and where it has been published. Search results display artifact images and associated linguistic annotations when available.

CDLI is a collaboration of developers, language scientists, machine learning engineers and cuneiform specialists who are creating a software infrastructure to process and analyze curated data. To this effect, we are actively developing two projects: Framework Update and Machine Translation and Automated Analysis of Cuneiform Languages. As part of these endeavors, we are building a natural language processing platform to empower specialists of ancient languages to undertake translation of Sumerian language texts, thus enabling data-driven study of the languages, culture, history, economy and politics of ancient Near Eastern civilizations. In this platform we are focusing on data normalization using Linked Open Data to foster best practices in data exchange, standardization and integration with other projects in digital humanities and computational philology.

The CDLI offers catalogue and text data that are downloadable from our Github data repository, image files that can be simply harvested online or obtained on demand (including higher resolution images, for research purposes only), and textual annotations that are currently being prepared by the MTAAC research team.

Potential mentors

Émilie Pagé-Perron Domain specialist / Web design (full stack) / NLP
Ilya Khait Domain specialist / NLP
Niko Schenk NLP / ML
Max Ionov NLP / ML
Amaan Iqbal Web design (front end)
Willis Monroe Domain specialist/ NLP
Gábor Zólyomi Domain specialist /scores management
Szilvia Sovegjarto Domain specialist /scores management
David Wong Web Design (full stack)
Ravneet Punia NLP / ML
Sagar Sehgal NLP / LOD / Web design (front end)
Kate Kelly Domain specialist
Rune Rattenborg Domain specialist / GIS
Sara Brumfield Domain specialist / Data science

List of potential project ideas

Extend authorization/authentication system
Bibliography management and display
Journals management and display
Design Integration
Finalizing main search and developing the advanced search
Prepare the search results display, expanded and compact
Translating the whole Ur III corpus
Sumerian - English Machine Translation
Image processing pipeline: from raw to archival and web
Score generation, browsing, and display from text witnesses
Accessibility audit and enhancement
Revamp CDLI tablet apps (Apple and Android)
Crowdsourcing API & Admin pipeline
ML-visualization
Other Ideas
Your own project

Description of potential projects

Extend authorization/authentication system

The outcome of this challenge is essential to the core of the new CDLI Framework. It encompasses authentication and authorization, including access roles and granular access outside the generic permission roles. Authentication with 2FA is already set up but access needs to be properly configured for the roles.

issues:

Outcomes: When this challenge is complete, authentication will be set up in full and access to pages which are already built will be set properly according to the appropriate level of access, and documentation will explain how to implement proper access to the new methods and views to be built in the future. The basis of an admin / editor dashboard should display links to restricted pages.

Skills required/preferred: PHP & HTML, CakePHP, 2FA

Bibliography management and display

The CDLI is moving away from its older flat database and for the occasion has designed a full model for bibliographic data management. This challenge concerns in the first place the conception of the management pages for this data, and in second the display of bibliographic data. This also includes bulk import and export of the data in bibtex and flat format, and bulk association of artifacts and publications.

issue: https://gitlab.com/cdli/framework/issues/122

Outcomes: Index and individual publication views (also showing associated artifacts) should be set up, display for publication in the views for artifacts search results views and individual artifact view will display publication data properly. Admin pages will allow add, edit, delete publications and the association with artifacts.

Skills required/preferred: Relational DBs, PHP & HTML, CakePHP, Bibtex format

Journals management and display

The CDLI hosts four journals, two peer-reviewed journals (CDL Journal and CDL Bulletin), the CDL Pre-prints repository and the CDL Notes. The journals are currently handled in Drupal, we need to develop the required controllers and views to manage and display the contents of the journals. We will be using latex templates for the CDLJ and CDLB soo. The bibliography must be link with the general bibliography of the cdli.

issue: TBD

Outcomes: Full management and display of the four CDL journals.

Skills required/preferred: PHP (CakePHP)/MVC, HTML, publication process

Design Integration

Our new design (https://scene.zeplin.io/project/5cd250d2e18bd568686fa597) has only partially been integrated to the interface of the new framework. This challenge will include both writing the css in sass for all elements of the design, styling existing pages and creating templates for generic pages to be used by anyone developing new features to the framework.

Issues: https://gitlab.com/cdli/framework/issues?label_name%5B%5D=ux

Outcomes: Full integration of the design into the framework.

Skills required/preferred: CSS/SASS, HTML, JS, Accessibility / WCAG guidelines, UX/UI design

Finalizing main search and developing the advanced search

We already have a search controller but not all features have been implemented and search queries must be heavily optimized. See Controller here: https://gitlab.com/cdli/framework/-/blob/phoenix/develop/app/cake/src/Controller/AdvancedSearchController.php

Issues: https://gitlab.com/cdli/framework/issues/37
https://gitlab.com/cdli/framework/issues/48

Outcomes: Arrays of results from simple and advanced search will give best results based on search query, custom rules and search features at a respectable short wait time.

Skills required/preferred: different search methods (eg elasticsearch), PHP, CakePHP, Regex...

Prepare the search results display, expanded and compact

Users must be able to view results in expanded and compact version with all information displayed properly. This also includes setting up downloads for the data in various formats including csv, pdf, xlsx, and rdf. This task does not include styling with css but requires that all data be interlinked and precisely displaying in the proper way. this means developing data processing routines to format contents exactly to requirements.

Issues: https://gitlab.com/cdli/framework/issues/51
https://gitlab.com/cdli/framework/issues/52

Outcomes: Expanded view of search results, compact search results for all 3 display formats, and downloads all set up properly and thoroughly tested.

Skills required/preferred: PHP, CakePHP, HTML, data formats, data conversion

Translating the whole Ur III corpus

The organization build an NMT model for Sumerian to English Translation, using a cleaned parallel dataset. But still around 1.5M raw untranslated data is available. The project aims a full translation pipeline, which will integrate NER and POS tagging of URIII languages, either using Neural Networks of Rule-Based approach to perfect the results and provide a full set of translations for the texts.

Possible Mentors:

Ravneet Punia
Niko Schenk

Project Repository

Outcomes: Translation pipeline and all Ur III texts translations.

Skills required/preferred: ML, NLP, Python

Sumerian - English Machine Translation (HARD)

As part of the MTAAC project, the organization host Sumerian data comprising 1.5 million transliteration lines and 10K parallel lines corpus (approx). We already developed a neural network-based encode-decoder architecture for English-Sumerian Machine Translation, but that leverages the parallel dataset only, which is not sufficient to achieve state of the art results. Your task is to develop a language model using the monolingual data as well as parallel data to translate Sumerian phrases to English, and vice versa.

Possible Mentors:

Niko Schenk
Ravneet Punia

Link for the Dataset: ... to come

Your Tasks & Desired (Minimum) Outcomes:

Train and evaluate different models and architectures on standard train/development/test splits.
Experiment with all possible hyperparameter settings to obtain the best performance.
Perform a quantitative and qualitative evaluation of the translations.
Better accuracy than the previous year model.
Testing different at least two NMT approaches like Cross-lingual Language Model, Dual Learning or Back-Translation.
Students with a research background will be preferred.

Getting started:

Cross-lingual Language Model
Dual Learning
Back-Translation for Unsupervised NMT

Image processing pipeline: from raw to archival and web

Create a scripted pipeline for processing of raw scans of artifacts and scans of line art to the final version images for archival storage and web display. Some scripts in Apple script and other languages exist for a part of these tasks, they can be used as models.

This Challenge has two parts: first the processing of newly digitized artifacts assets to prepare images for archival storage in the best quality possible, and secondly, the management of images from the web interface, for storage to archival and to create and maintain images for web display.

Issue: https://gitlab.com/cdli/framework/issues/87 ( for the web part )

Outcomes: 1 ) A set of scripts and instructions for scholar digitizing artifacts in collections that will assist them in producing high quality "fatcrosses" 2 ) An image manager integrated in the CDLI Framework so editors and admins can create web images, associate images with artifacts, etc.

Skills required/preferred: Image data understanding, Python, PHP/CakePHP Apple script, server side image processing software

Score generation, browsing, and display from text witnesses

Transliterations of cuneiform texts that are witnesses to literary compositions have encoded line numbers for the associated composition. Based on those line numbers, it it possible to generate "scores" or list of the same line of a composition but from different documents in which they appear.

The current system can be found here: https://cdli.ucla.edu/tools/scores/partitur-index.html

This task entails preparing a main index page of compositions and a single composition view page which will display the score for that particular composition, and the associated translation.

issue:https://gitlab.com/cdli/framework/issues/147

Accessibility audit and enhancement

CDLI has been spending a lot of energy on developing it's new Framework to share its data and tools to the public. One thing we are concerned about is the overall accessibility of the information we share. The concern is to make sure we can cater do differently abled users: be it in terms of vision, steadiness of hand, cognition, etc.

The Framework repo is https://gitlab.com/cdli/framework
branch: phoenix/develop

Outcomes: The student's deliverable would come in two parts: a complete audit of the cdli interface, and the implementation of changes in the code and interface to increase accessibility where possible, based on the audit. Most pages should hit A in the WCAG ladder, and AA in other cases.

Skills required/preferred:

Some knowledge or interest in researching accessibility standards
Proficiency in HTML/CSS (we use Bootstrap 4 with sass)
Some basic knowledge of PHP
Strong interest in UX/UI
Detailed oriented

Revamp CDLI tablet apps (Apple and Android)

CDLI always tries to find new ways to reach out to different and varied audiences. One way we do this is through our mobile application which provide daily

Read more about CDLI Tablet here: https://cdli.ucla.edu/?q=cdli-tablet

Current repo links:

https://github.com/cdli-gh/CDLI-Android-Application
https://github.com/cdli-gh/ios-cdli-tablet
https://github.com/cdli-gh/flutter_app

App stores links:

https://play.google.com/store/apps/details?id=com.cdlisolutions.cdli.cdlitablet
https://apps.apple.com/us/app/cdli-tablet/id636437023?ls=1

Outcomes:

After choosing an open-source UI software development kit such as Flutter or React Native, the student will develop a new version of our mobile applications that will include actual existing functionalities. Extra new functionalities will be decided upon with mentor and can be implemented if there is enough time.

Skills required/preferred:

Experience in coding mobile apps
Experience with UI/UX
Loves beautiful design and immersive experiences

Crowdsourcing API & Admin pipeline

Crowdsourcing infrastructure would enhance CDLI with direct contributions from the cuneiform research community. On the framework's side, the task requires creating an API capable of handling external HTTP(S) requests in a format appropriate for suggesting edits and new entries. Pending suggestions are to stand on the admin dashboard for review and approval or rejection and be available on the public website. Finally, upon the editor's action a response is to be sent back to the user and, if approved, corresponding changes are to be made in the database.

Skills required/preferred: PHP, CakePHP, HTML, HTTP(S) (e.g. with Curl)

Outcomes:

The API and pipeline should be integrated in CDLI's framework. The task involves developing a simple client server application for sending contributions and getting responses upon status update.

ML Visualization

For some of the early material in the project's corpus translation efforts are still in the early stages. Efforts to understand the material via ML-techniques are ongoing and significant value has been derived from various forms of visualization whether at the single token level, n-grams, or higher levels of analysis. This project would attempt to create a standardized visualization pipeline to enable the selection of a corpus with the project and the application of models and output convenient interactive pages to explore the selected corpus. This fits within a broader framework of reproducible scholarship within the humanities.

Skills required/preferred: ML, Python, JS

Outcomes:

This project would result in a module that could be added to the framework that would allow for the selection and processing of groups of text within the CDLI. The output of this module would consist of interactive visualizations based on standard methods of ML analysis.

Other ideas

A converter to C-ATF format from multiple other formats such as ORACC-ATF, BDTNS format, etc.
Develop the browsing pages

Your own project (bring 'em on!)

We are interested in expanding our technological capabilities in processing, analyzing and distributing (including visualization and accessibility) our catalogue and textual derived data. If you have an idea that could be reused either to reproduce your research or enhance further developments in the disciplines of Assyriology, Computational Linguistics or Computer Science, reach out to us and we can work together on preparing a project suitable for you, CDLI and GSoC.

[email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Summer of Code 2020 Ideas List

About the Cuneiform Digital Library Initiative

Potential mentors

List of potential project ideas

Description of potential projects

Extend authorization/authentication system

Bibliography management and display

Journals management and display

Design Integration

Finalizing main search and developing the advanced search

Prepare the search results display, expanded and compact

Translating the whole Ur III corpus

Sumerian - English Machine Translation (HARD)

Image processing pipeline: from raw to archival and web

Score generation, browsing, and display from text witnesses

Accessibility audit and enhancement

Revamp CDLI tablet apps (Apple and Android)

Current repo links:

App stores links:

Outcomes:

Crowdsourcing API & Admin pipeline

Outcomes:

ML Visualization

Outcomes:

Other ideas

Your own project (bring 'em on!)

Clone this wiki locally