Metadata for Efficient Management of Digital News Articles in Multilingual News Archives

The digital news preservation and management of low-resource languages are challenging tasks, especially in vast collections. Unique identification of individual digital objects is possible with well-defined attributes to assure efficient management, such as access, retrieval, preservation, usability, and transformability. The metadata element set is required to maximize the available attributes related to the digital objects. To create a comprehensive metadata set that contains all the necessary attributes and data about the digital news objects. It is more challenging and complicated when the archive contains articles from low-resourced and morphologically complex languages like Urdu and Arabic, which is difficult for machines to understand. The study presents challenges in low-resource languages (LRL) and research challenges. This metadata will help to link news articles based on similarity with other news articles stored in the digital news stories archive (DNSA) and ensures accessibility. In this study, we introduced 38 metadata elements set for the digital news stories preservation (DNSP) framework, of which 16 are explicit and 12 are implicit metadata elements. The paper presents how the digital news stories archive (DNSA) is enhanced to a multilingual archive and discusses the digital news stories extractor, which addresses major issues in implementing low-resource languages and facilitates normalized format migration. The extraction results are presented in detail for high-resource languages, that is, English, and low-resource languages (HRL), that is, Urdu and Arabic. The LRL encountered a high error rate during preservation compared to HRL, 10%, and 03%, respectively. The metadata extraction results show that HRL sources support all metadata elements as compared to LRL. The LRL has good support for explicit meta elements and many implicit meta elements with low extraction percentages. The LRL needs a more detailed study for accurate news content extraction and archiving for future access.


Introduction
The Internet is the leading resource that provides information and even holds a variety of information sources providing information related to every aspect of human life, such as weather forecasts travel deals, events happening locally and worldwide, and so on.This information can be accessed via the World Wide Web and web services (Khan, 2018).The generation of exponential web information will exceed the world's living brain capacity in 2025, as currently, the web information measured is 1,018 exabytes and 1,021 zettabytes (Emani et al., 2015;Size, 2021).
Though WWW is a fast-growing source of information, it is fragile in nature.The information fragility causes this valuable scholarly, cultural, and scientific information to vanish and become inaccessible to future generations.Therefore, there is a need to preserve the information available in different forms.
The newspaper has been the main source of information for thousands of years.The newspapers cover information related to different aspects of human life and provide information about the events happening locally and worldwide.Newspapers cover stories about various events like acts of parliaments, events of political importance for countries, proceedings of courts related to important cases, births, deaths, marriages, sports, science, technology, and so on.Newspapers reflect the social life, behaviors, and cultural values of different communities, and thence these are vital scholarly information for community individuals and even the community as a whole.To ensure that this information is available to future generations.For example, the prime minister's address to the assembly after winning an election or the packages announced of an imminent foreign invasion of a country becomes valuable to future generations as historical manuscripts are today.According to the UNESCO declaration on archives, the archives play a vital role in the development of societies by safeguarding the contributions of individuals and communities (UNESCO: Universal declaration on archives.In Adopted at the ICA Annual General Meeting in Malta (UNESCO Official, 2010).The only way to safeguard this published information is to preserve and make it available to the forthcoming generations.Several initiatives have been taken, and numerous newspaper archives have been created to preserve this published information.Most of the curators shared that they or their organizations manage to preserve several newspapers and maintain digital newspaper collections.Generally, newspapers are digitized either in-house or by vendors, and some manage as born-digital content either directly from publishers or harvesting through the web (Skinner & Schultz, 2014).
The state-of-the-art review of newspaper archives shows various approaches acquired for newspaper preservation, and most newspapers are digitized as a single digital record.Generally, the curated digitized records are scanned from microfilm, which is significantly reduced in size photographs, useful for storage, and magnified for reading to pdf, gif, jpg, or other graphical formats.The newspaper archives can be divided into old newspaper archives and newer newspaper archives.The old newspaper archives are hard to index by Optical Character Recognition (OCR) technology into the fulltext corpus and are primarily available in graphical format.In contrast, the newer newspaper archives are fully indexed and allow full-text searching mechanisms.
The digital news stories preservation (DNSP) framework is introduced to create a digital archive of news articles linked together based on some criteria for future use (Khan & Rahman, 2015).Recently, the DNSP framework is enriched to create a multilingual multi-sources digital news stories archive that will preserve digital news articles for the long term and future generations.The framework is added with two low-resource languages, that is, Urdu and Arabic languages.The challenges are identified with different aspects regarding low-resource languages that make it hard to simply include these sources in the digital news stories archive (DNSA).The study discusses different challenges related to aspects such as volume, variety and velocity during archival information package, technical challenges during creation of archive, and challenges related to the dissemination of archived content.The absence of resources for low-resources languages such as efficient tokenizers, dictionaries, and other basic resources that prompts heavy prepossessing during preservation process.
The section ''Preservation Challenges in Low-resource Languages'' and its subsections differentiate low-resource languages from high-resource languages, outline the challenges in LRL, metadata role in information dissemination and provide a brief overview of Urdu and Arabic languages.The section ''Why We Need Metadata for DNSP Framework'' presents details about the digital news stories preservation framework initiative, discusses the importance of preservation, research challenges, DNSP framework enhancement, multilingual archive, and its structure, and major issues in the implementation of enhancing the extraction tool.In section ''News Extraction Results,'' extraction quantification is comprehensively discussed.The section ''Proposed Metadata Element Set for DNSP'' present the pro-posed metadata element set for DNSP framework, explicit and implicit metadata, extraction results, and discussion.The last section concludes the findings of the study.

Preservation Challenges in Low-Resource Languages
The natural language processing (NLP) tools underwent a significant change in the 1990s, transitioning from rulebased techniques to statistical-based approaches, which marked the beginning of a new era of artificial intelligence.Since then, the primary focus has been on English as an international language, with only about 20 languages out of the 7,000 languages spoken around the world being considered (Guellil et al., 2021).
Natural languages are classified into two broad categories, that is, Low-resource Languages (LRL) and High-resource languages (HRL).Many data resources exist for high-resource languages that help machines to learn and understand natural languages, for example, English.By far, English is a well-resourced language as compared to other most spoken languages.Many West-European languages are well resource-languages, such as Chinese, Japanese and Russian, which are also considered as high-resourceful languages.In contrast, lowresource languages are languages with very few or no resources available.Low-resource languages can be defined as less studied, resource-scarce, less computerized, less privileged, less commonly taught, or low-density languages (Cieri et al., 2016;Magueresse et al., 2020).Many languages are difficult to preserve because they are mostly oral, and very few written resources exist in physical form, not in electronic format (Goyal et al., 2022).There are different types of resources for natural language processing and the development of languagebased systems; Collection of Text in various forms, such as research papers, books, email collections, social media contents collections, and so on.Lexical, syntactical, semantic resources, such as a bag of words, dictionaries, semantic databases (e.g., wordnet), organized dependency tree corpora, and so on.Task-specific resources, such as part-of-speech tags, corpora for machine translation, annotated text, named entity recognition resources, and so on.
Many language resources are costly to produce, which is why the economic inequalities between countries are reflected in the language resources and the lack of research.Hence, many challenges face in protecting these languages from being lost.
Alignment or Projection technique (three levels of alignment, document, sentence and word) is a common technique for annotation.It is difficult to adopt the projection technique from HRL to LRL because of the lack of resources and different structures of target and source languages (Magueresse et al., 2020).Creating a bag of words, dataset, and raw text collection for LRL is difficult, which is necessary for any natural language processing (NLP) task and mapping techniques (Magueresse et al., 2020).The most important resource for any language is the lexicon of that language.Many NLP tasks heavily depend on the textual material, which is lacking in LRL and a challenging task to produce an efficient lexicon.Morphology of evolving LRL and its vocabulary extended easily.Developing a comprehensive framework for morphological pattern recognition is difficult because of multiple roots (Elkateb et al., 2006).The major applications of NLP, such as questionanswer systems, sentiment analysis, image-to-text mapping, machine translation, and named entity recognition systems, are very difficult to implement in low-resource languages.The basic NLP tasks are also difficult in lowresource languages, such as stopwords identification and removal, tokenization, part-ofspeech tagging, sentence parsing, lemmatization, stemming, and so on.
The NLP systems of LRL are time-consuming and less efficient comparatively as of a lack of resources, and they are even more difficult when it comes to developing a machine learning system (Guellil et al., 2021).
There are many languages that are mostly oral, for which very few written resources exist (physical and digital formats).For some, there are written documents but not even a basic resource like a dictionary.Integrated and customized systems are always a huge challenge for multilingual systems.
Deal with all the challenges faced by low-resource language needs, extensive research in different dimensions.Urdu and Arabic languages are two huge languages that need a lot of focus in research.
Urdu Language.Urdu is a popular South Asian language, and about 70 million native speakers and more than 164 million people speak around the world (Andrabi & Wahid, 2022;Rehman et al., 2011).Urdu is the official literary language of Pakistan, spoken and understood in many countries like India, and Bangladesh, and is closely related to Hindi.Urdu periodicals offer a wide range of work on imperative issues of South Asia spread over the 19th and 20th centuries, making their conservation precious for researchers of the idiom (Rafique et al., 2022).
Arabic Language.Arabic is the third (3rd) most spoken language after English and Chinese.Around 292 million people speak Arabic as their first and official language in 27 states worldwide, and many more can understand it as a second language (Wright, 2022).The Arabic language is one of the six official languages of the United Nations, besides English, French, Spanish, Russian, and Chinese.Arabic is also becoming a popular language to learn in the Western world, and other languages have borrowed words from Arabic due to their historical significance.Grammar is sometimes tough to learn for native speakers of Indo-European languages and hence a challenge for machines to correctly interpret and understand the Arabic language (Kamusella, 2017;UNESCO Official, 2016).

Accessing Via Metadata
Metadata is commonly known as data about data or termed as information about the information (Riley, 2017).Metadata helps to organize electronic resources in archives or repositories.From most information fields' perspectives, Meta means an underlying definition or description.Information about structure, history, evolution, authenticity, availability, accessibility, digital signature, copyright, reproduction, and so on, is also metadata (Dashrath, 2014).Considering the scope of data it applies, from archaeological resources, document files, images, and videos to spreadsheets and webpages, or simply the big data, it's not surprising that understanding and managing metadata has become a high priority (Greenberg, 2005).
Metadata is essential in managing digital objects in libraries, archives, or digital collections.Some important roles of metadata are: Resources discovery from huge collection (Greenberg, 2010).Organizing electronic resources in digital libraries and collections (Habib & Balliot, 2000)).Enable interoperability is the ability of different systems to exchange and use together information without losing content and functionality using metadata (Riley, 2017).Certifying authenticity, reliability, integrity and provenance is ensured using metadata for digital objects (Harran et al., 2018).Metadata also stores information about the physical characteristics and documents the behavior so that it can be emulated in future technologies (Riley, 2017).During the object development phase, multiversions of the same object may be created for preservation and dissemination.Re-using data requires careful preservation and documentation of the metadata.

Newspaper Archive Sources
There are a number of archives maintained by different organizations (government and non-government) with different scopes, such as archives containing small, medium or large archives based on the number of newspapers archived and the coverage in terms of time.Many sources list these digital archives in alphabetical order or by creating different categories.For example, the United States (US) de-facto national library ''The Library of Congress'' [https://www.loc.gov/]provides newspaper archive, indexes and morgues list ''Newspaper and Current Periodical Reading Room'' (Library of Congress, 2022).The International Coalition on Newspapers (ICON) is a multi-institutional (contain universities, colleges, and independent research libraries) efforts that promote accessibility and preservation of newspapers collection from all over the world supported by Center for Research Libraries (CRL), which is a Global Resources Network [http://icon.crl.edu/]provide a list of newspaper digitization projects (Center for Research Libraries, 2022).Similarly, ''Phillips Library of Mount St. Mary's University'' (Phillips Library, 2022) and ''The Ancestor Hunt'' (Ancestor Hunt, 2022) are other known sources of newspaper archives that maintain a comprehensive list.A common problem in all these lists is that many broken or dead links exist or the archive parent links are updated.Even the Wikipedia list contains many archives without any parent link, and many archives are individual newspaper archives of a very short period.
Low-resource languages, such as Arabic, have very limited digital collections, and Urdu has no such digital collection.The British Library has maintain both Arabic (The British Library, 2022) and Urdu (The Internet Archive, 2022) collections contain very few books.Similarly, Harvard library maintains the Middle East and Islamic Studies library resources to safeguard the culture and heritage of the Islamic world (The Harvard Library, 2022).Digitization and preservation of old newspapers are mostly done by converting them into digital images to protect their culture and heritage for future generations.A study discussed comprehensively the contributions in preservation by different countries and organizations, which shows that the higher education organizations in the developed and technologically developed countries contributed more as compared to the developing countries (Khan et al., 2017).Mostly, low-resource languages are associated with developing countries and have very little focus on the preservation aspect of their cultural assets.Urdu is one of the lowresource languages with no newspaper archive or very few contents has been preserved by international archives like ''The Internet Archive.''The access mechanisms of the available archives are not sophisticated, and the manipulation of contents is not easy and remains inaccessible most of the time.

Why We Need Metadata for DNSP Framework?
The World Wide Web is continuously expanding due to the ever-increasing number of information sources providing information almost any time, making the repositories too dynamic and need continuous periodic updation and preservation.The information on the internet is much more volatile and fragile than that in hard form and can be vanished or altered if not smartly and efficiently tackled and archived.This information does not need uploading and adding to the repository but should provide efficient access and other services.Descriptive, technical, and administrative information must ensure access to the archived digital objects (Khan, 2018).
News is also one of the most visited and reliable information in today's world.People watch online news channels, newspapers, and other articles on the internet.Various applications and gadgets are being used, and different sources continuously contribute to providing news.All these sources offer different forms and types of content, which can't be handled by traditional old methods/strategies used for information archival.The dissemination of information is needed after the preservation and creation of archives.The news contents/articles are lost after some time because of technological changes, incompatibilities regarding hardware and software, or lack of preservation of the technical and content information, that is, metadata.Older news disappears after its lifespan, that is, 1 week, month, or maybe longer than this.Still, finally, it vanishes after its lifespan, which needs some specific way of archival for the news domain, which ensures the preservation of news for a long time and future generations.To ensure the news article's archival needs specific strategies are to be adopted for preservation with all technical and administrative aspects.The Digital News Stories Preservation (DNSP) framework is initiated to archive digital news from multiple sources in an organized form and create DNSA.Metadata is created and collected because it enables and improves the use of archived news articles (Khan & Rahman, 2015).
Many metadata standards exist; some are generic and widely used as a base for other evolving standards.Metadata standards have limitations where they do not effectively work out in some specific repositories.In contrast, the domain-specific metadata standards are mainly designed for that particular domain and can't do their best elsewhere.The same problem is facing news repositories.They must also be preserved by accounting for news-specific metadata, enabling efficient preservation of the contents and efficient access using metadata.The focus is to address the better access and retrieval of news from the DNSA archive from the DNSP framework.If an archive is very well organized with no efficient accessing mechanism, then this archive is of no use if it fails to satisfy the user queries.For this purpose, the metadata elements should be sufficiently rich and able to entertain the user's questions and search for information required by the user.

DNSP Framework Enhancement
The primary purpose of the DNSP framework is to create a multilingual multi-sources digital news stories archive that will preserve digital news articles for the long term and future generations.The framework is enriched with two low-resource languages, that is, Urdu and Arabic.The challenges presented in previous sections regarding low-resource languages make it hard to include these sources simply.The absence of efficient tokenizers, dictionaries, and other basic resources prompts heavy prepossessing during preservation in the framework.The workflow and main components are presented in the enhanced version of the DNSP framework in Figure 1.

Multilingual News Archive
This section briefly introduces the Digital News Stories Archive (DNSA).The core initiative of the Digital News Stories Preservation (DNSP) framework is demonstrated in the field conference ''the International Conference on Asian Digital Libraries 2015 (ICADL-2015)'' (Khan & Rahman, 2015).The following are the significant contributions to the framework; A generic systematic approach was proposed as a web preservation model.The model contained ten steps for different types of projects of web preservation after analyzing 120 news archives worldwide (Khan et al., 2017;Khan & Rahman, 2019).The study created the Digital News Stories Archives (DNSA) to preserve news articles from multiple online news sources (Khan, 2018).A news extractor tool, that is, Digital News Stories Extractor (DNSE), is designed for the extraction of news contents and for the creation of DNSA (Khan et al., 2016).Based on different features, a few content-based linking mechanisms are introduced during preservation to ensure the accessibility of the archived contents in the DNSA.Text-processing techniques such as Common Ratio Measure for Similarity (CRMS) (Khan et al., 2018), the role of named entities in linking (Khan et al., 2020), and so on.A comprehensive study is performed in the field of recommendation systems to understand the utility of similarity measures and refine the techniques in the DNSP framework (Feng et al., 2020).The framework can be enhanced in different directions and improve its utility (a few are discussed in future work) (Feng et al., 2020).The technique ''Common Ratio Measure for Similarity (CRMS)'' is modified for news headings to reduce extra computation for the terms appearing in the news body for linking English news articles during preservation (Khan et al., 2018).The technique ''Common Ratio Measure for Similarity (CRMS)'' is updated for linking Urdulanguage news articles with English-language Figure 1.Enhanced digital news story preservation framework for low-resourced language ''Arabic'' (Khan et al., 2020).
news articles, and the DNSA is also converted to a dual lingual archive (Khan et al., 2022).A heading-based technique is introduced for linking English news articles for efficient linkage in the DNSA in the DNSP framework (Khan et al., 2020).
The digital news stories archive (DNSA) is a news article archive created offline from multiple online sources that preserved news stories in Three (3) different languages, that is, English, Urdu, and Arabic.Currently, the DNSA is archiving digital news from three (3) local news television network websites and seven (7) local online newspapers (Khan et al., 2016) in English, five (5) Urdu news sources, and Four (4) Arabic online news sources.The archive is created offline locally and preserves more than 1,000 new articles news in each extraction from specified news sources after removing duplicate and old URLs.
The high-level system architecture of the DNSP framework is presented in Figure 2. The figure shows the ingestion package, two functional mediators, the archive, and the search and retrieval mechanism's module.The ingestion module extracts new news URLs from the selected news sources, the mediators extract news contents, metadata and preserve the news articles, and the search module will help to disseminate the archived contents in the future, creating the Archival Information Package (AIP), as shown in Figure 3.
The newsreaders read from different sources about a story to get a diverse and broader perspective and authenticate the information.It is challenging to navigate through a huge collection without linking mechanisms and metadata, which will help to retrieve relevant news from a multi-lingual archive for better understanding.Sophisticated linking mechanisms, well-defined meta-elements and indexing approaches are required to create and manage such a diverse collection.

Enhancing Digital News Story Extractor (DNSE)
The Digital News Story Extractor (DNSE) is a Javabased tool for extracting digital news stories from different online news websites using JSOUP, POI libraries.Initially, the DNSE is developed for English news sources (Khan et al., 2016), and then enhanced for Urdu news articles, now enhanced by including Arabic news sources and some features for quantification.The DNSE extract news stories from online sources, extract meta information, that is, metadata, and normalized both news content and related metadata into XML format to preserve in the DNSA.However, the enhancement is encountered the following problems as briefly discussed below; Non-Uniform Web Structure: There are many platforms and technologies for developing web-based applications, front-end like HTML, CSS, JAVA, JAVASCRIPT and its frameworks, and back-end logic creation technologies like PHP, ASP.net, XML, and many others.Due to the use of different technologies the web structure varies and hence, a challenging task to extract the desired information.
Recency or Maintenance of Fresh Content: Mostly, the web contents of the dynamic web applications, such as blogs, news websites update instantly and frequently.The recency of news content is very important to maintain efficiently considering access frequency and network traffic issues.Rise of Anti-scrapping tools: The biggest challenge in the extraction of news content is the rise of antiscrapping tools, for example, Captcha, which differentiates between bot and human.The extractor got stuck when anti-scraping tools is implemented.(Khan, 2018, Khan et al., 2020).
Unknown Host Issue: The unreliable internet connection leads to an unknown host issue, the extraction of news restarted after the interruption is time-consuming.Socket Timeout: Most websites temporarily block or suspend their services when frequently accessing the contents for a specific time period during preservation.The websites consider that a bot is unnecessary to send requests and overload the server and start blocking access.As extraction is important for any digital archive and becomes challenging when preserving low-resource contents.The enhanced DNSE is enabled to deal with the above challenges efficiently.

News Extraction Results
The ''DNSA'' is enriched with two low-resource languages, that is, Urdu and Arabic, with Five sources that provide Urdu news articles, and three online sources published news in Arabic language.The details of the included news articles from all three languages are summarized in the Table 1 below; The DNSP framework is gradually enhanced and the lake of resources and sufficient financial support the research progress is slow.Initially, three local English newspapers, that is, Dawn News, The Tribune and The News were selected for testing the DNSE tool (Khan et al., 2016).
The new extraction/crawling results after DNSE enriched with two low-resource languages, that is, Urdu and Arabic is keenly analyzed for shortcomings of the DNSP framework and DNSE tool.
In Figure 4, the extraction results are visualized for all ten sources of high resources language, that is, English.The results show that few of the news sources are not frequently update the news online and can be replaced by other sources for efficient utilization of the DNSP framework.
Assessing the frequency of extraction of new stories is important as the news stream is continuous and not periodic like printed media.
The extraction process was performed on daily basis or waiting for some days before performing the extraction.The average number of extracted URLs and unique URLs are presented in Figure 5.The figure shows that the number of new news URLs extracted is almost equal to new news stories from the online newspaper and among online news channels.
The processing of low-resource languages is expensive in terms of time complexity and accuracy.The main problems with the implementation of DNSE including LRLs are non-uniform web structure, unknown host issues, and garbage collection.Figures 6 and 7 present on average extraction of new news articles and unique URLs respectively.
Table 2 and Figure 8 present the error rate of URLs and stories extraction during preservation for both highresource language and low resources languages.The LRLs has a large error rate because of non-uniform web  structure, unknown host issue, maintenance of fresh content, anti-scrapping tools, and garbage collection.

Proposed Metadata Element Set for DNSA
Metadata is essential as the content itself because digital content is useful when accessible.Metadata is structured information that helps to locate a digital object in the digital archive.Some metadata may not be available explicitly with news stories but may be extracted from the text of news articles.The metadata extractor module is extended to include a sub-module to collect metadata from the text of news stories.
This metadata helps link multi-lingual news articles based on similarity with other news articles stored in the archive.In DNSP, 28 explicit and implicit metadata elements are extracted from the source and from the news article (if any), which are used as descriptive metadata and administrative metadata, as shown in Tables 3 and  4, respectively.

Metadata Extraction Results
The Tables 5 and 6 present the metadata extraction results during the news stories preservation in DNSA for both low-resource and high-resource languages.Wellorganized news websites normally keep all the explicit metadata, and a few descriptive metadata are left blank.The implicit metadata is extracted from the news stories, so almost all the meta elements are extracted, as shown in the respective tables.
Extracting explicit metadata is easy in terms of accuracy and computation in both high-resource and lowresource languages.In contrast, implicit meta-element extraction in low-resource languages is computationally expensive and inaccurate compared to high-resource languages.The twelve implicit meta elements in the proposed metadata elements set are not straightforward for LRL because of the different morphological complex

Discussion
The existing news archives can be classified into two types, that is, graphical formats and partially indexed archives.It is difficult to manipulate the contents of these archives, especially to access particular news about an event, because it encompasses many challenges.Such as; Vast Archive Collections: an archive created from many sources, Various Sources: having different platforms, Multi-lingual Archive: an archive created from multiple languages, that is, Urdu, Arabic & English Low-resource Language: this becomes more complicated when accessing news articles in lowresource languages, such as Urdu.Because of sophisticated tools, the preprocessing overhead is large compared to high-resource languages, such as English.
Besides these, there are many difficulties in digital news preservation, such as; Extraction of news from many diverse sources and technological different platforms, Extraction of implicit and explicit metadata, Similarity value computation among news articles, Transformation of news articles to a specific standard format for future integration and access, and so on.
Developing an efficient extractor is also challenging for multi-lingual content extraction, which apparently seems easy.The development of the Digital News Stories Extractor faces many challenges; Non-uniform web structure: the web platforms use different technologies and frameworks for development, such as HTML, CSS, JAVA, JAVASCRIPT, PHP, XML, and many others and its frameworks.The web platforms use different data structures and formats to provide news content.The preservation needs more versatility, and AI features to crawl contents from these different resources.The DNSA is an effort to create a fully texted archive to get the benefits from new technologies and approaches.The DNSP framework is developed to overcome different challenges facing the multi-lingual digital archives, including low-resource languages, and the current research is briefly added in the future work of this paper.

Conclusions and Future Work
The preservation of news and the creation of news archives is challenging.It becomes even further complicated when the archive contains articles from lowresourced and morphologically complex languages like Urdu and Arabic.The study introduced a multi-lingual news archive for Urdu, Arabic, and English news article sources from eighteen news publishing platforms.The digital news stories extractor is enhanced, addresses major issues in implementing low-resource languages, and facilitates normalized format migration.The extraction results of the proposed 28 meta-elements, including sixteen explicit and twelve implicit elements, are presented in detail for high-resource languages, that is, English, and low-resource languages, that is, Urdu and Arabic.The results showed that seventeen and ten meta elements are cent percent extracted for HRL and LRL, respectively.The LRL encountered a high error rate during preservation compared to HRL, 10%, and 03%, respectively.The framework preserved, on average, 879 news from ten HRL sources and 553 news from eight LRL news sources in the digital news stories archive.
The study presents details of how the framework is enhanced and needs a more detailed study for accurate news content extraction and archiving for future access.The framework can be extended in different dimensions in the future, such as; A standard user interface is required to enable access to the archived contents of the DNSA.The DNSE tool needs to be developed to professional standards.The meta attributes can be developed for multilingual archives and other languages, such as Urdu, Arabic, Pashto, and so on.More implicit meta elements can be added to the proposed set after comprehensively reviewing individual sources.Language-dependent metadata attributes can be added to the meta-set.
Garbage Collection: The inconsistency in development approaches lead to erroneous extraction by collecting unwanted data, such as in-text links, tags, or other code during news extraction.Identifying and Preprocessing of Low-resource Languages: The DNSE tool deployed different libraries for the identification and preprocessing of low-resource languages and the preprocessing is computationally expensive.Firewall Blocking: Few online news sources are protected from extraction using the firewall.

Figure 4 .
Figure 4. Average new news story extraction for high-resource language ''English.''

Figure 5 .
Figure 5. Average total URLs extraction and unique URLs extraction for HRL.

Table 2 .
Error Rate in Both HRL (English) and LRLs (Urdu and Arabic) During Extraction.

Table 3 .
Explicit Metadata Element set for DNSA.
Figure 8. Error rate comparison in both HRL (English) and LRLs (Urdu and Arabic) during extraction.

Table 4 .
Implicit Metadata Element set for DNSA.