Starting Points: Initial Methodological Approaches to Archiving Digital British Islam

The following discussion is linked to the online presentation of the first set of archived content for the Digital British Islam Project. It explains aspects of the methodological approaches and issues that arose during the research and development of these archiving research processes. This article will be updated and augmented as the project progresses.[1]

The ESRC-funded Digital British Islam project is developing an archive of the different cyber-Islamic environments in the UK.[2] The results will be used in a curated gallery, which would preserve the content for future examination, form the basis of the project’s discussions, and allow for subsequent deeper diving into specific sites of interest to the project.[3] These would fit into the parameters associated with the original project bid, namely, religious authority, gender, and political agency (although, naturally, the archiving also explores other areas).

While there has been some initial analysis of Muslim online activities in the UK, web archiving is an emerging study field and has not been previously applied to explore this subject area. Rather than using specialist archivists, this project draws on a team familiar with the subject-specific issues associated with Islam, Muslims and the internet in the UK. This approach has necessitated the team’s development of new practical and methodological skills in data acquisition, digital preservation and analysis.[4]

As a starting point, a set of spreadsheets was developed, listing organisations, platforms, mosques, interests, groups, and affiliated bodies associated with Islam and Muslims in the UK. This categorised thematic listing (now expanded to over 300 sites) was across specific areas associated with the research project to help with information management.[5] The project team developed and expanded the tentative listings based on their prior research experience in the field, alongside the results of consultations, focus groups, workshops and community interactions. The listing includes URLs of primary sites alongside notes on the specific affiliations of organisations and the amount of data the sites generated. Data were calculated initially through a series of test crawls using Archive-it.[6] The project team has developed notes and commentaries on the URLS, which form part of the archive, introducing the content and explaining its relevance.

The series of crawls initiated each contained 10 to 15 websites, broadly falling under one of the project’s categories. The team subsequently analysed the crawl results regarding the range and depth of the crawl provided and the effectiveness of any specific seeds. The team sought alternatives to any broken links. Developing a balanced listing across the spectrum of Muslim understandings in the UK was a significant aspect of the project. There was recognition that key players and organisations would dominate the archive, so some emphasis was on other zones, including influencers and platforms representing micro areas of interest to different sectors within Muslim communities across the UK. One advantage of the constituency of the project team is that its members are based outside of London, with a focus on Wales and Scotland, as well as central England, taking it outside of the London-centric discourse about Muslim communities.[7]

Using crawling technology, the seeds of specific URLs gathered the listed sites’ content through Archive-it crawling technology. The result was a listing of the sites, indicating the depth of the content, and a copy of everything ‘spidered’. The parameters were selected so that the crawling dates would be ranged tightly, from one to three days. The initial crawl generated substantial data, especially from top-heavy sites containing multimedia videos, lectures, content, and podcasts. It became necessary to narrow search parameters.

A subsequent top-level crawl entailed gathering the initial page of any site under consideration; the parameters were set so that the crawl would go one page (or level) down from this initial page of results. The parameters of the archiving technology mean that any sector of the site can be included or excluded, depending on the setup. This exclusion includes blocking particular data-heavy content. This factor avoids exceeding specific data allowances included in the Archive-it package. There are opportunities to return later to sites and capture further content on multiple occasions.

In order to develop this research, it was essential to explore the different methodological considerations of online archiving. In addition, the ethical aspects had to comply with the requirements of funders and partner institutions fully. This process includes formal ethics forms from each institution and data protection impact analysis. Data security, anonymisation (where relevant), site selection issues, the avoidance of specific personal data, and ensuring the sustainability and legacy elements of the archive were essential factors to address in our ethics approaches, both in the initial bid and subsequent partner universities’ ethical interventions.

The initial crawl immediately, even in test mode, resulted in some exciting findings. The sheer quantity of content generated by some platforms was extensive across different media formats. The project has particular restrictions regarding the ethical and data protection dimensions of information gathering. All content obtained emerges from the public domain. The terms and conditions of social media sites do not generally allow for archiving formally, especially given this level of personal content (as well as being a data-heavy zone requiring terabytes of storage space). The project focused on the public-facing websites, a significant foundation for analysis. This does not preclude discussion of social media in its various forms, which due to ethical considerations will be externally linked within research papers but not formally archived.

Substantial testing through the trial crawls was integral to the research approach as it influenced the parameters of the discussion in many ways, especially regarding the quantity and quality of data available for use. There is also a consciousness that some content could quickly disappear, especially from areas of online activity, which were in some ways contentious. Capturing elements of the discourse before it might disappear from view was essential. The intention was to generate a compelling snapshot of an underrepresented sector of the study of Islam and Muslims, specifically in the United Kingdom. It would demonstrate the fluid nature of online discourse, the different responses to issues, and the diverse approaches to generating information and, in some cases, responses from users.

The World Wide Web remains the core aspect of digital discourse, even when new forms of locations and technology become popular. Individual sites reflect digital legacies in terms of the long-term development and archiving of content, offering opportunities for analysis over wide timeframes, and they may have a broader audience in terms of age, range and usage than some apps and social media discourse. Their levels of searchability are significant, especially for those seeking advice from particular religious perspectives.

There is no specific manual in which to engage with this subject concerning archiving. The team benefitted from an Archive-it training session and support. Generic advice was available through the Archive-it websites, including training, videos and manuals. This content required many months of in-depth exploration before commencing the work to be conversant with the technology and its potential, along with possible pitfalls. The research team refined and developed this investigation’s findings in their finetuning of the methodological and analytical approaches.

The practicalities of this research project require the investigation of different aspects of technology and approaches towards presenting findings. An external website will act as an interface for the archive’s content, which Archive-it hosts across its servers. The first set of archived content will be available in August 2023. This interface forms the basis of a framework for narration, integrated metadata, and analysis of the initial materials required. A vital aspect of this will be using different tools to visualise the acquired materials and understand the different information flows through networks and hubs.[8]

The findings will be presented through workshops, community outreach, conference activities, and publications. The metadata and associated content will be shared through open-source platforms. Once the first complete data iteration is presented online, data visualisation techniques will be applied to explore significant linkages between and within different aspects of cyber-Islamic environments in the UK. This will be the subject of further articles.

The Digital British Islam team welcomes feedback as the archive development progresses. The intention is to regularly present new archive sets via this website as an initial step towards systemising a collection of thematic pages associated with the project. In conjunction with the other elements of the project, this will facilitate a developed understanding of the state of cyber-Islamic environments in UK contexts and be relevant to broad audiences from communities, academic interests, and interested observers. This approach will complement other academic work on digital Islam, providing a granulated multidisciplinary and interdisciplinary approach to the subject area, with a unique methodological approach within this developing academic field. In this way, it seeks to contribute to further studies and understandings on digital religion in general, and digital Islam and Muslim communities in particular.

[1] A more technical and detailed discussion will appear in due course. This article represents ongoing work in progress.

[2] ‘Digital British Islam: How do Cyber Islamic Environments impact everyday life?’ commenced in May 2022 and is funded by the ESRC (ES/W002175/1). It is a multi-institution project led by the University of Wales Trinity Saint David, with Coventry University and the University of Edinburgh. Cyber-Islamic environments is an umbrella concept for diverse and evolving online internet media identified (or self-identified) as affiliated with Islam and Muslims. See Gary R. Bunt, How Cyber-Islamic Environments are Transforming Religious Authority, (Chapel Hill: University of North Carolina Press, 2018), 7

[3] See the earlier discussion by Laura Jones, Digital British Islam, Insights, ‘Archiving British Muslim Digital Spaces’, July 2023, https://digitalbritishislam.com/archiving-british-muslim-digital-spaces/

[4] See Molly Bragg, Kristine Hanna, ‘Web Archiving Lifecycle Model’, March 2013, https://archive-it.org/blog/learn-more/publications/web-archiving-life-cycle-model/

[5] Prof. Gary R. Bunt is Principal Investigator for the project and leads its archiving strand. Dr Laura Jones, Dr Sadek Hamid and Dr Alamgir Ahmed work directly on the archiving elements of the project.

[6] Archive-it, 3 August 2023, https://archive-it.org

[7] Digital British Islam, ‘Digital British Islam Team’, 3 August 2023, https://digitalbritishislam.com/dbiteam/

[8] Matthew S. Weber, Peter Monge, The Flow of Digital News in a Network of Sources, Authorities, and Hubs, Journal of Communication, Volume 61, Issue 6, December 2011, Pages 1062–1081, https://doi.org/10.1111/j.1460-2466.2011.01596.x