New scans of Herculaneum papyri at insane resolutions
~15x more pixels per cm3 and many more incident energies
Together with EduceLab, we’re announcing the release of a new dataset that shatters records set with previous scans. Order of magnitude better resolution, more incident energies, completely new scrolls, a unique combination of scroll + fragment, and a new carbon phantom scan.
We hope that this will even further accelerate the already spectacular results all of you have achieved. The race is still wide open — no one has submitted for the Grand Prize yet. These new scans may be used for the Grand Prize. Let’s read these scrolls!
Why more scans?
The Vesuvius Challenge has always been heavily data-constrained: it’s hard to get access to scrolls, and expensive to make high-resolution CT scans at particle accelerators.
The dataset that we’ve been working with so far consists of two scrolls scanned at 7.91 µm voxel size and (mostly) 54 keV incident energy, and 4 fragments imaged at 3.24 µm voxel size with both 54 keV and 88 keV.
The difference between the voxel sizes sounds small, but 3.24 µm results in about 15 times more pixels per cm3 than 7.91 µm! Early investigations by Ryan Chesler and Kirchhoff, Rokuss, and Hamm showed how much this difference in resolution matters for ink detection models trained on fragments:
3.24 µm vs 7.91 µm voxel sizes for the top Kaggle ink detection model
The incident energy is another variable which is even less well understood. We have no strong empirical data on how much it matters which energy level is used for the CT scans, if at all.
Finally, it’s hard to say beforehand what the state of a scroll is on the inside. It turns out that one of our scrolls was very damaged indeed, making it extremely hard to segment. We’ve effectively been working with only half the scroll data!
Even so, this dataset has already yielded spectacular results, and we believe that when we started this competition it was already the largest cultural heritage dataset ever released (in terms of file size). Imagine what we can achieve with even more and better data!
Today, we’re officially announcing a new dataset that shatters previous records: 2 scrolls and 2 fragments, scanned at both (!) 3.24 µm and 7.91 µm, and 4 (!) different energy levels. We also reimaged the lab-made carbon phantom scroll at a higher resolution than ever before.
These scans were possible because of our unique collaboration: Professor Seales’ EduceLab led the scanning effort, bringing decades of expertise and key relationships to the task. The Officina dei Papiri Ercolanesi at the national library in Naples loaned the scrolls and fragments, which they’ve preserved for centuries. Diamond Light Source again provided state-of-the-art scanning facilities at their particle accelerator. Our donors — tech entrepreneurs and venture capitalists with a soft spot for ancient history — made this financially possible. And all of you, our contestants and community, showed everyone what results we can get when releasing this data!
We’ve previously highlighted the work of professor Brent Seales (who leads EduceLab) and Dr. Stephen Parsons (who first discovered ML works for ink detection), but there is another EduceLab member we would like to properly introduce: Seth Parker. He has not only led this and previous scanning efforts, but is also the main author of the virtual unwrapping software that was a key piece of the recent breakthroughs (Volume Cartographer).
Seth has been with Dr. Seales’ group for over a decade and actually started in video production. He then taught himself programming and C++ (!), transforming into an academic researcher. He started the Volume Cartographer project (and related libraries) in 2013 to work on scans of Herculaneum scrolls from 2009, but its first success was the virtual unwrapping of the En-Gedi scroll, published in Science Advances in 2016. Collaborating with many students and researchers over the years (and now with some of you!), Seth has been the one person carrying the project through. This project stands on the shoulders of many giants, and Seth is among the tallest. Thank you, Seth!
In June 2023, Seth, Stephen, and JP visited Italy to work on the machines Seth had set up there to photograph all the trays of fragments stored at the Biblioteca Nazionale di Napoli, which we reported on back then.
We didn’t share this at the time, but we were also there to select scrolls and fragments for another CT scanning session and to capture all of the data we would need to safely mount those samples in the scanner at Diamond. For the scrolls, this meant capturing a full set of exterior photographs to produce 3D models of the scrolls using photogrammetry.
Back home, Seth and Stephen used the 3D models to design and 3D print transport and scan cases for the scrolls (made of Nylon 12 printed by HP Multi Jet Fusion).
In late September the conservators packed the samples and flew them to the Diamond particle accelerator in Oxford.
The scrolls almost didn’t make it! Severe storms caused commercial flights in Naples to be grounded. We only had a few precious days of scanning time, with the next slot being in April 2024. Nat Friedman, the instigator of the Vesuvius Challenge, quickly sprung into action, chartering a private plane. Everyone got there just in time for the scanning session!
The scanning itself went pretty smoothly. We successfully negotiated an extra day, which meant that we could do all the combinations of voxel size and incident energies that we wanted.
Prior to the scan session, Seth, Stephen and Vesuvius Challenge volunteer Daniel Havíř worked on a reproducible reconstruction pipeline that could be run immediately after scan completion in order to catch any issues as early as possible. This was a significant improvement over the workflow from 2019, where the team had to wait days or weeks after the scan session to see the first slices.
There was one snag... on the final night of scanning, there was an explosion at a nearby recycling plant. The particle accelerator was shut down for an hour, but we were quickly back in action.
After a few sleepless nights, the team had collected all the data we needed
In the following weeks the team has worked hard to turn the raw scans into volumes that can be read by Volume Cartographer. Today we are releasing the first volume, but there are many more to come.
2 new scrolls
PHerc 332: diameter 26 mm, length 77 mm. Has been partially unrolled physically. This is the core of the scroll that remains intact.
PHerc 1667: diameter 30 mm, length 85 mm. Like PHerc 332, it has been partially unrolled physically and this is the core that remains intact.
2 new fragments
PHerc 1667, Cr. 1, Fr. 3: 14x23 mm. A fragment believed to come from the unrolled part of PHerc 1667 (above). Shows approximately 6 characters.
PHerc 51, Cr. 4, Fr. 48: 9x28 mm. A fragment which shows approximately 22 characters.
Carbon phantom scroll with known ground truth — the same one as in this seminal paper. However, this is at the bottom of our priority list and we might not get to it before the end of the year.
More details in the technical overview. We’ll release new versions as we process the data.
All samples are from the Officina dei Papiri Ercolanesi, Biblioteca Nazionale di Napoli Vittorio Emanuele III in Naples, Italy. Scanning was done on the I12 beamline, using optical modules 2 and 3, which have pixel sizes of 7.91 µm and 3.24 µm, respectively. We scanned with monochromatic incident energies of 53 keV, 70 keV, 88 keV, and 105 keV.
This means that for each of the 4 main samples (2 scrolls + 2 fragments) we have 8 volumes (4 energies at 7.91 µm and 4 energies at 3.24 µm). For the carbon phantom we only used the 3.24 µm optical module, resulting in 4 volumes. In total, we’ll be releasing 36 volumes!
Our goal with this data is to only segment each sample one time and then to use those segmentations across all of the other volumes. To that end, we will identify a 3.24 µm volume as the canonical volume for each object, meaning it will provide our reference coordinate space for segmentation. But how will we map segmentations across volumes?
As much as possible, we avoided touching or remounting the samples between scans. This means that, with a few exceptions, volumes of the same sample and resolution should naturally be pretty well aligned across energies. Across resolutions, the samples will be in approximately the same orientation, but some further alignment will be necessary. Fortunately, Volume Cartographer already has some features to this end, and we’ll be expanding on those in the coming weeks to make working across volumes as easy and seamless as possible — your help on this would be very welcome. :)
It takes a long time to reconstruct and verify the quality of 36 scans, so many of the reconstructions are still being performed. Our priority is to make the canonical volumes for each sample available first, with particular emphasis on the 53keV and 88keV energies to match the work already done on the previous datasets. After those are available, we’ll focus on releasing the extra energies and resolutions. The first volume (PHerc 332, 53keV, 3.24µm, 4.1TB) has already been uploaded to the data server (/full-scrolls/PHerc0332.volpkg), as well as a few initial segments.
Open Source Prizes
We currently have four Open Source prizes of $5,000 each, with a deadline of Nov 30th. We would love to see this dataset being used in open source projects that increase the likelihood of someone getting to the Grand Prize. Here are some ideas:
Anything that helps our Segmentation Team work with these scrolls more efficiently. They have a list of feature requests, but here are some more that are specific to these datasets:
15x more voxels per cm3 also means 15x more disk space, RAM, and CPU needed to segment the same area! We would love to see features in Volume Cartographer and other tools to mitigate this.
For example, operating on volumes with reduced resolution while still using the original coordinate space.
Or marking an area of interest and then only loading that area into RAM.
Even tools that make it easier to swap out parts of a volume between HDD and NVMe would be helpful.
Running open source ink detection models on this, either from the First Letters Prize, or from Kaggle. Looking at Kaggle again would be especially interesting, because we finally have scroll datasets with the same resolution as the Kaggle data!
Manually looking at patterns like crackle. Perhaps with this better resolution we can see more?
Comparing performance of ink detection on different voxel sizes and energies, or even using multiple energies at once for ink detection.
Creating 3D ink detection models that are invariant to the orientation of papyrus. Then running that on entire scrolls, so we can find hotspots of where to segment.
Analyzing effective resolution of each volume. How much extra information do we actually gain by using a high fidelity optical module?
Our Segmentation Team has already been hard at work to produce some initial segments, and we might do an extra release early next week in addition to our regular Friday releases. Stay tuned.
Thanks to everyone who has made this happen, especially Seth, Stephen, Daniel, the rest of the EduceLab team, Officina staff, Diamond staff, and our donors. We can’t wait to see what y’all will do with this!