Hello,
I am currently working with the get-files endpoint to fetch the list of files for a snapshot from the universal-internet-dataset-v2-ipv4 dataset, and I would like to clarify how to verify dataset completeness.
For context, although I’m posting this from a personal account, I’m using API credentials associated with an organization headed by my research group supervisor, who has research access to data granted to his account.
While retrieving the files for the snapshot universal-internet-dataset-v2-ipv4_20260317, I observed the following:
-
Pagination proceeded normally up to page 100
-
Each page returned exactly 100 file entries
-
In total, I obtained 10,000 files (indexed 0–9999) with no gaps
-
On page 100, the response did not include a nextPage token
-
Repeating the requests yields consistent results (same files), although the page tokens themselves change between runs
I performed the following validation steps:
-
Verified file size consistency against the expected size (sizeBytes) provided in the API response
-
Confirmed file index continuity (no missing files)
-
Re-requested the last pages multiple times to ensure consistency
However, I could not find in the documentation:
-
Whether there is a maximum number of pages or files per snapshot
-
Whether 10,000 files is an expected fixed partitioning scheme
-
Or how to definitively confirm that the dataset is complete and not truncated by possible pagination limits
My questions are:
-
Does the absence of a nextPage token reliably indicate that all files for the snapshot have been listed?
-
Is there a known or fixed number of files per dataset snapshot (e.g., 10,000 files)?
-
Is there any recommended way to verify completeness of a downloaded dataset snapshot?
-
Are there any pagination limits (e.g., max pages or result window) that could cause incomplete enumeration without explicit indication?
I am also currently downloading another dataset snapshot (hosts-ipv4 from the same date) to compare behavior.
Any clarification would be greatly appreciated.
Thank you in advance!
Regards,
Bernardo.
