Stubble wrote: ↑Sun Aug 17, 2025 5:46 pm
HansHill wrote: ↑Sun Aug 17, 2025 5:17 pm
My tinfoil hat is pinching me too, Mr Stubble - it's times like this with so much uncertainty around the future of the net, that I am urging the posters here to be smart and create multiple backups and archives of all the material you are likely to need.
Aim for having backups and redundancy. It can be something as basic as an old clunker spinning metal HDD, or a full RAID array, or ideally both.
I highly recommend a RAID. I also highly recommend a series of micro sd cards.
For a RAID for data, I recommend a lightweight linux distro, running it 'in a jar' (not online), using older, cheaper, abundant hardware.
Going back to micro sd, I just can't see a reason not to keep using them like 3.5" floppies.
For anyone able to assist with the above, this is the tool I used to archive (and ultimately resurrect) the old CODOH forum we have available through the present forum:
https://www.httrack.com/
It isn't perfect (tends to miss some of the middle pages in very long threads of more than 26+ pages [at least on the old forum -- only captured the first and last 13 of each for some reason]) but does a really good job overall at capturing the exact structure, links, etc.
I'd recommend it for any website you want to back-up, so long as its available publicly. But still important to save specific contributions, just in case the backup fails for whatever reason.
[EDIT: As you're setting up a website 'scraping', you have to do a few things:
- Specify the Project name, etc (whatever you want to call it)
- Indicate the parent URL for which you want to scrape everything within (e.g. if you put
https://www.codoh.com/, then it will scrape the ENTIRE CODOH site; whereas if you put
https://www.codoh.com/articles/, you would only get everything within the /articles/ subdirectory.
- If you run into errors, you might want to change the browser header (e.g. Mozilla, etc.) as this is basically the browser that the software 'pretends' to be while scraping, which can make it harder for websites to detect and block.
- There is a settings page where you can indicate which file types you'd like to include in the scraping. I usually add *.pdf but you can decide what to include
- I think that's mainly it...? There is also an option to ignore 'robots.txt' but that's slightly controversial for some websites. 'robots.txt' is kind of a website's way of telling web scrapers what webpages that are 'not supposed to' or 'not allowed' to scrape, however these can simply be ignored (allowing a total scraping, where possible) but that could raise questions about copyright or just good manners, etc.]