Backup your PMs with wget before the site goes down

Discussion in 'Off Topic Discussion' started by -=FamilyGuy=-, Jun 2, 2019.

Tags:
  1. -=FamilyGuy=-

    -=FamilyGuy=- Site Supporter 2049

    Joined:
    Mar 3, 2007
    Messages:
    3,097
    Likes Received:
    1,046
    You'll need to use the WGET program. It's available for Linux/OSX/Windows easily, just google it.

    1. Install Firefox and the export cookies addon: https://addons.mozilla.org/firefox/addon/export-cookies-txt/
    2. Log in into AG, making sure to tick "stay logged in". Use the addon to export the cookies for AG to a text file, mine is called "cookies-assemblergames-com.txt".
    3. Create a new folder, place the cookies file inside and open a console/terminal in the same folder.
    4. Run the following command (copy including the converssation URL):
      wget -mkEpnp --execute robots=off --load-cookies=cookies-assemblergames-com.txt https://assemblergames.com/conversations/
    5. WAIT!
    6. Rename the newly created "assemblergames.com" fodler to A.
    7. Run the following command (copy including the conversation URL):
      wget -mkEp --execute robots=off -I/attachments/ -I/data/ -I/conversations --load-cookies=cookies-assemblergames-com.txt https://assemblergames.com/conversations/
    8. Rename the folder "assemblergames.com" to B.
    9. Create a third folder called Final, copy the content of B to it.
    10. Copy the content of A over Final, overwriting/merging everything that was already there from B.
    11. Change the _bH variable to ./ (current dir) in every html files. In linux and probably OSX:
      find ./Final -type f -exec sed -i -e 's/_bH = "https:\/\/assemblergames\.com\/";/".\/"/g' {} \;
    12. Fix links of attachments:
      find ./Final/conversations/ -type f -exec sed -i -e 's/"https:\/\/assemblergames\.com\/attachments\//"\.\.\/\.\.\/attachments\//g' {} \;
    13. Profits $$$
    14. Like and subscribe! /jk
    You should now have an offline backup of your conversations/PMs in the Final folder. The main html file to open with your browser is Final/conversations/index.html

    Good luck!

    PS: My backup isn't done yet, so consider this untested. Use at your own risk.

    Edit: Some absolute links might have to be converted to local ones using external scripts afterwards, currently testing. All data is safe though, just inconvenient to browse.

    Everything should work pretty well now. When you click on an attachment, it'll open a basic file browsing page with an index.html file, that files is actually your attachment; right-click, save as.

    EDIT: Current bash script for Linux, should work under Windows via cygwin or on MacOS. To put in an empty folder along with cookies-assemblergames-com.txt and execute.
    Testing right now from a fresh folder with my own PMs.
    TESTED! It works fine!
    Code:
    #!/bin/bash
    
    wget -mkEpnp --execute robots=off --load-cookies=cookies-assemblergames-com.txt https://assemblergames.com/conversations/
    
    mv "assemblergames.com" A
    
    wget -mkEp --execute robots=off -I/attachments/ -I/data/ -I/conversations --load-cookies=cookies-assemblergames-com.txt https://assemblergames.com/conversations/
    
    mv "assemblergames.com" B
    
    mkdir Final
    
    cp -rf B/* Final/
    
    cp -rf A/* Final/
    
    find ./Final -type f -exec sed -i -e 's/_bH = "https:\/\/assemblergames\.com\/";/".\/"/g' {} \;
    
    find ./Final/conversations/ -type f -exec sed -i -e 's/"https:\/\/assemblergames\.com\/attachments\//"\.\.\/\.\.\/attachments\//g' {} \;
    
     
    Last edited: Jun 3, 2019
  2. -=FamilyGuy=-

    -=FamilyGuy=- Site Supporter 2049

    Joined:
    Mar 3, 2007
    Messages:
    3,097
    Likes Received:
    1,046
    There's a slight issue right now with absolute links, but I should be able to fix it using SED. Stay tuned.
    Fixed with FIND and SED, might need to adapt to your OS.

    Left to fix:
    • Some attachment might not get downloaded: Exploring wget options to try and fix this
      Should be fixed as of latest edit.
    Let me know if you find another issue.
     
    Last edited: Jun 3, 2019
  3. SONIC3D

    SONIC3D Spirited Member

    Joined:
    Oct 30, 2008
    Messages:
    151
    Likes Received:
    38
    Thanks a lot.
     
    -=FamilyGuy=- likes this.
  4. -=FamilyGuy=-

    -=FamilyGuy=- Site Supporter 2049

    Joined:
    Mar 3, 2007
    Messages:
    3,097
    Likes Received:
    1,046
    You're welcome.

    I just updated the procedure. It's a tad more complicated and completely unoptimised, but it also properly saves attachments.
    Let me know if anything is unclear.
     
    Last edited: Jun 3, 2019
  5. -=FamilyGuy=-

    -=FamilyGuy=- Site Supporter 2049

    Joined:
    Mar 3, 2007
    Messages:
    3,097
    Likes Received:
    1,046
    Updated first post with tuned commands and the Linux script I tested overnight.

    This should allow you to easily backup your PMs with attachments for comfortable viewing with AG's CSS and styles.

    Latest script:
    Code:
    #!/bin/bash
    
    wget -mkEpnp --execute robots=off --load-cookies=cookies-assemblergames-com.txt https://assemblergames.com/conversations/
    
    mv "assemblergames.com" A
    
    wget -mkEp --execute robots=off -I/attachments/ -I/data/ -I/conversations --load-cookies=cookies-assemblergames-com.txt https://assemblergames.com/conversations/
    
    mv "assemblergames.com" B
    
    mkdir Final
    
    cp -rf B/* Final/
    
    cp -rf A/* Final/
    
    find ./Final -type f -exec sed -i -e 's/_bH = "https:\/\/assemblergames\.com\/";/".\/"/g' {} \;
    
    find ./Final/conversations/ -type f -exec sed -i -e 's/"https:\/\/assemblergames\.com\/attachments\//"\.\.\/\.\.\/attachments\//g' {} \;
    
     
  6. T_chan

    T_chan Gutsy Member

    Joined:
    Apr 13, 2008
    Messages:
    470
    Likes Received:
    65
    Big Big Big Thanks for that !
    I would probably have lost my mails, out of pure lazyness :)

    Tried on Ubuntu 19, works like a charm !

    Only tiny problem that I see is that when I click on an image to see the detailed version (eg: file:///.../Final/attachments/img_0363-jpg.35183/), nothing happens.
    When I look at the folders, I see indeed a folder /attachments/img_0363-jpg.35183/, but inside it there is the high-res .jpg file renamed as index.html

    (I executed the steps manually, not via the script)
     
    -=FamilyGuy=- likes this.
  7. -=FamilyGuy=-

    -=FamilyGuy=- Site Supporter 2049

    Joined:
    Mar 3, 2007
    Messages:
    3,097
    Likes Received:
    1,046
    In my experience, the attachments aren't completely right. They do appear as index.html files on disc, probably for security purposes, but they fall to properly convert to the right filename when accessed.

    In my case, with Chrome, I see a basic file explorer with the proper index.html file in it. It might be a good idea to keep both the A and B folders in case we figure out how to completely fix it in the future. For now I consider this good enough ;)

    Edit: The find/sed command for attachment, the second one, should only be applied to the conversation folder, not the whole thing.

    Also, be aware of the difference between cp -rf A/ and cp -rf A/* if you're doing it manually.

    Maybe try the script in a new folder just in case.
     
    Last edited: Jun 3, 2019
  8. T_chan

    T_chan Gutsy Member

    Joined:
    Apr 13, 2008
    Messages:
    470
    Likes Received:
    65
    It's more than enough indeed :)

    The problem in fact is already appearing when downloading the files, so before applying the changes.
    I just checked the source code of the page, it's indeed because of the source code of the pages.
    The small image is a correct & easy to understand IMG SRC...
    The bigger image is a A HREF=... /file.jpg.1234/, but it points in fact directly to the jpg file... no wonder wget has problems with that :)
     
  9. -=FamilyGuy=-

    -=FamilyGuy=- Site Supporter 2049

    Joined:
    Mar 3, 2007
    Messages:
    3,097
    Likes Received:
    1,046
    Thanks! Happy to have saved some info for you, and hopefully others!

    If you got any idea on how to fix it, I'm opened to suggestions. Maybe try it with Chrome?

    I think the index.html thing is probably on purpose though. There may be exploits in pictures that can use the thumbnail generation from a file browser to execute code, so they might purposefully hide the extension of the files that the users upload themselves. Maybe we're missing some logic in the scrapping that should do the apparent conversions when accessing the attachments.
     
    Last edited: Jun 3, 2019
  10. T_chan

    T_chan Gutsy Member

    Joined:
    Apr 13, 2008
    Messages:
    470
    Likes Received:
    65
    Just launched the download again with the extra option --content-disposition, it does seem to do the job.
    (download not finished yet, not tested fully yet, I'm still downloading for folder A :))
     
  11. -=FamilyGuy=-

    -=FamilyGuy=- Site Supporter 2049

    Joined:
    Mar 3, 2007
    Messages:
    3,097
    Likes Received:
    1,046
    Interesting, update me once done!
     
  12. T_chan

    T_chan Gutsy Member

    Joined:
    Apr 13, 2008
    Messages:
    470
    Likes Received:
    65
    Done.
    The images seem to be saved with the correct extension... but now the link in the conversation does not work of course, because wget "invented" a filename :)

    eg: in conversation: link to file:///.../Final/attachments/img_0348-1515-jpg.35166/
    in folder:file:///.../Final/attachments/img_0363-jpg.35183/IMG_0363.JPG

    oh well... that's good enough for me for now :)
     
  13. -=FamilyGuy=-

    -=FamilyGuy=- Site Supporter 2049

    Joined:
    Mar 3, 2007
    Messages:
    3,097
    Likes Received:
    1,046
    We could build a rename rule to rename the index files to the proper filename, but I don't have to time to look into that until much later.

    Right now I won't change the proposed method.
     
  14. -=FamilyGuy=-

    -=FamilyGuy=- Site Supporter 2049

    Joined:
    Mar 3, 2007
    Messages:
    3,097
    Likes Received:
    1,046
    I remember that triple dots situation when I was testing the last find/sed command at one point.

    Ensure this is the one you use:
    find ./Final/conversations/ -type f -exec sed -i -e 's/"https:\/\/assemblergames\.com\/attachments\//"\.\.\/\.\.\/attachments\//g' {} \;

    Basically, "https://assemblergames.com/attachments/ should be substituted by "../../attachments/ in the conversation html files, and in nothing else. Escaping special regular expression characters with sed can be confusing and I made an error before, which is corrected now.
     
  15. T_chan

    T_chan Gutsy Member

    Joined:
    Apr 13, 2008
    Messages:
    470
    Likes Received:
    65
    The triple dots are just from me cutting off the first part of the url when pasting it into a post here :)
    The script/download instructions work super well, no problem with that at all !
    It's the HTML code this is a little bit too special for Firefox, but it's not something that critical at this point.
    Just like you wrote, this can be corrected one day with an extra small script if somebody's motivated enough to do this :)
    smt like:
    a- from folder ./Final/conversations/, go over all html files
    b- whenever there is a HREF ending with something like filename-jpg.1234/ and pointing to the attachments,
    c- go into that attachments folder & rename the index.html to filename.jpg
    (& same for png/zip/...)
    d- & append that filename.jpg to the original HREF link

    (With the --content-disposition I mentioned above, c- would not be needed)

    Not motivated enough to do this at this point however, the download as-is is more than enough for me :)
     
  16. Nemesis

    Nemesis Peppy Member

    Joined:
    Mar 22, 2007
    Messages:
    323
    Likes Received:
    154
  17. tichua

    tichua Site Supporter 2012,2013,2014,2015

    Joined:
    Feb 5, 2008
    Messages:
    501
    Likes Received:
    48
    just backed up mine. Thanks a lot
     
    -=FamilyGuy=- likes this.
  18. tichua

    tichua Site Supporter 2012,2013,2014,2015

    Joined:
    Feb 5, 2008
    Messages:
    501
    Likes Received:
    48
    -=FamilyGuy=-, is it possible to back up the entire site with this method?
     
  19. -=FamilyGuy=-

    -=FamilyGuy=- Site Supporter 2049

    Joined:
    Mar 3, 2007
    Messages:
    3,097
    Likes Received:
    1,046
    Yes, with the first wget command line, using the main site instead of the conversations subfolder.

    Some other people are already working on backing up AG and plan on hosting the archive online in the future. So unless you really want a personal copy, I'd hold back for now.
     
    Last edited: Jun 6, 2019

Share This Page