Uploaded image for project: 'Data Management'
  1. Data Management
  2. DM-14870

eups.lsst.codes sync from s3 does not update objects of identical size

    Details

      Description

      Fritz Mueller reported yesterday that the qserv-dev eups distrib tag was not updating after being published. The s3 object was confirmed to be correct but was not syncing to the k8s service. It was assumed at the time that this was a random case of s3 eventual consistency taking an excessively long time and the k8s pod was always getting an old version of the object. However, > 12 hours seems excessive for this.

      Upon further investigation this morning, it appears that aws s3 sync from awscli, which is used to to perform the sync, does not checksum the local file to determine if it is in-sync with s3. All it does it look at the file size by default, and can optionally compare timestamps (which isn't enabled) – there is no option to force checksums (ie., rsync -c). This is rather unfortunate as s3 does have an ETag (md5) for all objects. Eg.,

      $ aws s3api head-object  --bucket eups.lsst.codes --key stack/src/tags/qserv-dev.list 
      {
          "AcceptRanges": "bytes",
          "LastModified": "Fri, 22 Jun 2018 01:14:45 GMT",
          "ContentLength": 2495,
          "ETag": "\"04d0d2da6b4b1107bb03453177813201\"",
          "VersionId": "null",
          "ContentType": "binary/octet-stream",
          "Metadata": {}
      }
       $ aws s3 cp s3://eups.lsst.codes/stack/src/tags/qserv-dev.list .
      download: s3://eups.lsst.codes/stack/src/tags/qserv-dev.list to ./qserv-dev.list
       $ md5sum qserv-dev.list 
      04d0d2da6b4b1107bb03453177813201  qserv-dev.list
      

      Demonstration that the s3 object and the stale eups.lsst.codes file are the same size:

      [root@pkgroot-rc-jh4lf /]# grep BUILD= /var/www/html/stack/src/tags/qserv-dev.list
      #BUILD=b3668
      [root@pkgroot-rc-jh4lf /]# ls -la /var/www/html/stack/src/tags/qserv-dev.list
      -rw-r--r-- 1 root root 2495 Jun 21 12:42 /var/www/html/stack/src/tags/qserv-dev.list
       
      $ grep BUILD= qserv-dev.list 
      #BUILD=b3670
      $ ls -la qserv-dev.list 
      -rw-rw-r--. 1 jhoblitt jhoblitt 2495 Jun 21 18:14 qserv-dev.list
      

        Attachments

          Activity

          Hide
          jhoblitt Joshua Hoblitt added a comment -

          I'm timing the performance of s3cmd sync, which does compute an md5sum to compare against the ETag by default.  If it comes it < 20mins, I'm planning to switch over to it.  Otherwise, aws s3 sync --exact-timestamps will be the fastest fix.

          Show
          jhoblitt Joshua Hoblitt added a comment - I'm timing the performance of s3cmd sync , which does compute an md5sum to compare against the ETag by default.  If it comes it < 20mins, I'm planning to switch over to it.  Otherwise, aws s3 sync --exact-timestamps will be the fastest fix.
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          s3cmd appears to be vastly too slow to sync the entire bucket. The ultimate solution is probably to use a work queue to only copy files which have changed (which can be obtained from events on the s3 bucket).

          I bailed out after > 20mins:

          download: 's3://eups.lsst.codes/stack/osx/10.9/clang-800.0.42.1/miniconda2-4.2.12-7c8e67/ip_diffim-13.0-28-gf4bc96c+11@DarwinX86.tar.gz' -> '/var/www/html/stack/osx/10.9/clang-800.0.42.1/miniconda2-4.2.12-7c8e67/ip_diffim-13.0-28-gf4bc96c+11@DarwinX86.tar.gz'  [463 of 147924]
              65536 of 13176882     0% in    0s   467.28 kB/s^CSee ya!
           
          real	24m10.789s
          user	4m48.893s
          sys	0m22.416s
          

          Show
          jhoblitt Joshua Hoblitt added a comment - s3cmd appears to be vastly too slow to sync the entire bucket. The ultimate solution is probably to use a work queue to only copy files which have changed (which can be obtained from events on the s3 bucket). I bailed out after > 20mins: download: 's3://eups.lsst.codes/stack/osx/10.9/clang-800.0.42.1/miniconda2-4.2.12-7c8e67/ip_diffim-13.0-28-gf4bc96c+11@DarwinX86.tar.gz' -> '/var/www/html/stack/osx/10.9/clang-800.0.42.1/miniconda2-4.2.12-7c8e67/ip_diffim-13.0-28-gf4bc96c+11@DarwinX86.tar.gz' [ 463 of 147924 ] 65536 of 13176882 0 % in 0s 467.28 kB/s^CSee ya!   real 24m10.789s user 4m48.893s sys 0m22.416s
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          Hmm. It looks like aws s3 sync --exact-timestamps is going to cause almost all files to be re-downloaded as well.

          Show
          jhoblitt Joshua Hoblitt added a comment - Hmm. It looks like aws s3 sync --exact-timestamps is going to cause almost all files to be re-downloaded as well.
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          Even with the --exact-timestamps flag, awscli was able to sync thousands of files in < 10mins.

          real	6m47.635s
          user	3m55.213s
          sys	0m53.356s
          

          And a noop is ~3mins:

          [root@pkgroot-rc-jh4lf ~]# time aws s3 sync --delete --exact-timestamps "s3://${S3_BUCKET}" "$WWW_ROOT"
           
          real	2m58.257s
          user	1m51.405s
          sys	0m7.709s
          

          Show
          jhoblitt Joshua Hoblitt added a comment - Even with the --exact-timestamps flag, awscli was able to sync thousands of files in < 10mins. real 6m47.635s user 3m55.213s sys 0m53.356s And a noop is ~3mins: [root @pkgroot -rc-jh4lf ~]# time aws s3 sync --delete --exact-timestamps "s3://${S3_BUCKET}" "$WWW_ROOT"   real 2m58.257s user 1m51.405s sys 0m7.709s
          Hide
          jhoblitt Joshua Hoblitt added a comment - - edited

          A new jenkins job named sqre/infrastructure/build-s3sync has been merged to automate the build/push of the docker image.

          Show
          jhoblitt Joshua Hoblitt added a comment - - edited A new jenkins job named sqre/infrastructure/build-s3sync has been merged to automate the build/push of the docker image.
          Hide
          jhoblitt Joshua Hoblitt added a comment -

          I've restarted the pkgroot pod and the correct qsrev-dev file is now present.  This must have been a long standing bug for any new version of a file with the exact same size – which really should only ever be a tag but I suspect this has been true of other packages that get recreated/published by eups if the jenkins agent didn't already have a cached version of the package.

           $ curl -sSL https://eups.lsst.codes/stack/src/tags/qserv-dev.list | grep BUILD
          #BUILD=b3670

          Show
          jhoblitt Joshua Hoblitt added a comment - I've restarted the pkgroot pod and the correct qsrev-dev file is now present.  This must have been a long standing bug for any new version of a file with the exact same size – which really should only ever be a tag but I suspect this has been true of other packages that get recreated/published by eups if the jenkins agent didn't already have a cached version of the package.  $ curl -sSL https://eups.lsst.codes/stack/src/tags/qserv-dev.list | grep BUILD #BUILD=b3670
          Hide
          fritzm Fritz Mueller added a comment -

          Working now for me – thank you for the help!

          Show
          fritzm Fritz Mueller added a comment - Working now for me – thank you for the help!

            People

            • Assignee:
              jhoblitt Joshua Hoblitt
              Reporter:
              jhoblitt Joshua Hoblitt
              Reviewers:
              Fritz Mueller
              Watchers:
              Fritz Mueller, Frossie Economou, Gabriele Comoretto, Joshua Hoblitt
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Summary Panel