Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wappalyzer Technologies table has unexpected entries #1843

Closed
rockeynebhwani opened this issue Dec 27, 2020 · 22 comments
Closed

Wappalyzer Technologies table has unexpected entries #1843

rockeynebhwani opened this issue Dec 27, 2020 · 22 comments
Labels
analysis Querying the dataset bug Something isn't working
Milestone

Comments

@rockeynebhwani
Copy link
Contributor

rockeynebhwani commented Dec 27, 2020

Sometimes I am seeing entries which are not in Wappalyzer apps.json file - https://github.com/WPO-Foundation/Wappalyzer/blob/master/src/apps.json

For example, under eCommerce category, we have duplicate entries (With and without spaces).

  • SAP Commerce Cloud
  • SAPCommerceCloud
  • Salesforce Commerce Cloud
  • SalesforceCommerceCloud
  • Cart Functionality
  • CartFunctionality

You can see this in output of query

SELECT distinct app FROM httparchive.technologies.2020_10_01_mobileWHERE category = 'Ecommerce' order by app

Not sure why this is happening. This is resulting in slight over counting in some queries (For example - Total number of eCommerce platforms analyzed)

We should check why this is happening. Impact of this on 2020 chapter is minimal so I am not spending time to get to bottom of this for now and just raising an issue so that we can look into this later.

Also, If you look at site https://jelly-pop.com/, in technologies table, it shows app as 'SalesforceCommerceCloud' but if you see technologies using Wappalyzer chrome extension, this technology is not shown. Not sure why, this is appearing in technologies table.

Also, noticed this under 'Analytics' category and saw entries like -

  • GoogleTagManager
  • Google Tag Manager
@tunetheweb
Copy link
Member

tunetheweb commented Dec 29, 2020

Dunno why this happens but can tell you the outliers are rare enough they can pretty much be ignored:

SELECT DISTINCT
  t1.category,
  t1.app,
  t1.total,
  t2.category,
  t2.app,
  t2.total
FROM
   (SELECT category, app, count(1) AS total FROM `httparchive.technologies.2020_10_01_mobile` GROUP BY category, app) t1,
   (SELECT category, app, count(1) AS total FROM `httparchive.technologies.2020_10_01_mobile` GROUP BY category, app) t2
WHERE
  REPLACE(t1.category, ' ', '') = REPLACE(t2.category, ' ', '') AND
  REPLACE(t1.app, ' ', '') = REPLACE(t2.app, ' ', '') AND
  (t1.category != t1.category OR t1.app != t2.app) AND
  t1.total >= t2.total
ORDER BY
  t1.category,
  t1.app
category app total category_1 app_1 total_1
Advertising Google AdSense 810,985 Advertising GoogleAdSense 1
Analytics Baidu Analytics (百度统计) 19,220 Analytics BaiduAnalytics (百度统计) 1
Analytics Google Analytics 4,618,469 Analytics Google Analytics 1
Analytics Google Analytics 4,618,469 Analytics GoogleAnalytics 30
Analytics GoogleAnalytics 30 Analytics Google Analytics 1
Analytics Tencent Analytics (腾讯分析) 236 Analytics TencentAnalytics(腾讯分析) 1
CDN Netlify 10,958 CDN Netlify 1
CMS Adobe Experience Manager 16,732 CMS AdobeExperience Manager 1
CMS TYPO3 CMS 38,789 CMS TYPO3CMS 1
Ecommerce Cart Functionality 871,039 Ecommerce CartFunctionality 7
Ecommerce Salesforce Commerce Cloud 3,611 Ecommerce SalesforceCommerceCloud 2
Ecommerce SAP Commerce Cloud 2,324 Ecommerce SAPCommerceCloud 1
Font scripts Font Awesome 2,286,759 Font scripts FontAwesome 3
Font scripts FontAwesome 4.7.0 2 Font scripts Font Awesome  4.7.0 1
Font scripts Google Font API 3,292,664 Font scripts GoogleFontAPI 9
JavaScript frameworks Gatsby 7,645 JavaScript frameworks Gatsby 1
JavaScript frameworks React 327,194 JavaScript frameworks React 1
JavaScript graphics Raphael 19,947 JavaScript graphics Raphael 1
JavaScript libraries jQuery Migrate 1,610,580 JavaScript libraries jQueryMigrate 2
JavaScript libraries jQuery UI 1,453,426 JavaScript libraries jQuery UI 2
JavaScript libraries Modernizr 1,084,419 JavaScript libraries Modernizr 1
Maps Google Maps 341,206 Maps GoogleMaps 1
Miscellaneous Google Code Prettify 16,660 Miscellaneous GoogleCodePrettify 2
Miscellaneous Swiper Slider 464,738 Miscellaneous SwiperSlider 2
Miscellaneous Twitter Emoji (Twemoji) 1,634,230 Miscellaneous TwitterEmoji(Twemoji) 1
Miscellaneous webpack 342,236 Miscellaneous webpack 1
Operating systems Windows Server 524,703 Operating systems WindowsServer 20
PaaS Netlify 10,958 PaaS Netlify 1
PaaS WP Engine 75,925 PaaS WPEngine 1
Static site generator Gatsby 7,645 Static site generator Gatsby 1
Tag managers Google Tag Manager 2,590,588 Tag managers GoogleTagManager 13
UI frameworks animate.css 497,980 UI frameworks animate.css 1
UI frameworks Bootstrap 1,989,296 UI frameworks Bootstrap 4
Video players MediaElement.js 250,379 Video players MediaElement.js 1
Web frameworks Microsoft ASP.NET 460,820 Web frameworks MicrosoftASP.NET 11
Web servers Apache Tomcat 22,123 Web servers ApacheTomcat 2
Widgets Facebook 1,894,439 Widgets Facebook 1
Widgets OWL Carousel 576,306 Widgets OWLCarousel 4
Wikis MediaWiki 5,801 Wikis MediaWiki 2

@tunetheweb tunetheweb added analysis Querying the dataset bug Something isn't working labels Dec 29, 2020
@tunetheweb tunetheweb added this to the 2020 Backlog milestone Dec 29, 2020
@rockeynebhwani
Copy link
Contributor Author

Thanks @barrypollard . Agree that it's small enough can be ignored. I have ignored for now.

@tunetheweb
Copy link
Member

@pmeenan not urgent, but since you were looking at this code there any ideas on this one? Very small numbers but odd that spaces are stripped very rarely. Remember looking at the time and saw same in WPT for those URLs (but not Wappalyzer website) so think a WPT issue was repeatable. Meant to raise and issue but forgot until you jolted my memory!

@pmeenan
Copy link
Member

pmeenan commented Apr 16, 2021

@bazzadp If you can still repeat it, it would really help to have a repro case. I'm wondering if the Wappalyzer definitions were updated mid-crawl and the spaces were added.

Since then the whole wappalyzer engine was updated and changed so it will be more useful if we can see it in the May 2021 crawl.

@pmeenan
Copy link
Member

pmeenan commented Apr 16, 2021

That is bizarre. I wonder if the page itself is overriding some array or other ops because when I take the raw output from wappalyzer and run it through the same code not in the page, it keeps the spaces but if I run it on the console for that page then it strips them out (maybe a code page issue). At least I can reproduce it now though so it should be easier to fix

@pmeenan
Copy link
Member

pmeenan commented Apr 16, 2021

Ahh, looks like the pages override string.trim() and cause it to remove all of the whitespace. Since the Wappalyzer definitions don't have any trailing whitespace I can just remove the trim operations.

Should be fixed now (well, over the next hour as the agent update rolls out).

Screen Shot 2021-04-16 at 5 47 25 PM

@tunetheweb
Copy link
Member

Why, why would anyone do this? You get all sorts when you look at 7.5 milllion web pages...

Good work nailing it down.

@rviscomi
Copy link
Member

Thanks for tracking that down and fixing @pmeenan!

Can we close this?

@tunetheweb
Copy link
Member

Was going to give it a quick check after May crawl and then close it. There was also another issue where the technologies were all messed up as discussed on the HttpArchive slack.

So I say let's leave this open as a reminder to check the technologies results after May crawl as it's a key data for the Web Almanac so want to make sure it's definitely sorted before our crawl month.

@tunetheweb
Copy link
Member

Confirmed as all fixed in May crawl. Same query above gives 0 results.

@rviscomi
Copy link
Member

rviscomi commented Jun 16, 2021

Reopening this to track a related issue.

I noticed that "GoDaddy Website Builder" no longer has any websites detected since February:

SELECT
  _TABLE_SUFFIX AS suffix,
  COUNT(DISTINCT url) AS urls
FROM
  `httparchive.technologies.2021_*`
WHERE
  app = 'GoDaddy Website Builder'
GROUP BY
  suffix
ORDER BY
  suffix
suffix		urls
01_01_desktop	7525
01_01_mobile	11006
02_01_desktop	327

Interestingly, ismyhostfastyet.com is still able to detect GWB sites because it uses a signal from the HTTP headers and it's detecting 7k desktop pages as of May.

So I pulled out a URL from that detection and used the Wappalyzer extension for Chrome to get a true positive detection on https://finsbarandgrill.com/

image

However, in HA BQ and a plain WPT, we're only detecting Google Analytics:

SELECT
  *
FROM
  `httparchive.technologies.2021_05_01_desktop`
WHERE
  url = 'https://finsbarandgrill.com/'
url	category	app	info
https://finsbarandgrill.com/	Analytics	Google Analytics	""

https://webpagetest.org/jsonResult.php?test=210616_AiDcFQ_67699b36864cd25851abcfd97c53f792&pretty=1

                "detected": {
                    "Analytics": "Google Analytics"
                },
                "detected_apps": {
                    "Google Analytics": ""
                },

Wappalyzer uses a meta[name=generator] signal to detect GWB and the meta tag exists on the page as we'd expect:

https://github.com/AliasIO/wappalyzer/blob/6625a034b17965e9e30234f8a27b4f7f03e64e50/src/technologies.json#L7918-L7934

    "GoDaddy Website Builder": {
      "cats": [
        1
      ],
      "cookies": {
        "dps_site_id": ""
      },
      "icon": "godaddy.svg",
      "meta": {
        "generator": "Go Daddy Website Builder (.+)\\;version:\\1"
      },
      "pricing": [
        "mid"
      ],
      "saas": true,
      "website": "https://www.godaddy.com/websites/website-builder"
    },
// "Starfield Technologies; Go Daddy Website Builder 8.0.0000"
document.querySelector('meta[name=generator]').getAttribute('content')

@pmeenan this leads me to believe that there may be an integration bug with Wappalyzer in WPT. Would you be able to look into this?

@rviscomi rviscomi reopened this Jun 16, 2021
pmeenan added a commit to catchpoint/WebPageTest.agent that referenced this issue Jun 16, 2021
@pmeenan
Copy link
Member

pmeenan commented Jun 16, 2021

The Wappalyzer checks weren't including the meta tags. Agent has been updated and will be rolling out over the next hour.

Tested it in dev here and it correctly caught Go Daddy.

"detected": {
    "CMS": "GoDaddy Website Builder 8.0.0000",
    "Analytics": "Google Analytics"
},
"detected_apps": {
    "GoDaddy Website Builder": "8.0.0000",
    "Google Analytics": ""
},

@rviscomi
Copy link
Member

Thanks @pmeenan! I also noticed that the Wappalyzer extension detected React on that page, but it's not included in the new test results. Could something else be missing?

@pmeenan
Copy link
Member

pmeenan commented Jun 16, 2021

Possibly the serialized DOM. It serializes the HTML but not the DOM. Taking a look now.

@pmeenan
Copy link
Member

pmeenan commented Jun 16, 2021

Reached out to the Wappalyzer team to see how best to handle DOM-based detections. They are starting to migrate to it but the current engine doesn't directly support it and they have the extension doing the detections manually. Hoping there is a better way but should have something figured out soon.

pmeenan added a commit to catchpoint/WebPageTest.agent that referenced this issue Jun 17, 2021
@pmeenan
Copy link
Member

pmeenan commented Jun 17, 2021

Whew. That was somewhat more painful than I expected. Had to rewrite the JS variable detection part which changed pretty significantly when the engine changed a few months back (also added the support for the DOM detections).

Here is an updated test. Change is rolling out to prod (and HA) over the next hour.

"_detected": {
    "CMS": "GoDaddy Website Builder 8.0.0000",
    "JavaScript libraries": "React 16.13.1,Lodash 4.17.5",
    "Analytics": "Google Analytics"
},
"_detected_apps": {
    "GoDaddy Website Builder": "8.0.0000",
    "React": "16.13.1",
    "Lodash": "4.17.5",
    "Google Analytics": ""
}

@rviscomi
Copy link
Member

That's great, thanks for fixing!

@rviscomi
Copy link
Member

rviscomi commented Jul 1, 2021

Something to keep an eye on to verify that the fix is working. Here's a query to measure the change in origins from January to May for all technologies, using the table for the CWV Technology Report dashboard:

SELECT
  app,
  SAFE_DIVIDE(may.origins - jan.origins, jan.origins) AS pct_change,
  may.origins - jan.origins AS num_change,
  jan.origins AS jan_origins,
  may.origins AS may_origins
FROM (
  SELECT
    date,
    app,
    origins
  FROM
    `httparchive.core_web_vitals.technologies`
  WHERE
    date = '2021-01-01' AND
    client = 'mobile' AND
    origins >= 1000) AS jan
JOIN (
  SELECT
    date,
    app,
    origins
  FROM
    `httparchive.core_web_vitals.technologies`
  WHERE
    date = '2021-05-01' AND
    client = 'mobile') AS may
USING (app)
ORDER BY
  pct_change ASC

Top 20 results:

app pct_change num_change jan_origins may_origins
Incapsula -100% -19,360 19,364 4
Google Code Prettify -100% -13,713 13,716 3
CKEditor -100% -20,169 20,174 5
Hugo -100% -2,732 2,733 1
Pardot -100% -13,050 13,056 6
Angular -100% -28,733 28,760 27
AlloyUI -100% -5,966 5,973 7
INFOnline -100% -2,555 2,559 4
Intercom -100% -16,677 16,706 29
MobX -100% -9,661 9,694 33
VideoJS -100% -51,541 51,730 189
SilverStripe -100% -2,843 2,856 13
Webtrends -99% -1,590 1,599 9
Kampyle -99% -1,352 1,360 8
Neto -99% -1,097 1,106 9
Twitter Emoji (Twemoji) -99% -1,245,734 1,260,312 14,578
Polymer -95% -1,019 1,069 50
Open Web Analytics -95% -5,521 5,811 290
Disqus -95% -26,703 28,121 1,418
Dojo -95% -37,429 39,498 2,069

The bug appears to be more widespread than I'd initially thought and some technologies like Angular were almost entirely wiped out.

The June dataset should have the fix partially applied, so we should start to see these rebounding. Eventually everything should be fully counted in the July crawl.

@pmeenan WDYT about adding some kind of automated testing to ensure that the Wappalyzer integration is working? Not sure if that'd be implemented on the WPT or HTTP Archive side, or if it's something easy to build as a standalone WPT API app.

@rviscomi
Copy link
Member

Seeing detections for the 20 most affected technologies in the previous comment starting to recover in the June dataset. For example, here's a screenshot from the CWV Technology Report:

image
(Twemoji omitted because it's very popular and throws off the y-axis)

Given that the fix was applied late in the June crawl, detections haven't fully recovered, so I'll leave this issue open and continue to monitor this when the July crawl is available.

@rockeynebhwani
Copy link
Contributor Author

@rviscomi - Can we close this now?

@rviscomi
Copy link
Member

Yes we can close this now. Tracking improvements to technology detections in HTTPArchive/wappalyzer#70 instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Querying the dataset bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants