Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A better .htaccess in root #2306

Closed
wants to merge 7 commits into from
Closed

A better .htaccess in root #2306

wants to merge 7 commits into from

Conversation

addison74
Copy link
Contributor

This PR brings the original Magento .htaccess from root to an updated form nowadays.

The explanations are more detailed for each statement. The blocks on each enabled Apache modules have been ordered logically. There have been some parts that are no longer relevant (e.g if the browser is Netscape). New situations were considered when SSL is not performed by the webserver, when Apache is behind a load balancer/proxy server, bad bots, hotlinks and direct access to OpenMage URLs that a crawler could perform in a loop, developer mode. Also, certain existing root files are protected and no longer need to be deleted in production.

I have had this file in production for almost 1 month and I modified it during this period to be as explicit as possible. Obviously I am open to any proposals related to the content.

Regarding bad bots it is a real issue, the robots.txt file does not even temper good bots. For example, analyzing the bingbot activity on a daily basis, this bot is the most greedy of all and it ended up making a traffic of almost 25 GB, while others like Googlebot, Baiduspider, YandexBot are much more limited. After I will better analyze Bingbot hourly behavior I will do a test allowing access only in certain time intervals and see what happens.

PS - In other PR we need to create the error pages in /errors directory. We still use the original Magento pages (404 and 503 only) with the layout from 14 years ago.

@fballiano
Copy link
Contributor

I added a comment to the code (which seems nobody can see)
Schermata 2022-07-17 alle 15 56 42

but I don't know man, all of this upgrade can be pretty dangerous and hard to review, I would live the htaccess as simple as it is now and add this one as a possible "complete and advanced version", maybe in the documentation.

@addison74
Copy link
Contributor Author

If you analyze the file in deep you will find out that it is not dangerous or complicated at all. Each section follows in a logical order. I removed some parts that are obsolete, I added others for a better functionality, some of them are disabled. Each section is well commented and can be easily analyzed. This file retains its original functionality, the simplicity you mention. This PR has been in production for several months and there are no issues. If it is rejected or its implementation is requested as a separate file (e.g. .htaccess.advanced.sample) I would like to read solid arguments.

@justinbeaty
Copy link
Contributor

@addison74 for people who turn off JS/CSS merging (rightfully so with HTTP 2) then it could cause caching issues requiring customers to hard reload a page.

Have you tried to update a JS or CSS file in your theme (for example) and regular reload (not hard refresh) the page? With your htaccess setting that you should end up not seeing your changes.

That is what @fballiano is mentioning can be dangerous because Magento doesn’t version files.

@fballiano
Copy link
Contributor

exactly @justinbeaty, @addison74 these extremely long cache can create big problems and shouldn't be enabled by default.

I think this PR is a good reference documentation but shouldn't be the default configuration :-)

@justinbeaty
Copy link
Contributor

@fballiano offtopic, but I think the reason others cannot see your comments is that you either need to submit the review, or only use single comments.

@fballiano
Copy link
Contributor

@fballiano offtopic, but I think the reason others cannot see your comments is that you either need to submit the review, or only use single comments.

ah! interesting!

@addison74
Copy link
Contributor Author

addison74 commented Jul 17, 2022

Please show me your suggestions and I will do the changes.

Those Expire directives are inherited from OpenMage, I just added a few more. But CSS and JS were already there. This is the original hunk:

## Add default Expires header
## http://developer.yahoo.com/performance/rules.html#expires
    ExpiresActive On
    ExpiresByType image/jpg "access plus 1 year"
    ExpiresByType image/jpeg "access plus 1 year"
    ExpiresByType image/gif "access plus 1 year"
    ExpiresByType image/png "access plus 1 year"
    ExpiresByType text/css "access plus 1 month"
    ExpiresByType application/pdf "access plus 1 month"
    ExpiresByType text/x-javascript "access plus 1 month"
    ExpiresByType application/x-shockwave-flash "access plus 1 month"
    ExpiresByType image/x-icon "access plus 1 year"
    ExpiresDefault "access plus 2 days"

@justinbeaty
Copy link
Contributor

You are right they're already in there, added by #876. I am not sure anymore since I have been using nginx for a while now.

@fballiano
Copy link
Contributor

They're not the same guys, the ones we already have are not that good of an idea but these new ones are too much.

I already said my suggestion, this change is too big, better leave the htaccess as it is (actaully there's already too much stuff in there) and add this one as a docuementation.

In my projects I wouldn't add an htacess that I didn't deeply review and perfectly know every single line so I wouldn't remove it anyway from any deployment.

@fballiano
Copy link
Contributor

So i'm just saying i'm not the right person to review this, it's too technical and i don't know all of those instructions, so i'd not put them in my production stores.


Options +FollowSymLinks
RewriteEngine on
Header set Strict-Transport-Security "max-age=31536000" env=HTTPS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should ever be set by default, especially not to 1 year. It should only be set when you're sure you have your SSL certs set up correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you search for this header with Google you will see how much ink was consumed until this final version was reached. In an SSLabs test if the page has that header set you will get a better score. An administrator who does not set a SSL certificate correctly cannot be named an administrator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I use a value of 2 years as well as includeSubdomains and preload. But it’s dangerous to set until you’re sure your certs are working properly and all subpaths (for example reverse proxied services) are on https.

I don’t disagree that it’s good to set, I would just comment by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we comment on it and add an explanatory phrase. If you don't know about it, then you will be disqualified in SSL tests. If you know about it, you activate it and leave it like that. I agree that there may be situations when a certificate is not set correctly when the automatic task is done. If you have a more elaborate proposal we can discuss it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, It's especially dangerous to enable for a server admin who does not know what it does. I don't have any other more elaborate proposal other than to just have it commented by default.


</FilesMatch>

<FilesMatch "(.*)\.(json|lock|md|sh|txt|sample)$">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this blocking access to all json or txt files? I am sure that this would break part of someone's website if so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That directive locks all files with the respective extensions and which are found in OM root. If someone uploads such files in other directories they will be denied if they do not modify .htaccess files in those directories. If they were css, js, jpeg, php, html it was an issue, Those extensions don't seem to be very desirable to be opened in the browser.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this (not tested):

<FilesMatch "^/[^/]*\.(json|lock|md|sh|txt|sample)$">

Or, just list out the files in the root dir you want to exclude, like php.ini.sample, robots.txt.sample etc.

I would be willing to bet that the blocking of json in particular will break some extension that tries to load a json file from the server in javascript.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will analyze the expression you proposed because I want to limit myself strictly to the root directory and only for those files. The rest of the files is controlled by .htaccess files that exist in most sub-directories. The idea of the deny was that in production no one would have to delete the files, but simply leave them there protected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with justin on this one, especially the json file type.

BrowserMatchNoCase "coccocbot(.*)" BAD_BOT
BrowserMatchNoCase "cortex(.*)" BAD_BOT
BrowserMatchNoCase "^Custo(.*)" BAD_BOT
BrowserMatchNoCase "^curl(.*)" BAD_BOT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole block is too opinionated... What if someone has some process they're using with curl that would now be blocked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as curl is debatable, that piece is taken from an online bad bots generator and that has curl included. If the directive is not commented the curl access from the terminal will be blocked if you do not establish a User-Agent. Daily I am looking at the bots list and I found that around 700 hits are through the curl. As it is not made by the team I work in we block bots that have User-Agent that starts with curl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It being debatable is more or less what I mean by opinionated.

Curl may not be a good bot for you, but some poor soul probably has some script running in the office closet making curl requests to their magento site to run some process. If it suddenly stops working they may not know why, especially if it was created by some developer long gone.

Also, ahrefs bot. Some people probably use ahref's marketing tools that would now break. Ahrefs claims they're a good bot: https://ahrefs.com/robot since they obey robots.txt etc. But just because you do not use it, we should not block it from everyone's openmage site.

Maybe it's better to remove all the user agents and just leave the awk command that users can generate for themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial list was created by an online generator. I checked all of them with the help of the website mentioned on a line in the proposed htaccess. Then for two 2 months I added all the bots that generated high daily traffic, here I do not mean 10 hits but over 100. I analyze the objection related to curl, I will comment on that line. I will add Ahrefs in robots.txt, if it continues to crawl my website I will keep that line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe it's better to just link the online generator so people can decide themselves. If other reviewers are okay with a set of defaults then okay, I just think it's too opinionated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list has too many false positives, ahrefs is an extremely famous seo tool, ApacheBench is a benchmarking tool, Dalvik is the android vmachine name, wget and curl are highly debatable.

Comment on lines +510 to +518
## The ultimate hotlink protection
## Before use set the YOUR_DOMAIN_NAME value in the last condition
## https://perishablepress.com/creating-the-ultimate-htaccess-anti-hotlinking-strategy/

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{REQUEST_FILENAME} -f
RewriteCond %{REQUEST_FILENAME} \.(gif|jpe?g?|png)$ [NC]
RewriteCond %{HTTP_REFERER} !^https?://([^.]+\.)?YOUR_DOMAIN_NAME\. [NC]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to break all images by default until the site admin notices (if they notice) that they need to update YOUR_DOMAIN_NAME in .htaccess?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People may not want to disable hotlinking in the first place. For example affiliates who want to use product images.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read the article in the link to understand the sequence of conditions. It is the best hotlink implementation I've found so far. The author had several variants in mind that he analyzed professionally and kept only this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People may not want to disable hotlinking in the first place. For example affiliates who want to use product images.

For me it was more important to offer the best solution for those who need something like that. If people don't want to protect themselves on the hotlinking then they don't need this block, they can comment on it. If you notice there are situations where I have offered two denying methods, one classic and one based on redirection. It remains at the discretion of the users they will use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the link:

Implementing the de facto htaccess hotlink protection code required a simple binary decision: “do you want hotlink-protection or not?”

We should not make this decision for users IMO. In the past I absolutely had instructions for affiliate marketers to hot link images from my site. This would break that.

Okay but looking at the changes again I see that both RewriteRules are commented out, so it shouldn't do anything. But what happens if the user puts some other RewriteRule at the bottom of this file? Are the previous RewriteConditions affecting that? (I actually don't know here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people don't want to protect themselves on the hotlinking then they don't need this block, they can comment on it.

In general, it's best to make people opt-in to changes like this. To change it by default is the very nature of a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your opinion. the hotlink block will be commented and whoever wants this feature will have to activate it. The important thing is that it exists as a solution.

@addison74 addison74 marked this pull request as draft July 17, 2022 22:13
@justinbeaty
Copy link
Contributor

justinbeaty commented Jul 17, 2022

Last comment about the Expires header. Maybe we can come up with a good default set of headers?

For example something like a very short cache (1 hour?) by default,

For these directories we can set a very long cache (because I believe all of them should be versioned):

  • /media/js
  • /media/css
  • /media/css_secure
  • /media/catalog/product/cache/

Things like these, something like 1 day:

  • /skin/*(jpg,png,gif)
  • favicon

Maybe some others too. I am not so sure about this anymore because I am using a custom frontend where all assets are versioned.

@justinbeaty
Copy link
Contributor

Overall I am not opposed to a new htaccess file, but again it doesn't affect me since I've been using nginx for a while now. Any changes should be others with a lot of Apache experience.

@addison74
Copy link
Contributor Author

I drafted it until we give it a final shape. There is room for improvement and I do not intend to merge it quickly. Any proposal you may have please post it for discussions.


</IfModule>

<IfModule mod_php8.c>
Copy link
Contributor

@justinbeaty justinbeaty Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually correct? It seems that it might just be <IfModule php_module> (or <IfModule mod_php.c>) for PHP8.

https://php.watch/versions/8.0/mod_php-rename
nextcloud/server#26569 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have php8.x installed? If so, please check the name of the php module used by Apache. I didn't switch to php8 yet. If it's like that guy says we have to change not only here but also in the htaccess files I worked on recently. It is not critical but it must be resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't migrated yet, but also am not using mod_php but instead php-fpm. And yes, I did approve the other PR for 20.0 in the js directory figuring that was the correct IfModule, but maybe it's not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested today, for php7 we've to use "mod_php7.c", for php8 we've to use "mod_php.c", any other won't work.

ExpiresByType image/vnd.microsoft.icon "access plus 1 year"
ExpiresByType text/css "access plus 1 month"
ExpiresByType text/csv "access plus 1 month"
ExpiresByType text/html "access plus 1 month"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these line can be very useful but are extremely dangerous if there's no "versioning" of the static assets, I wouldn't suggest such values unless somebody knows exactly the danger they're entering. I also think it would break the integration I always use with my preferred CDN

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block should be commented and documented, users may want to have different caching strategies and these long ones can be dangerous if not properly handled.


</IfModule>

<IfModule mod_php8.c>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested today, for php7 we've to use "mod_php7.c", for php8 we've to use "mod_php.c", any other won't work.


php_value max_execution_time 18000
#php_value max_execution_time 18000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should change this settings, I don't know why the heck it should be suggested to have 18000 seconds max execution time :-D

## Redirect access to Advanced Search and Popular Search Terms pages

#RedirectMatch 301 /catalogsearch/advanced(.*)$ https://www.example.com
#RedirectMatch 301 /catalogsearch/term/popular(.*)$ https://www.example.com
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

people know how to make a 301 redirect and we shouldn't have this basic sections here, it's really easy to find the proper documentation in the right places

@addison74 addison74 marked this pull request as ready for review July 18, 2022 11:38
@addison74 addison74 closed this Jul 18, 2022
@addison74 addison74 deleted the ADDISON74-htaccess-root branch July 20, 2022 19:26
@addison74
Copy link
Contributor Author

This PR collected many good ideas and it would be a pity not to capitalize on them.

I will create a new draft to be discussed. I am considering keeping the current .htaccess with small changes in comments and form and a new .htaccess with a lot of features so that anyone who is interested can use parts of it or even activate it completely in production. The file will return with a known name .htaccess.sample (I deleted it from OM based on a recently PR because it had the same content as .htaccess).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants