You can make modifications to the URLs, as they are being crawled, using the ‘Remove URL Parameters’ and ‘URL Rewriting’ features in Advanced Settings, in step 4 of the crawl setup.
These features are useful to undertake tasks such as removing URL components that are complicating analysis of your website or to rewrite URLs to an external website, such as lookup service e.g. retrieving information from an API for a set of your page URLs.
Stripping URL Parameters
If you simply want to strip out parameters, you can list them on separate lines in the Remove Parameters option in Advanced Settings. e.g. Add ‘param1’ to strip all parameters such as param1=1, param1=2, etc.
URL Rewriting
The URL rewriting function allows you to use regular expressions to modify your page URLs in more complex ways.
The URL is matched by the regular expression in the ‘Match From’ column, and replaced with what you set in the ‘Match To’ column.
If you use parentheses in the Match From setting, you in conjunction with variables $1, $2, etc. Lumar will insert whatever text matches the corresponding parenthetical group.
Append URL (must be in this order):
Match From1: (.+?.+)
Match To1: $1&url=someurl.com
Match From2: (^[^?]+$)
Match To2: $1?url=someurl.com
Replace a domain:
Match From: (^https?://)(www.|)exampledomain.com
Match To: $1differentdomain.com
Change the name of a parameter:
Match From: parameter_x=([^&$]+)
Match To: parameter_y=$1
Change the case of a parameter:
Match From: parameter_x=([^&$]+)
Match To: parameter_x=$1
Case Options: uppercase
Change HTTPS to HTTP:
Match From: ^https:
Match To: http:
Force Trailing Slash:
Match From: ([^/])$
Match To: $1/
Writing regular expressions can get tricky, so contact us at support@lumar.io if you need any help with these features.