navigation
Home
admin
|
Analyse des perfs et tuning
October 18th, 2016
|
| Table des matières |  |
Docs
iometer
memory reservations/limits
iSCSI login timeout
RVTools
sanHQ
Disable Delayed ACKs
Dell DPack
esxtop
vmkernel.log
Analyse des switchs
| Docs |  |
Dell EqualLogic PS6000XVS Performance in a VMware View VDI Environment http://en.community.dell.com/dell-groups/dtcmedia/m/mediagallery/19861296/download.aspx
Best practices for performing storage performance tests within a virtualized environment http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2019131
| iometer |  |
Tests
Download : www.iometer.org
The total I/O per second label describes exactly what this number is a measure of. This score is a measure of the number of I/O requests completed per second. Since an I/O request is made up of moving the hard drive head to the proper location and then reading or writing a file, this number provides and excellent measure of how a drive or an array is performing.
Source : http://www.anandtech.com/show/788/11
Méthodologie - étape 1
il est possible d'utiliser le cgi du site ci-dessous pour avoir un tableau présentant les résultats du test
(il suffit de copier-coller le contenu du fichier résultat de iometer).
http://vmktree.org/iometer/
Méthodologie - étape 2
Il est également possible de comparer ses résultats à d'autres :
cf https://communities.vmware.com/thread/197844?start=525&tstart=0
La base de tests est téléchargeable ici : http://www.mez.co.uk/OpenPerformanceTest.icf
Avec les options suivantes :
Worker : Worker 1
Worker type : DISK
Default target settings for worker
Number of outstanding IOs,test connection rate,transactions per connection : 64,ENABLED,500
Disk maximum size,starting sector : 8000000,0
Run time = 5 min
For testing the disk C is configured and the test file (8000000 sectors) will be created by
first running - you need free space on the disk.
The cache size has direct influence on results. By systems with cache over 2GB the test file should be increased.
Autres résultats
lien 10GB
https://communities.vmware.com/thread/197844?start=525&tstart=0
STORAGE TYPE / DISK NUMBER / RAID LEVEL: EqualLogic PS6110XS Firmware 6.0.6 (1x10GB/s ISCSI port) - 7x 400GB SSDs + 17x 600GB 10K HDDs - RAID6 (accelerated), 1 spare - NO MEM
Test name Latency Avg iops Avg MBps cpu load
Max Throughput-100%Read 2.10 26365 824 7%
RealLife-60%Rand-65%Read 5.81 9749 76 4%
Max Throughput-50%Read 5.16 11275 352 5%
Random-8k-70%Read 5.60 10175 79 4%
Max Throughput-50% Read numbers seem a bit low, otherwise the results look decent
liens 1GB
https://communities.vmware.com/thread/197844?start=510&tstart=0
SERVER TYPE: Windows 2008 R2, 2 vCPU, 4GB RAM, 60GB hard disk
CPU TYPE / NUMBER: Intel X6550, 2 CPU
HOST TYPE: Dell PowerEdge R710, 64GB RAM, 2x E5-2660 (2.66 GHz), 4x1GB/s ISCSI ports
ISCSI LAN: 2x PowerConnect 6224 (MTU 9000)
STORAGE TYPE / DISK NUMBER / RAID LEVEL: EqualLogic PS6100X Firmware 6.0.5 (4x1GB/s ISCSI ports) - 22 SAS 10K 600GB - RAID10 + 2 spares - NO MEM
Test name Latency Avg iops Avg MBps cpu load
Max Throughput-100%Read 4.55 11910 372 1%
RealLife-60%Rand-65%Read 9.31 4938 38 4%
Max Throughput-50%Read 5.84 9430 294 0%
Random-8k-70%Read 9.11 5036 39 4%
Mes résultats
Test name Latency Avg iops Avg MBps cpu load
Max Throughput-100%Read 27.12 2201 68 24%
RealLife-60%Rand-65%Read 13.16 4091 31 37%
Max Throughput-50%Read 23.78 2414 75 25%
Random-8k-70%Read 12.89 4182 32 34%
Problème
Je n'ai pas réussi à faire les tests sur le RAW disque, celui-ci n'étant pas détecté par iometer.
Comme indiqué ici http://jrich523.wordpress.com/2011/01/18/iometer-not-showing-all-disks/, j'ai pourtant utilisé le compte administrateur mais cela n'a rien changé.
| memory reservations/limits |  |
Per-VM memory reservations/limits are discouraged since they needlessly increase admin overhead.
Source : http://www.reddit.com/r/vmware/comments/1986s0/question_about_resource_allocation_memory_limits/
In most cases, it is not necessary to specify a limit. There are benefits and drawbacks:
Benefits Assigning a limit is useful if you start with a small number of virtual machines and want to manage user expectations. Performance deteriorates as you add more virtual machines. You can simulate having fewer resources available by specifying a limit.
Drawbacks You might waste idle resources if you specify a limit. The system does not allow virtual machines to use more resources than the limit, even when the system is underutilized and idle resources are available. Specify the limit only if you have good reasons for doing so.
Sources :
http://pubs.vmware.com/vsphere-4-esx-vcenter/index.jsp?topic=/com.vmware.vsphere.resourcemanagement.doc_41/getting_started_with_resource_management/c_limit.html
http://pubs.vmware.com/vcd-51/index.jsp?topic=%2Fcom.vmware.vcloud.users.doc_51%2FGUID-DAF22DCC-1FCF-4627-8258-B7B0F5C90E05.html
| iSCSI login timeout |  |
Adjust the iSCSI login timeout on ESXi 5.0
In ESXi 5.x, the iSCSI login timeout is currently set 5 seconds. This means that after 5 seconds the ESXi host kills the iSCSI session if there is no response, and tries to log in again immediately after. This places additional load on the Storage Array, and can result in a "login storm".
The ability to change this setting from the vSphere Client has been added with VMware ESXi 5.0 Patch Release ESXi500-201112001 (2007680). To help alleviate this problem, extend the login timeout to 15 seconds, or 30 seconds if necessary.
Source : http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2007829
| RVTools |  |
RVTools is a windows .NET 2.0 application which uses the VI SDK to display information about your virtual machines and ESX hosts. Interacting with VirtualCenter 2.5, ESX Server 3.5, ESX Server 3i, ESX Server 4i, VirtualCenter 4.0, ESX Server 4.0, VirtualCenter 4.1, ESX Server 4.1, VirtualCenter 5.0, VirtualCenter Appliance, ESX Server 5, VirtualCenter 5.1 and ESX Server 5.1. RVTools is able to list information about VMs, CPU, Memory, Disks, Partitions, Network, Floppy drives, CD drives, Snapshots, VMware tools, Resource pools, ESX hosts, HBAs, Nics, Switches, Ports, Distributed Switches, Distributed Ports, Service consoles, VM Kernels, Datastores and health checks. With RVTools you can disconnect the cd-rom or floppy drives from the virtual machines and RVTools is able to update the VMware Tools installed inside each virtual machine to the latest version.
http://www.robware.net/
| sanHQ |  |
sanHQ permet d'analyser le fonctionnement de l'infra.
Concernant la baie equallogic, les latences I/O peuvent s'analyser ainsi :
Average I/O Latency
One of the leading indicators of a healthy SAN is latency. Latency is the time from the receipt of the I/O request to the time that the I/O is returned to the server.
Latency must be considered in conjunction with the average I/O size, because large I/O operations take longer to process than small I/O operations.
Use the following guidelines apply to I/O operations with an average size of 16 KB or less:
Less than 20 msIn general, average latencies of less than 20 ms are acceptable.
20 ms to 50 msSustained average latencies between 20 ms and 50 ms should be monitored closely. You might want to reduce the workload or add additional storage resources to handle the load.
51 ms to 80 msSustained average latencies between 51 ms and 80 ms should be monitored closely. Applications might experience problems and noticeable delays. You might want to reduce the workload or add additional storage resources to handle the load.
Greater than 80 msAn sustained average latency of more than 80 ms indicates a problem, especially if this value is sustained over time. Most enterprise applications will experience problems if latencies exceed 100 ms. You should reduce the workload or add additional storage resources to handle the load.
If the average I/O operation size is greater than 16 KB, the previous latency guidelines might not apply. If latency statistics indicate a performance problem, examine the total IOPS in the pools. The storage array configuration (disk drives and RAID level) determines the maximum number of random IOPS that can be sustained. EqualLogic customer support or your channel partner can help size storage configurations for specific workloads.
Also, review the latency on your servers. If the storage does not show a high latency but the server does, the source of the problem might be the server or network infrastructure. Consult your operating system, server, or switch vendor for appropriate actions to take.
Source : http://psonlinehelp.equallogic.com/V5.1/Content/V5.1/V51/Concpts/New_concepts/CNCPTS_Idntfy_Perf_Probs.htm
| Disable Delayed ACKs |  |
A central precept of the TCP network protocol is that data sent through TCP be acknowledged by the recipient. According to RFC 813, "Very simply, when data arrives at the recipient, the protocol requires that it send back an acknowledgement of this data. The protocol specifies that the bytes of data are sequentially numbered, so that the recipient can acknowledge data by naming the highest numbered byte of data it has received, which also acknowledges the previous bytes.". The TCP packet that carries the acknowledgement is known as an ACK.
A host receiving a stream of TCP data segments can increase efficiency in both the network and the hosts by sending less than one ACK acknowledgment segment per data segment received. This is known as a delayed ACK. The common practice is to send an ACK for every other full-sized data segment and not to delay the ACK for a segment by more than a specified threshold. This threshold varies between 100ms and 500ms. ESXi/ESX uses delayed ACK because of its benefits, as do most other servers.
Sources :
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1002598
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2007829
| Dell DPack |  |
DPACK = Dell Performance Analysis Collection Kit
The Dell Performance Analysis Collection Kit (DPACK) helps you optimize spending and analyze opportunities for virtualization or data center expansion.
Source : http://www.dell.com/learn/us/en/04/campaigns/dell-performance-analysis-collection-kit-dpack
The Windows DPACK collector supports adding a VMware vCenter server running 3.5 or
abo ve. DPACK uses the same protocol to gather information as VMware vSphere Client.
This protocol uses HTTPS/SOAP.
PS : pour pouvoir faire la collecte plus de 24 heures, il faut lancer l'exécutable en ligne de commande avec l'option /extended :
| esxtop |  |
Source principale : https://communities.vmware.com/docs/DOC-9279
Seuils
Metrics and Thresholds
Display Metric Threshold Explanation
CPU %RDY 10 Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check %MLMTD) has been set. See Jasons explanation for vSMP VMs
CPU %CSTP 3 Excessive usage of vSMP. Decrease amount of vCPUs for this particular VM. This should lead to increased scheduling opportunities.
CPU %SYS 20 The percentage of time spent by system services on behalf of the world. Most likely caused by high IO VM. Check other metrics and VM for possible root cause
CPU %MLMTD 0 The percentage of time the vCPU was ready to run but deliberately wasnt scheduled because that would violate the CPU limit settings. If larger than 0 the world is being throttled due to the limit on CPU.
CPU %SWPWT 5 VM waiting on swapped pages to be read from disk. Possible cause: Memory overcommitment.
MEM MCTLSZ 1 If larger than 0 host is forcing VMs to inflate balloon driver to reclaim memory as host is overcommited.
MEM SWCUR 1 If larger than 0 host has swapped memory pages in the past. Possible cause: Overcommitment.
MEM SWR/s 1 If larger than 0 host is actively reading from swap(vswp). Possible cause: Excessive memory overcommitment.
MEM SWW/s 1 If larger than 0 host is actively writing to swap(vswp). Possible cause: Excessive memory overcommitment.
MEM CACHEUSD 0 If larger than 0 host has compressed memory. Possible cause: Memory overcommitment.
MEM ZIP/s 0 If larger than 0 host is actively compressing memory. Possible cause: Memory overcommitment.
MEM UNZIP/s 0 If larger than 0 host has accessing compressed memory. Possible cause: Previously host was overcommited on memory.
MEM N%L 80 If less than 80 VM experiences poor NUMA locality. If a VM has a memory size greater than the amount of memory local to each processor, the ESX scheduler does not attempt to use NUMA optimizations for that VM and remotely uses memory via interconnect. Check GST_ND(X) to find out which NUMA nodes are used.
NETWORK %DRPTX 1 Dropped packets transmitted, hardware overworked. Possible cause: very high network utilization
NETWORK %DRPRX 1 Dropped packets received, hardware overworked. Possible cause: very high network utilization
DISK GAVG 25 Look at DAVG and KAVG as the sum of both is GAVG.
DISK DAVG 25 Disk latency most likely to be caused by array. (Seuil à 12 après après vmware)
DISK KAVG 2 Disk latency caused by the VMkernel, high KAVG usually means queuing. Check QUED.
DISK QUED 1 Queue maxed out. Possibly queue depth set to low. Check with array vendor for optimal queue depth value.
DISK ABRTS/s 1 Aborts issued by guest(VM) because storage is not responding. For Windows VMs this happens after 60 seconds by default. Can be caused for instance when paths failed or array is not accepting any IO for whatever reason.
DISK RESETS/s 1 The number of commands reset per second.
DISK CONS/s 20 SCSI Reservation Conflicts per second. If many SCSI Reservation Conflicts occur performance could be degraded due to the lock on the VMFS.
Source : http://www.yellow-bricks.com/esxtop/#esxtop-limiting
A world is an ESX Server VMkernel schedulable entity, similar to a process or thread in other operating systems.
Mémoire
Appuyer sur "m" une fois esxtop lancé.
Au bout de la première ligne : "MEM overcommit avg"
Q: What does it mean if overcommit is not 0?
A: It means that total requested guest physical memory is more than the machine memory available. This is fine, because ballooning and page sharing allows memory overcommit.
%MLMTD
The percentage of time the world was ready to run but deliberately wasn't scheduled because that would violate the "CPU limit" settings.
Note that %MLMTD is included in %RDY.
+Q: What does it mean if %MLMTD of a VM is high?+
+A: The VM cannot run because of the "CPU limit" setting. If you want to improve the performance of this VM, you may increase its limit. However, keep in mind that it may reduce the performance of others.+
%RDY
The percentage of time the world was ready to run.
A world in a run queue is waiting for CPU scheduler to let it run on a PCPU. %RDY accounts the percentage of this time. So, it is always smaller than 100%.
+Q: How do I know CPU resource is under contention?+
+A: %RDY is a main indicator. But, it is not sufficient by itself.+
+If a "CPU Limit" is set to a VM's resource settings, the VM will be deliberately held from scheduled to a PCPU when it uses up its allocated CPU resource. This may happen even when there is plenty of free CPU cycles. This time deliberately held by scheduler is shown by "%MLMTD", which will be describe next. Note that %RDY includes %MLMTD. For, for CPU contention, we will use "%RDY - %MLMTD". So, if "%RDY - %MLMTD" is high, e.g., larger than 20%, you may experience CPU contention.+
+What is the recommended threshold? Well, it depends. As a try, we could start with 20%. If your application speed in the VM is OK, you may tolerate higher threshold. Otherwise, lower.+
+Q: How do we break down 100% for the world state times?+
+A: A world can be in different states, either scheduled to run, ready to run but not scheduled, or not ready to run (waiting for some events).+
100% = %RUN + %READY + %CSTP + %WAIT.
+Check the description of %CSTP and %WAIT below.+
+Q: What does it mean if %RDY of a VM is high?+
+A: It means the VM is possibly under resource contention. Check "%MLMTD" as well. If "%MLMTD" is high, you may raise the "CPU limit" setting for the VM. If "%RDY - %MLMTD" is high, the VM is under CPU contention.+
PCPU UTIL(%)
The percentage of unhalted CPU cycles per PCPU, and its average over all PCPUs.
Q: What does it mean if PCPU UTIL% is high?
A: It means that you are using lots of resource. (a) If all of the PCPUs are near 100%, it is possible that you are overcommiting your CPU resource. You need to check RDY% of the groups in the system to verify CPU overcommitment. Refer to RDY% below. (b) If some PCPUs stay near 100%, but others are not, there might be an imbalance issue. Note that you'd better monitor the system for a few minutes to verify whether the same PCPUs are using ~100% CPU. If so, check VM CPU affinity settings.
Q: What is the difference between "PCPU UTIL(%)" and "PCPU USED(%)"?
A: While "PCPU UTIL(%)" indicates how much time a PCPU was busy (unhalted) in the last duration, "PCPU USED(%)" shows the amount of "effective work" that has been done by this PCPU. The value of "PCPU USED(%)" can be different from "PCPU UTIL(%)" mainly for the following two reasons:
Monitoring storage performance on a per-LUN basis
Start esxtop by typing esxtop from the command line.
Press u to switch to disk view (LUN mode).
Press f to modify the fields that are displayed.
Press b, c, f, and h to toggle the fields and press Enter.
Press s and then 2 to alter the update time to every 2 seconds and press Enter.
See Analyzing esxtop columns for a description of relevant columns.
DAVG
This is the latency seen at the device driver level. It includes the roundtrip time between the HBA and the storage.
DAVG is a good indicator of performance of the backend storage. If IO latencies are suspected to be causing performance problems, DAVG should be examined. Compare IO latencies with corresponding data from the storage array. If they are close, check the array for misconfiguration or faults. If not, compare DAVG with corresponding data from points in between the array and the ESX Server, e.g., FC switches. If this intermediate data also matches DAVG values, it is likely that the storage is under-configured for the application. Adding disk spindles or changing the RAID level may help in such cases.
Pour observer ce paramètre : esxtop > d (disk) ou u (lun mode)
Sources :
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008205
https://communities.vmware.com/docs/DOC-9279
Note :
All arrays perform differently, however DAVG/cmd, KAVG/cmd, and GAVG/cmd should not exceed more than 10 milliseconds (ms) for sustained periods of time.
Source : http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1008205
| vmkernel.log |  |
grep latency /var/log/vmkernel.log
[...]
2013-12-17T08:25:30.027Z cpu10:5862146)WARNING: ScsiDeviceIO: 1224: Device naa.68b7b2dc154380a1e2f4c402000060fb performance has deteriorated. I/O latency increased from average value of 5824 microseconds to 242126 microseconds.
[...] |
| Analyse des switchs |  |
show interfaces summary
SR6-EC2#show interfaces summary
*: interface is up
IHQ: pkts in input hold queue IQD: pkts dropped from input queue
OHQ: pkts in output hold queue OQD: pkts dropped from output queue
RXBS: rx rate (bits/sec) RXPS: rx rate (pkts/sec)
TXBS: tx rate (bits/sec) TXPS: tx rate (pkts/sec)
TRTL: throttle count
Interface IHQ IQD OHQ OQD RXBS RXPS TXBS TXPS TRTL
-------------------------------------------------------------------------
Vlan1 0 0 0 0 0 0 0 0 0
* Vlan101 1 0 0 0 1000 1 1000 1 0
[...]
* GigabitEthernet2/0/18 0 0 0 2683650 5447000 1829 17333000 3436 0
[...]
* GigabitEthernet2/0/47 0 0 0 416021 9926000 3195 41478000 5999 0
* GigabitEthernet2/0/48 0 0 0 3051726 36445000 5794 19286000 4102 0
[...]
* Port-channel1 0 0 0 6151397 51831000 10831 78101000 13537 0 |
One of the most telling is that the switches (assuming Cisco) carrying your iSCSI traffic may show high Output Queue Drops (OQD) counters when you run a Show Interface Summary. This is because the packets sent out the wrong vmnic have no business being on the switch they were routed too, so the switch drops the packet.
Source : http://vmtoday.com/2012/02/vsphere-5-networking-bug-affects-software-iscsi/
Pour info, les interfaces GigabitEthernet2/0/18,47 sont utilisées pour aller vers le coeur de réseau via le port channel 1.
Note
Output drops are caused by a congested interface. For example, the traffic rate on the outgoing interface cannot accept all packets that should be sent out. The ultimate solution to resolve the problem is to increase the line speed. However, there are ways to prevent, decrease, or control output drops when you do not want to increase the line speed. You can prevent output drops only if output drops are a consequence of short bursts of data. If output drops are caused by a constant high-rate flow, you cannot prevent the drops. However, you can control them.
Source : http://www.cisco.com/en/US/products/hw/routers/ps133/products_tech_note09186a0080094791.shtml
show int [type number] switching
SR6-EC2#show int GigabitEthernet2/0/48 switching
GigabitEthernet2/0/48 Vers SR5-CR1 int 5/30
[...]
Protocol Path Pkts In Chars In Pkts Out Chars Out
Other Process 256176 15370560 767094 46025640
Cache misses 0
Fast 0 0 0 0
Auton/SSE 0 0 0 0
CDP Process 128090 60073880 127863 58049802
Cache misses 0 |
Check whether the number of processed packets received is followed by a high number of cache misses. If so, this indicates that the packets, which congest the input queue, are forwarded through the router. Otherwise, these packets are destined for the router.
Source : http://www.cisco.com/en/US/products/hw/routers/ps133/products_tech_note09186a0080094791.shtml
|
|
Contact
|
|---|
Pour m'envoyer un mail, Pour me laisser un commentaire :richard.brunooo chez gmail.com |  |
|
|